so far...

Tue Nov 23 01:02:56 CET 1999

Hello Computer Taxonomists!  I have a bunch of comments on previous
posts here:

First Mauro J. Cavalcanti wrote:

> As an attempt to start a discussion, I am posting this to ask: do we all
> really agree that there should be a new standard for descriptive data
> based on XML, as a substitute for DELTA (as well as NEXUS and XDF)?

I was going to reply, but then Leigh Dodds wrote...

> Perhaps we should start by building a list of requirements and then
> measuring DELTA, NEXUS, etc against them to see whether they meet
> them.

...which seemed sensible enough to me too.

> So, what do people see as the basic requirements for this kind
> of format?
> 
> - ease of use (i.e. authoring)
> - ease of processing (parsing, validating, reading, converting)
> - ease of sharing (i.e. distribution)
> - open-ness (i.e. proprietary/non-proprietary)
> - ease of extensibility (i.e. ability to add more information cleanly at a
> later data)
> - internationalization
> - un-abiguity of data representation
> - unlimited size of data sets? (i.e. any limitation on character names,
> lengths, item names, numbers, etc)

My guess is all of the above?  And XML would provide them too, no?

> What types of data need to be modelled by the format? (this can
> be a post-requirements gathering step, but some consideration needs
> to be given early on to measure the 'success' of the current
> formats)

As far as Delta goes, I think that the data are fairly close to what
taxonomists have been encoding into traditional taxonomic descriptions. 
'Success' is pretty good.

Nexus can include charater/state data but also other things such as ways
to describe phylogenetic trees in the so-called "New Hampshire" format: 

  ((raccoon:19.19959,bear:6.80041):0.84600,((sea_lion:11.99700, 
  seal:12.00300):7.52973,((monkey:100.85930,cat:47.14069):20.59201, 
  weasel:18.87953):2.09460):3.87382,dog:25.46154); 

Nexus is actually an extensible format so almost anything could be
included. 

Unfortunately there is still doesn't seem to be an online description of
Nexus, though the standard was printed in the journal Taxon a couple of
years ago. 

I sense that application developers who use Nexus are pretty happy with
the format since is was a collaborative effort to begin with, and that the
Nexus community would not see a great benefit in adopting a new data
format.  Anybody else have this impression? 

I know nothing about XDF.  Mauro once pointed out this reference though: 

  White, R.J. & Allkin, R. 1992. A language for the definition and exchange
  of biological data sets. Mathl. Comput. Modelling, Vol. 16, No. 6/7,
  pp. 199-223.

> - what types of characters? (real, integer, text, etc)

Yep.  Also cyclical data (months, seasons) colors, lengths, shapes -- I'm
afraid this could list could be expanded quite a bit, if you go beyond
primitive types. 

> - what types of data (text, images, other formats?)

I would suggest that any MIME content-type could potentially be useful,
including video/mpeg, audio/wav, etc. 

Don Kirkup wrote:

> For example, how could retrieval of descriptive information across a
> department or institution be facilitated? Is it because of the sheer
> flexibility in how the characters can be defined using DELTA, that querying
> across projects is difficult to say the least, unless the character set is
> global that is (and therefore with a lot of redundacy)?

I agree.  XML only solves the syntax problem.  The semantics problem can't
be solved as easily with technology, but would rely either on community
agreement, or on extensive mappings between various ontologies.  I think.

> Can character definitions be constrained without making too tight a
> straight-jacket for oneself? Might it be possible to represent taxon
> character descriptors by for example an entity/property/value (eg.
> leaf/shape/ovate) based schema?

There is value in straightjackets, especially of the schema type.  For
example, one could make a perfectly decent database using a single table
looking like this: 

<pre>
 +---+------------+-------------+-----+-----------+-------+
 |id | table_name | column_name | row | value     | type  |
 +---+------------+-------------+-----+-----------+-------+
 | 1 | city       | population  | 22  | 2,000,000 | int   |
 +---+------------+-------------+-----+-----------+-------+
 | 2 | city       | name        | 22  | Boston    | char  |
 +---+------------+-------------+-----+-----------+-------+
 | 3 | person     | name        | 34  | Mr. Magoo | char  |
 +---+------------+-------------+-----+-----------+-------+
 | etc...
 +---
</pre>

But by avoiding the straightjacket of a fixed chema, we've made our
queries *really* tough to formulate. I think that this is a classic
tradeoff in any modelling methodology. 

> This on the face of it would seem to map onto an XML element/attribute/value
> schema pretty well. Would that help define more closely how we construct
> characters and maybe even prove universally applicable for all character
> types?

Great question. Leigh? Anyone? :-)

> Could one constrain further by expressing within the schema the hierarchical
> relationships between the elements(eg 'blade' and 'petiole' as child
> elements of leaf') or would the introduction of terminology into the
> 'standard' be a step too far?

This would be The Lexicon, no?  It does seem to go beyond the task of
making a file format for data transfer.  But it gets my vote nevertheless. 

Jean-Marc Vanel wrote:

> - The main existing formats, if we count by the number of species
> covered, are not DELTA, or other Computer Taxonomists's inventions, they are the
> semi-formal formats of the existing floras in plain text, which contain
> the vast majority of existing species;

Perhaps not even semi-formal?  (I'm reminded of Richard Montague's
"English as a Formal Language" paper which presents a viewpoint that is
still popular in some formal semantic theories.) 

..."Computer Taxonomists" -- ouch!

> - the priority is to put this existing material in XML (this can be done
> using standard parsing techniques);

A bit optimistic?  I think that even defining a context free grammar for
these data and constructing a lexicon will not get around some of the
classic problems of anaphora and ambiguities arising from multiple
quantifiers.  However, I grant that it can probably be done -- somehow
and sort of.  :-) 

(You've probably seen Andrew Taylors work? 
   http://www.cse.unsw.edu.au/~andrewt/papers/nlp_vlkb95/nlp_vlkb95.html)

> - we can then : - make queries using all the power of XPath

A new W3C recommmendation -- Hurrah!  But not a general enough inference
engine for RDF, I suspect? 

>     - make all kinds of RDF statements on the species from the outside,
> e.g.  an entomologist can, in an buterfly description, or any context,
> indicate  that it is a pollinizator for some plant species (see Annex)

I'm a big fan of RDF and I've experimented with it myself.  Gregor
Hagedorn also mentioned RDF in the context of XML during the recent TDWG
meeting.  So it's great to see it come up again!  That said, I must also
say that for the purposes of this group, I do have some reservations about
RDF being a technology that is mainstream enough as yet to base a
community effort on.  More likely, the XML community will use XML-Schema
or something else more like relational database technology and less like
predicate logic. 

> These techniques are in the straight line of the World Wide Web
> Consortium  Recommandations.

Yes, but you have to admit that this is one recommendation that hasn't
been terribly popular.  The recommendation was made in February and has
really not seen much enthusiatic adoption, so I think we should approach
this one with caution.  I say this because, for example:

 - Zero books available (at least from amazon.com) so far.

 - W3C's Timeline: News, Events, and Publications lists nothing since
   March 1999

 - IBM has apparently dropped their RDF Parser

 - Even the current maintainer of the W3C's SiRPAC parser, Sergey Melnik,
   has doubts about RDF saying:

  "RDF 1.0 has a number of legacy, heritage or flawed features, that make
  both the specification and implementation intransparent. To make my
  criticism a bit more constructive, consider as examples the following
  issues:

  "- aboutEachPrefix: if you create an RDF model using RDF/XML that
  contains aboutEachPrefix, and serialize it back, the intended semantics
  is lost, since this aboutEachPrefix is not reflected in the model.

  "- xml:lang does not appear in the model either and is therefore also a
  bug in the specs. Either a new triple has to be appended to the model,
  or xml:lang should be ignored.

  "- there is no principle difference between rdf:ID and rdf:about. There
  would be one if you appended rdf:isDefinedBy to every resource defined
  by rdf:ID. Not in the model - no semantics.

  "-...

  "Before such issues have been resolved, I'd be very careful about
  motivating people to implement something. Currently I am a courtesy
  maintainer of SiRPAC. However, to be frank, I'm not willing to
  rewrite/adapt it as long as the specs are flawed."

  From  http://lists.w3.org/Archives/Public/www-rdf-interest/1999Nov/0068.html

> Using such  RDF statements, it is also possible to add sets of
> characters à la DELTA.

Yes.  The attraction for me here is that not only would we not have
to write parsers, we wouldn't have to write inference engines either.
I've been playing around with SiLRI (Simple Logic-based RDF Interpreter)
and I think it's terrific.  However, as I said before, it is not clear
that these kinds of tools will become mainstream.  (I hope they do though)

> So I propose to decouple the description form the specific taxonomic issues,
> such as finding keys applying to specific subsets of species, minimal
> sets  of characters, etc. This is good design since no consensus can be
> quickly  achieved on taxonomic issues, since researchers want to keep their
> freedom.
>
> It seems that what the majority of botanists (and others biologists and
> the humanity in general) needs the most urgently, is a usable World Flora.
> 
> Moreover, having reference descriptions available as URI (Uniform
> Resource Identifiers) on the Web, allows different taxonomists researchers to
> link  their formal descriptions to a common reference, and to become
> interoperable.

I pretty much agree with you on all fronts (ask anybody) but I worry that
what you describe could be a massive undertaking and may be beyond the
scope of what we can practically accomplish here.  It would be terrific to
see it done though, especially with RDF.

OK, enough rambling from me -- best to all,

-Noel

p.s. Tim Berners-Lee says a lot about RDF.  Especially good is this:

 "Why the RDF model is different from the XML model"
 http://www.w3.org/DesignIssues/RDF-XML.html