Hello Computer Taxonomists! I have a bunch of comments on previous posts here:
First Mauro J. Cavalcanti wrote:
As an attempt to start a discussion, I am posting this to ask: do we all really agree that there should be a new standard for descriptive data based on XML, as a substitute for DELTA (as well as NEXUS and XDF)?
I was going to reply, but then Leigh Dodds wrote...
Perhaps we should start by building a list of requirements and then measuring DELTA, NEXUS, etc against them to see whether they meet them.
...which seemed sensible enough to me too.
So, what do people see as the basic requirements for this kind of format?
- ease of use (i.e. authoring)
- ease of processing (parsing, validating, reading, converting)
- ease of sharing (i.e. distribution)
- open-ness (i.e. proprietary/non-proprietary)
- ease of extensibility (i.e. ability to add more information cleanly at a
later data)
- internationalization
- un-abiguity of data representation
- unlimited size of data sets? (i.e. any limitation on character names,
lengths, item names, numbers, etc)
My guess is all of the above? And XML would provide them too, no?
What types of data need to be modelled by the format? (this can be a post-requirements gathering step, but some consideration needs to be given early on to measure the 'success' of the current formats)
As far as Delta goes, I think that the data are fairly close to what taxonomists have been encoding into traditional taxonomic descriptions. 'Success' is pretty good.
Nexus can include charater/state data but also other things such as ways to describe phylogenetic trees in the so-called "New Hampshire" format:
((raccoon:19.19959,bear:6.80041):0.84600,((sea_lion:11.99700, seal:12.00300):7.52973,((monkey:100.85930,cat:47.14069):20.59201, weasel:18.87953):2.09460):3.87382,dog:25.46154);
Nexus is actually an extensible format so almost anything could be included.
Unfortunately there is still doesn't seem to be an online description of Nexus, though the standard was printed in the journal Taxon a couple of years ago.
I sense that application developers who use Nexus are pretty happy with the format since is was a collaborative effort to begin with, and that the Nexus community would not see a great benefit in adopting a new data format. Anybody else have this impression?
I know nothing about XDF. Mauro once pointed out this reference though:
White, R.J. & Allkin, R. 1992. A language for the definition and exchange of biological data sets. Mathl. Comput. Modelling, Vol. 16, No. 6/7, pp. 199-223.
- what types of characters? (real, integer, text, etc)
Yep. Also cyclical data (months, seasons) colors, lengths, shapes -- I'm afraid this could list could be expanded quite a bit, if you go beyond primitive types.
- what types of data (text, images, other formats?)
I would suggest that any MIME content-type could potentially be useful, including video/mpeg, audio/wav, etc.
Don Kirkup wrote:
For example, how could retrieval of descriptive information across a department or institution be facilitated? Is it because of the sheer flexibility in how the characters can be defined using DELTA, that querying across projects is difficult to say the least, unless the character set is global that is (and therefore with a lot of redundacy)?
I agree. XML only solves the syntax problem. The semantics problem can't be solved as easily with technology, but would rely either on community agreement, or on extensive mappings between various ontologies. I think.
Can character definitions be constrained without making too tight a straight-jacket for oneself? Might it be possible to represent taxon character descriptors by for example an entity/property/value (eg. leaf/shape/ovate) based schema?
There is value in straightjackets, especially of the schema type. For example, one could make a perfectly decent database using a single table looking like this:
<pre> +---+------------+-------------+-----+-----------+-------+ |id | table_name | column_name | row | value | type | +---+------------+-------------+-----+-----------+-------+ | 1 | city | population | 22 | 2,000,000 | int | +---+------------+-------------+-----+-----------+-------+ | 2 | city | name | 22 | Boston | char | +---+------------+-------------+-----+-----------+-------+ | 3 | person | name | 34 | Mr. Magoo | char | +---+------------+-------------+-----+-----------+-------+ | etc... +--- </pre>
But by avoiding the straightjacket of a fixed chema, we've made our queries *really* tough to formulate. I think that this is a classic tradeoff in any modelling methodology.
This on the face of it would seem to map onto an XML element/attribute/value schema pretty well. Would that help define more closely how we construct characters and maybe even prove universally applicable for all character types?
Great question. Leigh? Anyone? :-)
Could one constrain further by expressing within the schema the hierarchical relationships between the elements(eg 'blade' and 'petiole' as child elements of leaf') or would the introduction of terminology into the 'standard' be a step too far?
This would be The Lexicon, no? It does seem to go beyond the task of making a file format for data transfer. But it gets my vote nevertheless.
Jean-Marc Vanel wrote:
- The main existing formats, if we count by the number of species
covered, are not DELTA, or other Computer Taxonomists's inventions, they are the semi-formal formats of the existing floras in plain text, which contain the vast majority of existing species;
Perhaps not even semi-formal? (I'm reminded of Richard Montague's "English as a Formal Language" paper which presents a viewpoint that is still popular in some formal semantic theories.)
..."Computer Taxonomists" -- ouch!
- the priority is to put this existing material in XML (this can be done
using standard parsing techniques);
A bit optimistic? I think that even defining a context free grammar for these data and constructing a lexicon will not get around some of the classic problems of anaphora and ambiguities arising from multiple quantifiers. However, I grant that it can probably be done -- somehow and sort of. :-)
(You've probably seen Andrew Taylors work? http://www.cse.unsw.edu.au/~andrewt/papers/nlp_vlkb95/nlp_vlkb95.html)
- we can then : - make queries using all the power of XPath
A new W3C recommmendation -- Hurrah! But not a general enough inference engine for RDF, I suspect?
- make all kinds of RDF statements on the species from the outside,
e.g. an entomologist can, in an buterfly description, or any context, indicate that it is a pollinizator for some plant species (see Annex)
I'm a big fan of RDF and I've experimented with it myself. Gregor Hagedorn also mentioned RDF in the context of XML during the recent TDWG meeting. So it's great to see it come up again! That said, I must also say that for the purposes of this group, I do have some reservations about RDF being a technology that is mainstream enough as yet to base a community effort on. More likely, the XML community will use XML-Schema or something else more like relational database technology and less like predicate logic.
These techniques are in the straight line of the World Wide Web Consortium Recommandations.
Yes, but you have to admit that this is one recommendation that hasn't been terribly popular. The recommendation was made in February and has really not seen much enthusiatic adoption, so I think we should approach this one with caution. I say this because, for example:
- Zero books available (at least from amazon.com) so far.
- W3C's Timeline: News, Events, and Publications lists nothing since March 1999
- IBM has apparently dropped their RDF Parser
- Even the current maintainer of the W3C's SiRPAC parser, Sergey Melnik, has doubts about RDF saying:
"RDF 1.0 has a number of legacy, heritage or flawed features, that make both the specification and implementation intransparent. To make my criticism a bit more constructive, consider as examples the following issues:
"- aboutEachPrefix: if you create an RDF model using RDF/XML that contains aboutEachPrefix, and serialize it back, the intended semantics is lost, since this aboutEachPrefix is not reflected in the model.
"- xml:lang does not appear in the model either and is therefore also a bug in the specs. Either a new triple has to be appended to the model, or xml:lang should be ignored.
"- there is no principle difference between rdf:ID and rdf:about. There would be one if you appended rdf:isDefinedBy to every resource defined by rdf:ID. Not in the model - no semantics.
"-...
"Before such issues have been resolved, I'd be very careful about motivating people to implement something. Currently I am a courtesy maintainer of SiRPAC. However, to be frank, I'm not willing to rewrite/adapt it as long as the specs are flawed."
From http://lists.w3.org/Archives/Public/www-rdf-interest/1999Nov/0068.html
Using such RDF statements, it is also possible to add sets of characters � la DELTA.
Yes. The attraction for me here is that not only would we not have to write parsers, we wouldn't have to write inference engines either. I've been playing around with SiLRI (Simple Logic-based RDF Interpreter) and I think it's terrific. However, as I said before, it is not clear that these kinds of tools will become mainstream. (I hope they do though)
So I propose to decouple the description form the specific taxonomic issues, such as finding keys applying to specific subsets of species, minimal sets of characters, etc. This is good design since no consensus can be quickly achieved on taxonomic issues, since researchers want to keep their freedom.
It seems that what the majority of botanists (and others biologists and the humanity in general) needs the most urgently, is a usable World Flora.
Moreover, having reference descriptions available as URI (Uniform Resource Identifiers) on the Web, allows different taxonomists researchers to link their formal descriptions to a common reference, and to become interoperable.
I pretty much agree with you on all fronts (ask anybody) but I worry that what you describe could be a massive undertaking and may be beyond the scope of what we can practically accomplish here. It would be terrific to see it done though, especially with RDF.
OK, enough rambling from me -- best to all,
-Noel
p.s. Tim Berners-Lee says a lot about RDF. Especially good is this:
"Why the RDF model is different from the XML model" http://www.w3.org/DesignIssues/RDF-XML.html