It's How the Data will be Used that Counts
I've been giving Kevin's approach some thought and have the following comments:
Kevin's original information flow model is too simplistic. A more realistic model would be something like:
Text Descriptions | | Text Descriptions Phylogenetics |---> Structured Data --->| Phylogenetics Specimens | | Identification Tools (many)
The sources are much more varied and are often group-specific. For example, invertebrates have very few good quality text descriptions (most are old, are in a range of languages (English, French, German, etc), vary greatly in style, quality, etc. etc) and the majority of invertebrates are currently undescribed (having 80% new taxa during a revision is common).
Similarly, the outputs required vary greatly and in ways hard to predict. While text descriptions would seem to be a common requirement, they are in some ways "legacy" and may become less important in the future as applications (and users) become more sophisticated. We need to make sure we keep this range of uses in mind at all times.
The Structured Data is the format we're talking about here. Don't know what that will look like yet (but see below).
I think it's important to realise that applications will be used to move among these in almost all cases and only very rarely will people manipulate the data directly. It's also important to remember that XSLT is nothing more than an application. Some comments seem to imply that XSLT is part of XML, but it isn't. For example, from a BioLink perspective (and DELTA and probably LucID) the above model will need to be extended by adding:
Text Descriptions | | Text Descriptions Phylogenetics |---> Structured Data --->| Phylogenetics Specimens | | | Identification Tools (many) | |<-- | BioLink |--> | DELTA | LucID Builder
That is, applications will import Structured Data, manipulate it and spit it out again. Because of this I don't really think the details of the model matter too much, more that it is rich enough to represent all data of interest.
********
I've also been thinking about Kevin's latest example:
"Leaf margins serrate with forward-pointing teeth" <feature name="leaf"> <feature name="margin"> <feature name = "teething shape"> <value>serrate</value> </feature> <feature name = "teeth orientation">with <value>forward-pointing</value>teeth </feature> </feature> </feature>
First, it seems to me that "feature" is what taxonomists call "character" and "value" is "state". Being a traditionalist I'll switch back to this common terminology:
"Leaf margins serrate with forward-pointing teeth" <character name="leaf"> <character name="margin"> <character name = "teething shape"> <state>serrate</state> </character> <character name = "teeth orientation"> <state>forward-pointing</state> </character> </character> </character>
A couple of points:
Kevin suggests a rule "<states>s cannot have <characters>s as siblings". But this is what DELTA calls Dependencies, it represents the state that controls a character. This would seem to be a good thing (and may be very important).
Kevin's representation is too focused on text descriptions. A more complete representation might be:
<character name="leaf"> <state>present</state> <character name="leaf margin"> <state>serrate</state> <character name = "tooth orientation"> <state>forward-pointing</state> </character> </character> </character>
This allows us to directly extract:
leaf = present leaf margin = serrate tooth orientation = forward-pointing
This will be important for both identification tools and phylogenetics. Trying to recover this information from Kevin's representation should be possible but will require a number of assumptions be made about the data. This representation also captures dependencies (although this is an advanced feature we shouldn't be talking about yet).
In my original, DELTA-centric model I used a <description> tag to try and capture the text description information separate from the <state> information. My thinking was that these two requirements/approaches/viewpoints are too distinct to cram together without falling into the same trap as the current DELTA Standard (which is a least-common denominator approach).
The problem here is that the phrase "Leaf margins serrate with forward-pointing teeth" concerns 3 characters (leaf, margin and teeth) and 1 implied and 2 expressed states (present, serrate and forward pointing) with the characters being dependent (and therefore the context containing significant information - we know that 'teeth' have something to do with 'serrate' which has something to do with 'leaf margins' - the leaves being present because we're describing them). There's a lot of logic involved in parsing this. I can't think of a simple way of representing all this complex information without separating it at some level. Kevin's suggestion represents the text description and mine the underlying data, but neither works well for the other.
Two steps forward, one step back. Sorry about that.
Thanks, Steve
participants (1)
-
unknown@example.com