It's How the Data will be Used that Counts

Tue Dec 4 11:31:31 CET 2001

I've been giving Kevin's approach some thought and have the following
comments:

Kevin's original information flow model is too simplistic.  A more realistic
model would be something like:

    Text Descriptions |                         | Text Descriptions
        Phylogenetics |---> Structured Data --->| Phylogenetics
            Specimens |                         | Identification Tools
(many)

The sources are much more varied and are often group-specific.  For example,
invertebrates have very few good quality text descriptions (most are old,
are in a range of languages (English, French, German, etc), vary greatly in
style, quality, etc. etc) and the majority of invertebrates are currently
undescribed (having 80% new taxa during a revision is common).

Similarly, the outputs required vary greatly and in ways hard to predict.
While text descriptions would seem to be a common requirement, they are in
some ways "legacy" and may become less important in the future as
applications (and users) become more sophisticated.  We need to make sure we
keep this range of uses in mind at all times.

The Structured Data is the format we're talking about here.  Don't know what
that will look like yet (but see below).

I think it's important to realise that applications will be used to move
among these in almost all cases and only very rarely will people manipulate
the data directly.  It's also important to remember that XSLT is nothing
more than an application.  Some comments seem to imply that XSLT is part of
XML, but it isn't.  For example, from a BioLink perspective (and DELTA and
probably LucID) the above model will need to be extended by adding:

    Text Descriptions |                         | Text Descriptions
        Phylogenetics |---> Structured Data --->| Phylogenetics
            Specimens |         |               | Identification Tools
(many)
                                |
                                |<-- | BioLink
                                |--> | DELTA
                                     | LucID Builder

That is, applications will import Structured Data, manipulate it and spit it
out again.  Because of this I don't really think the details of the model
matter too much, more that it is rich enough to represent all data of
interest.

********

I've also been thinking about Kevin's latest example:

"Leaf margins serrate with forward-pointing teeth"
<feature name="leaf">
  <feature name="margin">
    <feature name = "teething shape">
        <value>serrate</value>
    </feature>
    <feature name = "teeth orientation">with
        <value>forward-pointing</value>teeth
    </feature>
  </feature>
</feature>

First, it seems to me that "feature" is what taxonomists call "character"
and "value" is "state".  Being a traditionalist I'll switch back to this
common terminology:

"Leaf margins serrate with forward-pointing teeth"
<character name="leaf">
  <character name="margin">
    <character name = "teething shape">
        <state>serrate</state>
    </character>
    <character name = "teeth orientation">
        <state>forward-pointing</state>
    </character>
  </character>
</character>

A couple of points:

Kevin suggests a rule "<states>s cannot have <characters>s as siblings".
But this is what DELTA calls Dependencies, it represents the state that
controls a character.  This would seem to be a good thing (and may be very
important).

Kevin's representation is too focused on text descriptions.  A more complete
representation might be:

<character name="leaf">
  <state>present</state>
  <character name="leaf margin">
    <state>serrate</state>
    <character name = "tooth orientation">
      <state>forward-pointing</state>
    </character>
  </character>
</character>

This allows us to directly extract:

leaf = present
leaf margin = serrate
tooth orientation = forward-pointing

This will be important for both identification tools and phylogenetics.
Trying to recover this information from Kevin's representation should be
possible but will require a number of assumptions be made about the data.
This representation also captures dependencies (although this is an advanced
feature we shouldn't be talking about yet).

In my original, DELTA-centric model I used a <description> tag to try and
capture the text description information separate from the <state>
information.  My thinking was that these two
requirements/approaches/viewpoints are too distinct to cram together without
falling into the same trap as the current DELTA Standard (which is a
least-common denominator approach).

The problem here is that the phrase "Leaf margins serrate with
forward-pointing teeth" concerns 3 characters (leaf, margin and teeth) and 1
implied and 2 expressed states (present, serrate and forward pointing) with
the characters being dependent (and therefore the context containing
significant information - we know that 'teeth' have something to do with
'serrate' which has something to do with 'leaf margins' - the leaves being
present because we're describing them).  There's a lot of logic involved in
parsing this.  I can't think of a simple way of representing all this
complex information without separating it at some level.  Kevin's suggestion
represents the text description and mine the underlying data, but neither
works well for the other.

Two steps forward, one step back.  Sorry about that.

Thanks, Steve