It's How the Data will be Used that Counts

Mon Dec 3 22:11:42 CET 2001

Wow, Steve, we are virtually in perfect agreement! A first!
I have one small quibble---you actually raised it yourself
though---and some suggestions

Steve Shattuck writes:
 > ...
 > Kevin's representation is too focused on text descriptions.  A more complete
 > representation might be:
 >
 > <character name="leaf">
 >   <state>present</state>
 >   <character name="leaf margin">
 >     <state>serrate</state>
 >     <character name = "tooth orientation">
 >       <state>forward-pointing</state>
 >     </character>
 >   </character>
 > </character>
 >
 >
 > This allows us to directly extract:
 >
 > leaf = present
 > leaf margin = serrate
 > tooth orientation = forward-pointing
 >

 > The problem here is that the phrase "Leaf margins serrate with
 > forward-pointing teeth" concerns 3 characters (leaf, margin and teeth) and 1
 > implied and 2 expressed states (present, serrate and forward pointing) with
 > the characters being dependent (and therefore the context containing
 > significant information - we know that 'teeth' have something to do with

 > 'serrate' which has something to do with 'leaf margins' - the leaves being
 > present because we're describing them).

For this reason, it is probably unnecessary to represent the state
`present' at all, provided the semantics could reasonably require that
a feature which is absent is never described. Is that a reasonable
requirement?

For example, in such a case to extract from the XML all taxa which
have a leaf character in their description is not any harder for
lacking <state>present</state>. In fact it is triffling bit easier.

The desire for a `present' state perhaps comes from table-based
character-by-state organization where it could be hard to distinguish
whether the character is absent from the taxon or absent from the data.
That distinction can be made moot here, perhaps.

`presence' may be the only such state though.

 >There's a lot of logic involved in
 > parsing this.  I can't think of a simple way of representing all this
 > complex information without separating it at some level.

If we assume that there is a rigorous semantics to the effect that
syntactically nested characters are always logically nested (and here
we may need to return to "feature" if "character" is too dear to
overload), is there a problem with this:

<character name="leaf">
   <character name="margin">
     <character name="teeth">
        <character name = "orientation">
           <state>forward-pointing</state>
        </character>
     </character>
  </character>
</character>

To me, the main thing that this kind of model implies is the need, in
some cases, to provide a thesaurus, e.g. to provide advice that if a
character (here `margin') has a subcharacter `teeth' then it may be
described as `serrate'. Is that bad?  Or would a purely textual
description which just said "Leaf margins with forward-pointing teeth"
be deemed wrong absent the word "serrate" ?

 >Kevin's suggestion
 > represents the text description and mine the underlying data,

agreed

 >but neither works well for the other.

I think natural language parsing understanding is harder than natural
language production from structure. So I think there is less work to
go from data to description than the other way around.

 >
 > Two steps forward, one step back.  Sorry about that.

Nah, at most 1/2 step back. Is your middle name Zeno?

 >
 > Thanks, Steve
 >

Bob Morris