Wow, Steve, we are virtually in perfect agreement! A first! I have one small quibble---you actually raised it yourself though---and some suggestions
Steve Shattuck writes:
... Kevin's representation is too focused on text descriptions. A more complete representation might be:
<character name="leaf"> <state>present</state> <character name="leaf margin"> <state>serrate</state> <character name = "tooth orientation"> <state>forward-pointing</state> </character> </character> </character>
This allows us to directly extract:
leaf = present leaf margin = serrate tooth orientation = forward-pointing
The problem here is that the phrase "Leaf margins serrate with forward-pointing teeth" concerns 3 characters (leaf, margin and teeth) and 1 implied and 2 expressed states (present, serrate and forward pointing) with the characters being dependent (and therefore the context containing significant information - we know that 'teeth' have something to do with
'serrate' which has something to do with 'leaf margins' - the leaves being present because we're describing them).
For this reason, it is probably unnecessary to represent the state `present' at all, provided the semantics could reasonably require that a feature which is absent is never described. Is that a reasonable requirement?
For example, in such a case to extract from the XML all taxa which have a leaf character in their description is not any harder for lacking <state>present</state>. In fact it is triffling bit easier.
The desire for a `present' state perhaps comes from table-based character-by-state organization where it could be hard to distinguish whether the character is absent from the taxon or absent from the data. That distinction can be made moot here, perhaps.
`presence' may be the only such state though.
There's a lot of logic involved in parsing this. I can't think of a simple way of representing all this complex information without separating it at some level.
If we assume that there is a rigorous semantics to the effect that syntactically nested characters are always logically nested (and here we may need to return to "feature" if "character" is too dear to overload), is there a problem with this:
<character name="leaf"> <character name="margin"> <character name="teeth"> <character name = "orientation"> <state>forward-pointing</state> </character> </character> </character> </character>
To me, the main thing that this kind of model implies is the need, in some cases, to provide a thesaurus, e.g. to provide advice that if a character (here `margin') has a subcharacter `teeth' then it may be described as `serrate'. Is that bad? Or would a purely textual description which just said "Leaf margins with forward-pointing teeth" be deemed wrong absent the word "serrate" ?
Kevin's suggestion represents the text description and mine the underlying data,
agreed
but neither works well for the other.
I think natural language parsing understanding is harder than natural language production from structure. So I think there is less work to go from data to description than the other way around.
Two steps forward, one step back. Sorry about that.
Nah, at most 1/2 step back. Is your middle name Zeno?
Thanks, Steve
Bob Morris