At 02:15 PM 28-11-01 +1100, Kevin Thiele wrote:
Issues
I've taken the route of marking up a textual description, using a minimum of tags. It seems to me that a description comprises a series of features with values. I've used mixed markup because I wanted to have the mimimum tagging and make maximum use of the text. That is, in cases like:
<Feature Name="Life form"><Value>shrub</Value></Feature>
I've made use of the fact that "shrub" occurs in the text by enclosing it in <Value> tags. In cases like:
<Feature><Name>Leaves</Name>
I'm using a <Name> tag rather than a Name attribute
This may make processing considerably more difficult, but it seems to me that there are some advantages. An alternative would be to first process the description into a DELTA-like structure then mark up that. I chose not to go that route because I still have the dream that one day we'll have a method of semi-automatically marking up the content of natural language into something like the above. The other advantage of this structure I believe is that it maintains all the complexity and nuance of the original natural language (which can be retrieved merely by removing all the tags). I prefer this to the DELTA-process (take a natural-language description -> atomise it -> reconstruct semi-natural-language by putting the atoms back together again in some fashion).
My position on this is rather different, and I would prefer to see characters defined a bit more rigorously.
As an example, suppose that the description of one taxon is given as:
<Feature Name="Life form"><Value>shrub</Value></Feature>
A second as:
<Feature Name="Life form"><Value>bush</Value></Feature>
And a third as:
<Feature Name="Life form"><Value>small woody plant</Value></Feature>
And perhaps a fourth as:
<Feature Name="Growth form"><Value>shrub</Value></Feature>
Are these equivalent? There's certainly no obvious way to associate them, aside from knowing that "bush" and "shrub" mean pretty much the same thing in some dialects of English. I think the goal of trying to extract meaning by inserting a few tags into existing textual descriptions relies too heavily on a level of artificial intelligence that we're still unlikely to see for a couple of decades.
But pondering on this made me think about why Kevin and I often disagree on some of these issues. If I may greatly over-simplify our viewpoints, his view is (I think) that a formal description ought to be a document from which data can be extracted; whereas I tend of view a formal description more as a database from which text can be generated.
So what ARE we after: the free-flowing flexibility of a document, or the rigour and precision of a database? Fortunately, there is a fair bit of middle ground. It's worth noting that XML was developed as a markup language for documents, but that it's primary usage to date has been as a sort of portable and relatively light-weight data container. Still I think that finding the right balance between flexibility and rigour is going to be a major challenge in this exercise.
Features are nested:
<Feature><Name>Leaves</Name> <Feature><Name>Shape</Name></Feature>
</Feature>
Is this allowable? In XML-Spy I can create a Schema OK for this document. It's also well-formed.
It's certainly allowable in XML. Is it desirable in a description? Possibly, but I think it relates to my question from yesterday about hierarchies of characters ought to be expressed. Nesting is a very good way to express a single hierarchy, but not for handling alternative hierarchies. Which do people want?
Eric Zurcher CSIRO Livestock Industries Canberra, ACT Australia E-mail: Eric.Zurcher@pi.csiro.au