Document vs. database

29 Nov 2001

      At 02:15 PM 28-11-01 +1100, Kevin Thiele wrote:
...
Issues
I've taken the route of marking up a textual description, using a minimum of
tags. It seems to me that a description comprises a series of features with
values. I've used mixed markup because I wanted to have the mimimum tagging
and make maximum use of the text. That is, in cases like:
<Feature Name="Life form"><Value>shrub</Value></Feature>
I've made use of the fact that "shrub" occurs in the text by enclosing it in
<Value> tags. In cases like:
<Feature><Name>Leaves</Name>
I'm using a <Name> tag rather than a Name attribute
This may make processing considerably more difficult, but it seems to me
that there are some advantages. An alternative would be to first process the
description into a DELTA-like structure then mark up that. I chose not to go
that route because I still have the dream that one day we'll have a method
of semi-automatically marking up the content of natural language into
something like the above. The other advantage of this structure I believe is
that it maintains all the complexity and nuance of the original natural
language (which can be retrieved merely by removing all the tags). I prefer
this to the DELTA-process (take a natural-language description -> atomise
it -> reconstruct semi-natural-language by putting the atoms back together
again in some fashion).
My position on this is rather different, and I would prefer to see
characters defined a bit more rigorously.

As an example, suppose that the description of one taxon is given as:

   <Feature Name="Life form"><Value>shrub</Value></Feature>

A second as:

   <Feature Name="Life form"><Value>bush</Value></Feature>

And a third as:

   <Feature Name="Life form"><Value>small woody plant</Value></Feature>

And perhaps a fourth as:

   <Feature Name="Growth form"><Value>shrub</Value></Feature>

Are these equivalent? There's certainly no obvious way to associate them,
aside from knowing that "bush" and "shrub" mean pretty much the same thing
in some dialects of English. I think the goal of trying to extract meaning
by inserting a few tags into existing textual descriptions relies too
heavily on a level of artificial intelligence that we're still unlikely to
see for a couple of decades.

But pondering on this made me think about why Kevin and I often disagree on
some of these issues. If I may greatly over-simplify our viewpoints, his
view is (I think) that a formal description ought to be a document from
which data can be extracted; whereas I tend of view a formal description
more as a database from which text can be generated.

So what ARE we after: the free-flowing flexibility of a document, or the
rigour and precision of a database? Fortunately, there is a fair bit of
middle ground. It's worth noting that XML was developed as a markup
language for documents, but that it's primary usage to date has been as a
sort of portable and relatively light-weight data container. Still I think
that finding the right balance between flexibility and rigour is going to
be a major challenge in this exercise.
...
Features are nested:
<Feature><Name>Leaves</Name>
 <Feature><Name>Shape</Name></Feature>
</Feature>
Is this allowable? In XML-Spy I can create a Schema OK for this document.
It's also well-formed.
It's certainly allowable in XML. Is it desirable in a description?
Possibly, but I think it relates to my question from yesterday about
hierarchies of characters ought to be expressed. Nesting is a very good way
to express a single hierarchy, but not for handling alternative
hierarchies. Which do people want?

Eric Zurcher
CSIRO Livestock Industries
Canberra, ACT Australia
E-mail: Eric.Zurcher@pi.csiro.au