Document vs. database

Thu Nov 29 15:32:33 CET 2001

At 02:15 PM 28-11-01 +1100, Kevin Thiele wrote:
>Issues
>
>I've taken the route of marking up a textual description, using a minimum of
>tags. It seems to me that a description comprises a series of features with
>values. I've used mixed markup because I wanted to have the mimimum tagging
>and make maximum use of the text. That is, in cases like:
>
><Feature Name="Life form"><Value>shrub</Value></Feature>
>
>I've made use of the fact that "shrub" occurs in the text by enclosing it in
><Value> tags. In cases like:
>
><Feature><Name>Leaves</Name>
>
>I'm using a <Name> tag rather than a Name attribute
>
>This may make processing considerably more difficult, but it seems to me
>that there are some advantages. An alternative would be to first process the
>description into a DELTA-like structure then mark up that. I chose not to go
>that route because I still have the dream that one day we'll have a method
>of semi-automatically marking up the content of natural language into
>something like the above. The other advantage of this structure I believe is
>that it maintains all the complexity and nuance of the original natural
>language (which can be retrieved merely by removing all the tags). I prefer
>this to the DELTA-process (take a natural-language description -> atomise
>it -> reconstruct semi-natural-language by putting the atoms back together
>again in some fashion).

My position on this is rather different, and I would prefer to see
characters defined a bit more rigorously.

As an example, suppose that the description of one taxon is given as:

   <Feature Name="Life form"><Value>shrub</Value></Feature>

A second as:

   <Feature Name="Life form"><Value>bush</Value></Feature>

And a third as:

   <Feature Name="Life form"><Value>small woody plant</Value></Feature>

And perhaps a fourth as:

   <Feature Name="Growth form"><Value>shrub</Value></Feature>

Are these equivalent? There's certainly no obvious way to associate them,
aside from knowing that "bush" and "shrub" mean pretty much the same thing
in some dialects of English. I think the goal of trying to extract meaning
by inserting a few tags into existing textual descriptions relies too
heavily on a level of artificial intelligence that we're still unlikely to
see for a couple of decades.

But pondering on this made me think about why Kevin and I often disagree on
some of these issues. If I may greatly over-simplify our viewpoints, his
view is (I think) that a formal description ought to be a document from
which data can be extracted; whereas I tend of view a formal description
more as a database from which text can be generated.

So what ARE we after: the free-flowing flexibility of a document, or the
rigour and precision of a database? Fortunately, there is a fair bit of
middle ground. It's worth noting that XML was developed as a markup
language for documents, but that it's primary usage to date has been as a
sort of portable and relatively light-weight data container. Still I think
that finding the right balance between flexibility and rigour is going to
be a major challenge in this exercise.

>Features are nested:
>
><Feature><Name>Leaves</Name>
>  <Feature><Name>Shape</Name></Feature>
></Feature>
>
>Is this allowable? In XML-Spy I can create a Schema OK for this document.
>It's also well-formed.

It's certainly allowable in XML. Is it desirable in a description?
Possibly, but I think it relates to my question from yesterday about
hierarchies of characters ought to be expressed. Nesting is a very good way
to express a single hierarchy, but not for handling alternative
hierarchies. Which do people want?

Eric Zurcher
CSIRO Livestock Industries
Canberra, ACT Australia
E-mail: Eric.Zurcher at pi.csiro.au