Re: Document vs. database

29 Nov 2001

      ----- Original Message -----
From: "Eric Zurcher" <Eric.Zurcher@PI.CSIRO.AU>
To: <TDWG-SDD@USOBI.ORG>
Sent: Thursday, November 29, 2001 3:32 PM
Subject: Document vs. database

| At 02:15 PM 28-11-01 +1100, Kevin Thiele wrote:
| >Issues
| >
| >I've taken the route of marking up a textual description, using a minimum
of
| >tags. It seems to me that a description comprises a series of features
with
| >values. I've used mixed markup because I wanted to have the mimimum
tagging
| >and make maximum use of the text. That is, in cases like:
| >
| ><Feature Name="Life form"><Value>shrub</Value></Feature>
| >
| >I've made use of the fact that "shrub" occurs in the text by enclosing it
in
| ><Value> tags. In cases like:
| >
| ><Feature><Name>Leaves</Name>
| >
| >I'm using a <Name> tag rather than a Name attribute
| >
| >This may make processing considerably more difficult, but it seems to me
| >that there are some advantages. An alternative would be to first process
the
| >description into a DELTA-like structure then mark up that. I chose not to
go
| >that route because I still have the dream that one day we'll have a
method
| >of semi-automatically marking up the content of natural language into
| >something like the above. The other advantage of this structure I believe
is
| >that it maintains all the complexity and nuance of the original natural
| >language (which can be retrieved merely by removing all the tags). I
prefer
| >this to the DELTA-process (take a natural-language description -> atomise
| >it -> reconstruct semi-natural-language by putting the atoms back
together
| >again in some fashion).
|
| My position on this is rather different, and I would prefer to see
| characters defined a bit more rigorously.
|
| As an example, suppose that the description of one taxon is given as:
|
|    <Feature Name="Life form"><Value>shrub</Value></Feature>
|
| A second as:
|
|    <Feature Name="Life form"><Value>bush</Value></Feature>
|
| And a third as:
|
|    <Feature Name="Life form"><Value>small woody plant</Value></Feature>
|
| And perhaps a fourth as:
|
|    <Feature Name="Growth form"><Value>shrub</Value></Feature>
|
| Are these equivalent? There's certainly no obvious way to associate them,
| aside from knowing that "bush" and "shrub" mean pretty much the same thing
| in some dialects of English. I think the goal of trying to extract meaning
| by inserting a few tags into existing textual descriptions relies too
| heavily on a level of artificial intelligence that we're still unlikely to
| see for a couple of decades.

Good point, and there will clearly be a need for some type of processing
somewhere along the line to catch and allow validation of these situations.
But this is a processing issue, not a structural one. There's nothing in the
DELTA data structure to prevent

#1. Life form/
    1. shrub/
    2. bush/
    3. small woody plant/

#2. Growth form/
    1. shrub/
    2. smallish twiggy plant/

and I argue there should be nothing in the structure of the SDD standard to
prevent it either. People simply need to be aware of the issue. But the SDD
does, I agree, need to be structured in such a way that validation of these
situations is made possible. The first-cut structure that I proposed does
not allow much eyeball validation of this type - is this a problem?

| But pondering on this made me think about why Kevin and I often disagree
on
| some of these issues. If I may greatly over-simplify our viewpoints, his
| view is (I think) that a formal description ought to be a document from
| which data can be extracted; whereas I tend of view a formal description
| more as a database from which text can be generated.
|
| So what ARE we after: the free-flowing flexibility of a document, or the
| rigour and precision of a database? Fortunately, there is a fair bit of
| middle ground. It's worth noting that XML was developed as a markup
| language for documents, but that it's primary usage to date has been as a
| sort of portable and relatively light-weight data container. Still I think
| that finding the right balance between flexibility and rigour is going to
| be a major challenge in this exercise.

My view I suppose is that part of the power of XML (something that has never
been possible in the past) is that it allows precise retrieval of atomised
data from a fairly free-form document. XML blurs the distinction between
database and text. Before XML a text string like:

Rigid, spreading shrub to c. 1m high and wide; stems glabrous. Leaves soon
deciduous.....

was highly intractable to computer processing. This intractability required
us to manually and substantially restructure into e.g. a DELTA file before
we could really do much with it. But there are problems I think with the
DELTA process vis-a-vis natural language. In the relatively common case that
we begin with a natural language description, the process is:

natural language ----(1)-----> DELTA -------(2)-------->natural language
(etc)

where (1) is human processing and (2) is largely computer (CONFOR)
processing. Two related problems here. Firstly, I often don't much like the
output of CONFOR (some people say we humans should put up with the
limitations computers force on us, but call me old-fashioned , I rebel at
that {see also Asimov's Second Law of Robotics}). The second problem is that
step 2 is unidirectional - if I edit the output natural language I break the
process (the edits are volatile, next time I invoke CONFOR they get
overwritten).

So I'm trying to explore the new possibilities opened up by XML to produce a
minimally restructured but highly parsable document.

I'm happy to admit that I may be off-beam here. I just reckon we shouldn't
discount the possibility.

Existing data structures allow 2-level hierarchies e.g.

#1. Leaf shape/
    1. ovate/
    2. elliptic/

#2. Flower colour/
    1. blue/
    2. red/

I'm simply suggesting allowing n levels:

<Leaves>
    <shape>
        <ovate>
        <elliptic>
<Flowers>
    <colour>
        <blue>
        <red>

I can't decide if this a trivial or fundamental difference.

| Nesting is a very good way
| to express a single hierarchy, but not for handling alternative
| hierarchies.

In DELTA, Lucid etc we express alternate hierarchies in alternate documents.
Do we need to do better than this?

Cheers - k