----- Original Message ----- From: "Eric Zurcher" Eric.Zurcher@PI.CSIRO.AU To: TDWG-SDD@USOBI.ORG Sent: Thursday, November 29, 2001 3:32 PM Subject: Document vs. database
| At 02:15 PM 28-11-01 +1100, Kevin Thiele wrote: | >Issues | > | >I've taken the route of marking up a textual description, using a minimum of | >tags. It seems to me that a description comprises a series of features with | >values. I've used mixed markup because I wanted to have the mimimum tagging | >and make maximum use of the text. That is, in cases like: | > | ><Feature Name="Life form"><Value>shrub</Value></Feature> | > | >I've made use of the fact that "shrub" occurs in the text by enclosing it in | ><Value> tags. In cases like: | > | ><Feature><Name>Leaves</Name> | > | >I'm using a <Name> tag rather than a Name attribute | > | >This may make processing considerably more difficult, but it seems to me | >that there are some advantages. An alternative would be to first process the | >description into a DELTA-like structure then mark up that. I chose not to go | >that route because I still have the dream that one day we'll have a method | >of semi-automatically marking up the content of natural language into | >something like the above. The other advantage of this structure I believe is | >that it maintains all the complexity and nuance of the original natural | >language (which can be retrieved merely by removing all the tags). I prefer | >this to the DELTA-process (take a natural-language description -> atomise | >it -> reconstruct semi-natural-language by putting the atoms back together | >again in some fashion). | | My position on this is rather different, and I would prefer to see | characters defined a bit more rigorously. | | As an example, suppose that the description of one taxon is given as: | | <Feature Name="Life form"><Value>shrub</Value></Feature> | | A second as: | | <Feature Name="Life form"><Value>bush</Value></Feature> | | And a third as: | | <Feature Name="Life form"><Value>small woody plant</Value></Feature> | | And perhaps a fourth as: | | <Feature Name="Growth form"><Value>shrub</Value></Feature> | | Are these equivalent? There's certainly no obvious way to associate them, | aside from knowing that "bush" and "shrub" mean pretty much the same thing | in some dialects of English. I think the goal of trying to extract meaning | by inserting a few tags into existing textual descriptions relies too | heavily on a level of artificial intelligence that we're still unlikely to | see for a couple of decades.
Good point, and there will clearly be a need for some type of processing somewhere along the line to catch and allow validation of these situations. But this is a processing issue, not a structural one. There's nothing in the DELTA data structure to prevent
#1. Life form/ 1. shrub/ 2. bush/ 3. small woody plant/
#2. Growth form/ 1. shrub/ 2. smallish twiggy plant/
and I argue there should be nothing in the structure of the SDD standard to prevent it either. People simply need to be aware of the issue. But the SDD does, I agree, need to be structured in such a way that validation of these situations is made possible. The first-cut structure that I proposed does not allow much eyeball validation of this type - is this a problem?
| But pondering on this made me think about why Kevin and I often disagree on | some of these issues. If I may greatly over-simplify our viewpoints, his | view is (I think) that a formal description ought to be a document from | which data can be extracted; whereas I tend of view a formal description | more as a database from which text can be generated. | | So what ARE we after: the free-flowing flexibility of a document, or the | rigour and precision of a database? Fortunately, there is a fair bit of | middle ground. It's worth noting that XML was developed as a markup | language for documents, but that it's primary usage to date has been as a | sort of portable and relatively light-weight data container. Still I think | that finding the right balance between flexibility and rigour is going to | be a major challenge in this exercise.
My view I suppose is that part of the power of XML (something that has never been possible in the past) is that it allows precise retrieval of atomised data from a fairly free-form document. XML blurs the distinction between database and text. Before XML a text string like:
Rigid, spreading shrub to c. 1m high and wide; stems glabrous. Leaves soon deciduous.....
was highly intractable to computer processing. This intractability required us to manually and substantially restructure into e.g. a DELTA file before we could really do much with it. But there are problems I think with the DELTA process vis-a-vis natural language. In the relatively common case that we begin with a natural language description, the process is:
natural language ----(1)-----> DELTA -------(2)-------->natural language (etc)
where (1) is human processing and (2) is largely computer (CONFOR) processing. Two related problems here. Firstly, I often don't much like the output of CONFOR (some people say we humans should put up with the limitations computers force on us, but call me old-fashioned , I rebel at that {see also Asimov's Second Law of Robotics}). The second problem is that step 2 is unidirectional - if I edit the output natural language I break the process (the edits are volatile, next time I invoke CONFOR they get overwritten).
So I'm trying to explore the new possibilities opened up by XML to produce a minimally restructured but highly parsable document.
I'm happy to admit that I may be off-beam here. I just reckon we shouldn't discount the possibility.
| >Features are nested: | > | ><Feature><Name>Leaves</Name> | > <Feature><Name>Shape</Name></Feature> | ></Feature> | > | >Is this allowable? In XML-Spy I can create a Schema OK for this document. | >It's also well-formed. | | It's certainly allowable in XML. Is it desirable in a description? | Possibly, but I think it relates to my question from yesterday about | hierarchies of characters ought to be expressed. Nesting is a very good way | to express a single hierarchy, but not for handling alternative | hierarchies. Which do people want?
Existing data structures allow 2-level hierarchies e.g.
#1. Leaf shape/ 1. ovate/ 2. elliptic/
#2. Flower colour/ 1. blue/ 2. red/
I'm simply suggesting allowing n levels:
<Leaves> <shape> <ovate> <elliptic> <Flowers> <colour> <blue> <red>
I can't decide if this a trivial or fundamental difference.
| Nesting is a very good way | to express a single hierarchy, but not for handling alternative | hierarchies.
In DELTA, Lucid etc we express alternate hierarchies in alternate documents. Do we need to do better than this?
Cheers - k