What I'm trying to do

Mon Sep 4 08:54:45 CEST 2000

Jims and Bryan's comments are exactly where we want to be at the moment,
methinks.

I've realised over the weekend, thinking about Eric Zurcher's criticisms,
what it is I'm trying to do with Draft Spec Mark 2 (some people are a bit
slow).

In Lucid and DELTA we enforce a great deal of structure on the data file:

*CHARACTER LIST
#1. Distribution by State (or known provenance)/
       1. South Australia - south-eastern (south of the line from Port
Augusta
          to Broken Hill)/
       2. New South Wales (including Jervis Bay - A.C.T.)/
       3. Australian Capital Territory/
       4. Victoria/
       5. Tasmania/
...etc

If the * or the # or a / is left off or put in the wrong place, the whole
thing falls over. Only one very narrowly constrained data format can be a
valid file. This is primarily done for ease of processing, and was a
perfectly reasonable constraint for DELTA and Lucid to enforce, since what
they were trying to do was create a data file for their particular program.

The standard I'm working towards ALLOWS this degree of formality, but
doesn't ENFORCE it. In the standard, for a file to be valid input for Lucid
or DELTA, it would need to conform to higher-order structure as imposed by
those programs. But not all descriptive data out there is in DELTA or Lucid
format, as Bryan notes, and we need to be inclusive of other types of data
as well (by far the majority of which is natural-language legacy data). If
all descriptive data needs to be encoded to a strong specification, it will
never be so encoded (that's one reason, I think, why DELTA has failed as a
global specification).

So I'm trying to create a spec where both this:

<DOCUMENT Name = "d1">

 <ITEM_PROPERTIES>
  <ITEM ID = "1" NAME = "Gouania exilis"/>
  <ITEM ID = "2" NAME = "Gouania australiana"/>
 </ITEM_PROPERTIES>

 <ELEMENT_PROPERTIES>
  <ELEMENT ID = "1" >
   <ELEMENT_NAME> Flower colour <\ELEMENT NAME>
   <VALUE_LIST>
    <VALUE ID = "1">
     <VALUE_NAME> "green" </VALUE_NAME>
    </VALUE>
    <VALUE ID = "2">
     <VALUE_NAME> "yellow" </VALUE_NAME>
    </VALUE>
   </VALUE_LIST>
   </ELEMENT>
 </ELEMENT_PROPERTIES>

 <DESCRIPTION Name = "Gouania exilis">
  <ELEMENT>
   <ELEMENT_ID> 1 </ELEMENT_ID>
  </ELEMENT>
  <VALUE_ID> 1 </VALUE_ID>
  <QUALIFIER> rarely </QUALIFIER>
 </DESCRIPTION>

</DOCUMENT>

... and this ...

<DOCUMENT Name = "d2">

Viola eminens K. Thiele & Prober, sp. nov.

<DESCRIPTION Name = "Viola eminens">
<ELEMENT = "Longevity"> <VALUE>Perennial</VALUE> </ELEMENT> <ELEMENT = "Life
form"> <VALUE> herb </VALUE> </ELEMENT> spreading by stolons; rootstock
sometimes somewhat swollen and bulbous at the stem bases. Stems contracted
so that the leaves form rosettes, never elongate with caulescent leaves.
<ELEMENT Name = "Leaves">Leaves <ELEMENT Name = "lamina"><ELEMENT Name =
"Shape"><VALUE>broad-reniform</VALUE></ELEMENT>, the largest (10-)12-15(-25)
mm long, (20-)25-35(-45) mm wide, 1.5-3.2 times wider than long, usually
with a broad basal sinus; lamina with 9-20 +/- prominent teeth, glabrous or
with scattered unicellular hairs on the upper surface, +/- concolorous
bright green </ELEMENT>; petioles 2-8 cm long; stipules narrowly triangular,
usually with several small, glandular teeth on each side.</ELEMENT>
.........etc
</DESCRIPTION>
</DOCUMENT>

are valid documents.

Now there's a lot of blob text in d2, but it's still a description, and
surely the very simple markup has value-added enormously to it. If we find
this document through a web search we at least know that it includes a
description of Viola eminens, and that this description includes a statement
about the life form of that species. This is a huge advance on only knowing
that we've found a document that contains the words "Viola" and "eminens".
Isn't that what XML is all about?

Further, it seems it me that imposing the most basic formalities only (that
a description is about something, that it starts at <DESCRIPTION> and ends
with </DESCRIPTION>, and that it may optionally include some structured
statements of the form <ELEMENT></ELEMENT> etc) can actually surprisingly
easily allow parts at least of d2 to be mapped (or reformatted) to d1 (once
we've got the syntax right which isn't yet the case with d1 and d2, he
hastens to add).

It seems to me that creating a standard like this would be more valuable
than simply XMLifying the DELTA of Lucid data file structure.

So that's what I'm trying to do, I think.

Cheers - k