TDWG-SDD XML proposals

Wed Nov 1 10:29:08 CET 2000

Kevin Thiele writes:
 > Date:         Wed, 1 Nov 2000 16:32:28 +1100
 > From: Kevin Thiele <kevin.thiele at PI.CSIRO.AU>
 > To: TDWG-SDD at usobi.org
 > Subject:      Re: TDWG-SDD XML proposals
 >
 > Jim wrote:
 >
 > > Anyway, it looks as though we are settling on an architecture allows both
 >  > discursive descriptions accommodated by 'narrative' elements, and for the
 >  > compulsive obsessives, a nested set of 'features' that can have
 >  > 'feature_values', 'qualifers' and associated 'narrative'...  Right?
 >  >
 >  > And that the competing architecture of a list of 'feature_values' that
 > can
 >  > be present, absent, unknown, doubtful, present by misinterpretation,
 > absent
 >  > by misinterpretation, imaginary, etc. has been given the flick?
 >
 > There are two things mixed in here, surely.
 >
 > 1. Yes, I think an architecture that allows both discursive (read "natural
 > language" if you will) and coded (DELTA-like) descriptions is feasible. The
 > principal difference between discursive and coded forms is that one uses
 > idrefs and the other doesn't - so as Bob notes, once the XMLHeads work out
 > how to use idrefs here we will hopefully be able to accommodate the coded
 > form.

Well, only part of this intersects the distinction that we are working
against, though the non-intersection is certainly important one. The
distinction we are presently thinking about is purely syntactic. It is
whether the XML data representation is tree structured (by a
non-linear tree) and without idref's or is linearly structured and with
idrefs. In fact all trees can be flattened into lists if by nothing
else than adding references to the mix. It's easy to be convinced of
this: trees can be represented in the conventional linear memory
supplied with most computers, so it is clearly possible (and not that
hard). What we want to get right is not whether this can be
done---since we know it can---but specifically whether XML's reference
mechanisms will capture the requirements of the intended stakeholders
in the enterprise. We hope it does, because there is a BIG advantage
to using XML: there is an abundance of tools, protocols, databses and
services available to those who adopt it.

The question of discursive in the sense of natural langauge doesn't
directly enter our present efforts, though we've been thinking a
little about it. Gregor Hegedorn has implemented an approach in which
his database is the source of the natural language syntax in that it
stores verbs that a given description can access to produce something
that produces---at worst---bemused smiles from a human reader. This
may be a reasonable for discursive representations in XML too but we
didn't attempt to capture it. Rather we forced some of our sample into
our model by changing the English grammar a little. For example, we
text whose original was part of a diaganosis whose discourse
was

Workers: scapes and tibiae lack erect hair

we marked as

<FEATURE name="scapes and tibiae erect hair">
  <FEATURE_VALUE>lack</FEATURE_VALUE>
</FEATURE>

Naive discursive rendering of this would probably emerge as

scapes and tibiae erect hair lack.

Of course, Gregor might be happy about that sentence order. :-)

For me, what is lurking under the covers here is this: I don't think
that high quality human readability of the naked markup is that
important. I think it is the job of application software to render
markup palatable to humans if that is the problem, or to other
software if that is the problem.

 >
 > 2. In either case, there is still a need for a limited set of qualifiers for
 > feature values (ie present, uncertain, present by misinterpretation etc).
 > This will surely simply be another layer of complexity that will be added to
 > the core once it's robust.
 >
 > This is partially implemented already (using two rather curious qualifiers):
 > - <FEATURE name="color">
 >   <FEATURE_VALUE QUALIFIER="highly">varies between colonies</FEATURE_VALUE>
 >   <FEATURE_VALUE QUALIFIER="maybe">concolorous yellow-brown</FEATURE_VALUE>
 >   <FEATURE_VALUE>bicolored with darker head</FEATURE_VALUE>
 >   <FEATURE_VALUE>concolorous brown</FEATURE_VALUE>
 >   </FEATURE>

We extracted "highly" and "maybe" from the original.  This raises an
interesting point. It may well be good to recommend a specific set of
qualifiers. That allows more intelligent applications. For example,
(and neglecting that "highly variable" and "frequently variable" are
probably not acceptable as nearly equivalent ) it might be easier to
guide identification software if "frequently" is uniformly used where
appropriate, thus perhaps identifying some initially more (or less)
important feature. However, (a).An author might not be completely
satisfied with that qualifier and (b) an author may be satisfied but
prefer something else in natural language discourse. Probably this
entails recommending a mechanism for specifying alternatives to the
'official' qualifier. Those mechanisms could guide applications that
were aware of them. For example

<FEATURE_VALUE QUALIFIER="frequently" RENDER_QUALIFIER_AS="highly">

 >
 > Presumably, Bob's comments:
 > >As we represent it, TDD0.3 has only a few classes of objects:
 > >FEATURE a complex type with a name attribute and containing a string-based
 > FEATURE_VALUE,
 > >NARRATIVEs and, recursively, other FEATUREs.
 > >DESCRIPTION a container of FEATUREs and NARRATIVES.
 > >NARRATIVE a string-based type suitable for extension to more complex
 > markup.
 >
 > >Our goal was the production and application of a Schema for TDD0.3, and
 > consequently in the
 > >sample applications we have forced some things into the model which are
 > likely, ultimately,
 > >to have their own Schema. The notable examples are publishing artifacts
 > such as markup for
 > >literature references, and some common scientific vocabulary deserving its
 > own tagging standards.
 >
 > means that many things will eventually be moved out of the rather
 > featureless FEATURE and DESCRIPTION objects into more purpose-built
 > structures.
 >
 > Cheers - k
 >