Kevin Thiele writes:
Date: Wed, 1 Nov 2000 16:32:28 +1100 From: Kevin Thiele kevin.thiele@PI.CSIRO.AU To: TDWG-SDD@usobi.org Subject: Re: TDWG-SDD XML proposals
Jim wrote:
Anyway, it looks as though we are settling on an architecture allows both discursive descriptions accommodated by 'narrative' elements, and for the compulsive obsessives, a nested set of 'features' that can have 'feature_values', 'qualifers' and associated 'narrative'... Right?
And that the competing architecture of a list of 'feature_values' that
can
be present, absent, unknown, doubtful, present by misinterpretation,
absent
by misinterpretation, imaginary, etc. has been given the flick?
There are two things mixed in here, surely.
- Yes, I think an architecture that allows both discursive (read "natural
language" if you will) and coded (DELTA-like) descriptions is feasible. The principal difference between discursive and coded forms is that one uses idrefs and the other doesn't - so as Bob notes, once the XMLHeads work out how to use idrefs here we will hopefully be able to accommodate the coded form.
Well, only part of this intersects the distinction that we are working against, though the non-intersection is certainly important one. The distinction we are presently thinking about is purely syntactic. It is whether the XML data representation is tree structured (by a non-linear tree) and without idref's or is linearly structured and with idrefs. In fact all trees can be flattened into lists if by nothing else than adding references to the mix. It's easy to be convinced of this: trees can be represented in the conventional linear memory supplied with most computers, so it is clearly possible (and not that hard). What we want to get right is not whether this can be done---since we know it can---but specifically whether XML's reference mechanisms will capture the requirements of the intended stakeholders in the enterprise. We hope it does, because there is a BIG advantage to using XML: there is an abundance of tools, protocols, databses and services available to those who adopt it.
The question of discursive in the sense of natural langauge doesn't directly enter our present efforts, though we've been thinking a little about it. Gregor Hegedorn has implemented an approach in which his database is the source of the natural language syntax in that it stores verbs that a given description can access to produce something that produces---at worst---bemused smiles from a human reader. This may be a reasonable for discursive representations in XML too but we didn't attempt to capture it. Rather we forced some of our sample into our model by changing the English grammar a little. For example, we text whose original was part of a diaganosis whose discourse was
Workers: scapes and tibiae lack erect hair
we marked as
<FEATURE name="scapes and tibiae erect hair"> <FEATURE_VALUE>lack</FEATURE_VALUE> </FEATURE>
Naive discursive rendering of this would probably emerge as
scapes and tibiae erect hair lack.
Of course, Gregor might be happy about that sentence order. :-)
For me, what is lurking under the covers here is this: I don't think that high quality human readability of the naked markup is that important. I think it is the job of application software to render markup palatable to humans if that is the problem, or to other software if that is the problem.
- In either case, there is still a need for a limited set of qualifiers for
feature values (ie present, uncertain, present by misinterpretation etc). This will surely simply be another layer of complexity that will be added to the core once it's robust.
This is partially implemented already (using two rather curious qualifiers):
<FEATURE name="color"> <FEATURE_VALUE QUALIFIER="highly">varies between colonies</FEATURE_VALUE> <FEATURE_VALUE QUALIFIER="maybe">concolorous yellow-brown</FEATURE_VALUE> <FEATURE_VALUE>bicolored with darker head</FEATURE_VALUE> <FEATURE_VALUE>concolorous brown</FEATURE_VALUE> </FEATURE>
We extracted "highly" and "maybe" from the original. This raises an interesting point. It may well be good to recommend a specific set of qualifiers. That allows more intelligent applications. For example, (and neglecting that "highly variable" and "frequently variable" are probably not acceptable as nearly equivalent ) it might be easier to guide identification software if "frequently" is uniformly used where appropriate, thus perhaps identifying some initially more (or less) important feature. However, (a).An author might not be completely satisfied with that qualifier and (b) an author may be satisfied but prefer something else in natural language discourse. Probably this entails recommending a mechanism for specifying alternatives to the 'official' qualifier. Those mechanisms could guide applications that were aware of them. For example
<FEATURE_VALUE QUALIFIER="frequently" RENDER_QUALIFIER_AS="highly">
Presumably, Bob's comments:
As we represent it, TDD0.3 has only a few classes of objects: FEATURE a complex type with a name attribute and containing a string-based
FEATURE_VALUE,
NARRATIVEs and, recursively, other FEATUREs. DESCRIPTION a container of FEATUREs and NARRATIVES. NARRATIVE a string-based type suitable for extension to more complex
markup.
Our goal was the production and application of a Schema for TDD0.3, and
consequently in the
sample applications we have forced some things into the model which are
likely, ultimately,
to have their own Schema. The notable examples are publishing artifacts
such as markup for
literature references, and some common scientific vocabulary deserving its
own tagging standards.
means that many things will eventually be moved out of the rather featureless FEATURE and DESCRIPTION objects into more purpose-built structures.
Cheers - k