Kevin has offered us not just 1, but 4, different document models for comment. That's a lot to consider, and I won't attempt to offer detailed criticism on each alternative. But from my point of view, the first several appear far too "weak" to be of much use. In particular, I think it is essential that there be a mechanism for specifying a universe of the character and state values (or, mapping roughly from Kevin's terminology, "elements" and "values", respectively) which may appear within a description. That is, we need to be able to define (or, rather, provide a mechanism for the dataset designer to define) the equivalent of a DELTA "character list". This is needed to enforce consistency and to disallow nonsensical constructs. For example, a portion of Document 1 reads:
<STATEMENT> <ITEM> <ITEM_NAME> Gouania exilis </ITEM NAME> </ITEM> <ELEMENT> <ELEMENT_NAME> Flower colour </ELEMENT_NAME> </ELEMENT> <VALUE> green </VALUE> <QUALIFIER> rarely </QUALIFIER> </STATEMENT>
There needs to be a way to preclude a nonsensical entry like:
<STATEMENT> <ITEM> <ITEM_NAME> Gouania exilis </ITEM NAME> </ITEM> <ELEMENT> <ELEMENT_NAME> Flour colur </ELEMENT_NAME> </ELEMENT> <VALUE> Puerto Rico </VALUE> <QUALIFIER> anchovies, please! </QUALIFIER> </STATEMENT>
I might also note that I strongly question the way the above is organized. I think a rearrangement better expressing the relationships (but still not really addressing the problem of a lack of meaningful validation) would be more along the lines of:
<ITEM> <ITEM_NAME> Gouania exilis </ITEM NAME> <ELEMENT> <ELEMENT_NAME> Flower colour </ELEMENT_NAME> <VALUE> green <QUALIFIER> rarely </QUALIFIER> </VALUE> </ELEMENT> </ITEM>
And I'd be inclined to make a bit more use of attributes, rather than element content, though as Bob Morris points out, that is largely (though not entirely) a matter of stylistic convention.
I don't really see any great difficulties is implementing some sort of "character list" within the XML syntax. "Document 4" appears to be heading in that direction, but doesn't (in my opinion) go quite far enough. What I think we need is a Schema definition that would allow a validating parser to detect obvious data errors, and assist editing software in enforcing "correctness" of the data. I've recently been looking through the description of XML Schema, and it seems to have the expressive power needed for this sort of thing.
Cheers,
Eric Zurcher CSIRO Division of Entomology Canberra, Australia E-mail: ericz@ento.csiro.au
participants (1)
-
Eric Zurcher