Thought and a Question
ldodds at INGENTA.COM
Fri Nov 30 09:27:35 CET 2001
I'll start with the question:
Am I right in assuming that the process is to propose and agree on a
set of challenges, and then present different solutions for solving
If so, how should those solutions be presented. Kevin and Kerry
have both posted XML documents that meet Challenge 1, and
and Steve Shattuck and I have also provided XML formats that
can encode similar data.
How will we judge these formats as better or worse? I can imagine
producing a near infinite number of different XML vocabularies
that can encode the same data (+/- attributes, different tag
names, nesting, cross-referencing, etc, etc).
I'd like to suggest, again, that the solutions for the challenges
should be presented as simple data models. Syntax can come later
(and may fall naturally out of the model).
Now the thought:
Kevin has pointed out that there is a large amount of taxonomic
data which is simply free text, and it is a worthwhile goal to
consider how to repurpose this information.
However, I think that that is a separate work item to defining
a format for capturing that data.
I'd actually interpeted Challenge 1 to mean "represent the data contained
in these natural language descriptions", and was ignoring (for the
present) how that data would be extracted (manual markup of
the text description or direct data entry into a suitable tool).
Extracting information from free text is still far from being
an automated process. If there is still going to be a manual element
involved for some time, it seems better to have a standard and
supporting tools to capture _new_ data in a rigorous format, while
considering the markup and processing of 'legacy' data as a separate
To approach this slightly differently, the example markup that Kevin
has suggested adding to free text descriptions would be interpreted
according to a standard data model :)
I hope this isn't interpreted as an attempt to derail the good progress
being made (nice to see some discussion again! :) but in the 'XML world'
we're seeing problems surface because of incompatibilities between
specifications that have arisen due to slightly different interpretations
of the XML data model. This is because the model came after the
Leigh Dodds, Research Group, Ingenta | "Pluralitas non est ponenda
http://weblogs.userland.com/eclectic | sine necessitate"
http://www.xml.com/pub/xmldeviant | -- William of Ockham
More information about the tdwg-content