I'll start with the question:
Am I right in assuming that the process is to propose and agree on a set of challenges, and then present different solutions for solving these?
If so, how should those solutions be presented. Kevin and Kerry have both posted XML documents that meet Challenge 1, and and Steve Shattuck and I have also provided XML formats that can encode similar data.
How will we judge these formats as better or worse? I can imagine producing a near infinite number of different XML vocabularies that can encode the same data (+/- attributes, different tag names, nesting, cross-referencing, etc, etc).
I'd like to suggest, again, that the solutions for the challenges should be presented as simple data models. Syntax can come later (and may fall naturally out of the model).
--
Now the thought:
Kevin has pointed out that there is a large amount of taxonomic data which is simply free text, and it is a worthwhile goal to consider how to repurpose this information.
However, I think that that is a separate work item to defining a format for capturing that data.
I'd actually interpeted Challenge 1 to mean "represent the data contained in these natural language descriptions", and was ignoring (for the present) how that data would be extracted (manual markup of the text description or direct data entry into a suitable tool).
Extracting information from free text is still far from being an automated process. If there is still going to be a manual element involved for some time, it seems better to have a standard and supporting tools to capture _new_ data in a rigorous format, while considering the markup and processing of 'legacy' data as a separate issue.
To approach this slightly differently, the example markup that Kevin has suggested adding to free text descriptions would be interpreted according to a standard data model :)
--
I hope this isn't interpreted as an attempt to derail the good progress being made (nice to see some discussion again! :) but in the 'XML world' we're seeing problems surface because of incompatibilities between specifications that have arisen due to slightly different interpretations of the XML data model. This is because the model came after the syntax.
L.
-- Leigh Dodds, Research Group, Ingenta | "Pluralitas non est ponenda http://weblogs.userland.com/eclectic | sine necessitate" http://www.xml.com/pub/xmldeviant | -- William of Ockham
participants (1)
-
Leigh Dodds