kevin.thiele at PI.CSIRO.AU
Wed Nov 24 12:05:05 CET 1999
By golly what a start! Great stuff!
Having read this morning the small flurry of contributions from Mauro,
Leigh, Don, Jean-Marc and Noel, it's obvious that there are many ideas out
there just waiting to burst forth. Perhaps we need a bit of structure now
(or perhaps later).
I'd suggest that there are several issues that can be used to structure the
1. WHAT'S THE INITIAL LIMIT FOR WHAT WE'RE TRYING TO DO?
We could try to describe the universe that we're operating in by choosing
which of the following is most true.
a) We want an interchange format for current (and future) descriptive
These include LucID, DELTA's CONFOR, IntKey etc (the programs), XID,
Linnaeus, and others (we should perhaps fill in this list sometime). The
idea of an interchange format would be simple - each program would continue
to store data in their own native format, but would implement import/export
functions to the common standard so users would not be stuck in one format.
The format should be at least a sum of all the types of data stored by each.
Note that under this model if a user stored data using one program's data
editor (e.g. the DELTA Builder) then exported to another program (e.g. the
LucID Player) there will be data loss (since not all data stored will be
useable by the recipient) and data inadequacy (since the recipient may
support data that can't be stored by the storer). But having an exchange
standard will increase pressure from users on all developers to do some
competitive bootstrapping, and perhaps also encourage someone to develop a
direct builder for the exchange format.
b) We want a new data format for storing, parsing and manipulating
descriptive data, without reference to existing programs.
This would perhaps not differ from a) above, except that it would start more
from scratch. It could perhaps subsume a) if such a format could also be
used as an interchange format. But given the suggestion from Jean-Marc:
>- The main existing formats, if we count by the number of species
>are not DELTA, or other Computer Taxonomists's inventions, they are the
>semi-formal formats of the existing floras in plain text, which contain
>vast majority of existing species;
>- the priority is to put this existing material in XML (this can be done
>using standard parsing techniques);
it would perhaps not serve the function of a) and those of us with existing
programs would not advance beyond square 1.
If we go this route a whole new suite of software tools will be needed for
handling this, turnkey systems at the very least to put tools in the hands
of ordinary users who don't know and don't want to know the intricacies of XML.
c) We want to go the Universe and subsume all existing descriptive data
formats into one superformat.
This seems to be behind the comments about subsuming NEXUS. I think we
should tread warily here. My impression is that users of NEXUS are more than
happy with their format, there is no pressure for change coming from there,
and we could merrily try to subsume NEXUS but find that the NEXUS community
would ignore our efforts thankyou very much I'm quite alright. Are the
Maddisons/Swoffords/Donoghues of the world listening? What do you think?
d) any other statements?
2. WHAT ARE THE BASELINE REQUIREMENTS FOR THE SYSTEM
I agree completely that step 1 whatever our goal must be a formal
requirements analysis. Perhaps this would best proceed by the group
contributing to an evolving list (or is there a better way?). We could use
Leigh's as a starting point
REQUIREMENTS FOR A DESCRIPTIVE DATA FORMAT
1. - ease of use (i.e. authoring)
2. - ease of processing (parsing, validating, reading, converting)
3. - ease of sharing (i.e. distribution)
4. - open-ness (i.e. proprietary/non-proprietary)
5. - ease of extensibility (i.e. ability to add more information cleanly at a
6. - internationalization
7. - un-abiguity of data representation
8. - unlimited size of data sets? (i.e. any limitation on character names,
lengths, item names, numbers, etc)
I'd vote that it should be non-proprietary (i.e. belonging to no one
developer - that's been one limitation on DELTA as a "standard") and
unlimited. I'd also say that under ease of authoring we should consider a
mythical taxonomist who has data stored in a spreadsheet - can they easily
fudge that data into the format (this is important in deciding between e.g.
tagged formats like XML for which the fudge would be difficult and matrix
3. WHAT ARE THE CORE DATA ELEMENTS
Here's the beginning of a list, mostly a summation of LucID and DELTA. Note
that this assumes that we're not simply tagging up descriptions.
TYPES OF DATA NEEDED FOR A DESCRIPTIVE DATA STANDARD
rarely present (value?)
present by misinterpretation
not scored yet
.....do we need all these?
free-text comments attached to state scores
state-score links ("/" vs "&" in DELTA)
Ordered Cyclic Multistate
Real Numeric Data:
links to other data sets
- links to lower-level data
e.g. specimen-level data to be summed into taxon-level data
- links to higher-level data
e.g. passing of information up and down taxonomic hierarchies
links from both character and taxon elements to any form of multimedia
including blob text documents
Finally, I think we need to bear in mind at all times the practical
difficulties in wanting one format that's optimal for all things. My
experience has been that a format that's good for storing atomised
natural-language descriptions is not good for storing interactive key data
and vice versa. The idea of a universal data store that can be used for many
different purposes is a seductive one, and hence needs to be subject to
frequent reality checks.
Cheers - k
More information about the tdwg-content