By golly what a start! Great stuff!
Having read this morning the small flurry of contributions from Mauro, Leigh, Don, Jean-Marc and Noel, it's obvious that there are many ideas out there just waiting to burst forth. Perhaps we need a bit of structure now (or perhaps later).
I'd suggest that there are several issues that can be used to structure the discussion.
1. WHAT'S THE INITIAL LIMIT FOR WHAT WE'RE TRYING TO DO?
We could try to describe the universe that we're operating in by choosing which of the following is most true.
a) We want an interchange format for current (and future) descriptive data/identification programs.
These include LucID, DELTA's CONFOR, IntKey etc (the programs), XID, Linnaeus, and others (we should perhaps fill in this list sometime). The idea of an interchange format would be simple - each program would continue to store data in their own native format, but would implement import/export functions to the common standard so users would not be stuck in one format. The format should be at least a sum of all the types of data stored by each.
Note that under this model if a user stored data using one program's data editor (e.g. the DELTA Builder) then exported to another program (e.g. the LucID Player) there will be data loss (since not all data stored will be useable by the recipient) and data inadequacy (since the recipient may support data that can't be stored by the storer). But having an exchange standard will increase pressure from users on all developers to do some competitive bootstrapping, and perhaps also encourage someone to develop a direct builder for the exchange format.
b) We want a new data format for storing, parsing and manipulating descriptive data, without reference to existing programs.
This would perhaps not differ from a) above, except that it would start more from scratch. It could perhaps subsume a) if such a format could also be used as an interchange format. But given the suggestion from Jean-Marc:
- The main existing formats, if we count by the number of species
covered, are not DELTA, or other Computer Taxonomists's inventions, they are the semi-formal formats of the existing floras in plain text, which contain the vast majority of existing species;
- the priority is to put this existing material in XML (this can be done
using standard parsing techniques);
it would perhaps not serve the function of a) and those of us with existing programs would not advance beyond square 1.
If we go this route a whole new suite of software tools will be needed for handling this, turnkey systems at the very least to put tools in the hands of ordinary users who don't know and don't want to know the intricacies of XML.
c) We want to go the Universe and subsume all existing descriptive data formats into one superformat.
This seems to be behind the comments about subsuming NEXUS. I think we should tread warily here. My impression is that users of NEXUS are more than happy with their format, there is no pressure for change coming from there, and we could merrily try to subsume NEXUS but find that the NEXUS community would ignore our efforts thankyou very much I'm quite alright. Are the Maddisons/Swoffords/Donoghues of the world listening? What do you think?
d) any other statements?
2. WHAT ARE THE BASELINE REQUIREMENTS FOR THE SYSTEM
I agree completely that step 1 whatever our goal must be a formal requirements analysis. Perhaps this would best proceed by the group contributing to an evolving list (or is there a better way?). We could use Leigh's as a starting point
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ REQUIREMENTS FOR A DESCRIPTIVE DATA FORMAT 1. - ease of use (i.e. authoring) 2. - ease of processing (parsing, validating, reading, converting) 3. - ease of sharing (i.e. distribution) 4. - open-ness (i.e. proprietary/non-proprietary) 5. - ease of extensibility (i.e. ability to add more information cleanly at a later data) 6. - internationalization 7. - un-abiguity of data representation 8. - unlimited size of data sets? (i.e. any limitation on character names, lengths, item names, numbers, etc) +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
I'd vote that it should be non-proprietary (i.e. belonging to no one developer - that's been one limitation on DELTA as a "standard") and unlimited. I'd also say that under ease of authoring we should consider a mythical taxonomist who has data stored in a spreadsheet - can they easily fudge that data into the format (this is important in deciding between e.g. tagged formats like XML for which the fudge would be difficult and matrix formats).
3. WHAT ARE THE CORE DATA ELEMENTS
Here's the beginning of a list, mostly a summation of LucID and DELTA. Note that this assumes that we're not simply tagging up descriptions.
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ TYPES OF DATA NEEDED FOR A DESCRIPTIVE DATA STANDARD
State-level scores: absent normally present rarely present (value?) uncertain present by misinterpretation
Character-level scores: not scored yet unknown inapplicable uncertain .....do we need all these? .....more?
Extra data: free-text comments attached to state scores state-score links ("/" vs "&" in DELTA)
Character types: Ordered Multistate Unordered Multistate Real Numeric Integer Numeric? Ordered Cyclic Multistate Free Text
Real Numeric Data: Minimum Normal Low Normal High Maximum Mean? Standard Deviation? N?
Set-level data links to other data sets - links to lower-level data e.g. specimen-level data to be summed into taxon-level data e.g. subkeys
- links to higher-level data e.g. passing of information up and down taxonomic hierarchies e.g. superkeys
Multimedia data links from both character and taxon elements to any form of multimedia including blob text documents ++++++++++++++++++++++++++++++++++++++++++++++++
Finally, I think we need to bear in mind at all times the practical difficulties in wanting one format that's optimal for all things. My experience has been that a format that's good for storing atomised natural-language descriptions is not good for storing interactive key data and vice versa. The idea of a universal data store that can be used for many different purposes is a seductive one, and hence needs to be subject to frequent reality checks.
Cheers - k