Whoa!

24 Nov 1999

      By golly what a start! Great stuff!

Having read this morning the small flurry of contributions from Mauro,
Leigh, Don, Jean-Marc and Noel, it's obvious that there are many ideas out
there just waiting to burst forth. Perhaps we need a bit of structure now
(or perhaps later).

I'd suggest that there are several issues that can be used to structure the
discussion.

1. WHAT'S THE INITIAL LIMIT FOR WHAT WE'RE TRYING TO DO?

We could try to describe the universe that we're operating in by choosing
which of the following is most true.

a) We want an interchange format for current (and future) descriptive
data/identification programs.

These include LucID, DELTA's CONFOR, IntKey etc (the programs), XID,
Linnaeus, and others (we should perhaps fill in this list sometime). The
idea of an interchange format would be simple - each program would continue
to store data in their own native format, but would implement import/export
functions to the common standard so users would not be stuck in one format.
The format should be at least a sum of all the types of data stored by each.

Note that under this model if a user stored data using one program's data
editor (e.g. the DELTA Builder) then exported to another program (e.g. the
LucID Player) there will be data loss (since not all data stored will be
useable by the recipient) and data inadequacy (since the recipient may
support data that can't be stored by the storer). But having an exchange
standard will increase pressure from users on all developers to do some
competitive bootstrapping, and perhaps also encourage someone to develop a
direct builder for the exchange format.

b) We want a new data format for storing, parsing and manipulating
descriptive data, without reference to existing programs.

This would perhaps not differ from a) above, except that it would start more
from scratch. It could perhaps subsume a) if such a format could also be
used as an interchange format. But given the suggestion from Jean-Marc:
...
- The main existing formats, if we count by the number of species
covered,
are not DELTA, or other Computer Taxonomists's inventions, they are the
semi-formal formats of the existing floras in plain text, which contain
the
vast majority of existing species;
- the priority is to put this existing material in XML (this can be done
using standard parsing techniques);
it would perhaps not serve the function of a) and those of us with existing
programs would not advance beyond square 1.

If we go this route a whole new suite of software tools will be needed for
handling this, turnkey systems at the very least to put tools in the hands
of ordinary users who don't know and don't want to know the intricacies of XML.

c) We want to go the Universe and subsume all existing descriptive data
formats into one superformat.

This seems to be behind the comments about subsuming NEXUS. I think we
should tread warily here. My impression is that users of NEXUS are more than
happy with their format, there is no pressure for change coming from there,
and we could merrily try to subsume NEXUS but find that the NEXUS community
would ignore our efforts thankyou very much I'm quite alright. Are the
Maddisons/Swoffords/Donoghues of the world listening? What do you think?

d) any other statements?

2. WHAT ARE THE BASELINE REQUIREMENTS FOR THE SYSTEM

I agree completely that step 1 whatever our goal must be a formal
requirements analysis. Perhaps this would best proceed by the group
contributing to an evolving list (or is there a better way?). We could use
Leigh's as a starting point

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
REQUIREMENTS FOR A DESCRIPTIVE DATA FORMAT
1. - ease of use (i.e. authoring)
2. - ease of processing (parsing, validating, reading, converting)
3. - ease of sharing (i.e. distribution)
4. - open-ness (i.e. proprietary/non-proprietary)
5. - ease of extensibility (i.e. ability to add more information cleanly at a
later data)
6. - internationalization
7. - un-abiguity of data representation
8. - unlimited size of data sets? (i.e. any limitation on character names,
lengths, item names, numbers, etc)
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

I'd vote that it should be non-proprietary (i.e. belonging to no one
developer - that's been one limitation on DELTA as a "standard") and
unlimited. I'd also say that under ease of authoring we should consider a
mythical taxonomist who has data stored in a spreadsheet - can they easily
fudge that data into the format (this is important in deciding between e.g.
tagged formats like XML for which the fudge would be difficult and matrix
formats).

3. WHAT ARE THE CORE DATA ELEMENTS

Here's the beginning of a list, mostly a summation of LucID and DELTA. Note
that this assumes that we're not simply tagging up descriptions.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
TYPES OF DATA NEEDED FOR A DESCRIPTIVE DATA STANDARD

State-level scores:
 absent
 normally present
 rarely present (value?)
 uncertain
 present by misinterpretation

Character-level scores:
 not scored yet
 unknown
 inapplicable
 uncertain
.....do we need all these?
.....more?

Extra data:
 free-text comments attached to state scores
 state-score links ("/" vs "&" in DELTA)

Character types:
 Ordered Multistate
 Unordered Multistate
 Real Numeric
 Integer Numeric?
 Ordered Cyclic Multistate
 Free Text

Real Numeric Data:
 Minimum
 Normal Low
 Normal High
 Maximum
 Mean?
 Standard Deviation?
 N?

Set-level data
 links to other data sets
  - links to lower-level data
     e.g. specimen-level data to be summed into taxon-level data
     e.g. subkeys

  - links to higher-level data
     e.g. passing of information up and down taxonomic hierarchies
     e.g. superkeys

Multimedia data
 links from both character and taxon elements to any form of multimedia
including blob text documents
++++++++++++++++++++++++++++++++++++++++++++++++

Finally, I think we need to bear in mind at all times the practical
difficulties in wanting one format that's optimal for all things. My
experience has been that a format that's good for storing atomised
natural-language descriptions is not good for storing interactive key data
and vice versa. The idea of a universal data store that can be used for many
different purposes is a seductive one, and hence needs to be subject to
frequent reality checks.

Cheers - k

Whoa!

Kevin Thiele