Firstly, can I just say that the level of interest and debate on this topic already is impressive - there are clearly a lot of us around the world banging our heads against this one!
I guess it will take a little while to settle down to a rhythm of debate, and finding the level and especially scope of the task at hand will necessarily require some brain-storming in the early stages.
But I would like to make the general comment that we should be very clear in our minds about keeping quite separate and distinct the task of defining a new standard for descriptive data exchange, from the specific nature of the data that the standard will be communicating.
The standard is required for capturing and exchanging data across all kingdoms (and occasionally winds up being used for abiotic entities like archaeological pottery shards - see Louhivuori, DELTA Newsletter 12, April 1996)!
Examples drawn from particular disciplines are extremely useful for illustrating the requirements or functionality of elements within the existing or proposed data standard, but have the potential to draw us into another debate entirely - that of data sharing across projects.
As far as I can see this is the summary of this sub-groups task, as presented in the welcome message on joining this list:
" the subgroup should attempt to analyze the requirements for a new standard for descriptive data. The metaformat for the standard will probably be based on XML or related formats. It is hoped that this standard will reach universal recognition to become at some point a successor to existing standards like DELTA, NEXUS, or XDF."
The general mechanism for exchanging descriptive data between software applications and databases will have to be kept distinct from the definition and meaning of specific instances of descriptive data content, which I think should (currently) be outside the scope of this list.
It's pretty clear from Don, Noel and now Leigh's comments that combining data meaningfully across projects is a whole 'nother ballgame:
For example, how could retrieval of descriptive information across a department or institution be facilitated? Is it because of the sheer flexibility in how the characters can be defined using DELTA, that querying across projects is difficult to say the least, unless the character set is global that is (and therefore with a lot of redundacy)?
I agree. XML only solves the syntax problem. The semantics problem can't be solved as easily with technology, but would rely either on community agreement, or on extensive mappings between various ontologies. I think.
Definitely, and this is an important point to make. XML is an enabler, its not a magic-bullet. What types of problems have people met in attempting to share characters between research efforts?
At a basic level, linking withing and between documents and data is easy. Its defining the semantics of this, and ensuring that the data retrieved is meaningful.
Just to answer Leigh's question from the previous paragraph, here in PERTH we have made a serious attempt over the last seven years to capture descriptive data in DELTA format in a number of projects with an underlying aim of sharing characters across all projects.
Management of nested sets of characters and continued adherence to character definition across projects over time and at various levels of the taxonomic hierarchy has certainly been arduous at times, but the benefits of a 'corporate' approach to data capture are now becoming clear as the data is seeing the light of day (see our FloraBase web site for some readily available samples: http://www.calm.wa.gov.au/science/florabase.html )
And we have made some useful progress in developing database applications to aid in the process of amalgamating data reliably across projects. For an early discussion of some of the issues in managing institutional descriptive datasets see Chapman and Choo, DELTA Newsletter 12, April 1996).
(NB. PDF versions of recent DELTA Newsletters are available at: http://www.calm.wa.gov.au/science/delta/news/index.html )
So, I'm not saying its not a rivetting subject - just that one headache at a time is enough!
Cheers, Alex ____ Alex R. Chapman Email: alexc@calm.wa.gov.au Research Scientist Voice/Fax: +61 8 9334 0506 / 0515 WA Herbarium - Department of Conservation and Land Management Locked Bag 104 Bentley Delivery Centre Western Australia 6983 ---------- Original Text ----------
From: "Leigh Dodds" ldodds@ingenta.com, on 24/11/99 18:52:
As far as Delta goes, I think that the data are fairly close to what taxonomists have been encoding into traditional taxonomic descriptions. 'Success' is pretty good.
Nexus can include charater/state data but also other things such as ways to describe phylogenetic trees in the so-called "New Hampshire" format:
((raccoon:19.19959,bear:6.80041):0.84600,((sea_lion:11.99700, seal:12.00300):7.52973,((monkey:100.85930,cat:47.14069):20.59201, weasel:18.87953):2.09460):3.87382,dog:25.46154);
If the basic format contains the relevant data, then a phylogenetic tree can be constructed from that data - so I'd have seen the above as something that should be derived from your data set rather than stored within it.
I assume "New Hampshire" = "Newick" format? If so, heres an XML equivalent.
<!ELEMENT label (#PCDATA)> <!ELEMENT title (#PCDATA)> <!ELEMENT branch (label?, branch*)> <!ATTLIST branch length CDATA #IMPLIED> <!ELEMENT newick (branch?)>
Just something I was toying with before.
I sense that application developers who use Nexus are pretty happy with the format since is was a collaborative effort to begin with, and that the Nexus community would not see a great benefit in adopting a new data format. Anybody else have this impression?
Given this, and other similar comments on the list, I'd suggest that an additional requirement of any new format (if thats the way things go) is that it provides backwards-compatibility, as far as possible with other formats.
Granted this might well be lossy, but given the numbers of users of Nexus, and the amount of software currently available, making efforts to provide for data conversion between formats means that we don't have to start completely from scratch.
The first thing I did with XDELTA was provide a stylesheet with produce DELTA files. The same could be achieved for other formats.
For example, how could retrieval of descriptive information across a department or institution be facilitated? Is it because of the sheer flexibility in how the characters can be defined using DELTA,
that querying
across projects is difficult to say the least, unless the
character set is
global that is (and therefore with a lot of redundacy)?
I agree. XML only solves the syntax problem. The semantics problem can't be solved as easily with technology, but would rely either on community agreement, or on extensive mappings between various ontologies. I think.
Definitely, and this is an important point to make. XML is an enabler, its not a magic-bullet. What types of problems have people met in attempting to share characters between research efforts?
At a basic level, linking withing and between documents and data is easy. Its defining the semantics of this, and ensuring that the data retrieved is meaningful.
This on the face of it would seem to map onto an XML
element/attribute/value
schema pretty well. Would that help define more closely how we construct characters and maybe even prove universally applicable for all character types?
Great question. Leigh? Anyone? :-)
Sure element/attribute/value, but also element/element/content
i.e. <leaf shape="obvate" /> or
<leaf> <shape>obvate</shape> </leaf>
Or did I miss something?
Could one constrain further by expressing within the schema the
hierarchical
relationships between the elements(eg 'blade' and 'petiole' as child elements of leaf') or would the introduction of terminology into the 'standard' be a step too far?
This would be The Lexicon, no? It does seem to go beyond the task of making a file format for data transfer. But it gets my vote nevertheless.
I'd suggest that something like this (expressing meta-data relationships amongst elements) could be layered on top of a basic data description format. RDF might provide the facility to do this effectively. I need to revisit the spec.
cheers,
L.