so far...

Wed Nov 24 19:05:07 CET 1999

Firstly, can I just say that the level of interest and debate on this topic
already is impressive - there are clearly a lot of us around the world banging
our heads against this one!

I guess it will take a little while to settle down to a rhythm of debate, and
finding the level and especially scope of the task at hand will necessarily
require some brain-storming in the early stages.

But I would like to make the general comment that we should be very clear
in our minds about keeping quite separate and distinct the task of defining
a new standard for descriptive data exchange, from the specific nature of the
data that the standard will be communicating.

The standard is required for capturing and exchanging data across all
kingdoms (and occasionally winds up being used for abiotic entities like
archaeological pottery shards - see Louhivuori, DELTA Newsletter 12, April 1996)!

Examples drawn from particular disciplines are extremely useful for
illustrating the requirements or functionality of elements within the existing
or proposed data standard, but have the potential to draw us into another
debate entirely - that of data sharing across projects.

As far as I can see this is the summary of this sub-groups task, as presented
in the welcome message on joining this list:

" the subgroup should attempt to analyze the requirements for a new standard
for descriptive data. The metaformat for the standard will probably be
based on XML or related formats. It is hoped that this standard
will reach universal recognition to become at some point a successor
to existing standards like DELTA, NEXUS, or XDF."

The general mechanism for exchanging descriptive data between software
applications and databases will have to be kept distinct from the definition
and meaning of  specific instances of descriptive data content, which I think
should (currently) be outside the scope of this list.

It's pretty clear from Don, Noel and now Leigh's comments that combining data
meaningfully across projects is a whole 'nother ballgame:

>> > For example, how could retrieval of descriptive information across a
>> > department or institution be facilitated? Is it because of the sheer
>> > flexibility in how the characters can be defined using DELTA,  that querying
>> > across projects is difficult to say the least, unless the character set is
>> > global that is (and therefore with a lot of redundacy)?
>>
>> I agree.  XML only solves the syntax problem.  The semantics problem can't
>> be solved as easily with technology, but would rely either on community
>> agreement, or on extensive mappings between various ontologies.  I think.

> Definitely, and this is an important point to make. XML is an enabler,
> its not a magic-bullet. What types of problems have people met in attempting
> to share characters between research efforts?

> At a basic level, linking withing and between documents and data is
> easy. Its defining the semantics of this, and ensuring that the data
> retrieved is meaningful.

Just to answer Leigh's question from the previous paragraph, here in PERTH
we have made a serious attempt over the last seven years to capture descriptive
data in DELTA format in a number of projects with an underlying aim of sharing
characters across all projects.

Management of nested sets of characters and continued adherence to
character definition across projects over time and at various levels of the
taxonomic hierarchy has certainly been arduous at times, but the benefits of a
'corporate' approach to data capture are now becoming clear as the data is
seeing the light of day (see our FloraBase web site for some readily available
samples:   http://www.calm.wa.gov.au/science/florabase.html )

And we have made some useful progress in developing database applications
to aid in the process of amalgamating data reliably across projects.  For an early
discussion of some of the issues in managing institutional descriptive datasets
see Chapman and Choo, DELTA Newsletter 12, April 1996).

(NB. PDF versions of recent DELTA Newsletters are available at:
         http://www.calm.wa.gov.au/science/delta/news/index.html )

So, I'm not saying its not a rivetting subject - just that one headache at a time
is enough!

Cheers,
Alex
____
Alex R. Chapman                   Email: alexc at calm.wa.gov.au
Research Scientist          Voice/Fax: +61 8 9334 0506 / 0515
WA Herbarium - Department of Conservation and Land Management
Locked Bag 104 Bentley Delivery Centre Western Australia 6983
---------- Original Text ----------

From: "Leigh Dodds" <ldodds at ingenta.com>, on 24/11/99 18:52:

> As far as Delta goes, I think that the data are fairly close to what
> taxonomists have been encoding into traditional taxonomic descriptions.
> 'Success' is pretty good.
>
> Nexus can include charater/state data but also other things such as ways
> to describe phylogenetic trees in the so-called "New Hampshire" format:
>
>   ((raccoon:19.19959,bear:6.80041):0.84600,((sea_lion:11.99700,
>   seal:12.00300):7.52973,((monkey:100.85930,cat:47.14069):20.59201,
>   weasel:18.87953):2.09460):3.87382,dog:25.46154);

If the basic format contains the relevant data, then a phylogenetic tree
can be constructed from that data - so I'd have seen the above as
something that should be derived from your data set rather than stored
within it.

I assume "New Hampshire" = "Newick" format? If so, heres an XML equivalent.

<!ELEMENT label   (#PCDATA)>
<!ELEMENT title   (#PCDATA)>
<!ELEMENT branch (label?, branch*)>
<!ATTLIST branch length  CDATA    #IMPLIED>
<!ELEMENT newick  (branch?)>

Just something I was toying with before.

> I sense that application developers who use Nexus are pretty happy with
> the format since is was a collaborative effort to begin with, and that the
> Nexus community would not see a great benefit in adopting a new data
> format.  Anybody else have this impression?

Given this, and other similar comments on the list, I'd suggest that an
additional requirement of any new format (if thats the way things go)
is that it provides backwards-compatibility, as far as possible with
other formats.

Granted this might well be lossy, but given the numbers of users of Nexus,
and the amount of software currently available, making efforts to
provide for data conversion between formats means that we don't
have to start completely from scratch.

The first thing I did with XDELTA was provide a stylesheet with produce
DELTA files. The same could be achieved for other formats.

> > For example, how could retrieval of descriptive information across a
> > department or institution be facilitated? Is it because of the sheer
> > flexibility in how the characters can be defined using DELTA,
> that querying
> > across projects is difficult to say the least, unless the
> character set is
> > global that is (and therefore with a lot of redundacy)?
>
> I agree.  XML only solves the syntax problem.  The semantics problem can't
> be solved as easily with technology, but would rely either on community
> agreement, or on extensive mappings between various ontologies.  I think.

Definitely, and this is an important point to make. XML is an enabler,
its not a magic-bullet. What types of problems have people met in attempting
to share characters between research efforts?

At a basic level, linking withing and between documents and data is
easy. Its defining the semantics of this, and ensuring that the data
retrieved is meaningful.

> > This on the face of it would seem to map onto an XML
> element/attribute/value
> > schema pretty well. Would that help define more closely how we construct
> > characters and maybe even prove universally applicable for all character
> > types?
>
> Great question. Leigh? Anyone? :-)

Sure element/attribute/value, but also element/element/content

i.e. <leaf shape="obvate" />
or

<leaf>
  <shape>obvate</shape>
</leaf>

Or did I miss something?

> > Could one constrain further by expressing within the schema the
> hierarchical
> > relationships between the elements(eg 'blade' and 'petiole' as child
> > elements of leaf') or would the introduction of terminology into the
> > 'standard' be a step too far?
>
> This would be The Lexicon, no?  It does seem to go beyond the task of
> making a file format for data transfer.  But it gets my vote
> nevertheless.

I'd suggest that something like this (expressing meta-data relationships
amongst elements) could be layered on top of a basic data description
format. RDF might provide the facility to do this effectively. I need
to revisit the spec.

cheers,

L.