SEEK Project and TDWG-SDD

Thu Apr 15 14:30:29 CEST 2004

Jim Beach wrote:

> On Thu, 15 Apr 2004 10:54:36 -0500, Julian H <humphries at MAIL.UTEXAS.EDU>
> wrote:
>
>
>>At 10:44 AM 4/15/2004, you wrote:
>>
>>>single discussion, but it struck me that TDWG-SDD has an opportunity to
>>>have much broader acceptance and support if your schema was not designed
>>>as a single data object--to contain both the metadata about the package
>>>(or work or whatever you refer to it as) *and* the descriptive data that
>>>describe the individual concepts.
>>
>>Another novice (pre-novice) here, are you specifically referring to
>>separating out taxonomic concept information (metadata) from the
>>descriptive data?
>
>
> No, I was thinking of seperating the metadata about the package "This is a
> data set of Magnolias from FNA, it was assembled by, organized by, dates,
> etc.) from the data describing the character states of the individual
> taxa.  So a good question is what do you do with the character
> definitions! It seems the character state values without the character
> definitions would not be of much use for any system to interpret the
> meaning of the states.  Two options, de-normalize the character
> definitions and put them in each concept schema, or two have a separate
> server, and an external reference in the data schema that, has the
> character definitions. Not sure how that choice would play out.
>

First,it is important to understand that there is NO schema for any
particular taxon, group of taxa, specimen, or any other thing or class
of things that an SDD document can describe--such as, diseases,
restaurants of Lawrence Kansas,  "Halcyon House Bed and Breakfast", or
avian pests of Boston, MA. Rather the SDD schema constrains how you make
descriptions, how you represent properties and values (i.e. characters
and states), howthose things can be related to one another, and how you
can make decision trees based on them, usually for identification purposes.

Bearing in mind that SDD is in its rev 0.9, which is a proposal not yet
adopted by TDWG, in its "Terminology" section, SDD provides for both
"global" and "local" states, the former applicable  to any characters
that wishes to use them. Examples might be various colors, shape terms
like "ovate", etc. Consuming applications are invited to use a semantics
in which global states are identical no matter where used. (For local
states this is not a meaningful question in SDD. ). Characters and
states in an SDD document do not get a GUID (though it sounds worth
considering), so documents that do not use the same Terminology section
can't compare characters or states in any defined way. Also the only
mechanism on the table for sharing Terminology across instance documents
is by XML inclusion, and this is not really a persistant mechanism
suitable for integration. In the present version, it is generally
expected that all the Terminology is defined if not in the document,
then in something that accompanies it, thus finessing the issue. Put
another way, SDD's first design target was data exchange, not data
integration, with informal attempt to keep in view where the issues may
be to migrate to support of integrating applications. Put yet another
way, to reliably use an SDD0.9 instance refering to shared external
Terminology requires a contract between all such instances that the
external Terminology is the same. SDD0.9 itself provides no mechanisms
for representing or enforcing such a contract, nor a mechanism for
expressing a fallback position if an application can represent and
detect such a violation. (Though at most expected places, SDD provides
for application-specific data to be inserted of which applications can
make whatever meaning they wish). I don't doubt that it is easy to
replace inclusion with a reference to a shared Terminolgy acquired
through a registry and accessed by a GUID or DOI. I'm certain that the
only issue would be what to do if the Terminology is in fact
unacquirable, and that issue is orthogonal to all the structure issues
that SDD is meant to address.

The SDD committee welcomes both lurkers and contributers to the SDD
discussion Wiki http://efgblade.cs.umb.edu/twiki/bin/view/SDD/WebHome
There you can especially see what questions we wrestled with, and can
especially contribute questions we missed or comment on answers that are
less than helpful in other contexts where we missed a chance to contribute.

We expect there will be an SDD workshop before TDWG in Christ Church and
hope that well before then there will be several implementations to study.

>
>
>>>If the taxa/concepts had their own schemas and were linked to the
>>>package metadata with a GUID, maybe a DOI or some other globally unique
>>>identifier, then the XML concept data sets could be used for other
>>>systems like concept based classification or database management
>>>systems.
>>
>>Could you write this sentence with a few more words?  I'm want to be sure
>
> I
>
>>get the concept.
>
>
> How about an ASCII graphic?  I'm on thin ice, but if the metadata for the
> package is this:
>
> MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM
>
> and the individual taxon character state sets are this:
>
> taxon1       taxon2      taxon3
> char-val-1   char-val-1  etc.
> char-val-2   etc.
> char-val-3
>
> if the taxon data sets (and maybe also their character definitions)we in
> sperate XML documents, then we could use them as fodder for other concept
> systems.
>
>
>
>>The overhead for the traditional diagnostic identification software
>>
>>>makers would be that the XML parts would need to assembled for the
>>>various applications that use the data and there would be the potential
>>>risk that SDD data sets would be incomplete, if there were some careless
>>>file management.
>>
>>What parts could get lost? the taxonomic parts?
>
>
> yes, if you had multiple xml docs for the same 'diagnostic package' they
> would be managed as distinct files.
>
>
>>
>>> But presumably you guys are thinking about a registry
>>>or distributed federation of these data sets anyway, where they would be
>>>archived and served intact from a trusted source.
>>
>>Um, now I am really lost, amplify please? What does this have to do with
>>incomplete SDD data sets?  More on dataset archives in the next email.
>
>
> People serving SDD data sets thorugh the web, would presumably be aware of
> data set integrity issues and make sure their SDD packages were complete.
>
>

Yes, as above, that is the present assumption of SDD.

>>
>>>I also understand that data sets of diagnostic identification
>>>information are far from complete descriptions of concepts in either a
>>>taxonomic or phylogenetic sense, but if the SDD concept schema could
>>>accommodate additional characters, then the opportunity would be there
>>>for other people to use SDD for other kinds of systems.  The UI of
>>>diagnostic key programs would likely not need to use or display DNA
>>>sequences for interactive identification, but no harm done, they could
>>>just ignore fields of no use to the program at hand.
>>
>>
>>Ok, now we are getting to something I know about.  See the next email for
>>some comments on this...
>>
>>Julian
>>
>>
>>
>>Julian Humphries
>>DigiMorph.Org
>>Geological Sciences
>>University of Texas at Austin
>>Austin, TX 78712
>>512-471-3275

--
Robert A. Morris, Professor of Computer Science
University of Massachusetts at Boston
100 Morrissey Blvd; Boston, MA 02125
http://www.cs.umb.edu/~ram http://www.cs.umb.edu/efg
phone: (+1)617-287-6466 fax:   (+1)617-287-6433