Things we still need

Wed Jul 19 18:07:36 CEST 2000

Kevin and colleagues,

As per our discussions at the US-Australia Workshop, I would again reiterate a few
general observation with respect to the list and express my agreement with specific
comments made by you, Bryan Heidorn, and Stan Blum.  However, with your indulgence,
I would also like to provide what may be a somewhat different perspective.

The focus on "requirements" for a descriptive data standard for taxonomy, as you
and Stan emphasize is a critical one, even though as Bryan points out there remains
a number of issues that need to be dealt with that may not be fully accounted for
in the draft standard you have kindly provided.  I would agree that we need a
mechanism (structure?) for subsequent discussion on the list to permit both
general, theoretical issues to be addressed, while simultaneously breaking down the
practical realities of dealing with the complexity of specific issues involved that
at times dictates useful "digression" into jargon-laden specifics that might be
relevant for particular implementation issues that require vigorous discussion.
I'm not sure at this stage whether it is possible, at least in my own mind, to
distinguish structure from content, since dealing with existing structure and
content may be necessary to define what we perceive are requirements.  My own sense
of the previous discussion is that there are a variety of perspectives as to what
constitutes "descriptive data" and "requirements" in this context, as well as what
are the specific priorities (aspects of "standards") that are necessary for
specific applications (eg DELTA and LUCID) to intercommunicate in an
application-neutral manner.  However, mixing them into a common thread proved a bit
overwhelming.

My own bias is for a better understanding of how we can construct such a "draft
standard" so that it is open to considerable extension for the incorporation of
meta-language descriptors for more esoteric data structures, while maintaining a
flexible general framework needed to associate existing "character" data, while
also addressing the practical necessity of managing various "annotations" of
qualitative characters.   I believe this is important primarily because we ultimate
want machines to do most of the translating among formats, with minimal loss of
information or human intervention.  I believe it is also important for the more
difficult task that lies ahead of encoding means for machines to "feature
extract" across a multiplicity of representations of character data.

As a taxonomists/morphologist I am constantly confronted with new data formats and
widely different data sources.  Virtually all are created in specific contexts and
do not generally have a "web-wide" mechanism for associating their content.  For
example, it is difficult for me to determine if there exist data sets that
encompass different "encodings" of information pertaining to specific structures
for specific taxa.  I need a dyanamic mechanism that will permit me to become aware
of data sets pertaining to say the pectoral fins of a particular scorpionfish,
without having to know in advance that such data may exist in the form of 1) a
collections record of a skeleton in a particular collection, 2) a published data
set characterizing the measurements taken from a particular study, 3) a CAT scan of
such a critter, 4) an archive containing the representation of specific character
states used in a phylogenetic analysis, 5) numerous gif/jpg files of radiographs of
specimens, 6) a text based description of the pectoral fins in a fossil, 7) the
title of a paper describing the sensory innervation of the fin, or 8) a database of
specific HOX genes involved in fin formation.

Certainly, the Rich Attribution component of your document is critical element for
this, but I do not yet see how I can use this document to establish the "meta-data
wrapper" needed to compile such a list, much less establish to what extent I can
use such a text based "wrapper" to associate these disparate kinds of taxonomic
data.  How do I deal with data that are largely numeric in content or purely
graphic (pixel encoded)?  Nonetheless, I would agree that there is a need for a
series of "collation rules" to establish scope at different hierarchical levels or
for specific context-oriented activity.  I would, for example, add  several lower
levels still in this context (including parts of specimens as described at the
organ, tissue, and cellular, subcellular, and molecular level).  Of course, the
difficulty here is that resolution and context may create data structures that are
not entirely hierarchical, particularly for objects of composite origin or study.
For the nervous system in chordates one can break down the system into units with
respect to various elements that could in one sense be heirarchical (perhaps brain,
spinal chord, ramus lateralis accessorius, neuron, motor unit, motor endplate,
etc.).  However, with respect to a physiological classification dealing with action
at the level of specific neurons this classification scheme would not work since
the nerve is composite and composed of both sensory and motor elements.  Likewise,
it would be difficult to place neuroendrocrine components, specific neuropeptides,
or developmental anlagen, such as placodes, however important, into a parallel
heirarchy.  Likewise, usefully descriptive properties could not be easily
restricted to specific components.  I found the discussion at the workshop
regarding the use of acyclic directional graphs as a fundamental data structure
most interesting, but I'm not sure that morphological descriptors, perhaps unlike
gene products, are necessarily acyclic.  For example, a specific bone such as the
mandible can be classified as an element of the visceral skeleton as well as a
composite element containing both endochondral as well as dermal bone.  If one
looks early enough in development, one can't even recognize these anatomical
distinctions, although they may exist at a molecular level.  How should structures
that change with development or function be tagged and associated?  Would this not
depend upon context?  Nonetheless, following from your document, it might be a
useful excercise to consider to what extent certain classes of morphological
descriptors can be considered in such a graph-theoretical framework from which we
might be able to establish certain constructs as useful in associating otherwise
disparate, yet specific data (glossaries?).  Trees certainly are a useful data
structure for description of many morphological features, but not the only ones.

Consequently, it might be useful to break up the discussion into sub-discussions or
threads for which specific requirements can be more readily circumscribed and for
which the makings of a "meta-language" needed to search and assimilate alternate
representations might be more quickly forthcomming.  This is important because the
universe of potentially different data structures for encoding character data is
very large.  There is no need for those interested primarily in DELTA - LUCID
translations, or LUCID - PHYLIP, etc. transformations to be held up by more
specific requirements concerning translations/annotations of more arcane data
structures, even though some, like Bryan and I, may feel that transformations
between "other kinds" of data structures must also be incorporated in a way that
allows their potential richness to be exploited.  However, acheiving such
extensibility will require the "standard discriptors" to be be quite general (but
not ambiguous) in construction.

Such an approach might permit us to generalize across a number of possibly highly
specific topics and requirements that are not universally applicable and with which
many of us are differentially fluent.  This approach would be especially useful,
should we begin at a latter point begin to use them to construct XML schema or to
outline what might be necessary using XSLT to transform them from one XML format to
another.  Since XML is promising as a data neutral specification language, we might
want to maintain a separate "XML thread", and perhaps even various XML
(alternative)-implementaton subthreads (Java XML API's vs MS XML API's, vs "others?
or DTD's vs Schemas, "elements" vs "atributes", etc.) that will influence how such
a "standard" could currently be implemented.  Although certainly I would agree that
we do not want implementations to drive the standards, it is important to have an
understanding of how potential implementation might affect the utility of the
standards.  It might be useful here to draw an analogy to the presentation made at
the workshop by Sue Rhee in her discussion of the need for an "ontological"
database for common annotation of gene function across molecular databases.
Likewise, we need a generalized means of characterizing the "language" used to
describe the various entries in different "glossaries" used to describe character
data.  The need for such "cross molecular" databases would not arise, except for
specific implementation issues that are not presently adequately addressed.
Likewise, your "External lexica" might be usefully encompassed in the concept of
XML name spaces.  Although I can't think of specific examples off the top of my
head, some anatomical terms are used differently in different contexts ("viens" in
animals and plants might be a simple example).  We need to be able to distinguish
the contexts.  Perhaps this is what you mean by global versus local characters?

Hence, from my perspective it might be useful for the dialog to move forward along
several separate, yet not entirely distinct threads, where folks with specific
interests could provide input as they see fit, ignoring that which seems
irrelevant.  A few may even want to keep their thumbs in all the pies.  In glancing
over what has come before, we might consider as possible threads: 1) general
theoretical perspectives on "taxonomic data", 2) one or more application specific
threads (ie DELTA, LUCID, "phylogeny packages", NEXUS, others?, etc.), 3) issues
pertaining to description and characterization of qualitative data, 4) issues
pertaining description and characterization of quantitative characters, 5) issues
pertaining to text based description (semi-structured data), 6) issues pertaining
to structured data (ie relational or object modeled data structures), and
7) meta-language requirements (headers, tagging architecture, XML etc.).  No doubt
you or others might be able to amend these or to add a few others from within which
we might eventually reach consensus on assembly of a few key requirements that are
general to all and from which interoperable implementation could proceed so as to
be able to assess the usefulness of our work.  Perhaps some threads could rule out
discussion of "content" and others "structure".  In any event, it might be useful
to let natural selection act to allow the most productive threads to survive and
"establish focus", while the others die out, without letting the whole wither
because of the complexity and interconnection of the fundamental issues.

No doubt some of these threads will might morph into monsters not anticipated.
Consequently, to keep it all coordinate, there must be some general
agreement/understanding to focus on common requirements (GOALS) that we are trying
to acheive.  However, at this early stage, these might be largely implict so as not
to lock ourselves into unnecessarily narrow perspectives.  For this to work,
perhaps one or two "ring masters" or "virtual ushers (bouncers?)" are needed to
keep the various performers and audience on cue, to summarize progress from time to
time, and to remove, add, or combine  threads at key moments  (ie oversee and exert
some "administrative" control over the various threads).  This is important  so
that a specific set of useful general requirements is forthcomming in a timely
fashion.  I nominate you and Bryan (actually you guys nominated yourselves in
Boston or was it unanimous proclamation?).   Subsequent to such general and
specific discussion, I believe we would then be in a better position to respond to
specific requests for comments on documents, such as that you have put forth
outlining draft standards.

Stuart

Kevin Thiele wrote:

> Dear Colleagues,
>
> you will all be aware that the SDD list fell over several months ago. My
> interpretation of this is that many of the taxonomists on the list were
> left behind, perhaps early on, by the energetic discussions over issues of
> data structuring (XML, schemas, RDF etc). Most of this was certainly way
> over my head. Things got too top-heavy, and attempts to structure the
> discussion using message tags didn't seem to provide much focus.
>
> Recent discussions (at a meeting on US-Australian cooperation in
> bioinformatics in Washington, July 2000, attended by several SDD
> contributors) has again highlighted the great need for an SDD standard and
> shown that the lack of a new, inclusive standard is holding back progress
> on descriptive databasing and software design.
>
> We need to restart the list with better focus. I'd like to suggest that the
> way forward is to entirely set aside (for the time being) any discussion of
> data structure and focus entirely on content (the requirements analysis)
> for a while. We should agree on an outline of the data that we need to
> capture, then pass this on to the computerheads to provide a best-practice
> structure for storing and managing this captured data.
>
> The attached document was put up to the list shortly before it fell over.
> It's attached again here, slightly edited. ANY TAXONOMISTS STILL OUT THERE
> - please look at this. What data that you need to capture aren't handled
> here? Will this work? Is this the way to proceed?
>
> I think that the document subsumes the data requirements of the DELTA and
> Lucid programs, plus a bit more particularly in the areas of data
> attribution and hierarchical nesting of treatments. The intention is that
> the elements in this list should provide a way of storing any data needed
> to describe the morphology or anatomy of any organism or taxon.
>
> Note that this should be read merely as a list of data elements - the
> structure of the list does not imply a structure for the data file (XML or
> otherwise) used to store the data.
>
> It may be the case that this document can be jointly modified to produce a
> final document, or we may need to start from scratch with another. Any ideas?
>
> Cheers - k
>
>   ------------------------------------------------------------------------
>                               Name: DDST Specifications.doc
>    DDST Specifications.doc    Type: Microsoft Word Document (application/msword)
>                           Encoding: base64