HyperSpace shuttles and hyperbikes

Mon Jul 24 10:52:37 CEST 2000

Kevin,

I think the approach that Stan outlines in looking at specific relationships among
entities should make much of the complication, seem a bit less of an obstacle.  We
need to better understand the properties of our "data objects" and the various
ways they are associated.  If we define our vision too narrowly, and here is
perhaps where I would disagree with you and Stan, we will end up with a result
that may not be as general and flexible as we might need later.  We run the risk
that it will probably difficult to adapt to more complex relations.  However, I do
believe Stan is correct in that we need to first focus on the simplest relations
among entities.  This is probably a good idea for two reasons: 1) if we have
started with the right notion of "fundamental objects" at least some of their
interactions will be simple (also read general, ie less constrained), if we have
evaluated them correctly, and 2) simple relations will likely be more efficiently
implemented and at the core of our "hyperspace shuttle" (hyperbike?) we will
require efficiency, since the number of data sets and transformations are already
large.

If I understand the rationale for engaging in this exercise it is to permit users
to search, select and "interoperate" with a variety of different data types and
structures so that data mining and new insights can be gained.  I believe we agree
that we need more than a translator, but how much more, I believe, will remain
unclear until we have a better handle on exactly what properties these various
entities (features and data structures used to represent them) actually have.
This is not to say that studying carefully how specific translators may work (ie
Lucid to DELTA or reverse or Lucid to NEXUS, etc.) will not be informative for
ellucidating the generalities that are required.  On the contrarary, the approach
you suggest is quite a workable one and may well suffice for more limited ends.
My earlier comments were directed to permit the structure of the specifications to
extend to more general data types than we might be able to currently  envision.
I reiterate the general structure of the "standard descriptor language" should not
restrict progress along the lines you suggest, only that it would be useful adopt
a general approach that could subequently incorporate such extended
functionality.  I guess it stems from my need (hope) for mechanisms that will help
me integrate loosely structured text (in a web-based environment) and more
quantitative (operational) means of structuring data.

At the conceptual level, the notion of "character" is central to the dicussion of
what we mean by taxonomic information.  I believe we need to have a recomendation
for a specification that taxonomists/systematists can adopt to include the
"universe" of such characters, not that all are equal or equally important, or
utilized by a given implementation.  The recommendation probably also needs to be
simple enough that most can understand and use it.   Nonetheless, there appear to
be many different kinds of characters and here is where things start to get hard.
Some such as cladistic characters will have a basis of comparison that maps the
characters into specific disjoint character states.  Others will be mensural or
countable in nature, whose basis of comparison defines an axis of variation and
may or may not exhibit the property of disjunction (disjuction may be a property
of sampling).  In both cases they could be represented by a vector.  However, we
must understand the rules of accessing each element and whether or not specific
relations can be drawn from manipulation across vector types.  In the case of the
former they can be represented by text or by tags (ie state A vs. B, or 0 and
1).   The text used may be circumscribed via controled vocabularies to permit
analogy, if not direct comparision, with quantified characters (ie long vs short,
round vs square, etc).  Admittedly, this will be difficult because most biologists
often have used "local" reference frames when using free text in this way, rather
than any sort of "universally" applicable frame that we might be speaking about
for the first time here.  Consequently, "round" or "large" can have very different
meanings.  However, even here we can probably safely conclude that the
dimensionality of primary interest is 3D, at least when attempting to characterize
the size and shapes of features.

While the number of permutations may be extremely large, it may well be possible
to circumscribe characters with relatively few "tags", if we can define common
rules for the use of these "tags".   Perhaps the tags might define a
classification of characters (ie <qualitative character>, <quantitative
character>, in the former one may have <cladistic characters> and <general
qualitative> and among these <cladistic characters> might be further subdivided
<two state> or <multi-state> and the later may be tagged as <fully ordered> or
<partially ordered>, where as the former as well as the latter may also be tagged
as <directed> or <undirected>.  Similarly, cladistic characters will have other
properties that could likewise be tagged <name>,  <study used>, <basis of
comparison>, <textual description>, <date first proposed>, <represention of states
(ie 0's and 1's or some other tagging scheme)> some of which might have multiple
entries, such as <name (perhaps different for two different investigators, but the
same "character" nonetheless)> or <study used>.   Not all tags might get used.
However, if they are used, a standard must specifiy how they are to be used.

Characters also have other properites such as scale.  Although most informative in
a measured sense, (Angstroms, microns, millimeters, meters, and nanograms, grams,
kilograms, etc.), we would likely have to establish relations among a variety of
different kinds of reference scales (ie molecular, subcellular, cellular, tissue,
organ, organismal, ecosystem), which are variously defined in a broad array of
contexts.  Where possible, it would be useful to attempt to at least loosely
interoperate across such reference frames, say for purposes of a broad search.
Characters also have scope in that they are mapped onto organisms.  Consequently,
conventions are also necessary to circumscribe what we mean by taxon and how we
deal with issues of rank (and perhaps newer philosophies of "rankless"
classfication that seek to deprecate the concept of rank).  Although multiple
classifications and synonyms make both kinds of mapping complex, "standardized"
lists of names for molecules, structures, and taxa do exist that we can use to
simplify some of the complexity.  Characters also exist through time, but again
absolute scales are often wanting and we will likely need to deal with the period
of time in which a character exists (existed) in relation to relative reference
frames.  Obviously, in each case we require a means to specify the nature of the
standardization, much as XML names spaces attempt to deal with the potential
confounding of elements of the same "name").  Here it would be helpful to have a
"standard list of standard lists" so that we can begin to understand how ungainly
such an approach might actually be.  Some such as the geological reference scale
and ITIS names list are obvious candidates, but it would be nice begin to assemble
the URL's that we might have to visit to actually understand the details.

While not all implementations would need utilize all "implict" tags, specification
of the nestedness of a particular tagging scheme would require specification
(standardization).  Such an approach does lend itself nicely to parsing and to
object-oriented definition.  At the level of implementation, it would not be hard
to imaging for example Lucid or Delta passing their own internal representation to
the parser that uses the specification inherent in the internal representation to
wrap characters with (XML) tags that could then be either translated to other data
formats, or processed according to a previously specified search criteria or set
of processing instructions that might perform another kind of association, data
mining, or analysis.   Similarly, it would be relatively easy to take range data
from a given CAT scan or structured light range-sensor or graphical data gif or
jpg files and similarly tag (annotate) it so that particular features of the data
could be given "equivalent"  meaning for association with otherwise qualitative
data.  Admittedly, getting the machine to do the annotation remains, except in
limited circumstances, a dream for the heavens, and I too would just settle for a
"hyperbike" (with training wheels) at this stage in the development of "hyperspace
shuttle archicture".

Stuart

Kevin Thiele wrote:

> Dear Stuart,
>
> I agree completely with everything you say, but it worries me all the same.
> You point out the complexity of descriptive data and the enormity of the
> task of completely capturing it. But we need to get something done, and I
> think we need some incremental stages.
>
> Your suggestion as to maintaining threads of discussion is not unlike the
> way the list was running before it fell over. Some of the threads did
> indeed morph into monsters, others got lost and I think many people with
> them. I'd really like to try for a while keeping the discussion focused on
> the document with the proposed list of elements, to glean suggestions from
> people as to whether it's completely inadequate or what. At the same time,
> of course, I don't want to constrain people to run with this or with my
> suggested way of doing things. I may well be way off the track of what's
> possible or achievable. Working up the document may provide us with an
> incremental advance, or it may be that such incremental advances are not
> worth achieving and your suggestion for a great leap forward is the way to
> go. My way of looking at it is that if DELTA is a bicycle, I'm proposing a
> motor bike, and you're sketching out plans for a space shuttle. Maybe I'm
> not being visionary enough?
>
> It seems to me that there's an old way of describing something, and many
> possible new ways. The old way is with a set of characters with values
> (states) applied to a set of taxa. This is the form of DELTA data, Lucid
> data, textual descriptions (in a way). Updating our standard for this way
> of describing is achievable now, I think.
>
> New ways of describing something, such as with 3-D tomographic imaging etc,
> may well be the way of the future. But I'm not sure that we can have one
> descriptive standard that encompasses both old and new ways under one roof.
> This is why we need extensibility - can we take an incremental step along
> the lines that I'm proposing while allowing for the future brave new world.
> Or can we have a set of linked standards - one for describing in the boring
> old characters/states/taxa way, and others for the more space-shuttle ways
> that can be linked in as they develop.
>
> Looking forward to responses
>
> Cheers - k
>
> At 06:07 PM 19/7/00 -0600, you wrote:
> >Kevin and colleagues,
> >
> >As per our discussions at the US-Australia Workshop, I would again
> >reiterate a few
> >general observation with respect to the list and express my agreement with
> >specific
> >comments made by you, Bryan Heidorn, and Stan Blum.  However, with your
> >indulgence,
> >I would also like to provide what may be a somewhat different perspective.
> >
> >The focus on "requirements" for a descriptive data standard for taxonomy,
> >as you
> >and Stan emphasize is a critical one, even though as Bryan points out
> >there remains
> >a number of issues that need to be dealt with that may not be fully
> >accounted for
> >in the draft standard you have kindly provided.  I would agree that we need a
> >mechanism (structure?) for subsequent discussion on the list to permit both
> >general, theoretical issues to be addressed, while simultaneously breaking
> >down the
> >practical realities of dealing with the complexity of specific issues
> >involved that
> >at times dictates useful "digression" into jargon-laden specifics that
> >might be
> >relevant for particular implementation issues that require vigorous
> >discussion.
> >I'm not sure at this stage whether it is possible, at least in my own mind, to
> >distinguish structure from content, since dealing with existing structure and
> >content may be necessary to define what we perceive are requirements.  My
> >own sense
> >of the previous discussion is that there are a variety of perspectives as
> >to what
> >constitutes "descriptive data" and "requirements" in this context, as well
> >as what
> >are the specific priorities (aspects of "standards") that are necessary for
> >specific applications (eg DELTA and LUCID) to intercommunicate in an
> >application-neutral manner.  However, mixing them into a common thread
> >proved a bit
> >overwhelming.
> >
> >My own bias is for a better understanding of how we can construct such a
> >"draft
> >standard" so that it is open to considerable extension for the
> >incorporation of
> >meta-language descriptors for more esoteric data structures, while
> >maintaining a
> >flexible general framework needed to associate existing "character" data,
> >while
> >also addressing the practical necessity of managing various "annotations" of
> >qualitative characters.   I believe this is important primarily because we
> >ultimate
> >want machines to do most of the translating among formats, with minimal
> >loss of
> >information or human intervention.  I believe it is also important for the
> >more
> >difficult task that lies ahead of encoding means for machines to "feature
> >extract" across a multiplicity of representations of character data.
> >
> >As a taxonomists/morphologist I am constantly confronted with new data
> >formats and
> >widely different data sources.  Virtually all are created in specific
> >contexts and
> >do not generally have a "web-wide" mechanism for associating their
> >content.  For
> >example, it is difficult for me to determine if there exist data sets that
> >encompass different "encodings" of information pertaining to specific
> >structures
> >for specific taxa.  I need a dyanamic mechanism that will permit me to
> >become aware
> >of data sets pertaining to say the pectoral fins of a particular scorpionfish,
> >without having to know in advance that such data may exist in the form of 1) a
> >collections record of a skeleton in a particular collection, 2) a
> >published data
> >set characterizing the measurements taken from a particular study, 3) a
> >CAT scan of
> >such a critter, 4) an archive containing the representation of specific
> >character
> >states used in a phylogenetic analysis, 5) numerous gif/jpg files of
> >radiographs of
> >specimens, 6) a text based description of the pectoral fins in a fossil,
> >7) the
> >title of a paper describing the sensory innervation of the fin, or 8) a
> >database of
> >specific HOX genes involved in fin formation.
> >
> >Certainly, the Rich Attribution component of your document is critical
> >element for
> >this, but I do not yet see how I can use this document to establish the
> >"meta-data
> >wrapper" needed to compile such a list, much less establish to what extent
> >I can
> >use such a text based "wrapper" to associate these disparate kinds of
> >taxonomic
> >data.  How do I deal with data that are largely numeric in content or purely
> >graphic (pixel encoded)?  Nonetheless, I would agree that there is a need
> >for a
> >series of "collation rules" to establish scope at different hierarchical
> >levels or
> >for specific context-oriented activity.  I would, for example,
> >add  several lower
> >levels still in this context (including parts of specimens as described at the
> >organ, tissue, and cellular, subcellular, and molecular level).  Of
> >course, the
> >difficulty here is that resolution and context may create data structures
> >that are
> >not entirely hierarchical, particularly for objects of composite origin or
> >study.
> >For the nervous system in chordates one can break down the system into
> >units with
> >respect to various elements that could in one sense be heirarchical
> >(perhaps brain,
> >spinal chord, ramus lateralis accessorius, neuron, motor unit, motor endplate,
> >etc.).  However, with respect to a physiological classification dealing
> >with action
> >at the level of specific neurons this classification scheme would not work
> >since
> >the nerve is composite and composed of both sensory and motor
> >elements.  Likewise,
> >it would be difficult to place neuroendrocrine components, specific
> >neuropeptides,
> >or developmental anlagen, such as placodes, however important, into a parallel
> >heirarchy.  Likewise, usefully descriptive properties could not be easily
> >restricted to specific components.  I found the discussion at the workshop
> >regarding the use of acyclic directional graphs as a fundamental data
> >structure
> >most interesting, but I'm not sure that morphological descriptors, perhaps
> >unlike
> >gene products, are necessarily acyclic.  For example, a specific bone such
> >as the
> >mandible can be classified as an element of the visceral skeleton as well as a
> >composite element containing both endochondral as well as dermal bone.  If one
> >looks early enough in development, one can't even recognize these anatomical
> >distinctions, although they may exist at a molecular level.  How should
> >structures
> >that change with development or function be tagged and associated?  Would
> >this not
> >depend upon context?  Nonetheless, following from your document, it might be a
> >useful excercise to consider to what extent certain classes of morphological
> >descriptors can be considered in such a graph-theoretical framework from
> >which we
> >might be able to establish certain constructs as useful in associating
> >otherwise
> >disparate, yet specific data (glossaries?).  Trees certainly are a useful data
> >structure for description of many morphological features, but not the only
> >ones.
> >
> >Consequently, it might be useful to break up the discussion into
> >sub-discussions or
> >threads for which specific requirements can be more readily circumscribed
> >and for
> >which the makings of a "meta-language" needed to search and assimilate
> >alternate
> >representations might be more quickly forthcomming.  This is important
> >because the
> >universe of potentially different data structures for encoding character
> >data is
> >very large.  There is no need for those interested primarily in DELTA - LUCID
> >translations, or LUCID - PHYLIP, etc. transformations to be held up by more
> >specific requirements concerning translations/annotations of more arcane data
> >structures, even though some, like Bryan and I, may feel that transformations
> >between "other kinds" of data structures must also be incorporated in a
> >way that
> >allows their potential richness to be exploited.  However, acheiving such
> >extensibility will require the "standard discriptors" to be be quite
> >general (but
> >not ambiguous) in construction.
> >
> >Such an approach might permit us to generalize across a number of possibly
> >highly
> >specific topics and requirements that are not universally applicable and
> >with which
> >many of us are differentially fluent.  This approach would be especially
> >useful,
> >should we begin at a latter point begin to use them to construct XML
> >schema or to
> >outline what might be necessary using XSLT to transform them from one XML
> >format to
> >another.  Since XML is promising as a data neutral specification language,
> >we might
> >want to maintain a separate "XML thread", and perhaps even various XML
> >(alternative)-implementaton subthreads (Java XML API's vs MS XML API's, vs
> >"others?
> >or DTD's vs Schemas, "elements" vs "atributes", etc.) that will influence
> >how such
> >a "standard" could currently be implemented.  Although certainly I would
> >agree that
> >we do not want implementations to drive the standards, it is important to
> >have an
> >understanding of how potential implementation might affect the utility of the
> >standards.  It might be useful here to draw an analogy to the presentation
> >made at
> >the workshop by Sue Rhee in her discussion of the need for an "ontological"
> >database for common annotation of gene function across molecular databases.
> >Likewise, we need a generalized means of characterizing the "language" used to
> >describe the various entries in different "glossaries" used to describe
> >character
> >data.  The need for such "cross molecular" databases would not arise,
> >except for
> >specific implementation issues that are not presently adequately addressed.
> >Likewise, your "External lexica" might be usefully encompassed in the
> >concept of
> >XML name spaces.  Although I can't think of specific examples off the top
> >of my
> >head, some anatomical terms are used differently in different contexts
> >("viens" in
> >animals and plants might be a simple example).  We need to be able to
> >distinguish
> >the contexts.  Perhaps this is what you mean by global versus local
> >characters?
> >
> >Hence, from my perspective it might be useful for the dialog to move
> >forward along
> >several separate, yet not entirely distinct threads, where folks with specific
> >interests could provide input as they see fit, ignoring that which seems
> >irrelevant.  A few may even want to keep their thumbs in all the pies.  In
> >glancing
> >over what has come before, we might consider as possible threads: 1) general
> >theoretical perspectives on "taxonomic data", 2) one or more application
> >specific
> >threads (ie DELTA, LUCID, "phylogeny packages", NEXUS, others?, etc.), 3)
> >issues
> >pertaining to description and characterization of qualitative data, 4) issues
> >pertaining description and characterization of quantitative characters, 5)
> >issues
> >pertaining to text based description (semi-structured data), 6) issues
> >pertaining
> >to structured data (ie relational or object modeled data structures), and
> >7) meta-language requirements (headers, tagging architecture, XML
> >etc.).  No doubt
> >you or others might be able to amend these or to add a few others from
> >within which
> >we might eventually reach consensus on assembly of a few key requirements
> >that are
> >general to all and from which interoperable implementation could proceed
> >so as to
> >be able to assess the usefulness of our work.  Perhaps some threads could
> >rule out
> >discussion of "content" and others "structure".  In any event, it might be
> >useful
> >to let natural selection act to allow the most productive threads to
> >survive and
> >"establish focus", while the others die out, without letting the whole wither
> >because of the complexity and interconnection of the fundamental issues.
> >
> >No doubt some of these threads will might morph into monsters not anticipated.
> >Consequently, to keep it all coordinate, there must be some general
> >agreement/understanding to focus on common requirements (GOALS) that we
> >are trying
> >to acheive.  However, at this early stage, these might be largely implict
> >so as not
> >to lock ourselves into unnecessarily narrow perspectives.  For this to work,
> >perhaps one or two "ring masters" or "virtual ushers (bouncers?)" are
> >needed to
> >keep the various performers and audience on cue, to summarize progress
> >from time to
> >time, and to remove, add, or combine  threads at key moments  (ie oversee
> >and exert
> >some "administrative" control over the various threads).  This is
> >important  so
> >that a specific set of useful general requirements is forthcomming in a timely
> >fashion.  I nominate you and Bryan (actually you guys nominated yourselves in
> >Boston or was it unanimous proclamation?).   Subsequent to such general and
> >specific discussion, I believe we would then be in a better position to
> >respond to
> >specific requests for comments on documents, such as that you have put forth
> >outlining draft standards.
> >
> >Stuart
> >
> >
> >Kevin Thiele wrote:
> >
> > > Dear Colleagues,
> > >
> > > you will all be aware that the SDD list fell over several months ago. My
> > > interpretation of this is that many of the taxonomists on the list were
> > > left behind, perhaps early on, by the energetic discussions over issues of
> > > data structuring (XML, schemas, RDF etc). Most of this was certainly way
> > > over my head. Things got too top-heavy, and attempts to structure the
> > > discussion using message tags didn't seem to provide much focus.
> > >
> > > Recent discussions (at a meeting on US-Australian cooperation in
> > > bioinformatics in Washington, July 2000, attended by several SDD
> > > contributors) has again highlighted the great need for an SDD standard and
> > > shown that the lack of a new, inclusive standard is holding back progress
> > > on descriptive databasing and software design.
> > >
> > > We need to restart the list with better focus. I'd like to suggest that the
> > > way forward is to entirely set aside (for the time being) any discussion of
> > > data structure and focus entirely on content (the requirements analysis)
> > > for a while. We should agree on an outline of the data that we need to
> > > capture, then pass this on to the computerheads to provide a best-practice
> > > structure for storing and managing this captured data.
> > >
> > > The attached document was put up to the list shortly before it fell over.
> > > It's attached again here, slightly edited. ANY TAXONOMISTS STILL OUT THERE
> > > - please look at this. What data that you need to capture aren't handled
> > > here? Will this work? Is this the way to proceed?
> > >
> > > I think that the document subsumes the data requirements of the DELTA and
> > > Lucid programs, plus a bit more particularly in the areas of data
> > > attribution and hierarchical nesting of treatments. The intention is that
> > > the elements in this list should provide a way of storing any data needed
> > > to describe the morphology or anatomy of any organism or taxon.
> > >
> > > Note that this should be read merely as a list of data elements - the
> > > structure of the list does not imply a structure for the data file (XML or
> > > otherwise) used to store the data.
> > >
> > > It may be the case that this document can be jointly modified to produce a
> > > final document, or we may need to start from scratch with another. Any
> > ideas?
> > >
> > > Cheers - k
> > >
> > >   ------------------------------------------------------------------------
> > >                               Name: DDST Specifications.doc
> > >    DDST Specifications.doc    Type: Microsoft Word Document
> > (application/msword)
> > >                           Encoding: base64