General reference schemes for classifying characters

Tue Jul 25 14:52:06 CEST 2000

> At 11:29 AM 7/25/00 +1000, Eric Zurcher wrote:
> >5) Difficulty in merging or comparing datasets - it is rather difficult to
> >combine datasets based on differing character lists, even when those
> >character lists are fairly similar. There is no mechanism for "mapping"
> >character states from one dataset onto those of another. (Disparate
> >character lists are another matter entirely. My personal view is that the
> >holy grail of a "universal" character list for, say, all of botany will
> >tend to remain tantilizingly just out of reach, and the efforts of this
> >group should not be distracted in that direction.)
>

The issue of constructing mechanisms for mapping characters into one another
is quite different from attempting to circumscribe a language necessary to
describe all possible definitions of "character" so that rules governing their
description can be unambiguously applied.  One doesn't need "universal"
characters, only a usefully large "universe" of potential character
descriptors.  Properties of some classes of characters will be wholely
irrelevant to the characterization of others.  I would agree it would be
useless to search for such mechanisms that could be "universally applied".

Although I entirely agree with Peter, so few cladists actually measure the
objects to which they assign to state that one can safely conclude that
matching "characters" by their underlying measurment data would likely only
reference a very sparse matrix.  Quantiative characters are fundamentally
different than qualitative ones in large part because qualitative characters
are inherently more ambiguous and less constrained in terms of implicit
notions of "equivalence".

However, if one did want to associate features represented in widely different
formats, it would be helpful to have a data standard that could describe the
various kinds of associations that might be usefully employed and how they
might be related.  As Stan points out, the measurements actually taken are
often quite context sensitive and may even vary from one investigator to
another or from one method of measurment to another.  It would seem that one
would need to have a good description of exactly how the measurements were
taken, should either one want to repeat them or simply to understand what
measures exist and what they tell us.  However, other than that "caveat
emptor".

Nonetheless, despite numerous problems of correspondence, it would be
extremely useful to be able to use metadata consistently applied to tag two
different kinds of data (characters), both taken on the same organisms and
perhaps on the very same objects, so that they could be reevaluated.  Whether
the possible "remappings" could be made automatically, would largely depend on
how much ambiguity one could remove from the qualitative characterizations.
In my estimation, at present this could only be done for an extremely small
number of characters under special circumstances and consequently would not
now be of much interest to most.  However, newer data-rich acquisition
techniques continue to expand the number of features for which human-induced
ambiguity can be better excluded from the data.  Consequently, if our aim is
to develop a standard protocol able to describe and relate such data, as well
as more track and associate taxonomic data captured by more traditional
methods, it would be worthwhile to have a means to do this and not leave such
kinds of data out of the protocol, because they do not readily conform to
currently accepted paradigms (eg. characters in states always specified by 0
or 1, my least favorite approach, etc.).

It seems to me we need protocols for describing how multiple kinds of
characters can be referenced.   I think we need a reference model that permits
us to associate various kinds of metadata and not (necessarily) establish the
equivalence of specific character definitions.  For example, it would be
extremely helpful to be able to query the web oracle and ask "Please give me a
list of all the characters (names and descriptors) associated with veination
in taxa, X, Y, and Z"  or "what character systems have been used to establish
the propinquity of [ or simply describe' members of taxon A?"  If our lexicon
and grammar are inadequate for classifying some of these "characters", then
the list will always be incomplete and the reference model will be of limited
value.

> But Eric's comment reminds me that there is a STRONG reason for moving a
> data set between the various applications that deal with descriptive
> data:  a single person might want to use DELTA, LucID, PAUP and McClade in
> the same study.  It would be "nice" to have the capability to create and
> maintain a single data set that could store and "serve" data to each
> application.  If we could create the specification for that data set, I
> would judge this effort a success.

> -Stan

Obviously, it would be nice (probably essential) if we can then translate from
different representations, but if this list is really only about translation
among different existing data formats, then perhaps conceptualization for a
more general reference model for representation of taxonomic data can be done
elsewhere [ :( ].

Nonetheless, I really don't think that in many cases one would want to
actually store the reference data  in a common format and then serve it.
Rather, because some forms of storage will be much more compact, and hence
more efficient to specific tasks, there would be often strong incentive to
keep data in its "native" format.  However, these representations would be
necessarily more arcane and probably not human readable.  Instead, I think it
would be more often useful to consider how one could specify how the native
format could be reformulated dynamically into a common "standard?" reference
format and then retranslated into another native format.  XLST is well suited
for this purpose.  If the "standard reference" involved "standard tagging"
then use of XML could provide a nice human readable format that might exist
"conceptually" or only briefly in memory during translation.  This would
require, however, understanding what kind of language (tags?) we  might
formulate to describe the common representations so that the translations are
correct when made.  This has three advantages that would be useful: 1) that
the "view" of the data can be separated from its inherent logic, 2) that
natives who like to use "native" languages need not become restless as a
result of our efforts, and 3) various "native" formats might die out simply
because other approaches are found more useful or new formats could emerge,
without impacting our ability to track and reference such data.  Of course, we
need to keep in mind that in taxonomy and systematics, generally when we refer
to character data, we are usually really referring to representations of data
rather than the data itself (ie values ultimately derived from some repeatable
measurement process).  For those who shun phenetics, where characters have a
quantiative meaning, the measurement process is largely implicit in the
resultant conceptualizations ("states").

If translation is required, then one might at a suitable place in the
descriptor structure/lexicon establish a means for specifying <character
encoding="DELTA"> or <character encoding="LucID"> etc.  Then let the processor
handle the rules required by the transform, at least to the extent the
inherent level of ambiguity allows.  The question is then what information is
needed by the processor to perform the specific requested transformation.
Seems to me we require at a minimum: 1) tag(s) identifying the character, 2)
specification(s?) of the character/data class (what kind of character/data is
it?), and 3) a means of informing the processor what its various properties
(ie states, ordering, values, etc.) are and how they are encoded.

Might detailed discussion of a few example characters permit us to better
understand exactly how these specific (target?) implementation handle
different kinds of characters and how they might be alternatively
represented?  This would give us a better idea of what kinds of "rules" we
require and in what circumstances one kind of representation might be more or
less ambiguous than another.

Stuart