Re: General reference schemes for classifying characters
At 11:29 AM 7/25/00 +1000, Eric Zurcher wrote:
- Difficulty in merging or comparing datasets - it is rather difficult to
combine datasets based on differing character lists, even when those character lists are fairly similar. There is no mechanism for "mapping" character states from one dataset onto those of another. (Disparate character lists are another matter entirely. My personal view is that the holy grail of a "universal" character list for, say, all of botany will tend to remain tantilizingly just out of reach, and the efforts of this group should not be distracted in that direction.)
The issue of constructing mechanisms for mapping characters into one another is quite different from attempting to circumscribe a language necessary to describe all possible definitions of "character" so that rules governing their description can be unambiguously applied. One doesn't need "universal" characters, only a usefully large "universe" of potential character descriptors. Properties of some classes of characters will be wholely irrelevant to the characterization of others. I would agree it would be useless to search for such mechanisms that could be "universally applied".
Although I entirely agree with Peter, so few cladists actually measure the objects to which they assign to state that one can safely conclude that matching "characters" by their underlying measurment data would likely only reference a very sparse matrix. Quantiative characters are fundamentally different than qualitative ones in large part because qualitative characters are inherently more ambiguous and less constrained in terms of implicit notions of "equivalence".
However, if one did want to associate features represented in widely different formats, it would be helpful to have a data standard that could describe the various kinds of associations that might be usefully employed and how they might be related. As Stan points out, the measurements actually taken are often quite context sensitive and may even vary from one investigator to another or from one method of measurment to another. It would seem that one would need to have a good description of exactly how the measurements were taken, should either one want to repeat them or simply to understand what measures exist and what they tell us. However, other than that "caveat emptor".
Nonetheless, despite numerous problems of correspondence, it would be extremely useful to be able to use metadata consistently applied to tag two different kinds of data (characters), both taken on the same organisms and perhaps on the very same objects, so that they could be reevaluated. Whether the possible "remappings" could be made automatically, would largely depend on how much ambiguity one could remove from the qualitative characterizations. In my estimation, at present this could only be done for an extremely small number of characters under special circumstances and consequently would not now be of much interest to most. However, newer data-rich acquisition techniques continue to expand the number of features for which human-induced ambiguity can be better excluded from the data. Consequently, if our aim is to develop a standard protocol able to describe and relate such data, as well as more track and associate taxonomic data captured by more traditional methods, it would be worthwhile to have a means to do this and not leave such kinds of data out of the protocol, because they do not readily conform to currently accepted paradigms (eg. characters in states always specified by 0 or 1, my least favorite approach, etc.).
It seems to me we need protocols for describing how multiple kinds of characters can be referenced. I think we need a reference model that permits us to associate various kinds of metadata and not (necessarily) establish the equivalence of specific character definitions. For example, it would be extremely helpful to be able to query the web oracle and ask "Please give me a list of all the characters (names and descriptors) associated with veination in taxa, X, Y, and Z" or "what character systems have been used to establish the propinquity of [ or simply describe' members of taxon A?" If our lexicon and grammar are inadequate for classifying some of these "characters", then the list will always be incomplete and the reference model will be of limited value.
But Eric's comment reminds me that there is a STRONG reason for moving a data set between the various applications that deal with descriptive data: a single person might want to use DELTA, LucID, PAUP and McClade in the same study. It would be "nice" to have the capability to create and maintain a single data set that could store and "serve" data to each application. If we could create the specification for that data set, I would judge this effort a success.
-Stan
Obviously, it would be nice (probably essential) if we can then translate from different representations, but if this list is really only about translation among different existing data formats, then perhaps conceptualization for a more general reference model for representation of taxonomic data can be done elsewhere [ :( ].
Nonetheless, I really don't think that in many cases one would want to actually store the reference data in a common format and then serve it. Rather, because some forms of storage will be much more compact, and hence more efficient to specific tasks, there would be often strong incentive to keep data in its "native" format. However, these representations would be necessarily more arcane and probably not human readable. Instead, I think it would be more often useful to consider how one could specify how the native format could be reformulated dynamically into a common "standard?" reference format and then retranslated into another native format. XLST is well suited for this purpose. If the "standard reference" involved "standard tagging" then use of XML could provide a nice human readable format that might exist "conceptually" or only briefly in memory during translation. This would require, however, understanding what kind of language (tags?) we might formulate to describe the common representations so that the translations are correct when made. This has three advantages that would be useful: 1) that the "view" of the data can be separated from its inherent logic, 2) that natives who like to use "native" languages need not become restless as a result of our efforts, and 3) various "native" formats might die out simply because other approaches are found more useful or new formats could emerge, without impacting our ability to track and reference such data. Of course, we need to keep in mind that in taxonomy and systematics, generally when we refer to character data, we are usually really referring to representations of data rather than the data itself (ie values ultimately derived from some repeatable measurement process). For those who shun phenetics, where characters have a quantiative meaning, the measurement process is largely implicit in the resultant conceptualizations ("states").
If translation is required, then one might at a suitable place in the descriptor structure/lexicon establish a means for specifying <character encoding="DELTA"> or <character encoding="LucID"> etc. Then let the processor handle the rules required by the transform, at least to the extent the inherent level of ambiguity allows. The question is then what information is needed by the processor to perform the specific requested transformation. Seems to me we require at a minimum: 1) tag(s) identifying the character, 2) specification(s?) of the character/data class (what kind of character/data is it?), and 3) a means of informing the processor what its various properties (ie states, ordering, values, etc.) are and how they are encoded.
Might detailed discussion of a few example characters permit us to better understand exactly how these specific (target?) implementation handle different kinds of characters and how they might be alternatively represented? This would give us a better idea of what kinds of "rules" we require and in what circumstances one kind of representation might be more or less ambiguous than another.
Stuart
participants (1)
-
Stuart G. Poss