Gregor Hagedorn wrote:
---
> For the sake of completeness: If other states shall depend on certain
> values, this is not a problem, since one would define a character
> state mapping (DELTA: KEYSTATES) say:
> Char 1: Seed size, length (numeric) [µm]
> Char 2: Seed size, class
> state 1: Char1 < 2 mm
> state 1: Char1 >= 2 mm
> Some characters that can only be observed in larger seeds could then
> be made dependent on 2,2
This seems to characterize in a nutshell the problem discussed earlier
contrasting qualitative and quantitative definitions when the underlying basis
of comparison is not independent between two characters, but yet only loosely
associated. For both, one must understand the context of both "size" and
"length" with respect to some reference frame. Further both use different
units. Also one or the other may have associations with other characters
(basis of comparisons), also perhaps loose that determine the dependency of
other characters.
It might be useful to require a tagging language that permits one to wrap such
characters with a generalized, but controlled vocabulary that fully (to the
depth required) specifies the basis of comparison and its map onto the objects
for which the comparison is useful (character domain). Presumably a
generalized basis of comparison would have a name that may or may not be
identical to that used by the original author and that might give a
generalized definition to a particular class of properties (basis/es of
comparison). Tags that allow ancillary metadata associated specific
instances, including its first use, author, citation, etc, would also be
useful.
As Gregor's example shows, we also need to allow other "loosely associated"
characters (say "size" characters for objects of class "x", where here "x" is
"seeds") to also be associated (machine callable) independently of their
precise definitions/useage, which may actually differ subtly among workers.
In terms of adding the potential to broaden or narrow a search for other
related relevant datasets/characters, one might wish to be careful in stating
exactly what one might mean by seed (embryonic sporophyte) in its most general
context applying categorical "adjectives" to both refine the intended meaning
and maximize the useful extent of its potential associations.
Use of such a "wrapper or tagging language" would require the specific
character be embedded into a controlled generalized character hierarchy or
definition space, much as Gregor has done, but in more general terms that lend
themselves to increasing precision through refinement. For some applications
one might not need to attach all the metadata, since much could potentially be
"inherited" from higher levels of generality, leaving most workers to include
"content/domain specific" wording as given in the example, and perhaps a few
key "lower level generalities" that could inherit higher-order properties
within a generalized system needed to assure the comparability of datasets and
character definitions across a wide variety of scales, taxa, and
context-specific feature sets.
For quantitative characters one would need to provide enough information to
specify exactly what one means by length (or details of the algorithm used, if
this is machine defined); units of measure would be needed. The mass property
would be much easier than a length or other general_size measure, it would
simply need to be specified in grams or suitable scale factor (kilograms,
picograms, etc).
So for the two characters one might wrap them something like (syntax possibly
more suitably arranged for
XML or DELTA, NEXUS, etc. translation) and with thought as to which are
objects, which are attributes of objects, and which are attributes of
attributes of objects, likewise the same for the character domains (taxa),
etc, and perhaps the data itself (or representations of data) :
<@character name="seed size; @property="general_size"
@character_type="quantitative" ;
@property_subtype="landmark_length_measurement"; @measurement_protocol
parameters="[[landmark definitions list], data_values="list_of_scalar",
storage_representation="float", units="µm"]"; @object="seed_bearing_plant";
@object_taxon="Leguminacea?"; @object_rank="species"; @character_definition
author="Hagedorn", citation="TDWG-SDD(a)USOBI.ORG", date="Tue, 21 Dec 1999
17:18:24 +0100";>
<@character name="seed size"; @property="general_size";
@character_type="qualitative";
@property_subtype=landmark_length_measurement_class @assignment_protocol
parameters=["disjoint_state_assignment", state_names_list="[state1 | state2]"
statevalues=["<2" | ">=2"], units="mm"]) ; @object="seed_bearing_plant";
@object_taxon="Leguminacea?"; @object_rank="species"; @character_definition
author="Hagedorn", citation="TDWG-SDD(a)USOBI.ORG", date="Tue, 21 Dec 1999
17:18:24 +0100">
Character data itself might be likewise be appended as a subsequent block in
such a character descriptor stream, perhaps quite roughly as <@character_data
description="Hagedorn's character state data" encoding="DELTA: KEYSTATES"
format="matrix, no_characters, no_taxa"> <@start_datastream>ACTUAL
DATASTREAM<@end_datastream>
Perhaps some of this could be made even less verbose by choice use of special
pointers say <@unique_dataset_registry_number="123456";
@data_registry_archive="Harvard Herbaria Character Data MegaArchive"> that
could associate groups of characters yet be maintained separate from a fully
self-describing protocol so that calls to the data archives, say for subsets
of characters for a composite set of taxa across multiple datasets would
"know" how to read data and make the suitable translations (and infer
dependencies where possible, but sometimes using <UNDEFINED> or
<UNRESOLVED>as a default when required), without the need to send "fully
tagged" meta-data unless required.
Based on examples given so far, alternatives to the @property might include as
a simple start "general_size" "general_shape" "general composition" and
"general color".
With respect to @character_type we have so far discussed two "quantiative" and
"qualitative" but one could envision at least two more "mixed" and
"molecular", although potentially the last might be considered a subtype.
Left out of the foregoing would be a specification to deal with parts and
subparts of objects upon which characters are defined. This has been touched
on repeatedly in several threads. I'm not enough of a Botantist to know
exactly what subpart of a plant a seed is, although I would assume that it
might be defined in multiple ways depending upon things like germ layer,
topographic position, life cycle. This may be more difficult, since
dependency may be determined on both how other characters are defined and the
specific taxa (character domain) in question.
If we are careful in circumscribing a generalized character markup language,
the user (or subsequent user) could then decide how powerful a
characterization would likely be useful, without having to specify every
conceivable domain-specific tag, while at the same time building a largely
self-refining system (assuming there is some permanancy to the data and
metadata). Those most familiar with DELTA or NEXUS, etc. would have to bother
with only a minimal set of wrappers to characterize data in these formats,
depending upon need, whereas others may add characterization to existing
wrappers where it is deemed insufficient (eg. range data that inherently
includes landmarks and positional relationships among landmarks in Euclidean
space). Interfaces could be built to assist in traversing domain specific
vocabulary, to either more fully encode or strip off tags as required, to
validate character data submittted for publication prior to archival, as well
as search for and compare specific characters and study their general
usefulness or properties.
The precise syntax needs to be flexible, yet be able to characterize
characters in general terms that would lend themselves to refinements of
meaning, as specific as required for a given context (domain). However we do
it, it is critical that we focus on ensuring that we are defining the basis of
comparision with sufficient precision to insure that results are comparable
and with sufficient generality to insure that our searches of "feature or
taxon" space have broad applicability. It would be definitely nice if the
system is general enough to permit morphological characters to be studied side
by side with molecular data, which through evolution seem to be universally
coded through an elegantly simple arrangement of base-pairs.
Stuart
>