Gregor Hagedorn wrote: ---
For the sake of completeness: If other states shall depend on certain values, this is not a problem, since one would define a character state mapping (DELTA: KEYSTATES) say: Char 1: Seed size, length (numeric) [µm] Char 2: Seed size, class state 1: Char1 < 2 mm state 1: Char1 >= 2 mm Some characters that can only be observed in larger seeds could then be made dependent on 2,2
This seems to characterize in a nutshell the problem discussed earlier contrasting qualitative and quantitative definitions when the underlying basis of comparison is not independent between two characters, but yet only loosely associated. For both, one must understand the context of both "size" and "length" with respect to some reference frame. Further both use different units. Also one or the other may have associations with other characters (basis of comparisons), also perhaps loose that determine the dependency of other characters.
It might be useful to require a tagging language that permits one to wrap such characters with a generalized, but controlled vocabulary that fully (to the depth required) specifies the basis of comparison and its map onto the objects for which the comparison is useful (character domain). Presumably a generalized basis of comparison would have a name that may or may not be identical to that used by the original author and that might give a generalized definition to a particular class of properties (basis/es of comparison). Tags that allow ancillary metadata associated specific instances, including its first use, author, citation, etc, would also be useful.
As Gregor's example shows, we also need to allow other "loosely associated" characters (say "size" characters for objects of class "x", where here "x" is "seeds") to also be associated (machine callable) independently of their precise definitions/useage, which may actually differ subtly among workers. In terms of adding the potential to broaden or narrow a search for other related relevant datasets/characters, one might wish to be careful in stating exactly what one might mean by seed (embryonic sporophyte) in its most general context applying categorical "adjectives" to both refine the intended meaning and maximize the useful extent of its potential associations.
Use of such a "wrapper or tagging language" would require the specific character be embedded into a controlled generalized character hierarchy or definition space, much as Gregor has done, but in more general terms that lend themselves to increasing precision through refinement. For some applications one might not need to attach all the metadata, since much could potentially be "inherited" from higher levels of generality, leaving most workers to include "content/domain specific" wording as given in the example, and perhaps a few key "lower level generalities" that could inherit higher-order properties within a generalized system needed to assure the comparability of datasets and character definitions across a wide variety of scales, taxa, and context-specific feature sets.
For quantitative characters one would need to provide enough information to specify exactly what one means by length (or details of the algorithm used, if this is machine defined); units of measure would be needed. The mass property would be much easier than a length or other general_size measure, it would simply need to be specified in grams or suitable scale factor (kilograms, picograms, etc).
So for the two characters one might wrap them something like (syntax possibly more suitably arranged for XML or DELTA, NEXUS, etc. translation) and with thought as to which are objects, which are attributes of objects, and which are attributes of attributes of objects, likewise the same for the character domains (taxa), etc, and perhaps the data itself (or representations of data) :
<@character name="seed size; @property="general_size" @character_type="quantitative" ; @property_subtype="landmark_length_measurement"; @measurement_protocol parameters="[[landmark definitions list], data_values="list_of_scalar", storage_representation="float", units="µm"]"; @object="seed_bearing_plant"; @object_taxon="Leguminacea?"; @object_rank="species"; @character_definition author="Hagedorn", citation="TDWG-SDD@USOBI.ORG", date="Tue, 21 Dec 1999 17:18:24 +0100";> <@character name="seed size"; @property="general_size"; @character_type="qualitative"; @property_subtype=landmark_length_measurement_class @assignment_protocol parameters=["disjoint_state_assignment", state_names_list="[state1 | state2]" statevalues=["<2" | ">=2"], units="mm"]) ; @object="seed_bearing_plant"; @object_taxon="Leguminacea?"; @object_rank="species"; @character_definition author="Hagedorn", citation="TDWG-SDD@USOBI.ORG", date="Tue, 21 Dec 1999 17:18:24 +0100">
Character data itself might be likewise be appended as a subsequent block in such a character descriptor stream, perhaps quite roughly as <@character_data description="Hagedorn's character state data" encoding="DELTA: KEYSTATES" format="matrix, no_characters, no_taxa"> <@start_datastream>ACTUAL DATASTREAM<@end_datastream>
Perhaps some of this could be made even less verbose by choice use of special pointers say <@unique_dataset_registry_number="123456"; @data_registry_archive="Harvard Herbaria Character Data MegaArchive"> that could associate groups of characters yet be maintained separate from a fully self-describing protocol so that calls to the data archives, say for subsets of characters for a composite set of taxa across multiple datasets would "know" how to read data and make the suitable translations (and infer dependencies where possible, but sometimes using <UNDEFINED> or <UNRESOLVED>as a default when required), without the need to send "fully tagged" meta-data unless required.
Based on examples given so far, alternatives to the @property might include as a simple start "general_size" "general_shape" "general composition" and "general color".
With respect to @character_type we have so far discussed two "quantiative" and "qualitative" but one could envision at least two more "mixed" and "molecular", although potentially the last might be considered a subtype.
Left out of the foregoing would be a specification to deal with parts and subparts of objects upon which characters are defined. This has been touched on repeatedly in several threads. I'm not enough of a Botantist to know exactly what subpart of a plant a seed is, although I would assume that it might be defined in multiple ways depending upon things like germ layer, topographic position, life cycle. This may be more difficult, since dependency may be determined on both how other characters are defined and the specific taxa (character domain) in question.
If we are careful in circumscribing a generalized character markup language, the user (or subsequent user) could then decide how powerful a characterization would likely be useful, without having to specify every conceivable domain-specific tag, while at the same time building a largely self-refining system (assuming there is some permanancy to the data and metadata). Those most familiar with DELTA or NEXUS, etc. would have to bother with only a minimal set of wrappers to characterize data in these formats, depending upon need, whereas others may add characterization to existing wrappers where it is deemed insufficient (eg. range data that inherently includes landmarks and positional relationships among landmarks in Euclidean space). Interfaces could be built to assist in traversing domain specific vocabulary, to either more fully encode or strip off tags as required, to validate character data submittted for publication prior to archival, as well as search for and compare specific characters and study their general usefulness or properties.
The precise syntax needs to be flexible, yet be able to characterize characters in general terms that would lend themselves to refinements of meaning, as specific as required for a given context (domain). However we do it, it is critical that we focus on ensuring that we are defining the basis of comparision with sufficient precision to insure that results are comparable and with sufficient generality to insure that our searches of "feature or taxon" space have broad applicability. It would be definitely nice if the system is general enough to permit morphological characters to be studied side by side with molecular data, which through evolution seem to be universally coded through an elegantly simple arrangement of base-pairs.
Stuart