Re: Types of data
Kevin Thiele wrote:
Could one constrain further by expressing within the schema the hierarchical relationships between the elements(eg 'blade' and 'petiole' as child elements of leaf') or would the introduction of terminology into the 'standard' be a step too far?
This worries me. Is the entity/property/value schema restricted to 3 levels? (ditto for XMLs element/attribute/value). If so, it seems too much of a straightjacket, especially if we want to express a hierarchy of elements like this. What to do about leaf/lamina/abaxial_surface/vein_islands/indumentum/density. I agree that a hierarchically structured schema may be wonderful, but in a biological context I wouldn't like to see it constrained at 3 levels just because that's what existing computer formats do.
Cheers - k
I don't think there is any internal constraint within XML that would restrict one to evaluating hierarchial relationships, if these were the most suitable, to only 3 levels. Some parsers may actually be constrained to work with a finite level of nestedness, but I would suspect that it would be much greater than 3 deep. I would think as is the case for object or relational modeling one could instantiate one to many or many to one relationships as required.
My original point here was only that it is often difficult to decide how to represent concepts that associate objects/entities/roles. In one context it would be appropriate to say regard a leaf as an entity, its shape as an attribute, and ovate as a value. However, (and again this is probably not be the best example) in another context a leaf of the same plant might be regarded as a attribute of yet another object/entity, perhaps a tree, and not considered an entity on its own. A particular implementation, such as one using XML, will often provide a variety choices for representing the underlying ideas, but until we model/conceptualize how the various conceptual components interoperate/relate, a particular approach can often be difficult to evaluate or to extend.
Further, I didn't mean to imply that all conceptual associations among character data types would be hierarchial. It was probably a poor choice of words on my part and I believe such a restriction would greatly limit the range of investigators who might be able to contribute and substantially limit the number of taxa or morphs that could be so characterized. There could be a large number of potential representations, including: trees, cyclic acyclic or directional graphs, lattices, semilattices, ordered sets, partially ordered sets, Galois connections, tensorial decompositions, etc. My point was that I believe we need to identify the critical conceptual relations and their context in a "standard" model for character data and to then formulate a language (implementation [in XML?]) for describing the most meaningful or useful among them.
This language would presumably be a superset of existing methods. To use the terminology of Ganter and Wille (1996. Formal Concept Analysis. Springer.) in its simplest form, it might be useful to speak, as closely as possible, in terms of formal contexts, such as K := (G, M, I), which consists of two sets G (objects)and M (attributes) and I the incidence relation betwen G and M of the context, AND the formal concept of the context (G, M, I), namely a pair (A, B), where A is the extent and B the intent, and where A C G (C meaning subset not "C") and B C M, A' = B and B' = A, where A' := {m e M | gIm for all g e A} and B' := {g e G | gIm for all m e B}. (reading e as "is an element of"). Fortunately, even for many-valued contexts, these can be represented as simple cross tables. Already, we have considered several including 1) matrices of 0's and 1's 2) taxon by measure matrices, 3) range matricies for a given specimen, 4) specimen x location matricies, 5) species by location matrices... No doubt others may think of lots of other possibilities.
A useful approach might then be to approximate the "quasi"-complete concept lattice B*(G, M, I) of a context derived from gluing (union of) concept lattices of a large number of relevant contexts and then identifying the most meaningful/useful/easily encoded congruent sublattices, factors, and implications between attributes, ignoring, at least initially, issues of scale, closure, and incomparability. Such an approach might thus permit working subgroups to focus on particular forms/types of character data and their most appropriate/useful/efficient representation. It might also permit us to collate useful lists of works/websites employing particular kinds of representations for an eventual website characterizing the emerging potential standard or recommendation.
Admittedly, such a more general view of what would be required to model "character data" creates some tangential considerations with respect to straightforward conversion among existing formats, certainly a worthy and perhaps more immediately necessary goal. Such a subdiscussion would be roughly equivalent to enumerating the rules by which existing approaches encode various contexts and concepts and then establishing a uniform syntax to describe the common rules among them. However, as Peter and Gregor (at the meeting) indicated, it would be desirable to permit extensions that allow observational data (ie things that are measured) to be recognized as distinct from, as well as related to, what we might be loosely call conceptual data (representations that bear a greater degree of inference, such as 0's and 1's in phylogenetic data, leaf "shape" "ovate", or leaf color "green" etc.), so that one is in a better position to 1) verify the accuracy of the former, perhaps using alternate techniques, 2) study the relations between and properties of the two kinds of data, 3) test assumptions explict or implicit in the latter, 4) develop machine-assisted search/display/evaluation techniques for 1) 2) 3) across a wide range of independent data sets in a variety of contexts.
If this is too radical a restructuring of the original intent of the discussion that may have been more narrowly focused (I didn't attend the initial setup meeting at Harvard that took place outside the main meeting room), I would be content to continue discussion of such general considerations among a smaller or alternate subgroup, assuming there is anyone else who feels likewise inclined. In any event, hopefully discussions of character data representations focused toward a potential emerging standard will be robust enough to include relatively novel formats, such as that involved in range data, and their conceptual relations to more traditional data types.
participants (1)
-
Stuart G. Poss