Types of data

Thu Nov 25 17:41:19 CET 1999

Kevin Thiele wrote:

>
> >Could one constrain further by expressing within the schema the hierarchical
> >relationships between the elements(eg 'blade' and 'petiole' as child
> >elements of leaf') or would the introduction of terminology into the
> >'standard' be a step too far?
>
> This worries me.  Is the entity/property/value schema restricted to 3
> levels? (ditto for XMLs element/attribute/value). If so, it seems too much
> of a straightjacket, especially if we want to express a hierarchy of
> elements like this. What to do about
> leaf/lamina/abaxial_surface/vein_islands/indumentum/density.
> I agree that a hierarchically structured schema may be wonderful, but in a
> biological context I wouldn't like to see it constrained at 3 levels just
> because that's what existing computer formats do.
>
> Cheers - k

I don't think there is any internal constraint within XML that would restrict
one to evaluating hierarchial relationships, if these were the most suitable, to
only 3 levels.   Some parsers may actually be constrained to work with a finite
level of nestedness, but I would suspect that it would be much greater than 3
deep.  I would think as is the case for object or relational modeling one could
instantiate one to many or many to one relationships as required.

My original point here was only that it is often difficult to decide how to
represent concepts that associate objects/entities/roles.  In one context it
would be appropriate to say regard a leaf as an entity, its shape as an
attribute, and ovate as a value.  However, (and again this is probably not be
the best example) in another context a leaf of the same plant might be regarded
as a attribute of yet another object/entity, perhaps a tree, and not considered
an entity on its own.  A particular implementation, such as one using XML, will
often provide a variety choices for representing the underlying ideas, but until
we model/conceptualize how the various conceptual components
interoperate/relate, a particular approach can often be difficult to evaluate or
to extend.

Further, I didn't mean to imply that all conceptual associations among character
data types would be hierarchial.  It was probably a poor choice of words on my
part and I believe such a restriction would greatly limit the range of
investigators who might be able to contribute and substantially limit the number
of taxa or morphs that could be so characterized.  There could be a large number
of potential representations, including: trees, cyclic acyclic or directional
graphs, lattices, semilattices, ordered sets, partially ordered sets, Galois
connections, tensorial decompositions, etc.  My point was that I believe we need
to identify the critical conceptual relations and their context in a "standard"
model for character data and to then formulate a language (implementation [in
XML?]) for describing the most meaningful or useful among them.

This language would presumably be a superset of existing methods.  To use the
terminology of Ganter and Wille (1996.  Formal Concept Analysis.  Springer.) in
its simplest form, it might be useful to speak, as closely as possible, in terms
of formal contexts, such as K := (G, M, I),  which consists of two sets G
(objects)and M (attributes) and I the incidence relation betwen G and M of the
context, AND the formal concept of the context (G, M, I), namely a pair (A, B),
where A is the extent and B the intent, and where A C G (C meaning subset not
"C") and B C M, A' = B and B' = A, where A' := {m e M | gIm for all g e A} and
B' := {g e G | gIm for all m e B}.  (reading e as "is an element of").
Fortunately, even for many-valued contexts, these can be represented as simple
cross tables.  Already, we have considered several including 1) matrices of 0's
and 1's 2) taxon by measure matrices, 3) range matricies for a given specimen,
4) specimen x location matricies, 5) species by location matrices...  No doubt
others may  think of lots of other possibilities.

A useful approach might then be to approximate the "quasi"-complete concept
lattice B*(G, M, I) of a context derived from gluing (union of) concept lattices
of a large number of relevant contexts and then identifying the most
meaningful/useful/easily encoded congruent sublattices, factors, and
implications between attributes, ignoring, at least initially, issues of scale,
closure, and incomparability.  Such an approach might thus permit working
subgroups to focus on particular forms/types of character data and their most
appropriate/useful/efficient representation.  It might also permit us to collate
useful lists of works/websites employing particular kinds of representations for
an eventual website characterizing the emerging potential standard or
recommendation.

Admittedly, such a more general view of what would be required to model
"character data" creates some tangential considerations with respect to
straightforward conversion among existing formats, certainly a worthy and
perhaps more immediately necessary goal.  Such a subdiscussion  would be roughly
equivalent to enumerating the rules by which existing approaches encode various
contexts and concepts and then establishing a uniform syntax to describe the
common rules among them.  However, as Peter and Gregor (at the meeting)
indicated, it would be desirable to permit extensions that allow observational
data (ie things that are measured) to be recognized as distinct from, as well as
related to, what we might be loosely call conceptual data (representations that
bear a greater degree of inference, such as 0's and 1's in phylogenetic data,
leaf "shape" "ovate", or leaf color "green" etc.), so that one is in a better
position to 1) verify the accuracy of the former, perhaps using alternate
techniques, 2) study the relations between and properties of the two kinds of
data, 3) test assumptions explict or implicit in the latter, 4) develop
machine-assisted search/display/evaluation techniques for 1) 2) 3) across a wide
range of independent data sets in a variety of contexts.

If this is too radical a restructuring of the original intent of the discussion
that may have been more narrowly focused (I didn't attend the initial setup
meeting at Harvard that took place outside the main meeting room), I would be
content to continue discussion of such general considerations among a smaller or
alternate subgroup, assuming there is anyone else who feels likewise inclined.
In any event, hopefully discussions of character data representations focused
toward a potential emerging standard will be robust enough to include relatively
novel formats, such as that involved in range data, and their conceptual
relations to more traditional data types.