(GEN) Lexicons

Thu Dec 9 09:58:44 CET 1999

There is a subtext running in this discussion - whether part of our scope is
the creation of lexicons or standard name-spaces - that to me is causing
confusion.

For instance:

>>>From Leigh:

>Restricting states to particular options depending on the property
>in question (e.g. leaf and/or wing shape) leads back to the
>prior discussion on accepted standards for character description.
>
>Defining, and agreeing upon these standard notations/descriptions are
>A FUNDAMENTAL PART (my caps) of specifying this new format, and one that isn't
>solved simply by deciding to use XML (for example). Its part of the
>fundamental design and modelling, and is therefore something
>that should be addressed early on.

But from Leigh again:

>I'd say that the DELTA approach - of avoiding domain (i.e. zoology,
>virology, etc) specific notations in the format has worked well.
>And I think this is the level that any initial work should be pitched at.
>i.e. the data format should encode taxonomic *data* - just as DELTA
>does. Any domain specific schema can be layered on top of this, or
>include it. Begin with capturing the relevant data just as DELTA does,
>and then progress from there.

So do we or don't we? Am I misinterpreting these that they seem to say
opposite things?

>>>From Gregor:

>It is true: Morphological structures may have containment
>hierarchies, but I believe that these depent strongly on the
>viewpoint of the author or user.
>
>EXAMPLE 2: Stuff can be in-between: The inflorescence contains part
>of stem, part of leaves, and all flowers. Which leaves are part of
>inflorescence and thus called bracts, and which aren't is often a
>matter of taste, school, country...
>
>Thus: there are multiple concurrent or competing hierarchies, which
>may overlap.

The only problem with competing hierarchies is if we are trying to
standardise and resolve the conflicts. If every worker resolves for their
own project what to call bracts, this is not a problem for us.

>>>From Jean-Marc:

>We are designing XML vocabularies for the description of biological
>species.

Are we? I thought we were designing a format by which such a vocabulary can
be represented.

For the record, all the current systems (DELTA, LucID, NEXUS etc) enforce
nothing lexically, they merely enforce a particular way of representing
data. Two data sets for similar groups of plants may contain entirely
different characters, or the same characters worded in different ways, or
the same characters resolved into states in different ways, or
(occasionally) identical characters. Comparing and combining datasets
automatically is thus impossible. This seems such a shame, but is it perhaps
unavoidable?

Thus, if we are designing vocabularies, we are going a long way beyond
what's been attempted before.

Personally, I think designing domain-specific vocabularies will never work,
unless the domain is the individual worker or group of collaborating
workers. The popularity of lexicons is the old seductive universalism again.
Great idea, but...

There are two problems. Firstly, there are (broadly) two types of characters
used in descriptions (and keys) - lets call them comparative and diagnostic
characters. Comparative characters are the fairly general ones - e.g. leaf
shape, ovary position - the sorts of characters that one would aim to
describe consistently for all taxa in a monograph. Diagnostic characters are
special characters that are useful for separating two or more taxa (of
course, sometimes fairly general characters are diagnostic, but not always).

A real example of a diagnostic character (from Synaphaea: Proteaceae):

Ovary with an apical ring of translucent glands......S. bifurcata
Ovary without glands.................................S. oulopha

Clearly, no generalised lexicon or name-space will allow for capture of such
diagnostic characters.

BUT, perhaps we can have a standardised representation for the generalised
characters using a lexicon and then use extensibility to allow user-specific
diagnostic characters? To some extent, but perhaps not...:

I will (foolishly) raise a challenge here that any generalised morphological
character that anyone can come up with (in the plant domain) will be
entirely inadequate for capturing data for some groups. For example, the
most straightforward character I can think of is

Leaves
 present
 absent

But, a diagnostic difference between Discaria pubescens and Discaria nitida
(Rhamnaceae) is the degree to which the leaves persist - in both, leaves
tend to be absent in the adult plant, but in D. pubescens they are often
completely absent while in D. nitida there are usually scattered reduced
leaves on younger branchlets. And in Podostemaceae and Utricularia there's
no guarantee that a leaf-like part is a leaf, because the conventional
differentiation of vegetative parts into leaves/stems doesn't hold.

Mother Nature's a tricky old dame, and any character definition will be
inadequate to catch her. But do we put up with the inadequacy for the
advantages that the universality brings? - if it means we constrain our
ability to capture data, then I'd say no.

So, I'd like to suggest that we try to develop a standardised data
representation, but put no constraints on character definitions whatsoever.

Cheers - k

Beware the Universe - it bites