Re: Minimalism AND functionalism

8 Sep 2000

      I'm feeling a bit as though the whole world's agin' me here, but wotthehell
as Archie said to Mehitabel - toujours gai, toujours gai.

There seem to be some huge misapprehensions about what I'm suggesting (and
about where Bryan's coming from also, I think), summed up by:

from Mike:
...
...
...
producing comparative data is difficult. Nevertheless,
shouldn't this be one of the main objectives of taxonomy?
(to which I'd say "What, being difficult?)
...
My comments were in response to Kevin Thiele's opinion:
...
...
I think part of the basic problem is [trying] to force too much structure
and while this is a great promise it's been an impediment in practice.
...
'Structure' apparently means a character list, i.e. the basis for producing
comparative data. Although Kevin says elsewhere that both structured and
unstructured data should be allowed (and I agree - they are allowed in
DELTA), the above statement seems to suggest that the 'impediment' should
be
avoided by encouraging the use of non-comparative data.
and from Joe Kirkbride:
...
I want rigor and compability in descriptions, keys, and data, not a retreat
to the chaos of the past.
I am NOT suggesting that comparability of data is not a good idea, and I am
NOT suggesting a retreat to chaos. In fact, it seems to me that what I am
suggesting is more trivial and innocuous than you would imagine from the
responses, and differs from current practice only in being a little better!

The issue to me comes down to two main options: either we create an
exclusive standard that enforces a great deal and leaves most descriptive
data out in the cold, or we create an inclusive one that enforces less but
allows more. It's an interesting issue and a philosphical one, so here goes
with a short ramble on the topic.

In option 1 we would create a standard (S1) similar to the existing Lucid
and DELTA file formats,
just with some enhancements and a more modern (XML?) structure (I suppose
there's also an Option 0 in which we just go with Lucid or DELTA pretty much
as is).
S1 would *require* that a description have a header that includes (for
instance) a character list, a taxon list and some type of item scoring.
Simplest here would be to use coded scoring much like DELTA uses now -
there's little point really in having:

<DOCUMENT>
    <CHARACTER LIST>
        <CHARACTER Character_ID = "1" Character_Name = "Leaves">
            <STATE State_ID = "1" State_Name = "present"/>
            <STATE State_ID = "2" State_Name = "absent"/>
        </CHARACTER>
    </CHARACTER LIST>
    <TAXON LIST>
        <TAXON Taxon_ID = "1" Taxon_Name = "Viola eminens"/>
    </TAXON LIST>
    <DESCRIPTION Taxon_Name = "Viola eminens">
        <CHARACTER Character_Name = "Leaves">
            <STATE State_Name = "present">
        </CHARACTER>
    </DESCRIPTION>
</DOCUMENT>

..you'd be better off with..

<DOCUMENT>
    <CHARACTER LIST>
        <CHARACTER Character_ID = "1" Character_Name = "Leaves">
            <STATE State_ID = "1" State_Name = "present"/>
            <STATE State_ID = "2" State_Name = "absent"/>
        </CHARACTER>
    </CHARACTER LIST>
    <TAXON LIST>
        <TAXON Taxon_ID = "1" Taxon_Name = "Viola eminens"/>
    </TAXON LIST>
    <DESCRIPTION Taxon_ID = "1">
        <CHARACTER ID = "1"><VALUE ID = "1"></CHARACTER>
    </DESCRIPTION>
</DOCUMENT>

S1 would enforce standardization (e.g. comparability) of descriptions
(within a document), would allow all sorts of validations, would be
moderately rigorous, and would probably be used by about the same number of
people as use Lucid and DELTA today.

In option 2 we would create a standard (S2) similar in almost all respects
to S1,
merely with the difference that it would allow but not enforce a character
list and taxon list etc. Both examples given above would be valid under S2,
but so also would

<DOCUMENT>
    <DESCRIPTION Taxon_Name = "Viola hederacea">
        <CHARACTER Character_Name = "Leaves">
            <STATE State_Name = "present">
        </CHARACTER>
    </DESCRIPTION>
</DOCUMENT>

As I said, this actually seems to me to be fairly innocuous. But it has
interesting implications.

First, to clear up another misapprehension, under S2 I am NOT suggesting
that
every description should be a separate document, and I would expect that
this would rarely be the case. S2 just doesn't force the issue.

The main objection to S2 seems to be that you could have another document
similar to the one above that would not be fully comparable e.g.

<DOCUMENT>
    <DESCRIPTION Taxon_Name = "Viola banksii">
        <CHARACTER Character_Name = "Foliage">
            <STATE State_Name = "present">
        </CHARACTER>
    </DESCRIPTION>
</DOCUMENT>

This is the true. But does S1 (or current practice) get around this? Not at
all - you could
just as easily have two documents under S1 (or two treatments under Lucid or
DELTA) that are equally incomparable. The only way to really get around the
problem of comparability
is to have a universal lexicon and to force everyone to use the same
characters and states. We've discussed this before, and I think the
consensus was that it's a neat idea, but....

The interesting thing about both S1 and S2 is that neither precludes the
possibility of the development of a lexicon, universal or local. In the
draft standard I proposed that a document COULD have a character/state list,
and that this could be embedded in the document or could be an external
resource. An external character/state list would be a lexicon if several
documents refer to it, and it could even be a universal one if everyone
referred to it. Again, the standard just wouldn't enforce this.

An interesting aside is that DELTA also allows lexica in much the same way.
We could have had such a successful CHARS file developed 30 years ago that
everyone's used the same file ever since. It just hasn't happened.

The critical difference between S1 and S2 is in the degree of allowance for
variation in practice. S1, following the Lucid and DELTA model, would allow
one data structure
only to be regarded as valid under the standard - it would enforce
comparability in structure (although not in content as discussed above). All
other descriptions
would be deemed not sufficiently rigorous and discounted until their authors
have done the extra work to format them accordingly which, as Bryan points
out for legacy data, will probably never happen. S2, on the other hand,
would be looser in the sense that less highly structured documents would be
allowed. Note here that these documents would be less structured, not
unstructured.

How does this compare with what we have now? Currently, a few descriptions
on the web are highly structured DELTA documents, and the vast majority are
completely unstructured blobs from which little or nothing can be recovered
except perhaps that they include one or more words somewhere within them.
That is, the vast majority of descriptions are left out in the cold. Now Les
Watson's sentiments expressing his frustrations about these documents 30
years ago are relevant and noble, and in the best of all possible worlds all
descriptions today or tomorrow would be fully structured in the way the Les
suggested. S2 doesn't change any of this, but it does provide a
stepping-stone from unstructured to structured. S1 also doesn't change any
of this, it merely provides no stepping stone - it goes for broke. It seems
to me that without the stepping stone, most descriptions will stay in the
cold.

Jim Croft has assayed that what I'm suggesting is impossible:

| A definition/specification that can accommodate both approaches would be
| nice, but it is very unlikely that we will be able to fully resolve the
| internal tension between rigour/structure and freedom/flexibility.  They
| are incompatible and even if we can formulate a specification to handle
| both approaches, at the end of the day people have to apply the specs, and
| some will be control freaks, some will be anarchists and others will be
| schizophrenic - it is difficult to imagine a real conduit between the
extremes.

But it seems to me that the difference between S1 and S2 in some ways is
trivial - both would include the same elements and properties, but S1 would
make more of these required while in S2 a minimal set would be required and
more would be optional. Is this a return to the chaos of the past?

Here are some examples.

Let's say that out there in webland there is a document

<DOCUMENT>
Viola eminens K. Thiele & Prober, sp. nov.
Perennial herb spreading by stolons; rootstock sometimes somewhat swollen
and bulbous at the stem bases. Stems contracted so that the leaves form
rosettes, never elongate with caulescent leaves. Leaves broad-reniform, the
largest (10-)12-15(-25) mm long, (20-)25-35(-45) mm wide, 1.5-3.2 times
wider than long, usually with a broad basal sinus; lamina with 9-20 +/-
prominent teeth, glabrous or with scattered unicellular hairs on the upper
surface, +/- concolorous bright green; petioles 2-8 cm long; stipules
narrowly triangular, usually with several small, glandular teeth on each
side. Flowers ... etc
</DOCUMENT>

Currently, the best we could tell of this document is that it contains
various words, amongst which are "Viola", "eminens" etc. If we were
searching the web for descriptive data for V. eminens, we would perhaps hit
upon this document, but we couldn't distinguish it from this one:

<DOCUMENT>
Hi Mum,
the garden's really growing well this spring, and that Viola eminens you
sent me is flowering beautifully
much love, Kevin
</DOCUMENT>

Let's say that the descriptive data standard S2 has in it only one absolute
requirement, which is that a description must be tagged and named. Our first
document becomes:

<DOCUMENT>
<DESCRIPTION Name = "Viola eminens">
Viola eminens K. Thiele & Prober, sp. nov.
Perennial herb spreading by stolons; rootstock sometimes somewhat swollen
and bulbous at the stem bases. Stems contracted so that the leaves form
rosettes, never elongate with caulescent leaves. Leaves broad-reniform, the
largest (10-)12-15(-25) mm long, (20-)25-35(-45) mm wide, 1.5-3.2 times
wider than long, usually with a broad basal sinus; lamina with 9-20 +/-
prominent teeth, glabrous or with scattered unicellular hairs on the upper
surface, +/- concolorous bright green; petioles 2-8 cm long; stipules
narrowly triangular, usually with several small, glandular teeth on each
side. Flowers ... etc
</DESCRIPTION>
</DOCUMENT>

This simple thing is already a huge advance, because we can now sift out
this decription from amongst all other documents containing the key words.
Sure the stuff between <DESCRIPTION> and </DESCRIPTION> is blob text and a
computer can't do much with it. But we've still got somewhere.

Now the important thing about S2 is that it doesn't leave it there. It says,
in effect, "OK, if you want to provide more structure for these data, follow
these rules..." Our document could now become something like:

<DOCUMENT>
<DESCRIPTION Name = "Viola eminens">
Viola eminens K. Thiele & Prober, sp. nov.
<ELEMENT Name = "Longevity"><VALUE>Perennial </VALUE></ELEMENT> <ELEMENT
Name = "Habit"><VALUE>herb</VALUE> spreading by stolons; rootstock sometimes
somewhat swollen and bulbous at the stem bases. Stems contracted so that the
leaves form rosettes, never elongate with caulescent leaves. Leaves
broad-reniform, the largest (10-)12-15(-25) mm long, (20-)25-35(-45) mm
wide, 1.5-3.2 times wider than long, usually with a broad basal sinus;
lamina with 9-20 +/- prominent teeth, glabrous or with scattered unicellular
hairs on the upper surface, +/- concolorous bright green; petioles 2-8 cm
long; stipules narrowly triangular, usually with several small, glandular
teeth on each side. Flowers ... etc
</DESCRIPTION>
</DOCUMENT>

Another huge advance, because now we can parse the document and extract bits
of descriptive data (note also the entertaining thing that we could also
ignore the tags and render our document as the original natural language).

But our document still hasn't reached the gold standard. So, S2 says "So
much for straight markup, if you now want to make the document more
standardised, follow these rules to include a character list...". And you
could do just that. Or, you could include a reference to an external
character list (a lexicon) and do the work to force your document through
Bryan's narrow pipe. At somewhere around this stage the document becomes
valid input for, say, Lucid or IntKey. Or, you could start from scratch and
create a document just like current Lucid and DELTA documents. Then S2 would
say "OK, you have a structured document, but what about annotations as to
where the bits of data came from - to include those, follow these rules..."
or "if your name's Peter Stevens and you want to link this document to
another that stores information for specimens and describe some rules for
converting leaf measurements to standardised shapes, follow these rules..."
and now S2 is way beyond current best practice. So much for it being weak!

Consider now S1. It says "Sorry mate, your description's bloody useless,
it's just not up to scratch, go away and don't come back until you've
completely broken it up into bits, with character and taxon lists, and then
we'll talk turkey"

This is what I mean by S1 leaving most descriptions out in the cold. S2
provides incremental steps for improving the structure of documents. S1 goes
for broke the first time. Does this make S2 weak, or somehow threaten the
basis of systematics? Under S2, if you're a Lucid freak or DELTAHead you can
create, publish and exchange your Lucid or DELTA documents just as you do
now. But if you're not (and most taxonomists aren't) it provides a way of
improving the structure of your data by following a standardised set of
rules. So Jim, it's not a standard that isn't a standard.

So tell me, does *anyone* out there agree with me?

Cheers - k