I'm feeling a bit as though the whole world's agin' me here, but wotthehell as Archie said to Mehitabel - toujours gai, toujours gai.
There seem to be some huge misapprehensions about what I'm suggesting (and about where Bryan's coming from also, I think), summed up by:
from Mike:
producing comparative data is difficult. Nevertheless, shouldn't this be one of the main objectives of taxonomy?
(to which I'd say "What, being difficult?)
My comments were in response to Kevin Thiele's opinion:
I think part of the basic problem is [trying] to force too much structure and while this is a great promise it's been an impediment in practice.
'Structure' apparently means a character list, i.e. the basis for producing comparative data. Although Kevin says elsewhere that both structured and unstructured data should be allowed (and I agree - they are allowed in DELTA), the above statement seems to suggest that the 'impediment' should
be
avoided by encouraging the use of non-comparative data.
and from Joe Kirkbride:
I want rigor and compability in descriptions, keys, and data, not a retreat to the chaos of the past.
I am NOT suggesting that comparability of data is not a good idea, and I am NOT suggesting a retreat to chaos. In fact, it seems to me that what I am suggesting is more trivial and innocuous than you would imagine from the responses, and differs from current practice only in being a little better!
The issue to me comes down to two main options: either we create an exclusive standard that enforces a great deal and leaves most descriptive data out in the cold, or we create an inclusive one that enforces less but allows more. It's an interesting issue and a philosphical one, so here goes with a short ramble on the topic.
In option 1 we would create a standard (S1) similar to the existing Lucid and DELTA file formats, just with some enhancements and a more modern (XML?) structure (I suppose there's also an Option 0 in which we just go with Lucid or DELTA pretty much as is). S1 would *require* that a description have a header that includes (for instance) a character list, a taxon list and some type of item scoring. Simplest here would be to use coded scoring much like DELTA uses now - there's little point really in having:
<DOCUMENT> <CHARACTER LIST> <CHARACTER Character_ID = "1" Character_Name = "Leaves"> <STATE State_ID = "1" State_Name = "present"/> <STATE State_ID = "2" State_Name = "absent"/> </CHARACTER> </CHARACTER LIST> <TAXON LIST> <TAXON Taxon_ID = "1" Taxon_Name = "Viola eminens"/> </TAXON LIST> <DESCRIPTION Taxon_Name = "Viola eminens"> <CHARACTER Character_Name = "Leaves"> <STATE State_Name = "present"> </CHARACTER> </DESCRIPTION> </DOCUMENT>
..you'd be better off with..
<DOCUMENT> <CHARACTER LIST> <CHARACTER Character_ID = "1" Character_Name = "Leaves"> <STATE State_ID = "1" State_Name = "present"/> <STATE State_ID = "2" State_Name = "absent"/> </CHARACTER> </CHARACTER LIST> <TAXON LIST> <TAXON Taxon_ID = "1" Taxon_Name = "Viola eminens"/> </TAXON LIST> <DESCRIPTION Taxon_ID = "1"> <CHARACTER ID = "1"><VALUE ID = "1"></CHARACTER> </DESCRIPTION> </DOCUMENT>
S1 would enforce standardization (e.g. comparability) of descriptions (within a document), would allow all sorts of validations, would be moderately rigorous, and would probably be used by about the same number of people as use Lucid and DELTA today.
In option 2 we would create a standard (S2) similar in almost all respects to S1, merely with the difference that it would allow but not enforce a character list and taxon list etc. Both examples given above would be valid under S2, but so also would
<DOCUMENT> <DESCRIPTION Taxon_Name = "Viola hederacea"> <CHARACTER Character_Name = "Leaves"> <STATE State_Name = "present"> </CHARACTER> </DESCRIPTION> </DOCUMENT>
As I said, this actually seems to me to be fairly innocuous. But it has interesting implications.
First, to clear up another misapprehension, under S2 I am NOT suggesting that every description should be a separate document, and I would expect that this would rarely be the case. S2 just doesn't force the issue.
The main objection to S2 seems to be that you could have another document similar to the one above that would not be fully comparable e.g.
<DOCUMENT> <DESCRIPTION Taxon_Name = "Viola banksii"> <CHARACTER Character_Name = "Foliage"> <STATE State_Name = "present"> </CHARACTER> </DESCRIPTION> </DOCUMENT>
This is the true. But does S1 (or current practice) get around this? Not at all - you could just as easily have two documents under S1 (or two treatments under Lucid or DELTA) that are equally incomparable. The only way to really get around the problem of comparability is to have a universal lexicon and to force everyone to use the same characters and states. We've discussed this before, and I think the consensus was that it's a neat idea, but....
The interesting thing about both S1 and S2 is that neither precludes the possibility of the development of a lexicon, universal or local. In the draft standard I proposed that a document COULD have a character/state list, and that this could be embedded in the document or could be an external resource. An external character/state list would be a lexicon if several documents refer to it, and it could even be a universal one if everyone referred to it. Again, the standard just wouldn't enforce this.
An interesting aside is that DELTA also allows lexica in much the same way. We could have had such a successful CHARS file developed 30 years ago that everyone's used the same file ever since. It just hasn't happened.
The critical difference between S1 and S2 is in the degree of allowance for variation in practice. S1, following the Lucid and DELTA model, would allow one data structure only to be regarded as valid under the standard - it would enforce comparability in structure (although not in content as discussed above). All other descriptions would be deemed not sufficiently rigorous and discounted until their authors have done the extra work to format them accordingly which, as Bryan points out for legacy data, will probably never happen. S2, on the other hand, would be looser in the sense that less highly structured documents would be allowed. Note here that these documents would be less structured, not unstructured.
How does this compare with what we have now? Currently, a few descriptions on the web are highly structured DELTA documents, and the vast majority are completely unstructured blobs from which little or nothing can be recovered except perhaps that they include one or more words somewhere within them. That is, the vast majority of descriptions are left out in the cold. Now Les Watson's sentiments expressing his frustrations about these documents 30 years ago are relevant and noble, and in the best of all possible worlds all descriptions today or tomorrow would be fully structured in the way the Les suggested. S2 doesn't change any of this, but it does provide a stepping-stone from unstructured to structured. S1 also doesn't change any of this, it merely provides no stepping stone - it goes for broke. It seems to me that without the stepping stone, most descriptions will stay in the cold.
Jim Croft has assayed that what I'm suggesting is impossible:
| A definition/specification that can accommodate both approaches would be | nice, but it is very unlikely that we will be able to fully resolve the | internal tension between rigour/structure and freedom/flexibility. They | are incompatible and even if we can formulate a specification to handle | both approaches, at the end of the day people have to apply the specs, and | some will be control freaks, some will be anarchists and others will be | schizophrenic - it is difficult to imagine a real conduit between the extremes.
But it seems to me that the difference between S1 and S2 in some ways is trivial - both would include the same elements and properties, but S1 would make more of these required while in S2 a minimal set would be required and more would be optional. Is this a return to the chaos of the past?
Here are some examples.
Let's say that out there in webland there is a document
<DOCUMENT> Viola eminens K. Thiele & Prober, sp. nov. Perennial herb spreading by stolons; rootstock sometimes somewhat swollen and bulbous at the stem bases. Stems contracted so that the leaves form rosettes, never elongate with caulescent leaves. Leaves broad-reniform, the largest (10-)12-15(-25) mm long, (20-)25-35(-45) mm wide, 1.5-3.2 times wider than long, usually with a broad basal sinus; lamina with 9-20 +/- prominent teeth, glabrous or with scattered unicellular hairs on the upper surface, +/- concolorous bright green; petioles 2-8 cm long; stipules narrowly triangular, usually with several small, glandular teeth on each side. Flowers ... etc </DOCUMENT>
Currently, the best we could tell of this document is that it contains various words, amongst which are "Viola", "eminens" etc. If we were searching the web for descriptive data for V. eminens, we would perhaps hit upon this document, but we couldn't distinguish it from this one:
<DOCUMENT> Hi Mum, the garden's really growing well this spring, and that Viola eminens you sent me is flowering beautifully much love, Kevin </DOCUMENT>
Let's say that the descriptive data standard S2 has in it only one absolute requirement, which is that a description must be tagged and named. Our first document becomes:
<DOCUMENT> <DESCRIPTION Name = "Viola eminens"> Viola eminens K. Thiele & Prober, sp. nov. Perennial herb spreading by stolons; rootstock sometimes somewhat swollen and bulbous at the stem bases. Stems contracted so that the leaves form rosettes, never elongate with caulescent leaves. Leaves broad-reniform, the largest (10-)12-15(-25) mm long, (20-)25-35(-45) mm wide, 1.5-3.2 times wider than long, usually with a broad basal sinus; lamina with 9-20 +/- prominent teeth, glabrous or with scattered unicellular hairs on the upper surface, +/- concolorous bright green; petioles 2-8 cm long; stipules narrowly triangular, usually with several small, glandular teeth on each side. Flowers ... etc </DESCRIPTION> </DOCUMENT>
This simple thing is already a huge advance, because we can now sift out this decription from amongst all other documents containing the key words. Sure the stuff between <DESCRIPTION> and </DESCRIPTION> is blob text and a computer can't do much with it. But we've still got somewhere.
Now the important thing about S2 is that it doesn't leave it there. It says, in effect, "OK, if you want to provide more structure for these data, follow these rules..." Our document could now become something like:
<DOCUMENT> <DESCRIPTION Name = "Viola eminens"> Viola eminens K. Thiele & Prober, sp. nov. <ELEMENT Name = "Longevity"><VALUE>Perennial </VALUE></ELEMENT> <ELEMENT Name = "Habit"><VALUE>herb</VALUE> spreading by stolons; rootstock sometimes somewhat swollen and bulbous at the stem bases. Stems contracted so that the leaves form rosettes, never elongate with caulescent leaves. Leaves broad-reniform, the largest (10-)12-15(-25) mm long, (20-)25-35(-45) mm wide, 1.5-3.2 times wider than long, usually with a broad basal sinus; lamina with 9-20 +/- prominent teeth, glabrous or with scattered unicellular hairs on the upper surface, +/- concolorous bright green; petioles 2-8 cm long; stipules narrowly triangular, usually with several small, glandular teeth on each side. Flowers ... etc </DESCRIPTION> </DOCUMENT>
Another huge advance, because now we can parse the document and extract bits of descriptive data (note also the entertaining thing that we could also ignore the tags and render our document as the original natural language).
But our document still hasn't reached the gold standard. So, S2 says "So much for straight markup, if you now want to make the document more standardised, follow these rules to include a character list...". And you could do just that. Or, you could include a reference to an external character list (a lexicon) and do the work to force your document through Bryan's narrow pipe. At somewhere around this stage the document becomes valid input for, say, Lucid or IntKey. Or, you could start from scratch and create a document just like current Lucid and DELTA documents. Then S2 would say "OK, you have a structured document, but what about annotations as to where the bits of data came from - to include those, follow these rules..." or "if your name's Peter Stevens and you want to link this document to another that stores information for specimens and describe some rules for converting leaf measurements to standardised shapes, follow these rules..." and now S2 is way beyond current best practice. So much for it being weak!
Consider now S1. It says "Sorry mate, your description's bloody useless, it's just not up to scratch, go away and don't come back until you've completely broken it up into bits, with character and taxon lists, and then we'll talk turkey"
This is what I mean by S1 leaving most descriptions out in the cold. S2 provides incremental steps for improving the structure of documents. S1 goes for broke the first time. Does this make S2 weak, or somehow threaten the basis of systematics? Under S2, if you're a Lucid freak or DELTAHead you can create, publish and exchange your Lucid or DELTA documents just as you do now. But if you're not (and most taxonomists aren't) it provides a way of improving the structure of your data by following a standardised set of rules. So Jim, it's not a standard that isn't a standard.
So tell me, does *anyone* out there agree with me?
Cheers - k