At 08:19 AM 9/3/00 +1000, Jim Croft wrote:
Kevin wrote:
I'm not sure about a requirement for strong validation in the spec. I agree that validation should be allowed, but not required.
But isn't that the main (only?) reason for bothering to encode descriptive data in the first place? To try and impose a degree of consistency, rigour, comparability and validity that is not possible in an unstructured, unformatted blob of text? If the specifications are not going to require this degree of strandardization, are we going to be letting the tansferability of data and effort that we seek slip by?
The value of the DELTA, Lucid and other character-based encoding of descriptive data is that they do impose a high degree of validation, whether it be by measurement ranges, allowable values or states, or whatever. Without validation in biological datasets you end up with a mess
- just look at any herbarium or museum specimen dataset - who doesn't have
terrestrial specimens that plot in the middle of the ocean?
I agree that there should be the capability to do strong validation but as Jim stated below there are times when "free" text is appropriate in a taxonomic description. I'll go farther than Jim however and say free text fields are not only interesting for discussion but necessary elements of the standard. If we do not allow at least some free text in the descriptions then the standard will not be applicable to many worthy endeavors where it would be of great use. The matter hinges on the type of taxonomic descriptions that we want the standard to cover. We should discuss some of them.
Since there was a little confusion about what is meant by text I'll distinguish between two types. Controlled Text comes from a controlled vocabulary. It is proscribed like the controlled vocabularies used in libraries for indexing. "Free text" is just words in the wild. It can be anything that the writer intends including anchovies. For example, Kevin has argued that the qualifiers should come from a controlled and therefor validatable vocabulary.
Two Types of Taxonomic Description: (not even a little exhaustive but certainly exhausting)
Certainly the first type of taxonomic description is one that is used by computer identification systems like IntKey and Lucid. For this type of taxonomic description we need strong type checking and validation. It would need controlled vocabularies, no anchovies for qualifiers, numeric checking, range checking...
The second type of description is an entry from a field guide to butterflies of the arctic. It is mostly intended for direct human consumption without much prior chewing by a computer. Even in this case there might be good reason to allow the field guide entry to follow the standard. For example, assuming that the field guide is on-line we might want to do some name validation and indexing in the Global Taxonomic Workbench. Also, even field guides have keys sometimes. We shouldn't just ignore interesting text like "The wood of this tree makes wonderful canoes." I do not mean that we should have a <Canoe-Quality> character. We'll never be able to capture everything in characters.
There is certainly a substantial middle ground in types. For example the first document type, the key matrix data, might be run through a program to produce at least acceptable natural language descriptions. This has proven useful for some projects using Delta. Like Delta we'll need to allow "holes" in the validation so that we can include text to make things a little more readable for us humans.
- Validate qualifiers against a list of allowable qualifiers.
It would be nice to put a cap on the number/type of 'qualifiers' allowed, but you can bet that there will always be something on a list that people will want/need to use, but at this stage we probaly should confine ouselve to the need for the list itself, not to the content of the list?
Perhaps the standard could have a suggested set or two but require that is qualifiers are used that they be from a project managed controlled vocabulary. This is a little like Z39.50 allowing different attribute sets but still being a standard. I did say a _Little_ like Z39.50.
Taking these in reverse order:
as one would naturally want to do.. :)
- As I see it the spec itself would define a set of allowable qualifiers
(such as "rare" (= "rarely"?), "by misinterpretation", "uncertain" etc). I think we could probably agree on a limited set of qualifiers, and stick to that (with allowance for extension). If we do this, then "anchovies, please!" will be out for several reasons.
in theory yes, but in practice, if you open this can of worms, the discussion will probably go on for weeks...
Oh, I am that we'll find some other trivia to argue about for weeks if not this.
- Validation of allowable values is covered in the draft spec. One of the
properties of an element (~"character") is a list of allowable values (~"states"). If such a property is filled, then validation can be done by the program that handles the data, just as in DELTA and Lucid.
It is not just 'allowed', isn't it? Isn't this what the draft specs are all about?
Usually, but we do need to make it clear in documents following the standards which pieces are validatable and how they are validated.
Two notes:
- I'd like to allow but not enforce that a valid document have the
allowable_values property filled. By not enforcing it, a simply marked-up natural-language description could be a valid document. This would perhaps mean that the spec could meet half-way the automated markup of legacy descriptions, and I'm keen to do this. Of course, a document without this property specified would not be able to be validated in the way you suggest and hence may not be very useful for some purposes, but this may be a price one is willing to pay for other benefits, and I think we need to keep this open.
It would be pretty difficult to validate a blob of free text... but I can see a case for wanting to include them in the exercise... so perhaps there should be a deprecated class of elements that are essentially unable to be validated but perhaps of interest in some instances?
We can validate existence if not content. I should be able to say that all of my descriptions will have <natural history> notes and maybe <medical uses> in regular old natural language. If I set up my project standard correctly under SDD my validator should scream is an author tries to send in a manuscript without that section, even if the computer can not read the text. Also computer programs processing the text should be able to tell what section they are in. For example people should be able to search just <Medical uses> section for the word "eye". If you can not do that you might end up with a list of all creatures with eyes and not medicines for your own eye.
Finally, there is free text in legacy data. It is reasonable to write computer programs that process this data and "mark-up" interesting things at least weakly following the standard. We might be able to have programs find the Latin name, common names, authority, bibliography and other useful things and tag them for future use.It would be impossible however to make programs that convey the entire contents of these free text documents into the highly structured "interactive key" quality data sets.
- I'm using "allowable values" rather than "states" as this seems to me to
be more general, and subsumes the requirement to distinguish between, for instance, multistate and numeric "characters". A numeric "character" of course, doesn't have "states", but it does have allowable values (integers in the case of an integer numeric, real numbers in the case of a real numeric).
I would like to see a "Free Text" character type in the standard and warn authors that they can not use it for characters that they want to process through interactive key programs.
This would cover the use of text rather than various numeric values - there is nothing that should say everything as to be represented numerically. It should be possible to use text *and* to have it validated in some way.
Right
- How strong is the requirement for this type of validation? Enforcing this
seems to me to be like requiring that all word documents carry in their header a dictionary of the english language to allow validation of spellings. It seems to me that providing tools that allow people to check these strings against a predefined list (defined either within the document or in an external resource) would be useful, but not absolutely necessary. A document that is not or cannot be validated in this way would not be useless, and would perhaps be more free.
Documents do not have to carry their validation with them - they could refer to an external source of validation (or dictionary in your above example - this is what is implied and expected in Word documents at the
moment) Right, systems work this way now. We could store the standard the a description is supposed to follow anywhere in the world. At the beginning of the document place text saying where to find it. <SDD Version 1.10 http://global.taxonomy.org/SDD1.1/ Gouania > Assuming some group had established the character lists for Gouania and registered them with global.taxonomy.org. Like W3 stores standards.
Note that the spec as I see it would allow (but again, not enforce as DELTA and Lucid do) the encoding of descriptions. Thus, a valid document may be d1 as below. This would preempt the need for typographic validation, and allow allowable-values validation. But for some reason I don't want to disallow d2 as a valid document also.
This may be a non threatening approach to gradually introducing validity and rigour into descriptive data and is probably worth exploring some more...
Bravo
jim
Bryan