Validation

Sat Sep 2 21:36:15 CEST 2000

At 08:19 AM 9/3/00 +1000, Jim Croft wrote:
>Kevin wrote:
>>I'm not sure about a requirement for strong validation in the spec. I agree
>>that validation should be allowed, but not required.
>
>But isn't that the main (only?) reason for bothering to encode descriptive
>data in the first place?  To try and impose a degree of consistency,
>rigour, comparability  and validity that is not possible in an
>unstructured, unformatted blob of text?   If the specifications are not
>going to require this degree of strandardization, are we going to be
>letting the tansferability of data and effort that we seek slip by?
>
>The value of the DELTA, Lucid and other character-based encoding of
>descriptive data is that they do impose a high degree of validation,
>whether it be by measurement ranges, allowable values or states, or
>whatever.  Without validation in biological datasets you end up with a mess
>- just look at any herbarium or museum specimen dataset - who doesn't have
>terrestrial specimens that plot in the middle of the ocean?

I agree that there should be the capability to do strong validation but as
Jim stated below there are times when "free" text is appropriate in a
taxonomic
description. I'll go farther than Jim however and say free text fields are
not only
interesting for discussion but necessary elements of the standard.
If we do not allow at least some free text in the descriptions then the
standard will not be applicable to many worthy endeavors where it would be
of great
use. The matter hinges on the type of taxonomic descriptions that we want the
standard to cover. We should discuss some of them.

Since there was a little confusion about what is meant by text I'll
distinguish between
two types. Controlled Text comes from a controlled vocabulary. It is
proscribed like
the controlled vocabularies used in libraries for indexing. "Free text" is
just words in the wild.
It can be anything that the writer intends including anchovies. For
example, Kevin has argued that
the qualifiers should come from a controlled and therefor validatable
vocabulary.

Two Types of Taxonomic Description: (not even a little exhaustive but
certainly exhausting)

Certainly the first type of taxonomic description is one that is used by
computer identification systems like IntKey and Lucid. For this type of
taxonomic
description we need strong type checking and validation. It would need
controlled
vocabularies, no anchovies for qualifiers, numeric checking, range checking...

The second type of description is an entry from a field guide to
butterflies of the
arctic. It is mostly intended for direct human consumption without much prior
chewing by a computer. Even in this case there might be good reason to
allow the field guide entry to follow the standard. For example, assuming
that the
field guide is on-line we might want to do some name validation and
indexing in the
Global Taxonomic Workbench. Also, even field guides have keys sometimes.
We shouldn't just ignore interesting text like "The wood of this tree makes
wonderful canoes." I do not mean that we should have a <Canoe-Quality>
character.
We'll never be able to capture everything in characters.

There is certainly a substantial middle ground in types. For example the first
document type, the key matrix data, might be run through a program to produce
at least acceptable natural language descriptions. This has proven useful
for some
projects using Delta. Like Delta we'll need to allow "holes" in the
validation so that
we can include text to make things a little more readable for us humans.
>

>>3. Validate qualifiers against a list of allowable qualifiers.
>
>It would be nice to put a cap on the number/type of 'qualifiers' allowed,
>but you can bet that there will always be something on a list that people
>will want/need to use, but at this stage we probaly should confine ouselve
>to the need for the list itself, not to the content of the list?
Perhaps the standard could have a suggested set or two but require that
is qualifiers are used that they be from a project managed controlled
vocabulary. This is a little like Z39.50 allowing different attribute sets but
still being a standard. I did say a _Little_ like Z39.50.
>
>>Taking these in reverse order:
>
>as one would naturally want to do..  :)
>
>>3. As I see it the spec itself would define a set of allowable qualifiers
>>(such as "rare" (= "rarely"?), "by misinterpretation", "uncertain" etc). I
>>think we could probably agree on a limited set of qualifiers, and stick to
>>that (with allowance for extension). If we do this, then "anchovies,
>>please!" will be out for several reasons.
>
>in theory yes, but in practice, if you open this can of worms, the
>discussion will probably go on for weeks...

Oh, I am  that we'll find some other trivia to argue about for weeks if not
this.
>
>>2. Validation of allowable values is covered in the draft spec. One of the
>>properties of an element (~"character") is a list of allowable values
>>(~"states"). If such a property is filled, then validation can be done by
>>the program that handles the data, just as in DELTA and Lucid.
>
>It is not just 'allowed', isn't it?  Isn't this what the draft specs are
>all about?
Usually, but we do need to make it clear in documents following the
standards which pieces are validatable and how they are validated.
>
>>Two notes:
>>* I'd like to allow but not enforce that a valid document have the
>>allowable_values property filled. By not enforcing it, a simply marked-up
>>natural-language description could be a valid document. This would perhaps
>>mean that the spec could meet half-way the automated markup of legacy
>>descriptions, and I'm keen to do this. Of course, a document without this
>>property specified would not be able to be validated in the way you suggest
>>and hence may not be very useful for some purposes, but this may be a price
>>one is willing to pay for other benefits, and I think we need to keep this
>>open.
>
>It would be pretty difficult to validate a blob of free text...  but I can
>see a case for wanting to include them in the exercise...  so perhaps there
>should be a deprecated class of elements that are essentially unable to be
>validated but perhaps of interest in some instances?

We can validate existence if not content. I should be able to say that all
of my
descriptions will have <natural history> notes and maybe <medical uses>
in regular old natural language. If I set up my project standard correctly
under SDD
my validator should scream is an author tries to send in a manuscript
without that
section, even if the computer can not read the text. Also computer programs
processing the text should be able to tell what section they are in. For
example
people should be able to search just <Medical uses> section for the word
"eye".
If you can not do that you might end up with a list of all creatures with
eyes and not
medicines for your own eye.

Finally, there is free text in legacy data. It is reasonable to write computer
programs that process this data and "mark-up" interesting things at least
weakly
following the standard. We might be able to have programs find the Latin name,
common names, authority, bibliography and other useful things and tag them for
future use.It would be impossible however to make programs that convey the
entire
contents of these free text documents into the highly structured
"interactive key"
quality data sets.

>
>>* I'm using "allowable values" rather than "states" as this seems to me to
>>be more general, and subsumes the requirement to distinguish between, for
>>instance, multistate and numeric "characters". A numeric "character" of
>>course, doesn't have "states", but it does have allowable values (integers
>>in the case of an integer numeric, real numbers in the case of a real
>>numeric).

I would like to see a "Free Text" character type in the standard and warn
authors
that they can not use it for characters that they want to process through
interactive key programs.

>
>This would cover the use of text rather than various numeric values - there
>is nothing that should say everything as to be represented numerically.  It
>should be possible to use text *and* to have it validated in some way.
>
Right
>>1. How strong is the requirement for this type of validation? Enforcing this
>>seems to me to be like requiring that all word documents carry in their
>>header a dictionary of the english language to allow validation of
>>spellings. It seems to me that providing tools that allow people to check
>>these strings against a predefined list (defined either within the document
>>or in an external resource) would be useful, but not absolutely necessary. A
>>document that is not or cannot be validated in this way would not be
>>useless, and would perhaps be more free.
>
>Documents do not have to carry their validation with them - they could
>refer to an external source of validation (or dictionary in your above
>example - this is what is implied and expected in Word documents at the
moment)
Right, systems work this way now. We could store the standard the a
description is
supposed to follow anywhere in the world. At the beginning of the document
place text
saying where to find it.
<SDD Version 1.10 http://global.taxonomy.org/SDD1.1/ Gouania >
Assuming some group had established the character lists for Gouania and
registered them with global.taxonomy.org. Like W3 stores standards.
>
>>Note that the spec as I see it would allow (but again, not enforce as DELTA
>>and Lucid do) the encoding of descriptions. Thus, a valid document may be d1
>>as below. This would preempt the need for typographic validation, and allow
>>allowable-values validation. But for some reason I don't want to disallow d2
>>as a valid document also.
>
>This may be a non threatening approach to gradually introducing validity
>and rigour into descriptive data and is probably worth exploring some more...
Bravo
>
>jim

Bryan