Validation

Sun Sep 3 08:19:55 CEST 2000

Kevin wrote:
>I'm not sure about a requirement for strong validation in the spec. I agree
>that validation should be allowed, but not required.

But isn't that the main (only?) reason for bothering to encode descriptive
data in the first place?  To try and impose a degree of consistency,
rigour, comparability  and validity that is not possible in an
unstructured, unformatted blob of text?   If the specifications are not
going to require this degree of strandardization, are we going to be
letting the tansferability of data and effort that we seek slip by?

The value of the DELTA, Lucid and other character-based encoding of
descriptive data is that they do impose a high degree of validation,
whether it be by measurement ranges, allowable values or states, or
whatever.  Without validation in biological datasets you end up with a mess
- just look at any herbarium or museum specimen dataset - who doesn't have
terrestrial specimens that plot in the middle of the ocean?

>There are three types of validation you suggest here:
>1. Validate the element strings (and presumably also items, values and
>qualifiers) against predefined lists to guard against typographical and
>spelling errors;

and for consistency...  one can be right, but inconsistent, and although
humans can handle this, computers stuff it up...

>2. Validate values against a list of allowable values for the character;

This is the mechanism to addressing the above...

>3. Validate qualifiers against a list of allowable qualifiers.

It would be nice to put a cap on the number/type of 'qualifiers' allowed,
but you can bet that there will always be something on a list that people
will want/need to use, but at this stage we probaly should confine ouselve
to the need for the list itself, not to the content of the list?

>Taking these in reverse order:

as one would naturally want to do..  :)

>3. As I see it the spec itself would define a set of allowable qualifiers
>(such as "rare" (= "rarely"?), "by misinterpretation", "uncertain" etc). I
>think we could probably agree on a limited set of qualifiers, and stick to
>that (with allowance for extension). If we do this, then "anchovies,
>please!" will be out for several reasons.

in theory yes, but in practice, if you open this can of worms, the
discussion will probably go on for weeks...

>2. Validation of allowable values is covered in the draft spec. One of the
>properties of an element (~"character") is a list of allowable values
>(~"states"). If such a property is filled, then validation can be done by
>the program that handles the data, just as in DELTA and Lucid.

It is not just 'allowed', isn't it?  Isn't this what the draft specs are
all about?

>Two notes:
>* I'd like to allow but not enforce that a valid document have the
>allowable_values property filled. By not enforcing it, a simply marked-up
>natural-language description could be a valid document. This would perhaps
>mean that the spec could meet half-way the automated markup of legacy
>descriptions, and I'm keen to do this. Of course, a document without this
>property specified would not be able to be validated in the way you suggest
>and hence may not be very useful for some purposes, but this may be a price
>one is willing to pay for other benefits, and I think we need to keep this
>open.

It would be pretty difficult to validate a blob of free text...  but I can
see a case for wanting to include them in the exercise...  so perhaps there
should be a deprecated class of elements that are essentially unable to be
validated but perhaps of interest in some instances?

>* I'm using "allowable values" rather than "states" as this seems to me to
>be more general, and subsumes the requirement to distinguish between, for
>instance, multistate and numeric "characters". A numeric "character" of
>course, doesn't have "states", but it does have allowable values (integers
>in the case of an integer numeric, real numbers in the case of a real
>numeric).

This would cover the use of text rather than various numeric values - there
is nothing that should say everything as to be represented numerically.  It
should be possible to use text *and* to have it validated in some way.

>1. How strong is the requirement for this type of validation? Enforcing this
>seems to me to be like requiring that all word documents carry in their
>header a dictionary of the english language to allow validation of
>spellings. It seems to me that providing tools that allow people to check
>these strings against a predefined list (defined either within the document
>or in an external resource) would be useful, but not absolutely necessary. A
>document that is not or cannot be validated in this way would not be
>useless, and would perhaps be more free.

Documents do not have to carry their validation with them - they could
refer to an external source of validation (or dictionary in your above
example - this is what is implied and expected in Word documents at the moment)

>Note that the spec as I see it would allow (but again, not enforce as DELTA
>and Lucid do) the encoding of descriptions. Thus, a valid document may be d1
>as below. This would preempt the need for typographic validation, and allow
>allowable-values validation. But for some reason I don't want to disallow d2
>as a valid document also.

This may be a non threatening approach to gradually introducing validity
and rigour into descriptive data and is probably worth exploring some more...

jim