Re: Validation

3 Sep 2000


      Kevin wrote:
...
I'm not sure about a requirement for strong validation in the spec. I agree
that validation should be allowed, but not required.
But isn't that the main (only?) reason for bothering to encode descriptive
data in the first place?  To try and impose a degree of consistency,
rigour, comparability  and validity that is not possible in an
unstructured, unformatted blob of text?   If the specifications are not
going to require this degree of strandardization, are we going to be
letting the tansferability of data and effort that we seek slip by?

The value of the DELTA, Lucid and other character-based encoding of
descriptive data is that they do impose a high degree of validation,
whether it be by measurement ranges, allowable values or states, or
whatever.  Without validation in biological datasets you end up with a mess
- just look at any herbarium or museum specimen dataset - who doesn't have
terrestrial specimens that plot in the middle of the ocean?
...
There are three types of validation you suggest here:
1. Validate the element strings (and presumably also items, values and
qualifiers) against predefined lists to guard against typographical and
spelling errors;
and for consistency...  one can be right, but inconsistent, and although
humans can handle this, computers stuff it up...
...
2. Validate values against a list of allowable values for the character;
This is the mechanism to addressing the above...
...
3. Validate qualifiers against a list of allowable qualifiers.
It would be nice to put a cap on the number/type of 'qualifiers' allowed,
but you can bet that there will always be something on a list that people
will want/need to use, but at this stage we probaly should confine ouselve
to the need for the list itself, not to the content of the list?
...
Taking these in reverse order:
as one would naturally want to do..  :)
...
3. As I see it the spec itself would define a set of allowable qualifiers
(such as "rare" (= "rarely"?), "by misinterpretation", "uncertain" etc). I
think we could probably agree on a limited set of qualifiers, and stick to
that (with allowance for extension). If we do this, then "anchovies,
please!" will be out for several reasons.
in theory yes, but in practice, if you open this can of worms, the
discussion will probably go on for weeks...
...
2. Validation of allowable values is covered in the draft spec. One of the
properties of an element (~"character") is a list of allowable values
(~"states"). If such a property is filled, then validation can be done by
the program that handles the data, just as in DELTA and Lucid.
It is not just 'allowed', isn't it?  Isn't this what the draft specs are
all about?
...
Two notes:
* I'd like to allow but not enforce that a valid document have the
allowable_values property filled. By not enforcing it, a simply marked-up
natural-language description could be a valid document. This would perhaps
mean that the spec could meet half-way the automated markup of legacy
descriptions, and I'm keen to do this. Of course, a document without this
property specified would not be able to be validated in the way you suggest
and hence may not be very useful for some purposes, but this may be a price
one is willing to pay for other benefits, and I think we need to keep this
open.
It would be pretty difficult to validate a blob of free text...  but I can
see a case for wanting to include them in the exercise...  so perhaps there
should be a deprecated class of elements that are essentially unable to be
validated but perhaps of interest in some instances?
...
* I'm using "allowable values" rather than "states" as this seems to me to
be more general, and subsumes the requirement to distinguish between, for
instance, multistate and numeric "characters". A numeric "character" of
course, doesn't have "states", but it does have allowable values (integers
in the case of an integer numeric, real numbers in the case of a real
numeric).
This would cover the use of text rather than various numeric values - there
is nothing that should say everything as to be represented numerically.  It
should be possible to use text *and* to have it validated in some way.
...
1. How strong is the requirement for this type of validation? Enforcing this
seems to me to be like requiring that all word documents carry in their
header a dictionary of the english language to allow validation of
spellings. It seems to me that providing tools that allow people to check
these strings against a predefined list (defined either within the document
or in an external resource) would be useful, but not absolutely necessary. A
document that is not or cannot be validated in this way would not be
useless, and would perhaps be more free.
Documents do not have to carry their validation with them - they could
refer to an external source of validation (or dictionary in your above
example - this is what is implied and expected in Word documents at the moment)
...
Note that the spec as I see it would allow (but again, not enforce as DELTA
and Lucid do) the encoding of descriptions. Thus, a valid document may be d1
as below. This would preempt the need for typographic validation, and allow
allowable-values validation. But for some reason I don't want to disallow d2
as a valid document also.
This may be a non threatening approach to gradually introducing validity
and rigour into descriptive data and is probably worth exploring some more...

jim

Jim Croft

tags

participants (1)