Validation

Joseph H. Kirkbride, Jr. jkirkbri at ASRR.ARSUSDA.GOV
Mon Sep 4 11:39:02 CEST 2000


The last few days of discussion have seemed to be drifting towards a
'free' structure which would not enforce comparablity.  I totally support
Jim that one of the key elements in DELTA format is the absolute
enforcement of parallelism and compatibility.  The idea that each species
or specimen in a genus could be a separate document is anathema to me.  I
want rigor and compability in descriptions, keys, and data, not a retreat
to the chaos of the past.

Joseph H. Kirkbride, Jr.
USDA, Agricultural Research Service
Systematic Botany and Mycology Laboratory
Room 304, Building 011A, BARC-West
Beltsville, Maryland 20705-2350 USA
Voice telephone: 301-504-9447
FAX: 301-504-5810
Internet: jkirkbri at asrr.arsusda.gov


On Sun, 3 Sep 2000, Jim Croft wrote:

> Kevin wrote:
> >I'm not sure about a requirement for strong validation in the spec. I agree
> >that validation should be allowed, but not required.
>
> But isn't that the main (only?) reason for bothering to encode descriptive
> data in the first place?  To try and impose a degree of consistency,
> rigour, comparability  and validity that is not possible in an
> unstructured, unformatted blob of text?   If the specifications are not
> going to require this degree of strandardization, are we going to be
> letting the tansferability of data and effort that we seek slip by?
>
> The value of the DELTA, Lucid and other character-based encoding of
> descriptive data is that they do impose a high degree of validation,
> whether it be by measurement ranges, allowable values or states, or
> whatever.  Without validation in biological datasets you end up with a mess
> - just look at any herbarium or museum specimen dataset - who doesn't have
> terrestrial specimens that plot in the middle of the ocean?
>
> >There are three types of validation you suggest here:
> >1. Validate the element strings (and presumably also items, values and
> >qualifiers) against predefined lists to guard against typographical and
> >spelling errors;
>
> and for consistency...  one can be right, but inconsistent, and although
> humans can handle this, computers stuff it up...
>
> >2. Validate values against a list of allowable values for the character;
>
> This is the mechanism to addressing the above...
>
> >3. Validate qualifiers against a list of allowable qualifiers.
>
> It would be nice to put a cap on the number/type of 'qualifiers' allowed,
> but you can bet that there will always be something on a list that people
> will want/need to use, but at this stage we probaly should confine ouselve
> to the need for the list itself, not to the content of the list?
>
> >Taking these in reverse order:
>
> as one would naturally want to do..  :)
>
> >3. As I see it the spec itself would define a set of allowable qualifiers
> >(such as "rare" (= "rarely"?), "by misinterpretation", "uncertain" etc). I
> >think we could probably agree on a limited set of qualifiers, and stick to
> >that (with allowance for extension). If we do this, then "anchovies,
> >please!" will be out for several reasons.
>
> in theory yes, but in practice, if you open this can of worms, the
> discussion will probably go on for weeks...
>
> >2. Validation of allowable values is covered in the draft spec. One of the
> >properties of an element (~"character") is a list of allowable values
> >(~"states"). If such a property is filled, then validation can be done by
> >the program that handles the data, just as in DELTA and Lucid.
>
> It is not just 'allowed', isn't it?  Isn't this what the draft specs are
> all about?
>
> >Two notes:
> >* I'd like to allow but not enforce that a valid document have the
> >allowable_values property filled. By not enforcing it, a simply marked-up
> >natural-language description could be a valid document. This would perhaps
> >mean that the spec could meet half-way the automated markup of legacy
> >descriptions, and I'm keen to do this. Of course, a document without this
> >property specified would not be able to be validated in the way you suggest
> >and hence may not be very useful for some purposes, but this may be a price
> >one is willing to pay for other benefits, and I think we need to keep this
> >open.
>
> It would be pretty difficult to validate a blob of free text...  but I can
> see a case for wanting to include them in the exercise...  so perhaps there
> should be a deprecated class of elements that are essentially unable to be
> validated but perhaps of interest in some instances?
>
> >* I'm using "allowable values" rather than "states" as this seems to me to
> >be more general, and subsumes the requirement to distinguish between, for
> >instance, multistate and numeric "characters". A numeric "character" of
> >course, doesn't have "states", but it does have allowable values (integers
> >in the case of an integer numeric, real numbers in the case of a real
> >numeric).
>
> This would cover the use of text rather than various numeric values - there
> is nothing that should say everything as to be represented numerically.  It
> should be possible to use text *and* to have it validated in some way.
>
> >1. How strong is the requirement for this type of validation? Enforcing this
> >seems to me to be like requiring that all word documents carry in their
> >header a dictionary of the english language to allow validation of
> >spellings. It seems to me that providing tools that allow people to check
> >these strings against a predefined list (defined either within the document
> >or in an external resource) would be useful, but not absolutely necessary. A
> >document that is not or cannot be validated in this way would not be
> >useless, and would perhaps be more free.
>
> Documents do not have to carry their validation with them - they could
> refer to an external source of validation (or dictionary in your above
> example - this is what is implied and expected in Word documents at the moment)
>
> >Note that the spec as I see it would allow (but again, not enforce as DELTA
> >and Lucid do) the encoding of descriptions. Thus, a valid document may be d1
> >as below. This would preempt the need for typographic validation, and allow
> >allowable-values validation. But for some reason I don't want to disallow d2
> >as a valid document also.
>
> This may be a non threatening approach to gradually introducing validity
> and rigour into descriptive data and is probably worth exploring some more...
>
> jim
>




More information about the tdwg-content mailing list