Kevin wrote:
I'm not sure about a requirement for strong validation in the spec. I agree that validation should be allowed, but not required.
But isn't that the main (only?) reason for bothering to encode descriptive data in the first place? To try and impose a degree of consistency, rigour, comparability and validity that is not possible in an unstructured, unformatted blob of text? If the specifications are not going to require this degree of strandardization, are we going to be letting the tansferability of data and effort that we seek slip by?
The value of the DELTA, Lucid and other character-based encoding of descriptive data is that they do impose a high degree of validation, whether it be by measurement ranges, allowable values or states, or whatever. Without validation in biological datasets you end up with a mess - just look at any herbarium or museum specimen dataset - who doesn't have terrestrial specimens that plot in the middle of the ocean?
There are three types of validation you suggest here:
- Validate the element strings (and presumably also items, values and
qualifiers) against predefined lists to guard against typographical and spelling errors;
and for consistency... one can be right, but inconsistent, and although humans can handle this, computers stuff it up...
- Validate values against a list of allowable values for the character;
This is the mechanism to addressing the above...
- Validate qualifiers against a list of allowable qualifiers.
It would be nice to put a cap on the number/type of 'qualifiers' allowed, but you can bet that there will always be something on a list that people will want/need to use, but at this stage we probaly should confine ouselve to the need for the list itself, not to the content of the list?
Taking these in reverse order:
as one would naturally want to do.. :)
- As I see it the spec itself would define a set of allowable qualifiers
(such as "rare" (= "rarely"?), "by misinterpretation", "uncertain" etc). I think we could probably agree on a limited set of qualifiers, and stick to that (with allowance for extension). If we do this, then "anchovies, please!" will be out for several reasons.
in theory yes, but in practice, if you open this can of worms, the discussion will probably go on for weeks...
- Validation of allowable values is covered in the draft spec. One of the
properties of an element (~"character") is a list of allowable values (~"states"). If such a property is filled, then validation can be done by the program that handles the data, just as in DELTA and Lucid.
It is not just 'allowed', isn't it? Isn't this what the draft specs are all about?
Two notes:
- I'd like to allow but not enforce that a valid document have the
allowable_values property filled. By not enforcing it, a simply marked-up natural-language description could be a valid document. This would perhaps mean that the spec could meet half-way the automated markup of legacy descriptions, and I'm keen to do this. Of course, a document without this property specified would not be able to be validated in the way you suggest and hence may not be very useful for some purposes, but this may be a price one is willing to pay for other benefits, and I think we need to keep this open.
It would be pretty difficult to validate a blob of free text... but I can see a case for wanting to include them in the exercise... so perhaps there should be a deprecated class of elements that are essentially unable to be validated but perhaps of interest in some instances?
- I'm using "allowable values" rather than "states" as this seems to me to
be more general, and subsumes the requirement to distinguish between, for instance, multistate and numeric "characters". A numeric "character" of course, doesn't have "states", but it does have allowable values (integers in the case of an integer numeric, real numbers in the case of a real numeric).
This would cover the use of text rather than various numeric values - there is nothing that should say everything as to be represented numerically. It should be possible to use text *and* to have it validated in some way.
- How strong is the requirement for this type of validation? Enforcing this
seems to me to be like requiring that all word documents carry in their header a dictionary of the english language to allow validation of spellings. It seems to me that providing tools that allow people to check these strings against a predefined list (defined either within the document or in an external resource) would be useful, but not absolutely necessary. A document that is not or cannot be validated in this way would not be useless, and would perhaps be more free.
Documents do not have to carry their validation with them - they could refer to an external source of validation (or dictionary in your above example - this is what is implied and expected in Word documents at the moment)
Note that the spec as I see it would allow (but again, not enforce as DELTA and Lucid do) the encoding of descriptions. Thus, a valid document may be d1 as below. This would preempt the need for typographic validation, and allow allowable-values validation. But for some reason I don't want to disallow d2 as a valid document also.
This may be a non threatening approach to gradually introducing validity and rigour into descriptive data and is probably worth exploring some more...
jim
participants (1)
-
Jim Croft