At 09:01 8/09/2000 +1000, Kevin Thiele wrote:
So tell me, does *anyone* out there agree with me?
Well, I don't thing we're really all that far apart. As Jim Croft nicely summarized it:
"Is this the consensus we are arriving at? That we strive for structure and comparability but that we accommodate the free text 'blob' because oftentimes it may be the best we can get? Yes? Great!"
So I'm largely in agreement with you, Kevin, with the disagreement perhaps being on where we should aim as our primary target on the structure/unstructred spectrum. (That is, I'd prefer to see highly structure information as the "default", with loosely structured "blobs" being optional, rather than the other way round.)
Yes, I think that this should be the primary focus as well -- both as a programmer and a botanist.
A similar sort of problem must occur commonly with specimen databases. For example, how to you capture a "location"? You might prefer to have lats, longs, and elevations nicely geocoded to the nearest meter, but the label on a older specimen might not say much more than "in a shady little billabong along Reedy Creek, back of Beyond". It's desirable to be able to store that information one way or another, even when it doesn't fit into your preferred structure, but used structured data when you can get it.
And this is something which I am currently *painfully* aware of. I'm a member of the ATBI plants group (the Biodiversity Inventory of the Great Smoky Mountains National Park ... qv. www.discoverlife.org or www.goldsword.com/sfarmer/ATBI for our taxon pages ... they're supposed to get crosslinked *eventually* ... ) and one of the things that we're doing at the moment is databasing all of our legacy data -- all the herbarium sheets that we have.
Unfortunately, for starters, that location data is all going into a big blob *initially* so that it can be analysed and determined what fields we have and which ones that we really want to use -- we're also working with the folks who are designing the Final Database that we're going to use because we have the most legacy data already in databases.
And this might be something to consider -- your legacy label says something like "adjacent to Reedy Creek." and that's all. So, you go back later, and *can* add additional information because that population is still there -- perhaps it's the only one along the creek -- how do you add that information into the existing record and identify it as *added* data? It's valid -- some of it more-so than others, perhaps. Maybe all you might be able to add is the County -- or a gross locality name -- "Greenbrier Section" but you want to identify it as added data. Would that be something that this group would want to consider? I know that it goes back to the Validation issues ...
As a programmer, I'd like to be able to know whether I'm looking at a rigourously defined object or something "fuzzier". I think one could fairly easily accomodate both. Modifying one of Kevin's examples only slightly, you could have something like:
<DOCUMENT> <DESCRIPTION Taxon_Name = "Viola odorata"> <CHARACTER type="defined" Character_Name = "Leaves"> <STATE State_Name = "present"> </CHARACTER> <CHARACTER type="arbitrary" Character_Name = "scent"> a marvelous perfume on a perfect spring day </CHARACTER> </DESCRIPTION> </DOCUMENT>
where there is some small distinction made between rigourously defined characters (those which can be validated against some "character list", sensu lato, and for which cross-taxa comparisons are clearly meaningful), and characters defined "on the fly". (Note: I'm not so sure that using attributes is syntactically the best way to do this, but it illustrates the principle.)
Of course, as Mike would point out, such characters are really not terribly different from DELTA "text" characters...
I suspect that how all this eventually gets used will depend substantially on the sorts of editing and markup tools that are developed. I don't really anticipate that many taxonomists will want to go through their existing natural language descriptions and insert <ELEMENT></ELEMENT> tags manually. It's not only tedious, it's too darned easy to make mistakes. And perhaps the editing tools can be made clever enough that maintaining a character list isn't all that difficult. But it's still a good idea to have a system flexible enough to catch material that doesn't fit the mold.
So we do agree, kind of...
Eric Zurcher CSIRO Division of Entomology Canberra, Australia E-mail: ericz@ento.csiro.au
Susan Farmer -----
Susan Farmer sfarmer@goldsword.com Botany Department, University of Tennessee http://www.goldsword.com/sfarmer/Trillium