Gregor wrote:
Do you mean those GLOPP organism- interaction-data that have specimen voucher information can not be published/referenced in GBIF until I figure out whether a collection has digitized them (most have never digitized elsewhere!)?
Richard wrote:
Not necessarily. I don't think the issue is whether or not the collection has been digitized, but rather whether GUIDs have already been assigned to the vouchers you want to document in the GLOPP dataset. So, if your question is more along the lines of "do I need to check to see if GUIDs have already been issued to voucher specimens that I cite, before I issue new GUIDs", then my answer -- in the long run, at least -- would be, "well....yes!" That's sort of the fundamental point of the GUIDs, isn't it? But I don't see this as being necessarily burdensome. For example, if your GLOPP dataset included unambiguous pointers to specific voucher specimens (e.g., via InstitutionCode+CollectionCode+CatalogNumber), then it *should* be a relatively quick and straightforward process to find out if GUIDs have already been assigned (if it's not quick & easy, then the GUID service would be horribly inadequate!)
I think this is the problem. No single record in our database has this information, and to my knowledge, most if not all of the physical specimen sheets referred to do not yet have a unique catalogue number. Adding truly unique catalogue numbers physically to specimens (as opposed to often non-unique batch accession numbers) has often not been made and is now done either during digitization, or it may in recent year also happen during loan processing.
However, most of the printed literature cites collection name (which may be historic, if collections are merged), taxon name, plus one of: - simply the information that it is a type (often expressed by exclamation mark after collection acronym, indicating that a type has been studied) - for non-types 2-4 elements out of: collector, collection date, collectors field number and location. Having collector plus collectors field number is relatively good (although uniqueness is up to the collector, some assign batch numbers for collection events), but again in my experience it is relatively rarely cited. None of the GLOPP records taken from literature cited a field number. The other fields are normally sufficiently unique if you go into the collection and see what is there, but is a terrible key to try any matching against a data service - at least the location is usually comparable by automatic string matching.
If, on the other hand, the GLOPP dataset does not provide unambiguous pointers to specific voucher specimens, then the "vouchered" aspect of those specimen citations seems unsupported, in which case your GUIDs would need to be assigned to virtual/unvouchered "specimens" (analogous to observation records), and hence non-duplicate.
As said, if you go into the collection, it is easy to identify them. If you know that a fragment of the collection is completely digitized - as opposed to random digitization which only digitizes specimens recently loaned - you can manually identify it on the computer. I would guess every specimen identification takes at most 5 minutes and involves one or several queries and picking from the result list manually.
I believe most biologists will consider the specimen citations used in print until now a voucher - I do not agree with the blank statement that this is something else. It is usually unambiguous - but not very good for machine processing. Contradictions?
when the collection starts to digitize them, they would have to create for those that have already been published in GLOPP a new version of the GLOPP LSID?
I would hope that if you assigned GUIDs to GLOPP-relevant voucher specimens that belong to a collection that is not-yet digitized, you would do the courtesy of providing the manager of that collection with a listing of the GUIDs you created for the specific relevant specimens. I would further hope that, when that collection is eventually digitized, the manager would have the wherewithal to assign new GUIDs only to those specimens that did not yet have them.
The problem is: what would the collection manager make with such a list? It would be the problem in reverse: the list of GUIDs is not easy to connect to the physical specimens. Most collections where specimens can be handled attach unique numbers to the specimens during digitization (but perhaps even this is not true in some insect collections, where handling the specimen creates the danger of destroying it). The list delivered by GLOPP would contain information about specimens that have no such barcode/etc. number yet.
accommodated in any GUID system that is developed. My main point is that such "redundant" GUID issuance should be minimized (i.e., never done intentionally), and quickly/easily identified as such whenever it is discovered.
Certainly not intentionally, but is should be clear that a museum should not start to prohibit the use of laptops when a Ph.D. candidate comes in and "digitizes" some specimens for a taxonomic revision. If the museum system supports it, it is wise to ask to use the museum system, but if the system is too complex and requires long training, rather have the monography than nothing...
So....if/when the situation does come up that (for example) GLOPP assigns GUIDs to vouchers on behalf of a non-digitized collection, and that collection later (inadvertently) re-assigns redundant GUIDs to the same set of specimens; that eventual discovery of this duplication should be accommodated by a mechanism for "retiring" one of the IDs into "objective synonomy" of the other ID, and automated systems should be implemented in the resolver service that "auto-forward" the retired ID to the active ID.
I think you could rather view this as an optional deduplication layer. Your specification explicitly contradicts at least the LSID specifiction to retrieve repeatedly exactly the same data. If an analysis of GLOPP is based on some data - e.g. a misidentified host plant - and cites this, it should be recoverable. Being silently forwarded to different information only causes confusion.
So I prefer a view where GUIDs refer to data objects. I still do not see, how you propose to attach them to the physical objects for those researchers working in the collection itself. A secondary service can then know about relationships of multiple data objects referring to the same physical object. This service may be able to find cross- references in the data itself, have smart methods to estimate uniqueness based even on location strings, or may have manually create cross-reference tables.
An important point is that different "deduplication" scenarios exist. For example, in culture collection, many strains are cross-preserved in multiple collections. So "CBS 123.88" may be "equal" to "ATCC 1234132" or "BBA 77123". Ideally we may even know the history: "BBA 77123" > "CBS 123.88" > "ATCC 1234132". However, the chance that any of these strains (which are like "versions") has been mixed up (or mutations occurred) is always there. Thus, if I look for duplication of the collection event data, I want to deduplicate. If I want to check a confusing DNA sequence, I may want to know about other derived strains, but I absolutely need to know exactly which strain from which collections was sequenced.
For the most part, though -- I see these as "growing pains" of a GUID system during its first years of existence. I would predict that two decades from now, if one were to do an analysis of redundant GUIDs, one would find the bulk of those having been issued relatively early on.
I agree, but I probably think it is more relevant than you seem to think. I believe the "early days" to last the next 50 years - the time needed until collections are fully digitized *plus* the time it takes to make publication without citing GUIDs inacceptable.
Gregor---------------------------------------------------------- Gregor Hagedorn (G.Hagedorn@bba.de) Institute for Plant Virology, Microbiology, and Biosafety Federal Research Center for Agriculture and Forestry (BBA) Königin-Luise-Str. 19 Tel: +49-30-8304-2220 14195 Berlin, Germany Fax: +49-30-8304-2203