Re: Globally Unique Identifier
Gregor wrote:
Richard wrote:
I think this is the problem. No single record in our database has this information, and to my knowledge, most if not all of the physical specimen sheets referred to do not yet have a unique catalogue number. Adding truly unique catalogue numbers physically to specimens (as opposed to often non-unique batch accession numbers) has often not been made and is now done either during digitization, or it may in recent year also happen during loan processing.
However, most of the printed literature cites collection name (which may be historic, if collections are merged), taxon name, plus one of: - simply the information that it is a type (often expressed by exclamation mark after collection acronym, indicating that a type has been studied) - for non-types 2-4 elements out of: collector, collection date, collectors field number and location. Having collector plus collectors field number is relatively good (although uniqueness is up to the collector, some assign batch numbers for collection events), but again in my experience it is relatively rarely cited. None of the GLOPP records taken from literature cited a field number. The other fields are normally sufficiently unique if you go into the collection and see what is there, but is a terrible key to try any matching against a data service - at least the location is usually comparable by automatic string matching.
As said, if you go into the collection, it is easy to identify them. If you know that a fragment of the collection is completely digitized - as opposed to random digitization which only digitizes specimens recently loaned - you can manually identify it on the computer. I would guess every specimen identification takes at most 5 minutes and involves one or several queries and picking from the result list manually.
I believe most biologists will consider the specimen citations used in print until now a voucher - I do not agree with the blank statement that this is something else. It is usually unambiguous - but not very good for machine processing. Contradictions?
The problem is: what would the collection manager make with such a list? It would be the problem in reverse: the list of GUIDs is not easy to connect to the physical specimens. Most collections where specimens can be handled attach unique numbers to the specimens during digitization (but perhaps even this is not true in some insect collections, where handling the specimen creates the danger of destroying it). The list delivered by GLOPP would contain information about specimens that have no such barcode/etc. number yet.
Certainly not intentionally, but is should be clear that a museum should not start to prohibit the use of laptops when a Ph.D. candidate comes in and "digitizes" some specimens for a taxonomic revision. If the museum system supports it, it is wise to ask to use the museum system, but if the system is too complex and requires long training, rather have the monography than nothing...
I think you could rather view this as an optional deduplication layer. Your specification explicitly contradicts at least the LSID specifiction to retrieve repeatedly exactly the same data. If an analysis of GLOPP is based on some data - e.g. a misidentified host plant - and cites this, it should be recoverable. Being silently forwarded to different information only causes confusion.
So I prefer a view where GUIDs refer to data objects. I still do not see, how you propose to attach them to the physical objects for those researchers working in the collection itself. A secondary service can then know about relationships of multiple data objects referring to the same physical object. This service may be able to find cross- references in the data itself, have smart methods to estimate uniqueness based even on location strings, or may have manually create cross-reference tables.
An important point is that different "deduplication" scenarios exist. For example, in culture collection, many strains are cross-preserved in multiple collections. So "CBS 123.88" may be "equal" to "ATCC 1234132" or "BBA 77123". Ideally we may even know the history: "BBA 77123" > "CBS 123.88" > "ATCC 1234132". However, the chance that any of these strains (which are like "versions") has been mixed up (or mutations occurred) is always there. Thus, if I look for duplication of the collection event data, I want to deduplicate. If I want to check a confusing DNA sequence, I may want to know about other derived strains, but I absolutely need to know exactly which strain from which collections was sequenced.
I agree, but I probably think it is more relevant than you seem to think. I believe the "early days" to last the next 50 years - the time needed until collections are fully digitized *plus* the time it takes to make publication without citing GUIDs inacceptable.
Gregor---------------------------------------------------------- Gregor Hagedorn (G.Hagedorn@bba.de) Institute for Plant Virology, Microbiology, and Biosafety Federal Research Center for Agriculture and Forestry (BBA) Königin-Luise-Str. 19 Tel: +49-30-8304-2220 14195 Berlin, Germany Fax: +49-30-8304-2203
participants (1)
-
Gregor Hagedorn