I think this is the problem. No single record in our database has this information, and to my knowledge, most if not all of the physical specimen sheets referred to do not yet have a unique catalogue number. Adding truly unique catalogue numbers physically to specimens (as opposed to often non-unique batch accession numbers) has often not been made and is now done either during digitization, or it may in recent year also happen during loan processing.
I guess my point was, if the GLOPP database does not contain enough information to uniquely identify a given voucher specimen (whether it is a catalog number, or some combination of other data), then the GLOPP data record can't really be considered "vouchered", can it? Maybe I am not thinking hard enough about this, but it seems to me that without the ability to re-locate a cited specimen with a fair degree of certainty, then the record would need to be considered "unvouchered". So, if enough information is available to pinpoint the specimen, then it *should* (and I emphasize this word only because I know there will undoubtedly be exceptions) be possible to associate the GUID with the voucher specimen at a later time. If there is not enough information to re-locate the specimen, then it seems to me that the connection with the specimen is broken, and the record becomes a stand-alone "unvouchered" biological instance.
However, most of the printed literature cites collection name (which may be historic, if collections are merged), taxon name, plus one of:
- simply the information that it is a type (often expressed by
exclamation mark after collection acronym, indicating that a type has been studied)
- for non-types 2-4 elements out of: collector, collection date,
collectors field number and location. Having collector plus collectors field number is relatively good (although uniqueness is up to the collector, some assign batch numbers for collection events), but again in my experience it is relatively rarely cited. None of the GLOPP records taken from literature cited a field number. The other fields are normally sufficiently unique if you go into the collection and see what is there, but is a terrible key to try any matching against a data service - at least the location is usually comparable by automatic string matching.
O.K., I think I understand now, that a human could likely re-locate the specimen based on some assemblage of data, but that this assemblage of data would certainly require a human to establish the connection.
I guess my feeling is this: We should always do our best to avoid the assignment of duplicate GUIDs to a single "biological instance" (of the physical kind), but we should also acknowledge that inadvertent duplication will be inevitable (as it may well be in the scenario you describe), and therefore build-in a system for accommodating "objective GUID synonymies".
As said, if you go into the collection, it is easy to identify them. If you know that a fragment of the collection is completely digitized
- as opposed to random digitization which only digitizes specimens
recently loaned - you can manually identify it on the computer. I would guess every specimen identification takes at most 5 minutes and involves one or several queries and picking from the result list manually.
I imagine it will depend, on a case-by-case-basis, whether the cost of manual match-up of this sort at the time a GUID is to be assigned exceeds the cost of risking duplicate GUID assignment.
I believe most biologists will consider the specimen citations used in print until now a voucher - I do not agree with the blank statement that this is something else. It is usually unambiguous - but not very good for machine processing. Contradictions?
No contradictions from me. But the concept of "voucher" to me becomes more ambiguous as time goes by. I have thought a lot about how one might pool "evidence" from Museum collections, published data, in-situ images (still-ph otos and video), unpublished sighting reports, etc., and there is no easy answer that I have found. My conceptual approach has been to reduce all of these to reported instances of a particular organism at a particular place and time. This is what I have meant by "biological instance". Where they differ is merely in how well-documented they are (unpublished word-of-mouth, published word-of-mouth, film or electronic image, tissue sample, preserved organism, etc.), and to what extent they can be re-examined by later researchers (in my mind, the distinction between "vouchered" and "unvouchered"). Some uncollected sight records (e.g., rare plants in Hawaii) have a high degree of re-examination potential, whereas some specimens collected and preserved in a Museum, but misplaced, lost, or deteriorated, have a very low degree of re-examination potential. In-situ images, though they may have limited documentation (external appearance only, from one angle only in the case of still photos, limited in resolution at which the image was captured), have a very high degree of re-examination potential.
The point (to me, at least) is that in all cases there was some physical "biological instance", and it is that entity to which I think the GUID should be assigned. If a published report cites a specimen preserved in a Museum, then the same GUID should be used for both the Museum specimen, and the published citation thereof. If a published report cites an organism that was never collected or preserved in a Museum, and the publication itself constitutes the only record of that biological instance, then a new GUID should be assigned to it. When a record of the latter sort is later discovered to be in reference to a specimen that exists in a museum that already has its own GUID, then those two GUIDs should be branded as "synonyms".
My main point is that such "redundant" GUID issuance should be minimized (i.e., never done intentionally), and quickly/easily identified as such whenever it is discovered.
Certainly not intentionally, but is should be clear that a museum should not start to prohibit the use of laptops when a Ph.D. candidate comes in and "digitizes" some specimens for a taxonomic revision. If the museum system supports it, it is wise to ask to use the museum system, but if the system is too complex and requires long training, rather have the monography than nothing...
Agreed. But I would think the student should provide a listing of all GUIDs assigned to specimens within a collection, including as much information as is necessary to uniquely identify each GUID-assigned specimen. Whether the collection manager ever uses that information or not is a different question, but I think a "culture of respect" for avoiding duplicate GUID assignment should be integral to the whole GUID process.
So....if/when the situation does come up that (for example) GLOPP assigns GUIDs to vouchers on behalf of a non-digitized collection, and that collection later (inadvertently) re-assigns redundant GUIDs to the same set of specimens; that eventual discovery of this duplication should be accommodated by a mechanism for "retiring" one of the IDs into "objective synonomy" of the other ID, and automated systems should be implemented in the resolver service that "auto-forward" the retired ID to the active ID.
I think you could rather view this as an optional deduplication layer.
I'm not sure I understand exactly what you mean by "optional" (i.e., at whose option), but I think it should be a fundamental component to any resolution service.
Your specification explicitly contradicts at least the LSID specifiction to retrieve repeatedly exactly the same data.
Yes, I know -- which is why I'm feeling less cozy about LSIDs. I think the crux of the issue centers on the question that Donald asked in one of his PowerPoint slides: Are these numbers assigned to the physical or "conceptual" (=non-electronic "virtual") objects, or are they assigned to the electronic/digital representations thereof? My feeling, from the point of view of a taxonomist who develops databases for natural history collections, is that the ultimate goal (i.e., seamless transmission and exchange of biodiversity-relevant data) will be better served if:
1) The ID's are assigned to the non-electronic (physical or conceptual) objects;
2) "Static" data associated with those objects be allowed to be changed (as errors are discovered and corrected) without altering the GUID (and that data history logs for these static data attributes be thought of as a secondary function of the data management, not affecting GUIDs);
3) "Dynamic" data associated with those objects (e.g., multiple taxonomic identifications of a specimen) should be handled "robustly" (i.e., not as "versions" of the complete set of data associated with a particular specimen)
So I prefer a view where GUIDs refer to data objects. I still do not see, how you propose to attach them to the physical objects for those researchers working in the collection itself.
I'm not sure I understand the question. I guess I would answer with another question: How does a Social Security Number (SSN) for a U.S. Citizen (NINO in the UK, SIN in Canada, INSEE in France, TFN in Australia, etc. -- see http://encyclopedia.thefreedictionary.com/Social%20Security%20number) get attached to an individual person? I don't think anyone would think of a SSN as an identifier for a data object -- it is a unique identifier for the physical person. I believe that commonly used physical objects in biology (specimens, taxon names/concepts, references, agents, character definitions, etc.) should have equivalents of SSNs assigned to them, for the same reasons that U.S. Citizens have SSNs (i.e., to provide an unambiguously unique identifier useful for managing information associated with a physical person).
Obviously, SSNs are not the perfect model for BioGUIDs (for a number of reasons), but the point is that they represent an ID attached to a physical entity.
A secondary service can then know about relationships of multiple data objects referring to the same physical object. This service may be able to find cross- references in the data itself, have smart methods to estimate uniqueness based even on location strings, or may have manually create cross-reference tables.
This sounds to me like an unnecessary layer of complexity. But I remain open-minded on this issue.
An important point is that different "deduplication" scenarios exist. For example, in culture collection, many strains are cross-preserved in multiple collections. So "CBS 123.88" may be "equal" to "ATCC 1234132" or "BBA 77123". Ideally we may even know the history: "BBA 77123" > "CBS 123.88" > "ATCC 1234132". However, the chance that any of these strains (which are like "versions") has been mixed up (or mutations occurred) is always there. Thus, if I look for duplication of the collection event data, I want to deduplicate. If I want to check a confusing DNA sequence, I may want to know about other derived strains, but I absolutely need to know exactly which strain from which collections was sequenced.
I'll have to think about this some more, but it comes back to the question of what "unit" a GUID is assigned to. This is not so much a problem for taxonomic objects; a little bit of a problem for Reference objects, and a potentially HUGE problem for specimen/"biological instance" objects. Again this raises the question: How important is it to use the same GUID scheme for all of these different classes of bio-objects?
For the most part, though -- I see these as "growing pains" of a GUID system during its first years of existence. I would predict that two decades from now, if one were to do an analysis of redundant GUIDs, one would find the bulk of those having been issued relatively early on.
I agree, but I probably think it is more relevant than you seem to think. I believe the "early days" to last the next 50 years - the time needed until collections are fully digitized *plus* the time it takes to make publication without citing GUIDs inacceptable.
I imagine that the vast majority of "publications" in science 50 years from now will be electronic. But I see your point -- even if it's only 20 years as I suggested, that's still a lot of headaches to deal with.
Aloha, Rich
Richard L. Pyle, PhD Natural Sciences Database Coordinator, Bishop Museum 1525 Bernice St., Honolulu, HI 96817 Ph: (808)848-4115, Fax: (808)847-8252 email: deepreef@bishopmuseum.org http://www.bishopmuseum.org/bishop/HBS/pylerichard.html