In my view, we would assign only ONE GUID, which represents the actual, physical specimen. That this one specimen has multiple catalog number assigned to it is simply additional information associated with that one specimen (in the same way that many specimens may have more than one taxonomic name applied to it, by different investigators at different times).
I agree on the multiple catalogue numbers, but I believe still multiple database records of specimens will exists.
Yes, but I guess a central question, which Donald included in his PowerPoint file, is whether the GUID is assigned to the physical object, or to the electronic representation (data record). Most of my comments have been from the standpoint that the GUID applies to the physical specimen. If it is the electronic records that we wish to uniquely identify, then it seems to me that the <objectID> component of an LSID should apply to the physical specimen, and multiple database records should be uniquely identified using the <version> component.
Since I myself am not involved in collection curation, but in evaluating the information therein (specifically we work on organism interactions) we have a database of now close to 200 000 fungal host parasite records. Some express opinion without further citation, others express opinion backed up by voucher specimen that contains all the information that would be found in collection databases. GBIF seems to have no place for such data so far - and it would be difficult to provide, since we usually have none of "InstitutionCode]+[CollectionCode]+[CatalogNumber" (which is different from the problem having duplicate CatalogNumbers you discuss).
This is part of the reason why I think that the "[InstitutionCode]+[CollectionCode]+[CatalogNumber]" solution is only a temporary one. This raises another question: would the GUIDs be limited to just vouchered specimens? Or, would they also be assigned to unvouchered specimens (e.g., field observations of specific individual organisms, that were not vouchered in a Museum collection). Or, would unvouchered "biological instances" represent a different class of GUIDs? The simple answer is to make observations a different class of object. But in my data management world, I need to deal with everything from sight records with little more data than a taxonomic determination; to specific observations involving specific (uncollected) individual organisms (sometimes with as much associated data as any vouchered specimen); to collected organisms that were brought into a lab, examined by experts, but not added to a permanent collection; to stereotypical museum voucher specimens; to specimens that were added to the permanent voucher collection, but later lost or destroyed. In my mind, there is little fundamental difference between the two endpoints of this spectrum, and I have thus decided to treat all such entities as the same class of object ("Biological Instances") -- which also spans the [population-->multiple specimen-->single specimen-->specimen part] continuum. It's not a perfectly clean solution; but no solution is perfectly clean, and to my mind, this is the optimal solution from a data management perspective.
Still what kind of data is that? What kind of data is created if a PH.D. student digitizes the specimen records used for a taxonomic revision in a database that is specific to that revision?
I would say that the student would (ideally) reference existing specimen GUIDs in his/her specific database -- not create new GUIDs (unless referencing physical specimens -- vouchered or not -- that have not yet received GUIDs, in which case new GUIDs would be assigned using the appropriate procedure, whatever that ends up being).
Bottomline: The physical specimen does exist, but in the foreseeable future all data GUIDs will be attached to data, not to the specimen. The exceptions is only where indeed it is possible to attach the GUID to the specimen, then this could be cited.
Good point! My concern, though, would be that we might end up in the same state of chaos that we are now, where multiple electronic records of a particular physical specimen are not rigorously cross-linked, and thus run the risk of being counted as multiple/separate physical instances. Identifying the physical object with the <objectID> component of an LSID, and different electronic representations of the associated data as different <versions> of the same <objectID> could represent a way to deal with both "realities".
But then we have descriptions, and for description concepts (characters, structures, states, modifiers, etc.) we also need GUIDs to allow federating descriptions that use a common terminology. We have discussed this in SDD on and off (specifically we are proposing to prefer semantically neutral identifiers, and propose a simple optional mechanism called debugid/debugref to enrich data with calculated, semantically meaningful identifiers to facilitate debugging) - but at the moment SDD really waits for a more general and common solution.
Are you talking about GUIDs for character definitions, or GUIDs for instances of character definitions applied to individual specimens/taxa, or both? I see character definitions as analogous to taxon concepts; names used to represent those character definitions as analogous to taxon names; and the application of those character definitions to specific specimens (or taxa) as analogous to taxonomic determinations (i.e., the application of taxon names/concepts to specimens).
So this discussion is highly relevant to descriptions as well. My main point is: what we are really interested in GBIF in the end is knowledge, not physical possession. If we limit our thinking of the GBIF system to the very special case of institutionalized collections (as both DwC and ABCD in my opinion currently do), or names governed by a nomenclatural code, I believe we may later have to rearchitect.
Agreed!! (Especially considering the forum on which this discussion is taking place.) This is part of the reason why I keep mixing taxon names/concepts examples among specimen examples. In the back of my mind, I was thinking SDD examples as well, though I think I failed to express that adequately.
BTW, partly for these differences between institutional collection- customs and knowledge publication customs, I vote against a strongly central system. LSID authority (lsid.gbif.net) and namespace (with no or low semantics) should be managed by GBIF, but not the ids/versions. GBIF may provide a service to generate them, but should accept any locally generated ID and trust the generator to manage uniqueness.
I generally agree. My leanings towards centralization revolved around GUID generation only (not application of GUID to associated data -- which is what I would define as the "management" part), and *perhaps* GUID resolution (but more for "unowned" sorts of data like taxonomy, and only in the paradigm that each GUID could be resolved by one and only one domain server, and that the GUID would cease to have meaning if/when the branded domain server ceased to exist). Whether or not GUID generation happens centrally or in a distributed way seems to me to depend on what GUID scheme is ultimately adopted.
