Richard Pyle writes:
"a central question, which Donald included in his PowerPoint file, is whether the GUID is assigned to the physical object, or to the electronic representation (data record). Most of my comments have been from the standpoint that the GUID applies to the physical specimen. If it is the electronic records that we wish to uniquely identify, then it seems to me that the <objectID> component of an LSID should apply to the physical specimen, and multiple database records should be uniquely identified using the <version> component."
I think this is not practical. Do you mean those GLOPP organism- interaction-data that have specimen voucher information can not be published/referenced in GBIF until I figure out whether a collection has digitized them (most have never digitized elsewhere!)? Or if I find they have not been, when the collection starts to digitize them, they would have to create for those that have already been published in GLOPP use a new version of the GLOPP LSID?
The same applies to taxonomic data - most revisions contain voucher data.
---
On a closely related point, Chuck Miller wrote:
"Duplicate specimens occur because the collector collected multiple samples of the same organism and sent them to other institutions.
The
duplicate specimens themselves probably have different
CatalogNumbers
in each institution. The specimen database records reflect the
actual
specimens. Therefore, the specimen database records when combined
from
multiple institutions have duplicates of the same organism. But,
only
by looking at either the Collector and Collector's number or date/location can the duplication be recognized."
I believe there are three situations:
1. A collector has created what is classically called a specimen duplicate, i.e. material has been collected in multiple or a single collection has been split (including propagation of living material/Cultures). These may be in the possession of a single or different collection. In most cases it will be very important to know about their close relationship, but also identify them separately (esp. in dead cryptogams or any living culture there is no guarantee that the assumption about conspecificity of the collector is actually true!).
2. A single specimen has multiple conventional accession number in a single collection, because over time the collection used different numbering schemes (stamping, then barcoding, perhaps RFID in the future). This was what I thought was meant with "duplicate CatalogNumbers", and I think it is the only one where full identity exist, so a single GUID with repeated AccessionNumber field is best.
3. A single specimen has been digitized mutliple times - esp. not only in a curatorial database, but in analytical datasets created for data evaluation (like host-parasite studies, taxonomic revisions, etc.). Here the physical specimen is identical, but GUIDs need to be available to identify the distinct data.
---
Richard Pyle writes:
The simple answer is to make observations a different class of object. But in my data management world, I need to deal with everything from sight records with little more data than a taxonomic determination; to specific observations involving specific (uncollected) individual organisms (sometimes with as much associated data as any vouchered specimen); to collected organisms that were brought into a lab, examined by experts, but not added to a permanent collection; to stereotypical museum voucher specimens; to specimens that were added to the permanent voucher collection, but later lost or destroyed. In my mind, there is little fundamental difference between the two endpoints of this spectrum, and I have thus decided to treat all such entities as the same class of object ("Biological Instances") -- which also spans the [population-->multiple specimen-->single specimen-->specimen part] continuum. It's not a perfectly clean solution; but no solution is perfectly clean, and to my mind, this is the optimal solution from a data management perspective.
I think I agree. But regarding the current discussion, I think the most important point is NOT to burden LSID with any semantics that allows to conclude which kind of decision a data provider has taken. The namespace part of LSID should NOT be interpreted as a standardized class-of-objects. The GLOPP data have a continuum of specimen data where at least I know the institution, those where a know a physical voucher exists, but I have no clue where, those where I don't know whether material has been preserved, and those where I know it is a pure observation. So my decision would be similar to yours. However, some museums may want to act different.
I would say that the student would (ideally) reference existing specimen GUIDs in his/her specific database -- not create new GUIDs (unless referencing physical specimens -- vouchered or not -- that have not yet received GUIDs, in which case new GUIDs would be assigned using the appropriate procedure, whatever that ends up being).
Since the latter case is the default case (unless everything changes, the money will allow only digitization of a tiny fragment of collections in the coming decades): If that is done by the student - how would the collection when it finally comes to digitize the specimen itself learn about the "physical GUID" of a specimen?
Good point! My concern, though, would be that we might end up in the same state of chaos that we are now, where multiple electronic records of a particular physical specimen are not rigorously cross-linked, and thus run the risk of being counted as multiple/separate physical instances. Identifying the physical object with the <objectID> component of an LSID, and different electronic representations of the associated data as different <versions> of the same <objectID> could represent a way to deal with both "realities".
I think the solution is in data, not in GUIDs. If the collection has barcodes or RFIDs, include them in the data. Also make sure that collections (Institutions) and subcollections have GUIDs (perhaps derived from Index herbariorum). This solves the counting problem without imposing a system of requirements that in my opinion are non- managable.
But then we have descriptions, and for description concepts (characters, structures, states, modifiers, etc.) we also need GUIDs to allow federating descriptions that use a common terminology. We have discussed this in SDD on and off (specifically we are proposing to prefer semantically neutral identifiers, and propose a simple optional mechanism called debugid/debugref to enrich data with calculated, semantically meaningful identifiers to facilitate debugging) - but at the moment SDD really waits for a more general and common solution.
Are you talking about GUIDs for character definitions, or GUIDs for instances of character definitions applied to individual specimens/taxa, or both?
(Not sure, I think they are the same: the definition is identified by a GUID, the instance references that GUID, and the GUID of a class (taxon) or object (specimen), which in itself is a unique combination. Assigning a GUID to the cross of two GUIDs seems superfluous.)
I see character definitions as analogous to taxon concepts; names used to represent those character definitions as analogous to taxon names; and the application of those character definitions to specific specimens (or taxa) as analogous to taxonomic determinations (i.e., the application of taxon names/concepts to specimens).
I agree; a minor point of difference being that there is no equivalent to nomenclatural codes and that the "taxon names" are expressable in any number of languages. Which is why SDD uses IDs only for the character concepts, not for the names. The combination of character-concept ID, language-ISO-code and audience code (pupil, student, expert, farmer, etc) is an ID to the "character name".
Gregor ---------------------------------------------------------- Gregor Hagedorn (G.Hagedorn@bba.de) Institute for Plant Virology, Microbiology, and Biosafety Federal Research Center for Agriculture and Forestry (BBA) Koenigin-Luise-Str. 19 Tel: +49-30-8304-2220 14195 Berlin, Germany Fax: +49-30-8304-2203
Often wrong but never in doubt!