Re: Globally Unique Identifier

1 Oct 2004

      Richard Pyle writes:
...
"a central question, which Donald included in his
PowerPoint file, is whether the GUID is assigned to the physical
object, or to the electronic representation (data record).  Most of my
comments have been from the standpoint that the GUID applies to the
physical specimen. If it is the electronic records that we wish to
uniquely identify, then it seems to me that the <objectID> component
of an LSID should apply to the physical specimen, and multiple
database records should be uniquely identified using the <version>
component."
I think this is not practical. Do you mean those GLOPP organism-
interaction-data that have specimen voucher information can not be
published/referenced in GBIF until I figure out whether a collection
has digitized them (most have never digitized elsewhere!)? Or if I
find they have not been, when the collection starts to digitize them,
they would have to create for those that have already been published
in GLOPP use a new version of the GLOPP LSID?

The same applies to taxonomic data - most revisions contain voucher
data.

---

On a closely related point, Chuck Miller wrote:
...
"Duplicate specimens occur because the collector collected multiple
samples of the same organism and sent them to other institutions.
The
duplicate specimens themselves probably have different
CatalogNumbers
in each institution.  The specimen database records reflect the
actual
specimens. Therefore, the specimen database records when combined
from
multiple institutions have duplicates of the same organism.  But,
only
by looking at either the Collector and Collector's number or
date/location can the duplication be recognized."
I believe there are three situations:

1. A collector has created what is classically called a specimen
duplicate, i.e. material has been collected in multiple or a single
collection has been split (including propagation of living
material/Cultures). These may be in the possession of a single or
different collection. In most cases it will be very important to know
about their close relationship, but also identify them separately
(esp. in dead cryptogams or any living culture there is no guarantee
that the assumption about conspecificity of the collector is actually
true!).

2. A single specimen has multiple conventional accession number in a
single collection, because over time the collection used different
numbering schemes (stamping, then barcoding, perhaps RFID in the
future). This was what I thought was meant with "duplicate
CatalogNumbers", and I think it is the only one where full identity
exist, so a single GUID with repeated AccessionNumber field is best.

3. A single specimen has been digitized mutliple times - esp. not
only in a curatorial database, but in analytical datasets created for
data evaluation (like host-parasite studies, taxonomic revisions,
etc.). Here the physical specimen is identical, but GUIDs need to be
available to identify the distinct data.

---

Richard Pyle writes:
...
The simple answer is to make observations a different class of object.
 But in my data management world, I need to deal with everything from
sight records with little more data than a taxonomic determination; to
specific observations involving specific (uncollected) individual
organisms (sometimes with as much associated data as any vouchered
specimen); to collected organisms that were brought into a lab,
examined by experts, but not added to a permanent collection; to
stereotypical museum voucher specimens; to specimens that were added
to the permanent voucher collection, but later lost or destroyed.  In
my mind, there is little fundamental difference between the two
endpoints of this spectrum, and I have thus decided to treat all such
entities as the same class of object ("Biological Instances") -- which
also spans the [population-->multiple specimen-->single
specimen-->specimen part] continuum.  It's not a perfectly clean
solution; but no solution is perfectly clean, and to my mind, this is
the optimal solution from a data management perspective.
I think I agree. But regarding the current discussion, I think the
most important point is NOT to burden LSID with any semantics that
allows to conclude which kind of decision a data provider has taken.
The namespace part of LSID should NOT be interpreted as a
standardized class-of-objects.
The GLOPP data have a continuum of specimen data where at least I
know the institution, those where a know a physical voucher exists,
but I have no clue where, those where I don't know whether material
has been preserved, and those where I know it is a pure observation.
So my decision would be similar to yours. However, some museums may
want to act different.
...
I would say that the student would (ideally) reference existing
specimen GUIDs in his/her specific database -- not create new GUIDs
(unless referencing physical specimens -- vouchered or not -- that
have not yet received GUIDs, in which case new GUIDs would be assigned
using the appropriate procedure, whatever that ends up being).
Since the latter case is the default case (unless everything changes,
the money will allow only digitization of a tiny fragment of
collections in the coming decades): If that is done by the student -
how would the collection when it finally comes to digitize the
specimen itself learn about the "physical GUID" of a specimen?
...
Good point!  My concern, though, would be that we might end up in the
same state of chaos that we are now, where multiple electronic records
of a particular physical specimen are not rigorously cross-linked, and
thus run the risk of being counted as multiple/separate physical
instances. Identifying the physical object with the <objectID>
component of an LSID, and different electronic representations of the
associated data as different <versions> of the same <objectID> could
represent a way to deal with both "realities".
I think the solution is in data, not in GUIDs. If the collection has
barcodes or RFIDs, include them in the data. Also make sure that
collections (Institutions) and subcollections have GUIDs (perhaps
derived from Index herbariorum). This solves the counting problem
without imposing a system of requirements that in my opinion are non-
managable.
...
...
But then we have descriptions, and for description concepts
(characters, structures, states, modifiers, etc.) we also need GUIDs
to allow federating descriptions that use a common terminology. We
have discussed this in SDD on and off (specifically we are proposing
to prefer semantically neutral identifiers, and propose a simple
optional mechanism called debugid/debugref to enrich data with
calculated, semantically meaningful identifiers to facilitate
debugging) - but at the moment SDD really waits for a more general
and common solution.
Are you talking about GUIDs for character definitions, or GUIDs for
instances of character definitions applied to individual
specimens/taxa, or both?
(Not sure, I think they are the same: the definition is identified by
a GUID, the instance references that GUID, and the GUID of a class
(taxon) or object (specimen), which in itself is a unique
combination. Assigning a GUID to the cross of two GUIDs seems
superfluous.)
...
I see character definitions as analogous to
taxon concepts; names used to represent those character definitions as
analogous to taxon names; and the application of those character
definitions to specific specimens (or taxa) as analogous to taxonomic
determinations (i.e., the application of taxon names/concepts to
specimens).
I agree; a minor point of difference being that there is no
equivalent to nomenclatural codes and that the "taxon names" are
expressable in any number of languages. Which is why SDD uses IDs
only for the character concepts, not for the names. The combination
of character-concept ID, language-ISO-code and audience code (pupil,
student, expert, farmer, etc) is an ID to the "character name".

Gregor
----------------------------------------------------------
Gregor Hagedorn (G.Hagedorn@bba.de)
Institute for Plant Virology, Microbiology, and Biosafety
Federal Research Center for Agriculture and Forestry (BBA)
Koenigin-Luise-Str. 19          Tel: +49-30-8304-2220
14195 Berlin, Germany           Fax: +49-30-8304-2203

Often wrong but never in doubt!