Globally Unique Identifier

Gregor Hagedorn G.Hagedorn at BBA.DE
Fri Oct 1 10:35:34 CEST 2004


Richard Pyle writes:

> "a central question, which Donald included in his
> PowerPoint file, is whether the GUID is assigned to the physical
> object, or to the electronic representation (data record).  Most of my
> comments have been from the standpoint that the GUID applies to the
> physical specimen. If it is the electronic records that we wish to
> uniquely identify, then it seems to me that the <objectID> component
> of an LSID should apply to the physical specimen, and multiple
> database records should be uniquely identified using the <version>
> component."

I think this is not practical. Do you mean those GLOPP organism-
interaction-data that have specimen voucher information can not be
published/referenced in GBIF until I figure out whether a collection
has digitized them (most have never digitized elsewhere!)? Or if I
find they have not been, when the collection starts to digitize them,
they would have to create for those that have already been published
in GLOPP use a new version of the GLOPP LSID?

The same applies to taxonomic data - most revisions contain voucher
data.

---

On a closely related point, Chuck Miller wrote:

> "Duplicate specimens occur because the collector collected multiple
> samples of the same organism and sent them to other institutions.
The
> duplicate specimens themselves probably have different
CatalogNumbers
> in each institution.  The specimen database records reflect the
actual
> specimens. Therefore, the specimen database records when combined
from
> multiple institutions have duplicates of the same organism.  But,
only
> by looking at either the Collector and Collector's number or
> date/location can the duplication be recognized."

I believe there are three situations:

1. A collector has created what is classically called a specimen
duplicate, i.e. material has been collected in multiple or a single
collection has been split (including propagation of living
material/Cultures). These may be in the possession of a single or
different collection. In most cases it will be very important to know
about their close relationship, but also identify them separately
(esp. in dead cryptogams or any living culture there is no guarantee
that the assumption about conspecificity of the collector is actually
true!).

2. A single specimen has multiple conventional accession number in a
single collection, because over time the collection used different
numbering schemes (stamping, then barcoding, perhaps RFID in the
future). This was what I thought was meant with "duplicate
CatalogNumbers", and I think it is the only one where full identity
exist, so a single GUID with repeated AccessionNumber field is best.

3. A single specimen has been digitized mutliple times - esp. not
only in a curatorial database, but in analytical datasets created for
data evaluation (like host-parasite studies, taxonomic revisions,
etc.). Here the physical specimen is identical, but GUIDs need to be
available to identify the distinct data.

---

Richard Pyle writes:

> The simple answer is to make observations a different class of object.
>  But in my data management world, I need to deal with everything from
> sight records with little more data than a taxonomic determination; to
> specific observations involving specific (uncollected) individual
> organisms (sometimes with as much associated data as any vouchered
> specimen); to collected organisms that were brought into a lab,
> examined by experts, but not added to a permanent collection; to
> stereotypical museum voucher specimens; to specimens that were added
> to the permanent voucher collection, but later lost or destroyed.  In
> my mind, there is little fundamental difference between the two
> endpoints of this spectrum, and I have thus decided to treat all such
> entities as the same class of object ("Biological Instances") -- which
> also spans the [population-->multiple specimen-->single
> specimen-->specimen part] continuum.  It's not a perfectly clean
> solution; but no solution is perfectly clean, and to my mind, this is
> the optimal solution from a data management perspective.

I think I agree. But regarding the current discussion, I think the
most important point is NOT to burden LSID with any semantics that
allows to conclude which kind of decision a data provider has taken.
The namespace part of LSID should NOT be interpreted as a
standardized class-of-objects.
The GLOPP data have a continuum of specimen data where at least I
know the institution, those where a know a physical voucher exists,
but I have no clue where, those where I don't know whether material
has been preserved, and those where I know it is a pure observation.
So my decision would be similar to yours. However, some museums may
want to act different.

> I would say that the student would (ideally) reference existing
> specimen GUIDs in his/her specific database -- not create new GUIDs
> (unless referencing physical specimens -- vouchered or not -- that
> have not yet received GUIDs, in which case new GUIDs would be assigned
> using the appropriate procedure, whatever that ends up being).

Since the latter case is the default case (unless everything changes,
the money will allow only digitization of a tiny fragment of
collections in the coming decades): If that is done by the student -
how would the collection when it finally comes to digitize the
specimen itself learn about the "physical GUID" of a specimen?

> Good point!  My concern, though, would be that we might end up in the
> same state of chaos that we are now, where multiple electronic records
> of a particular physical specimen are not rigorously cross-linked, and
> thus run the risk of being counted as multiple/separate physical
> instances. Identifying the physical object with the <objectID>
> component of an LSID, and different electronic representations of the
> associated data as different <versions> of the same <objectID> could
> represent a way to deal with both "realities".

I think the solution is in data, not in GUIDs. If the collection has
barcodes or RFIDs, include them in the data. Also make sure that
collections (Institutions) and subcollections have GUIDs (perhaps
derived from Index herbariorum). This solves the counting problem
without imposing a system of requirements that in my opinion are non-
managable.

> > But then we have descriptions, and for description concepts
> > (characters, structures, states, modifiers, etc.) we also need GUIDs
> > to allow federating descriptions that use a common terminology. We
> > have discussed this in SDD on and off (specifically we are proposing
> > to prefer semantically neutral identifiers, and propose a simple
> > optional mechanism called debugid/debugref to enrich data with
> > calculated, semantically meaningful identifiers to facilitate
> > debugging) - but at the moment SDD really waits for a more general
> > and common solution.
>
> Are you talking about GUIDs for character definitions, or GUIDs for
> instances of character definitions applied to individual
> specimens/taxa, or both?

(Not sure, I think they are the same: the definition is identified by
a GUID, the instance references that GUID, and the GUID of a class
(taxon) or object (specimen), which in itself is a unique
combination. Assigning a GUID to the cross of two GUIDs seems
superfluous.)

> I see character definitions as analogous to
> taxon concepts; names used to represent those character definitions as
> analogous to taxon names; and the application of those character
> definitions to specific specimens (or taxa) as analogous to taxonomic
> determinations (i.e., the application of taxon names/concepts to
> specimens).

I agree; a minor point of difference being that there is no
equivalent to nomenclatural codes and that the "taxon names" are
expressable in any number of languages. Which is why SDD uses IDs
only for the character concepts, not for the names. The combination
of character-concept ID, language-ISO-code and audience code (pupil,
student, expert, farmer, etc) is an ID to the "character name".

Gregor
----------------------------------------------------------
Gregor Hagedorn (G.Hagedorn at bba.de)
Institute for Plant Virology, Microbiology, and Biosafety
Federal Research Center for Agriculture and Forestry (BBA)
Koenigin-Luise-Str. 19          Tel: +49-30-8304-2220
14195 Berlin, Germany           Fax: +49-30-8304-2203

Often wrong but never in doubt!




More information about the tdwg-content mailing list