Globally Unique Identifier

Thu Sep 30 06:01:32 CEST 2004

> > In my view, we would assign only ONE GUID, which represents the
> > actual, physical specimen.  That this one specimen has multiple
> > catalog number assigned to it is simply additional information
> > associated with that one specimen (in the same way that many specimens
> > may have more than one taxonomic name applied to it, by different
> > investigators at different times).
>
> I agree on the multiple catalogue numbers, but I believe still
> multiple database records of specimens will exists.

Yes, but I guess a central question, which Donald included in his PowerPoint
file, is whether the GUID is assigned to the physical object, or to the
electronic representation (data record).  Most of my comments have been from
the standpoint that the GUID applies to the physical specimen. If it is the
electronic records that we wish to uniquely identify, then it seems to me
that the <objectID> component of an LSID should apply to the physical
specimen, and multiple database records should be uniquely identified using
the <version> component.

> Since I myself am
> not involved in collection curation, but in evaluating the
> information therein (specifically we work on organism interactions)
> we have a database of now close to 200 000 fungal host parasite
> records. Some express opinion without further citation, others
> express opinion backed up by voucher specimen that contains all the
> information that would be found in collection databases. GBIF seems
> to have no place for such data so far - and it would be difficult to
> provide, since we usually have none of
> "InstitutionCode]+[CollectionCode]+[CatalogNumber" (which is
> different from the problem having duplicate CatalogNumbers you
> discuss).

This is part of the reason why I think that the
"[InstitutionCode]+[CollectionCode]+[CatalogNumber]" solution is only a
temporary one.  This raises another question: would the GUIDs be limited to
just vouchered specimens? Or, would they also be assigned to unvouchered
specimens (e.g., field observations of specific individual organisms, that
were not vouchered in a Museum collection).  Or, would unvouchered
"biological instances" represent a different class of GUIDs?

The simple answer is to make observations a different class of object.  But
in my data management world, I need to deal with everything from sight
records with little more data than a taxonomic determination; to specific
observations involving specific (uncollected) individual organisms
(sometimes with as much associated data as any vouchered specimen); to
collected organisms that were brought into a lab, examined by experts, but
not added to a permanent collection; to stereotypical museum voucher
specimens; to specimens that were added to the permanent voucher collection,
but later lost or destroyed.  In my mind, there is little fundamental
difference between the two endpoints of this spectrum, and I have thus
decided to treat all such entities as the same class of object ("Biological
Instances") -- which also spans the [population-->multiple specimen-->single
specimen-->specimen part] continuum.  It's not a perfectly clean solution;
but no solution is perfectly clean, and to my mind, this is the optimal
solution from a data management perspective.

> Still what kind of data is that? What kind of data is
> created if a PH.D. student digitizes the specimen records used for a
> taxonomic revision in a database that is specific to that revision?

I would say that the student would (ideally) reference existing specimen
GUIDs in his/her specific database -- not create new GUIDs (unless
referencing physical specimens -- vouchered or not -- that have not yet
received GUIDs, in which case new GUIDs would be assigned using the
appropriate procedure, whatever that ends up being).

> Bottomline: The physical specimen does exist, but in the foreseeable
> future all data GUIDs will be attached to data, not to the specimen.
> The exceptions is only where indeed it is possible to attach the GUID
> to the specimen, then this could be cited.

Good point!  My concern, though, would be that we might end up in the same
state of chaos that we are now, where multiple electronic records of a
particular physical specimen are not rigorously cross-linked, and thus run
the risk of being counted as multiple/separate physical instances.
Identifying the physical object with the <objectID> component of an LSID,
and different electronic representations of the associated data as different
<versions> of the same <objectID> could represent a way to deal with both
"realities".

> But then we have descriptions, and for description concepts
> (characters, structures, states, modifiers, etc.) we also need GUIDs
> to allow federating descriptions that use a common terminology.
> We have discussed this in SDD on and off (specifically we are proposing
> to prefer semantically neutral identifiers, and propose a simple
> optional mechanism called debugid/debugref to enrich data with
> calculated, semantically meaningful identifiers to facilitate
> debugging) - but at the moment SDD really waits for a more general
> and common solution.

Are you talking about GUIDs for character definitions, or GUIDs for
instances of character definitions applied to individual specimens/taxa, or
both?  I see character definitions as analogous to taxon concepts; names
used to represent those character definitions as analogous to taxon names;
and the application of those character definitions to specific specimens (or
taxa) as analogous to taxonomic determinations (i.e., the application of
taxon names/concepts to specimens).

> So this discussion is highly relevant to descriptions as well. My
> main point is: what we are really interested in GBIF in the end is
> knowledge, not physical possession. If we limit our thinking of the
> GBIF system to the very special case of institutionalized collections
> (as both DwC and ABCD in my opinion currently do), or names governed
> by a nomenclatural code, I believe we may later have to rearchitect.

Agreed!! (Especially considering the forum on which this discussion is
taking place.) This is part of the reason why I keep mixing taxon
names/concepts examples among specimen examples. In the back of my mind, I
was thinking SDD examples as well, though I think I failed to express that
adequately.

> BTW, partly for these differences between institutional collection-
> customs and knowledge publication customs, I vote against a strongly
> central system. LSID authority (lsid.gbif.net) and namespace (with no
> or low semantics) should be managed by GBIF, but not the
> ids/versions. GBIF may provide a service to generate them, but should
> accept any locally generated ID and trust the generator to manage
> uniqueness.

I generally agree.  My leanings towards centralization revolved around GUID
generation only (not application of GUID to associated data -- which is what
I would define as the "management" part), and *perhaps* GUID resolution (but
more for "unowned" sorts of data like taxonomy, and only in the paradigm
that each GUID could be resolved by one and only one domain server, and that
the GUID would cease to have meaning if/when the branded domain server
ceased to exist).

Whether or not GUID generation happens centrally or in a distributed way
seems to me to depend on what GUID scheme is ultimately adopted.

Aloha,
Rich

Richard L. Pyle, PhD
Natural Sciences Database Coordinator, Bishop Museum
1525 Bernice St., Honolulu, HI 96817
Ph: (808)848-4115, Fax: (808)847-8252
email: deepreef at bishopmuseum.org
http://www.bishopmuseum.org/bishop/HBS/pylerichard.html