Globally Unique Identifier

Mon Oct 4 09:57:22 CEST 2004

> I think this is the problem. No single record in our database has
> this information, and to my knowledge, most if not all of the
> physical specimen sheets referred to do not yet have a unique
> catalogue number. Adding truly unique catalogue numbers physically to
> specimens (as opposed to often non-unique batch accession numbers)
> has often not been made and is now done either during digitization,
> or it may in recent year also happen during loan processing.

I guess my point was, if the GLOPP database does not contain enough
information to uniquely identify a given voucher specimen (whether it is a
catalog number, or some combination of other data), then the GLOPP data
record can't really be considered "vouchered", can it?  Maybe I am not
thinking hard enough about this, but it seems to me that without the ability
to re-locate a cited specimen with a fair degree of certainty, then the
record would need to be considered "unvouchered".  So, if enough information
is available to pinpoint the specimen, then it *should* (and I emphasize
this word only because I know there will undoubtedly be exceptions) be
possible to associate the GUID with the voucher specimen at a later time.
If there is not enough information to re-locate the specimen, then it seems
to me that the connection with the specimen is broken, and the record
becomes a stand-alone "unvouchered" biological instance.

> However, most of the printed literature cites collection name (which
> may be historic, if collections are merged), taxon name, plus one of:
>  - simply the information that it is a type (often expressed by
> exclamation mark after collection acronym, indicating that a type has
> been studied)
>  - for non-types 2-4 elements out of: collector, collection date,
> collectors field number and location. Having collector plus
> collectors field number is relatively good (although uniqueness is up
> to the collector, some assign batch numbers for collection events),
> but again in my experience it is relatively rarely cited. None of the
> GLOPP records taken from literature cited a field number. The other
> fields are normally sufficiently unique if you go into the collection
> and see what is there, but is a terrible key to try any matching
> against a data service - at least the location is usually comparable
> by automatic string matching.

O.K., I think I understand now, that a human could likely re-locate the
specimen based on some assemblage of data, but that this assemblage of data
would certainly require a human to establish the connection.

I guess my feeling is this:  We should always do our best to avoid the
assignment of duplicate GUIDs to a single "biological instance" (of the
physical kind), but we should also acknowledge that inadvertent duplication
will be inevitable (as it may well be in the scenario you describe), and
therefore build-in a system for accommodating "objective GUID synonymies".

> As said, if you go into the collection, it is easy to identify them.
> If you know that a fragment of the collection is completely digitized
> - as opposed to random digitization which only digitizes specimens
> recently loaned - you can manually identify it on the computer. I
> would guess every specimen identification takes at most 5 minutes and
> involves one or several queries and picking from the result list
> manually.

I imagine it will depend, on a case-by-case-basis, whether the cost of
manual match-up of this sort at the time a GUID is to be assigned exceeds
the cost of risking duplicate GUID assignment.

> I believe most biologists will consider the specimen citations used
> in print until now a voucher - I do not agree with the blank
> statement that this is something else. It is usually unambiguous -
> but not very good for machine processing. Contradictions?

No contradictions from me. But the concept of "voucher" to me becomes more
ambiguous as time goes by.  I have thought a lot about how one might pool
"evidence" from Museum collections, published data, in-situ images (still-ph
otos and video), unpublished sighting reports, etc., and there is no easy
answer that I have found.  My conceptual approach has been to reduce all of
these to reported instances of a particular organism at a particular place
and time. This is what I have meant by "biological instance". Where they
differ is merely in how well-documented they are (unpublished word-of-mouth,
published word-of-mouth, film or electronic image, tissue sample, preserved
organism, etc.), and to what extent they can be re-examined by later
researchers (in my mind, the distinction between "vouchered" and
"unvouchered").  Some uncollected sight records (e.g., rare plants in
Hawaii) have a high degree of re-examination potential, whereas some
specimens collected and preserved in a Museum, but misplaced, lost, or
deteriorated, have a very low degree of re-examination potential.  In-situ
images, though they may have limited documentation (external appearance
only, from one angle only in the case of still photos, limited in resolution
at which the image was captured), have a very high degree of re-examination
potential.

The point (to me, at least) is that in all cases there was some physical
"biological instance", and it is that entity to which I think the GUID
should be assigned.  If a published report cites a specimen preserved in a
Museum, then the same GUID should be used for both the Museum specimen, and
the published citation thereof. If a published report cites an organism that
was never collected or preserved in a Museum, and the publication itself
constitutes the only record of that biological instance, then a new GUID
should be assigned to it.  When a record of the latter sort is later
discovered to be in reference to a specimen that exists in a museum that
already has its own GUID, then those two GUIDs should be branded as
"synonyms".

> > My main point is
> > that such "redundant" GUID issuance should be minimized (i.e., never
> > done intentionally), and quickly/easily identified as such whenever it
> > is discovered.
>
> Certainly not intentionally, but is should be clear that a museum
> should not start to prohibit the use of laptops when a Ph.D.
> candidate comes in and "digitizes" some specimens for a taxonomic
> revision. If the museum system supports it, it is wise to ask to use
> the museum system, but if the system is too complex and requires long
> training, rather have the monography than nothing...

Agreed. But I would think the student should provide a listing of all GUIDs
assigned to specimens within a collection, including as much information as
is necessary to uniquely identify each GUID-assigned specimen.  Whether the
collection manager ever uses that information or not is a different
question, but I think a "culture of respect" for avoiding duplicate GUID
assignment should be integral to the whole GUID process.

> > So....if/when the situation does come up that (for example) GLOPP
> > assigns GUIDs to vouchers on behalf of a non-digitized collection, and
> > that collection later (inadvertently) re-assigns redundant GUIDs to
> > the same set of specimens; that eventual discovery of this duplication
> > should be accommodated by a mechanism for "retiring" one of the IDs
> > into "objective synonomy" of the other ID, and automated systems
> > should be implemented in the resolver service that "auto-forward" the
> > retired ID to the active ID.
>
> I think you could rather view this as an optional deduplication
> layer.

I'm not sure I understand exactly what you mean by "optional" (i.e., at
whose option), but I think it should be a fundamental component to any
resolution service.

> Your specification explicitly contradicts at least the LSID
> specifiction to retrieve repeatedly exactly the same data.

Yes, I know -- which is why I'm feeling less cozy about LSIDs.  I think the
crux of the issue centers on the question that Donald asked in one of his
PowerPoint slides:  Are these numbers assigned to the physical or
"conceptual" (=non-electronic "virtual") objects, or are they assigned to
the electronic/digital representations thereof?  My feeling, from the point
of view of a taxonomist who develops databases for natural history
collections, is that the ultimate goal (i.e., seamless transmission and
exchange of biodiversity-relevant data) will be better served if:

1) The ID's are assigned to the non-electronic (physical or conceptual)
objects;

2) "Static" data associated with those objects be allowed to be changed (as
errors are discovered and corrected) without altering the GUID (and that
data history logs for these static data attributes be thought of as a
secondary function of the data management, not affecting GUIDs);

3) "Dynamic" data associated with those objects (e.g., multiple taxonomic
identifications of a specimen) should be handled "robustly" (i.e., not as
"versions" of the complete set of data associated with a particular
specimen)

> So I prefer a view where GUIDs refer to data objects. I still do not
> see, how you propose to attach them to the physical objects for those
> researchers working in the collection itself.

I'm not sure I understand the question.  I guess I would answer with another
question:  How does a Social Security Number (SSN) for a U.S. Citizen (NINO
in the UK, SIN in Canada, INSEE in France, TFN in Australia, etc. -- see
http://encyclopedia.thefreedictionary.com/Social%20Security%20number) get
attached to an individual person?  I don't think anyone would think of a SSN
as an identifier for a data object -- it is a unique identifier for the
physical person.  I believe that commonly used physical objects in biology
(specimens, taxon names/concepts, references, agents, character definitions,
etc.) should have equivalents of SSNs assigned to them, for the same reasons
that U.S. Citizens have SSNs (i.e., to provide an unambiguously unique
identifier useful for managing information associated with a physical
person).

Obviously, SSNs are not the perfect model for BioGUIDs (for a number of
reasons), but the point is that they represent an ID attached to a physical
entity.

> A secondary service can
> then know about relationships of multiple data objects referring to
> the same physical object. This service may be able to find cross-
> references in the data itself, have smart methods to estimate
> uniqueness based even on location strings, or may have manually
> create cross-reference tables.

This sounds to me like an unnecessary layer of complexity.  But I remain
open-minded on this issue.

> An important point is that different "deduplication" scenarios exist.
> For example, in culture collection, many strains are cross-preserved
> in multiple collections. So "CBS 123.88" may be "equal" to "ATCC
> 1234132" or "BBA 77123". Ideally we may even know the history: "BBA
> 77123" > "CBS 123.88" > "ATCC 1234132". However, the chance that any
> of these strains (which are like "versions") has been mixed up (or
> mutations occurred) is always there. Thus, if I look for duplication
> of the collection event data, I want to deduplicate. If I want to
> check a confusing DNA sequence, I may want to know about other
> derived strains, but I absolutely need to know exactly which strain
> from which collections was sequenced.

I'll have to think about this some more, but it comes back to the question
of what "unit" a GUID is assigned to.  This is not so much a problem for
taxonomic objects; a little bit of a problem for Reference objects, and a
potentially HUGE problem for specimen/"biological instance" objects.  Again
this raises the question: How important is it to use the same GUID scheme
for all of these different classes of bio-objects?

> > For the most part, though -- I see these as "growing pains" of a GUID
> > system during its first years of existence.  I would predict that two
> > decades from now, if one were to do an analysis of redundant GUIDs,
> > one would find the bulk of those having been issued relatively early
> > on.
>
> I agree, but I probably think it is more relevant than you seem to
> think. I believe the "early days" to last the next 50 years - the
> time needed until collections are fully digitized *plus* the time it
> takes to make publication without citing GUIDs inacceptable.

I imagine that the vast majority of "publications" in science 50 years from
now will be electronic.  But I see your point -- even if it's only 20 years
as I suggested, that's still a lot of headaches to deal with.

Aloha,
Rich

Richard L. Pyle, PhD
Natural Sciences Database Coordinator, Bishop Museum
1525 Bernice St., Honolulu, HI 96817
Ph: (808)848-4115, Fax: (808)847-8252
email: deepreef at bishopmuseum.org
http://www.bishopmuseum.org/bishop/HBS/pylerichard.html