Globally Unique Identifier

Thu Sep 23 15:39:29 CEST 2004

> Mahalo for your informative discussion, Rich.
> A few questions.  You're pretty active on this so maybe you can help me
out.
> What about duplicate specimens?  Although a specimen may be MO 1234, K
5678 and P AABB,
> they may in fact all be SMITH 10001 and duplicates of the exact same
specimen, not
> different specimens. Is that one GUID or 3?

In my view, we would assign only ONE GUID, which represents the actual,
physical specimen.  That this one specimen has multiple catalog number
assigned to it is simply additional information associated with that one
specimen (in the same way that many specimens may have more than one
taxonomic name applied to it, by different investigators at different
times).  This is part of the problem with using the "soft" GUID surrogate of
[InstitutionCode]+[CollectionCode]+[CatalogNumber].  A simple solution would
be to select one of these catalog numbers (e.g., SMITH 10001) as the
"current" catalog number, and enter that in the appropriate DarwinCore (DwC)
fields (either [CollectionCode]+[CatalogNumber] or
[InstitutionCode]+[CatalogNumber], in this case). The MaNIS implementation
of DwC included a "OtherCatalogNumbers" element, which would store the other
numbers.

I imagine two main problems:

1) Data for the single specimen may be represented more than once in an
Aggregator, if different providers represent the "soft" GUID for the
specimen with two different catalog numbers.  For human-viewed search
results, it would probably be evident soon by looking at the other data that
the two records are the same.  For statistical search results, the specimen
would be counted more than once, which could cause errors in the numeric
results of statistical queries.

2) If the record is only represented by one of its catalog numbers, then how
is someone supposed to locate it by one of the other catalog numbers?  One
way is to include support for a "OtherCatalogNumbers" element, in such a way
that it can be searched in addition to the "soft" GUID of
[InstitutionCode]+[CollectionCode]+[CatalogNumber]. But that's a bit
convoluted.

So, the real solution, in my mind, is to implement a "hard" GUID
("GlobalUniqueIdentifier":
http://darwincore.calacademy.org/Documentation/DarwinCore2DraftHTML).  That
way, the specimen could be represented in four different Provider records,
but easily combine as one by an Aggregator via the shared GUID.

> When attempting to use world-wide specimen records via GBIF for
biodiversity counts
> and species analyses, these duplicates artificially inflate the counts
significantly
> in some cases.

Yes -- that's what I meant by "statistical search results".  Presumably,
DiGIR Providers should only provide data on specimens that they current
hold.  For instance, if BPBM 12345 was donated to Smithsonian, and now has
the new catalog number USNM 987654, then Bishop Museum should not include
the record in its DiGIR provider under its original catalog number (BPBM
12345).  Bishop could either represent it with the current catalog number
(USNM 987654), in which case an Aggregator could easily identify it as the
same specimen, or Bishop should exclude it from its DiGIR provider
altogether.

Of course, none of this is perfect -- there are likely to be all kinds of
errors of this sort when institutions wholesale dump their electronic
catalogs online in the form of DiGIR providers.  But the same is true of
"hard" GUIDs.  What's to stop Bishop Museum from assigning one GUID to its
record of BPBM 12345, and Smithsonian assigning another GUID to its record
of USNM 987654?  The correct answer is, "nothing, really" -- except to
whatever extent the people in charge of assigning these GUIDs to specimens
in their charge are careful to avoid making such duplications.  But nobody
is perfect -- which is why *any* GUID system is going to require some sort
of integrated "inadvertent duplication index", to keep a permanent index of
"objective" duplications (not to be confused with "subjective" record
equivalencies, such as this taxonomic concept is equivalent to that
taxonomic concept).

> What about triplicate names?  IPNI is often given as the example for a set
of name records.
> But, IPNI can have three records for the same exact name and
reference--one from IK, APNI
> and Grey Cards.  IPNI has no plans to ever deduplicate these records due
to the nature of
> the creation of the IPNI collaboration.  So, do the three duplicate
records get three GUIDs?

Not intentionally -- no (at least not in my view).  But I can very easily
see how they would inadvertently be assigned different GUIDs -- hence the
need to be able to seamlessly deal with objective duplicates when they are
discovered.

> Where are the GUIDs actually to be perpetually located after they are
assigned?

That's the crux of the question posed in Donald's PowerPoint file.  My
inclination is to pick a more centralized organization that seems likely to
survive in the long run (GBIF seems to me to be a leading candidate;
although for taxonomic names, I would still favor the respective
nomenclatural Commissions).

> Are all the originating organizations supposed to modify their databases
to add
> the GUID attribute and then build a mechanism to send out their records
and then
> receive the GUID back from somewhere and finally update their records with
it so
> the record+GUID can then in turn be published from their database onto the
web?

I would like to think so, yes.  Certainly all organizations that set up a
DiGIR provider.  If you follow the link (above) to the DwC2 draft, you'll
see that the first element is "GlobalUniqueIdentifier", which is required in
the current draft.  A stop-gap solution is to concatenate a "soft" GUID in
the form of:

URN:catalog:[InstitutionCode]:[CollectionCode]:[CatalogNumber]

...but personally, I see this only as a temporary solution. I'd rather see
the bioinformatics community bite the bullet and commit to a "hard" GUID
system.

> Couldn't agree more on the need for a single index/GUIDs to all
references,
> but beyond that is needed the single database containing all the GUIDS
plus
> the standard abbreviations and descriptions for them.  Nobody has this
database.
> There are subsets like BPH and TL2.  But no single, definitive list of all
> references, online, in one place with GUIDs.  This science needs that in
the worst way.

I agree on all counts.  Which is why I think someone (GBIF?) needs to build
it.  It won't suddenly materialize out of nothing -- it will have to be
built over time.  If you want to assign a GUID to a Taxon Name, you must
first enter the citation details for its original description Reference in
the Reference GUID issuer.

> If a concept is Name+Reference, then don't IPNI and Tropicos contain
millions
> of concept records?

It depends on what you mean by "Concept".  Note in my last email that I
explicitly identified "Concepts" as a *subset* of Name+Reference instances.
Who decides which Name+Reference instances are "concept-bearers" and which
are not? Tough question -- but one that is being thought about by the SEEK
folks. Similarly, who decides which Name+Reference instances are
"Name-bearers"?  That's easier to answer:  the respective Code of
Nomenclature.

Will there be millions of concept records? Well, given that there are
millions of names, I imagine there will probably be tens of millions of
Concepts to which those names have been, or will be, applied.  There will,
of course, be BILLIONS of Name+Reference instances.  I say this with
confidence because, in my view, every identification label of every specimen
in the world could potentially be considered as a "Reference", and there are
presumably billions of specimens out there.  But I'm not terribly concerned
about such large numbers.  As of this moment, there are 4,285,199,774 web
pages indexed by Google, yet it can find what I'm looking for with AMAZING
speed and efficiency -- and that's without any semantic context.  What we're
talking about here is highly structured data in a tightly controlled
semantic context.  Computers are exceedingly good and managing vast
quantities of data very quickly -- and they're getting better and faster all
the time.  By the time we (the Bioinformatics community) get around to
digitizing billions of specimens and Name+Reference instances, the hard
drive on my laptop will be measured in Terabytes.

Aloha,
Rich

Richard L. Pyle, PhD
Natural Sciences Database Coordinator, Bishop Museum
1525 Bernice St., Honolulu, HI 96817
Ph: (808)848-4115, Fax: (808)847-8252
email: deepreef at bishopmuseum.org
http://www.bishopmuseum.org/bishop/HBS/pylerichard.html