Re: Globally Unique Identifier
Mahalo for your informative discussion, Rich. A few questions. You're pretty active on this so maybe you can help me
out.
What about duplicate specimens? Although a specimen may be MO 1234, K
5678 and P AABB,
they may in fact all be SMITH 10001 and duplicates of the exact same
specimen, not
different specimens. Is that one GUID or 3?
In my view, we would assign only ONE GUID, which represents the actual, physical specimen. That this one specimen has multiple catalog number assigned to it is simply additional information associated with that one specimen (in the same way that many specimens may have more than one taxonomic name applied to it, by different investigators at different times). This is part of the problem with using the "soft" GUID surrogate of [InstitutionCode]+[CollectionCode]+[CatalogNumber]. A simple solution would be to select one of these catalog numbers (e.g., SMITH 10001) as the "current" catalog number, and enter that in the appropriate DarwinCore (DwC) fields (either [CollectionCode]+[CatalogNumber] or [InstitutionCode]+[CatalogNumber], in this case). The MaNIS implementation of DwC included a "OtherCatalogNumbers" element, which would store the other numbers.
I imagine two main problems:
1) Data for the single specimen may be represented more than once in an Aggregator, if different providers represent the "soft" GUID for the specimen with two different catalog numbers. For human-viewed search results, it would probably be evident soon by looking at the other data that the two records are the same. For statistical search results, the specimen would be counted more than once, which could cause errors in the numeric results of statistical queries.
2) If the record is only represented by one of its catalog numbers, then how is someone supposed to locate it by one of the other catalog numbers? One way is to include support for a "OtherCatalogNumbers" element, in such a way that it can be searched in addition to the "soft" GUID of [InstitutionCode]+[CollectionCode]+[CatalogNumber]. But that's a bit convoluted.
So, the real solution, in my mind, is to implement a "hard" GUID ("GlobalUniqueIdentifier": http://darwincore.calacademy.org/Documentation/DarwinCore2DraftHTML). That way, the specimen could be represented in four different Provider records, but easily combine as one by an Aggregator via the shared GUID.
When attempting to use world-wide specimen records via GBIF for
biodiversity counts
and species analyses, these duplicates artificially inflate the counts
significantly
in some cases.
Yes -- that's what I meant by "statistical search results". Presumably, DiGIR Providers should only provide data on specimens that they current hold. For instance, if BPBM 12345 was donated to Smithsonian, and now has the new catalog number USNM 987654, then Bishop Museum should not include the record in its DiGIR provider under its original catalog number (BPBM 12345). Bishop could either represent it with the current catalog number (USNM 987654), in which case an Aggregator could easily identify it as the same specimen, or Bishop should exclude it from its DiGIR provider altogether.
Of course, none of this is perfect -- there are likely to be all kinds of errors of this sort when institutions wholesale dump their electronic catalogs online in the form of DiGIR providers. But the same is true of "hard" GUIDs. What's to stop Bishop Museum from assigning one GUID to its record of BPBM 12345, and Smithsonian assigning another GUID to its record of USNM 987654? The correct answer is, "nothing, really" -- except to whatever extent the people in charge of assigning these GUIDs to specimens in their charge are careful to avoid making such duplications. But nobody is perfect -- which is why *any* GUID system is going to require some sort of integrated "inadvertent duplication index", to keep a permanent index of "objective" duplications (not to be confused with "subjective" record equivalencies, such as this taxonomic concept is equivalent to that taxonomic concept).
What about triplicate names? IPNI is often given as the example for a set
of name records.
But, IPNI can have three records for the same exact name and
reference--one from IK, APNI
and Grey Cards. IPNI has no plans to ever deduplicate these records due
to the nature of
the creation of the IPNI collaboration. So, do the three duplicate
records get three GUIDs?
Not intentionally -- no (at least not in my view). But I can very easily see how they would inadvertently be assigned different GUIDs -- hence the need to be able to seamlessly deal with objective duplicates when they are discovered.
Where are the GUIDs actually to be perpetually located after they are
assigned?
That's the crux of the question posed in Donald's PowerPoint file. My inclination is to pick a more centralized organization that seems likely to survive in the long run (GBIF seems to me to be a leading candidate; although for taxonomic names, I would still favor the respective nomenclatural Commissions).
Are all the originating organizations supposed to modify their databases
to add
the GUID attribute and then build a mechanism to send out their records
and then
receive the GUID back from somewhere and finally update their records with
it so
the record+GUID can then in turn be published from their database onto the
web?
I would like to think so, yes. Certainly all organizations that set up a DiGIR provider. If you follow the link (above) to the DwC2 draft, you'll see that the first element is "GlobalUniqueIdentifier", which is required in the current draft. A stop-gap solution is to concatenate a "soft" GUID in the form of:
URN:catalog:[InstitutionCode]:[CollectionCode]:[CatalogNumber]
...but personally, I see this only as a temporary solution. I'd rather see the bioinformatics community bite the bullet and commit to a "hard" GUID system.
Couldn't agree more on the need for a single index/GUIDs to all
references,
but beyond that is needed the single database containing all the GUIDS
plus
the standard abbreviations and descriptions for them. Nobody has this
database.
There are subsets like BPH and TL2. But no single, definitive list of all references, online, in one place with GUIDs. This science needs that in
the worst way.
I agree on all counts. Which is why I think someone (GBIF?) needs to build it. It won't suddenly materialize out of nothing -- it will have to be built over time. If you want to assign a GUID to a Taxon Name, you must first enter the citation details for its original description Reference in the Reference GUID issuer.
If a concept is Name+Reference, then don't IPNI and Tropicos contain
millions
of concept records?
It depends on what you mean by "Concept". Note in my last email that I explicitly identified "Concepts" as a *subset* of Name+Reference instances. Who decides which Name+Reference instances are "concept-bearers" and which are not? Tough question -- but one that is being thought about by the SEEK folks. Similarly, who decides which Name+Reference instances are "Name-bearers"? That's easier to answer: the respective Code of Nomenclature.
Will there be millions of concept records? Well, given that there are millions of names, I imagine there will probably be tens of millions of Concepts to which those names have been, or will be, applied. There will, of course, be BILLIONS of Name+Reference instances. I say this with confidence because, in my view, every identification label of every specimen in the world could potentially be considered as a "Reference", and there are presumably billions of specimens out there. But I'm not terribly concerned about such large numbers. As of this moment, there are 4,285,199,774 web pages indexed by Google, yet it can find what I'm looking for with AMAZING speed and efficiency -- and that's without any semantic context. What we're talking about here is highly structured data in a tightly controlled semantic context. Computers are exceedingly good and managing vast quantities of data very quickly -- and they're getting better and faster all the time. By the time we (the Bioinformatics community) get around to digitizing billions of specimens and Name+Reference instances, the hard drive on my laptop will be measured in Terabytes.
Aloha, Rich
Richard L. Pyle, PhD Natural Sciences Database Coordinator, Bishop Museum 1525 Bernice St., Honolulu, HI 96817 Ph: (808)848-4115, Fax: (808)847-8252 email: deepreef@bishopmuseum.org http://www.bishopmuseum.org/bishop/HBS/pylerichard.html
participants (1)
-
Richard Pyle