[tdwg-tag] Specimen identifiers [SEC=UNCLASSIFIED]

Mon Feb 27 18:45:59 CET 2012

Dear Dusty,

On 27 Feb 2012, at 16:44, DLMcDonald wrote:

> GBIF URIs are cheap - why not get a couple?!
> 
> Along with museums randomly publishing their specimens a few times, others often (re)publish specimen data. http://data.gbif.org/occurrences/146112485/ and http://data.gbif.org/occurrences/210282234/ are both "copies" of http://arctos.database.museum/guid/UAM:Mamm:29830, which is the primary, current specimen data, for example. There's no obvious link between the three records. The NatureServe record was (apparently) taken from the literature, the other was provided by the specimen owner but is cached at GBIF and very much abridged to fit DWC. I think there are probably many more "reported sightings" (from literature, checklists, "known" specimens at other museums, etc.) than specimen-backed data records in GBIF, especially for things like insects, and I think most GBIF data are probably severely out of sync with the primary data, if such a thing exists. I have no idea what any identifier could do about that - inspire false confidence, perhaps.
> 
> I can't really think of a worse "authority" than GBIF, at least from a museum/specimen perspective.

It's clear that GBIF has issues, especially with duplicate records. Tim Roberston has given a detailed explanation of the reasons GBIF struggles with this http://iphylo.blogspot.com/2012/02/how-many-specimens-does-gbif-really.html#comment-449811856. From my experience, problems include (a) museums changing the metadata for specimens (e.g., changing a the collection code form 'Bird' to 'Birds') and (b) lack of internal identifiers that are invariant when the metadata changes. It's also clear that the same records are being aggregated several times, some via "primary" sources (e.g., the museum's DiGiR provider) and some via "secondary" sources. Much of the specimen-level digital infrastructure we have has no notion of identifiers, hence no easy way to avoid duplicates.

Yes GBIF is problematic, but the one big advantage from my point of view is I want to go to one place and get information on specimens. I don't want to have to discover where I get this information from, then figure out how to retrieve it. If I'm matching 100,000s of records I want one place to do this for me.

The analogy I often use is CrossRef, which has metadata for millions of scientific articles. If I want to locate the DOI for an article I don't have to figure out which publisher published the article and how I talk to their database, I simply ask CrossRef what the DOI is using a simple API. This is how the linked reference list at then end of a paper are generated. I want something similar for specimens.

> 
> A small community of us have implemented a very successful URL "guid" model. If anyone has another specimen identifier that gets more inter-system use, I don't know about it. It's also pretty handy, if far from perfect, for finding things on the internet. (http://goo.gl/WRe6p should find the specimen listed above, for example.) We've worked closely with GenBank in developing this system, and even in the case of things like authors not bothering to tell us about publications we can use the GUID to automagically find specimens/sequences in each other's systems. I think it's about as good as URIs can get. And it's not very good. We're always a bad administrative decision or two from doing something we'll regret, and we've learned most of whatever we know by getting it wrong the first time(s). I cringe every time I see http://arctos.database.museum/SomeSupposedlyForeverURL in a static medium (http://goo.gl/8szEv).
> 
> tl;dr: There is a strong need for something beyond URLs, but there are scary social problems to address.
> 
> -Dusty

I guess this model is one reason David Schindel likes Darwin Core triplets. Being able to construct a URL from metadata is nice, but looking at the history of GBIF data lots of museums like to change this metadata, which makes the URLs fragile (or at least, potentially breaks attempts infer the URL from the metadata). Then there's the issue of global uniqueness of these codes.

The thing I find somewhat bemusing in all of this is that we are far from the first people to face these problems. The publishing industry has the same issues, and they ended up having centrally-managed identifiers (DOIs) that use redirection to hide the underlying URLs, so the individual publishers can muck about with the platforms they use to serve the data without breaking things for users. It seems to me that any system that links using URLs has to have a strategy for handling URLs changing (because it's pretty clear museums have virtually no ability to keep URLs from changing). If you accept that, then I think the way forward becomes pretty clear. GBIF is our CrossRef. 

Regards

Rod

> 
> On Sun, Feb 26, 2012 at 8:41 PM, Paul Murray <pmurray at anbg.gov.au> wrote:
> 
> The question is - who has the job of declaring what the "original URI" is for existing specimens that already have a history? And what should that URI be? Perhaps this is where GBIF-issued ids become important. Or perhaps we could ditch the idea of "original URI", and just track the "GBIF URI". It's the responsibility of anyone with a specimen that does not already have a GBIF URI to get one for it.
> 
> 

---------------------------------------------------------
Roderic Page
Professor of Taxonomy
Institute of Biodiversity, Animal Health and Comparative Medicine
College of Medical, Veterinary and Life Sciences
Graham Kerr Building
University of Glasgow
Glasgow G12 8QQ, UK

Email: r.page at bio.gla.ac.uk
Tel: +44 141 330 4778
Fax: +44 141 330 2792
AIM: rodpage1962 at aim.com
Facebook: http://www.facebook.com/profile.php?id=1112517192
Twitter: http://twitter.com/rdmpage
Blog: http://iphylo.blogspot.com
Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.tdwg.org/pipermail/tdwg-tag/attachments/20120227/fe37915e/attachment.html