[tdwg-content] delimiter characters for concatenated IDs

Tue May 6 15:39:41 CEST 2014

This illustrates the problems of constructing identifiers from metadata.

The “Darwin Core Triple” comes AFAIK from US vertebrate collections, where specimens are typically identified by acronym + catalogue number. This is often not unique (the same acronym + catalogue number combination such as FMNH 266214 may be used for a frog, a bird, or a mollusc spider within the same museum). Hence we add a “collection code” to make them distinct. Unfortunately, these are rarely used outside museum databases and GBIF (hardly any papers that cite specimens use the three-part codes). If you’re a zoologist seeing, say FMNH 266214, you “know” which specimen is being referred to by the taxonomic context.

My understanding of herbaria is that there are (like zoological collections) long standing abbreviations (see https://en.wikipedia.org/wiki/List_of_herbaria#Europe ) so if you’re botanist then B100094759 tells you that the herbarium specimen comes from Berlin. Hence B100094759 is enough (and is not, as such, an “odd acronym”). Roger Hyam’s proposal http://stories.rbge.org.uk/archives/1284 to make “cool URIs” out of these (e.g., http://data.rbge.org.uk/herb/E00421509 ) is based on this idea. Within that collection the catalogue number is unique (zoologists are rarely so lucky).

So, do we now recreate these URIs using Darwin Core Triples? Does B or BGBM identify a specimen from Berlin? If the prefix “B” is enough to identify a plant specimen as coming from Berlin, why do we then add “BGBM”? Would botanists used to citing B100094759 in their papers and floras recognise something like BGBM: B100094759 (which is somewhat redundant).

In short, yuck.

Regards

Rod

---------------------------------------------------------
Roderic Page
Professor of Taxonomy
Institute of Biodiversity, Animal Health and Comparative Medicine
College of Medical, Veterinary and Life Sciences
Graham Kerr Building
University of Glasgow
Glasgow G12 8QQ, UK

Email:  r.page at bio.gla.ac.uk<mailto:r.page at bio.gla.ac.uk>
Tel:  +44 141 330 4778
Fax:  +44 141 330 2792
Skype:  rdmpage
Facebook:  http://www.facebook.com/rdmpage
LinkedIn:  http://uk.linkedin.com/in/rdmpage
Twitter:  http://twitter.com/rdmpage
Blog:   http://en.wikipedia.org/wiki/Roderic_D._M._Page
Citations:  http://scholar.google.co.uk/citations?hl=en&user=4Z5WABAAAAAJ
ORCID:  http://orcid.org/0000-0002-7101-9767

On 6 May 2014, at 14:16, Hilmar Lapp <hlapp at nescent.org<mailto:hlapp at nescent.org>> wrote:

Hi Gabi,

That's indeed an odd acronym for BGBM. Have you tried to edit the record so it uses BGBM instead? It seems editable (though presumably edits go through approval), and GRBio expressly solicits the community to help with curating the accuracy of their records.

  -hilmar

On Tue, May 6, 2014 at 5:11 AM, "Dröge, Gabriele" <g.droege at bgbm.org<mailto:g.droege at bgbm.org>> wrote:
Hi Rod,

those are excellent use cases, that GGBN also aims to reach.
I just have a comment on the first one. NCBI and CBOL defined a mechanism to add voucher ids to a GenBank record, which is great. But unfortunately they use different values than GBIF does. Their institution code is based on GRBio (http://grbio.org/). So e.g. my institution is listed there with “B”, but we use “BGBM” for GBIF records and so do many institutions. Same happens for Catalogue Number. Therefore the in principal good idea does not work for a dynamic linkage to GBIF records.
One of my next steps on my to do list to talk to CBOL and GenBank about this issue, additionally establishing dynamic linkage between the sequence record and corresponding DNA and tissue samples at GGBN.

For my current problem I guess the short term solution is to define GGBN conventions on delimiters and switch to a better solution when it might be available. We would be happy to be part of those discussion and to help.

Best,
Gabi

Von: tdwg-content-bounces at lists.tdwg.org<mailto:tdwg-content-bounces at lists.tdwg.org> [mailto:tdwg-content-bounces at lists.tdwg.org<mailto:tdwg-content-bounces at lists.tdwg.org>] Im Auftrag von Roderic Page
Gesendet: Montag, 5. Mai 2014 22:10
An: Markus Döring
Cc: tdwg-content at lists.tdwg.org<mailto:tdwg-content at lists.tdwg.org>; Miller, Chuck; tomc at cs.uoregon.edu<mailto:tomc at cs.uoregon.edu>; John Deck
Betreff: Re: [tdwg-content] delimiter characters for concatenated IDs

Hi Markus,

I have three  use cases that

1. Linking sequences in GenBank to voucher specimens. Lots of voucher specimens are listed in GenBank but not linked to digital records for those specimens. These links are useful in two directions, one is to link GBIF to genomic data, the second is to enhance data in both databases, see http://iphylo.blogspot.co.uk/2012/02/linking-gbif-and-genbank.html (e.g., by adding missing georeferencing that is available in one database but not the other).

2. Linking to specimens cited in the literature. I’ve done some work on this in BioStor, see http://iphylo.blogspot.co.uk/2012/02/linking-gbif-and-biodiversity-heritage.html  One immediate benefit of this is that GBIF could display the scientific literature associated with a specimen, so we get access to the evidence supporting identification, georeferencing, etc.

3. Citation metrics for collections, see http://iphylo.blogspot.co.uk/2013/05/the-impact-of-museum-collections-one.html and http://iphylo.blogspot.co.uk/2012/02/gbif-specimens-in-biostor-who-are-top.html Based on citation sod specimens in the literature, and in databases such as GenBank (i.e., basically combining 1 + 2 above) we can demonstrate the value of a collection.

All of these use cases depend on GBIF occurenceIds remaining stable, I have often ranted on iPHylo when this doesn’t happen: http://iphylo.blogspot.co.uk/2012/07/dear-gbif-please-stop-changing.html

Regards

Rod

On 5 May 2014, at 20:51, Markus Döring <mdoering at gbif.org<mailto:mdoering at gbif.org>> wrote:

Hi Rod,

I agree GBIF has troubles to keep identifiers stable for *some* records, but in general we do a much better job than the original publishers in the first place. We try hard to keep GBIF ids stable even if publishers change collection codes, registered datasets twice or do other things to break a simple automated way of mapping source records to existing GBIF ids. Also the stable identifier in GBIF never has been the URL, but it is the local GBIF integer alone. The GBIF services that consume those ids have changed over the years, but its pretty trivial to adjust if you use the GBIF ids instead of the URLs. If there is a clear need to have stable URLs instead I am sure we can get that working easily.

The two real issues for GBIF are a) duplicates and b) records with varying local identifiers of any sort (triplet, occurrenceID or whatever else).

When it comes to the varying source identifiers I always liked the idea of flagging those records and datasets as unstable, so it is obvious to users. This is not a 100% safe, but most terrible datasets change all of their ids and that is easily detectable.
Also with a service like that it would become more obvious to publishers how important stable source ids are.

Before jumping on DOIs as the next big thing I would really like to understand what needs the community has around specimen ids.
Gabi clearly has a very real use case, are there others we know about?

Markus

On 05 May 2014, at 21:05, Roderic Page <r.page at bio.gla.ac.uk<mailto:r.page at bio.gla.ac.uk>> wrote:

Hi Hilmar,

I’m not arguing that we shouldn’t build a resolver (I have one that I use, Rich has mentioned he’s got one, Markus has one at GBIF, etc.).

Nor do I think we should wait for institutional and social commitment (because then we’d never get anything done).

But I do think it would be useful to think it through. For example, it’s easy to create a URL for a specimen. Easy peasy. OK, how do I discover that URL? How do I discover these for all specimens? Sounds like I need a centralised discover service like you’e described.

How do I handle changes in those URLs? I built a specimen code to GBIF resolver for BioStor so that I could link to specimens, GBIF changed lots of those URLs, all my work was undone, boy does GBIF suck sometimes. For example, if I map codes to URLs, I need to handle cases when they change.

If URLs can change, is there a way to defend against that (this is one reason for DOIs, or other methods of indirection, such as PURLs).

If providers change, will the URLs change? Is there a way to defend against that (again, DOIs handle this nicely by virtue of (a) indirection, and (b) lack of branding).

How can I encourage people to use the specimen service? What can I do to make them think it will persist? Can I convince academic publishers to trust it enough to link to it in articles? What’s the pitch to Pensoft, to Magnolai Press, to Springer and Elsevier?

Is there some way to make the service itself become trusted? For example if I look at a journal and see that it has DOIs issued by CrossRef, I take that journal more seriously than if it’s just got simple URLs. I know that papers in that journal will be linked into the citation network, I also know that there is a backup plan if the journal goes under (because you need that to have DOIs in CrossRef). Likewise, I think Figshare got a big boost when it stared minting DOIs (wow, a DOI, I know DOIs, you mean I can now cite stuff I’ve uploaded there?).

How can museums and herbaria be persuaded to keep their identifiers stable? What incentives can we provide (e.g., citation metrics for collections)? What system would enable us to do this? What about tracing funding (e.g., the NSF paid for these n papers, and they cite these y specimens, from these z collections, so science paid for by the NSF requires these collections to exist).

I guess I’m arguing that we should think all this through, because a specimen code to specimen URL is a small piece of the puzzle. Now, I’m desperately trying not to simply say what I think is blindingly obvious here (put DOIs on specimens, add metadata to specimen and specimen citation services, and we are done), but I think if we sit back and look at where we want to be, this is exactly what we need (or something functionally equivalent). Until we see the bigger picture, we will be stuck in amateur hour.

Take  a look at:

http://search.crossref.org<http://search.crossref.org/>
http://www.crossref.org/fundref/
http://support.crossref.org/
https://prospect.crossref.org/splash/

Isn’t this the kind of stuff we’d like to do? If so, let’s work out what’s needed and make it happen.

In short, I think we constantly solve an immediate problem in the quickest way we know how, without thinking it through. I’d argue that if we think about the bigger picture (what do we want to be able to, what are the questions we want to be able to ask) then things become clearer. This is independent of getting everyone’s agreement (but it would help if we made their agreement seem a no brainer by providing solutions to things that cause them pain).

Regards

Rod

On 5 May 2014, at 19:14, Hilmar Lapp <hlapp at nescent.org<mailto:hlapp at nescent.org>> wrote:

On Mon, May 5, 2014 at 1:29 PM, Roderic Page <r.page at bio.gla.ac.uk<mailto:r.page at bio.gla.ac.uk>> wrote:
Contrary to Hilmar, there is more to this than simply a quick hackathon. Yes, a service that takes metadata and returns one or more identifiers is a good idea and easy to create (there will often be more than one because museum codes are not unique). But who maintains this service? Who maintains the identifiers? Who do I complain to if they break? How do we ensure that they persist when, say, a museum closes down, moves its collection, changes it’s web technology? Who provides the tools that add value to the identifiers? (there’s no point having them if they are not useful)

Jonathan Rees pointed this out to me too off-list. Just for the record, this isn't contrary but fully in line with what I was saying (or trying to say). Yes, I didn't elaborate that part, assuming, perhaps rather erroneously, that all this goes without saying, but I did mention that one part of this becoming a real solution has to be an institution with an in-scope cyberinfrastructure mandate that going in would make a commitment to sustain the resolver, including working with partners on the above slew of questions. The institution I gave was iDigBio; perhaps for some reason that would not be a good choice, but whether they are or not wasn't my point.

I will add one point to this, though. It seems to me that by continuing to argue that we can't go ahead with building a resolver that works (as far as technical requirements are concerned) before we haven't first fully addressed the institutional and social long-term sustainability commitment problem, we are and have been making this one big hairy problem that we can't make any practical pragmatic headway about, rather than breaking it down into parts, some of which (namely the primarily technical ones) are actually fairly straightforward to solve. As a result, to this day we don't have some solution that even though it's not very sustainable yet, at least proves to everyone how critical it is, and that the community can rally behind. Perhaps that's naïve, but I do think that once there's a solution the community rallies behind, ways to sustain it will be found.

  -hilmar
--
Hilmar Lapp -:- informatics.nescent.org/wiki<http://informatics.nescent.org/wiki> -:- lappland.io<http://lappland.io/>

_______________________________________________
tdwg-content mailing list
tdwg-content at lists.tdwg.org<mailto:tdwg-content at lists.tdwg.org>
http://lists.tdwg.org/mailman/listinfo/tdwg-content

--
Hilmar Lapp -:- informatics.nescent.org/wiki<http://informatics.nescent.org/wiki> -:- lappland.io<http://lappland.io/>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.tdwg.org/pipermail/tdwg-content/attachments/20140506/b151d45c/attachment.html