[tdwg-content] delimiter characters for concatenated IDs

Richard Pyle deepreef at bishopmuseum.org
Tue May 6 19:31:42 CEST 2014

Hi Tim,


What you outline below (a service that issues community-wide actionable and
persistent identifiers) is what I have been advocating since before the
first TDWG/GBIF GUID workshops.  I still believe it would be a very useful
service; however, it seems we are really talking about two different things


The first, which is what you outline, is a service that mints new
identifiers for objects that do not already have good identifiers.


The other service, which is what I was addressing earlier, is a centralized
mechanism for cross-linking *existing* identifiers to each other.  Several
organizations already have an internal system for doing this.  I already
mentioned the GNUB version (which allows the little icons of related records
to show up on ZooBank pages).  EoL also has one(e.g.:
http://eol.org/pages/992573/resources/partner_links).  So does NCBI
(http://www.ncbi.nlm.nih.gov/projects/linkout/).  What I think we REALLY
need is a single, centralized system that manages cross-links among
identifiers and (separately) identifier dereferencing services.  The reality
is that we already have MANY identifiers minted for the same object (e.g.:
http://zoobank.org/2C6327E1-5560-4DB4-B9CA-76A0FA03D975) – and, sadly, there
will no-doubt be more redundant identifiers minted in the future.


While I think it would be GREAT if GBIF did offer a service to mint
proper/actionable identifiers for the community, ultimately this may end up
representing “yet another identifier”.  It kind of reminds me of that
standards joke: http://www.howtogeek.com/geekers/up/sshot50509a6b8cb11.jpg
Just replace the word “standards” with “identifiers”, and we’re in the same


And I’ll make this point one more time:  Almost all of our problems with
identifiers revolve around the fact that we have drank the TBL Cool-Aid and
assumed/insisted that we conflate the role of object identification with the
role of metadata dereferencing (i.e., “actionability”).  Sigh






Richard L. Pyle, PhD
Database Coordinator for Natural Sciences
Associate Zoologist in Ichthyology
Dive Safety Officer
Department of Natural Sciences, Bishop Museum
1525 Bernice St., Honolulu, HI 96817
Ph: (808)848-4115, Fax: (808)847-8252
email: deepreef at bishopmuseum.org





From: tdwg-content-bounces at lists.tdwg.org
[mailto:tdwg-content-bounces at lists.tdwg.org] On Behalf Of Tim Robertson
Sent: Tuesday, May 06, 2014 4:11 AM
To: Hilmar Lapp; Roderic Page; tdwg-content at lists.tdwg.org; "Markus Döring
(GBIF)"; tomc at cs.uoregon.edu
Subject: Re: [tdwg-content] delimiter characters for concatenated IDs


Hi all,


Supposing GBIF or some other body were interested in offering such a central
service as proposed in this thread.  Can we articulate what we envisage
would be the process?  


  a) client has a specimen record they wish to stamp with an identifier

  b) client requests DOI (or other format) from the issuing service and
provides the minimum metadata in a DwC-esque profile, potentially with a
preferred suffix  

  c) service provides identifier, and client stores this along with their
digital record

  d) from this point on, the DOI identifies the record


If so, what would happen on resolution?  Does the client provide the target
URL during minting which will be the redirection target on resolution?  Does
the service have to monitor the availability and return the cached copy on


In such a model, effectively we would have a central specimen registration
service where data owners push individual specimen records.  Is that
something we envisage the community would accept?  Presumably the minimum
metadata would include things like dwc:scientificName - would someone
register a DOI pushing that for specimens of a new species before they have
published on the name?  


This model will not in itself stop duplicate IDs.  The scientists assembling
datasets of specimens referenced in a paper might submit those references
specimens for DOIs, while the original specimen curators might also submit
the same records - thus the specimen is identified twice.  Which piece of
the infrastructure would capture that relationship?


What seems most important to me when I think this through is that the
identifier needs to be minted as early on as possible in the record life -
before it is shared with others.  Which leads us back to the question of
whether we envisage people adopting a model where they effectively submit
their record data in order to get an identifier.  If not, at least if we got
stable IDs on records in whatever form, we can manage the resolvability bit
later, and identify duplicates.


It would be interesting to hear how others imagine such a service operating.









On 06 May 2014, at 15:34, Hilmar Lapp <hlapp at nescent.org> wrote:

Every registration agency has its own set of standard metadata which members
register for every DOI, but the content-negotiation strategy does allow for
a richer metadata response. By default it is the registration agency's
resolver that responds with RDF (and thus only with the metadata it knows
of), but members (the entities registering DOIs) can register their own
content-negotiation resolver, which would allow them to return richer
metadata. We have, for example, considered doing this for Dryad
(http://datadryad.org <http://datadryad.org/> ), but it hasn't risen to
high-enough priority yet.


Hence, if GBIF were to register DOIs for specimens through DataCite (rather
than being its own RA), then GBIF could still operate its own resolver for
returning DwC metadata for RDF queries.


That doesn't mean there couldn't still be good arguments for GBIF serving as
a RA.



On Tue, May 6, 2014 at 5:53 AM, Roderic Page <Roderic.Page at glasgow.ac.uk>

Hi Steve,


My understanding is that the non-HTML content is decided at the level of
registration agency. For a bibliographic DOI registered with CrossRef, the
HTML redirect goes to whatever the publisher provides CrossRef (e.g., the
article landing page), other content (including RDF) is served by CrossRef
based on the metadata they hold for each article. Likewise, DataCite will
serve metadata based on what they have. Hence, metadata from CrossRef and
DatacIte look rather different.


So, this is something that would need to be worked out at the level of
registration agency (see
l and
for background).


Hence, if GBIF were to be a DOI registration agency they could serve Darwin
Core RDF (and JSON and whoever else they want). This is a strong argument
for GBIF doing this, rather than using DataCite (which serves very generic






On 6 May 2014, at 01:42, Steve Baskauf <steve.baskauf at vanderbilt.edu> wrote:

I'm a big fan of not reinventing the wheel, and as such find the idea of
using DOIs appealing.  I think they pretty much follow all of the "rules"
set out in the TDWG GUID Applicability Standard.   They also play nicely in
the Linked Data universe in their HTTP URI form, i.e. they redirect to HTML
or RDF depending on the request header.  

But I have a question for someone who understands how DOIs work better than
I do.  The HTML representation seems to arise by redirection to whatever is
the current web page  for the resource.  You can see this by pasting this
DOI for a specimen into a browser: http://dx.doi.org/10.7299/X7VQ32SJ which
redirects to http://arctos.database.museum/guid/UAM:Ento:230092 when HTML is
requested by a client.  However, when the client requests RDF, one gets
redirected to a DataCite metadata page:
http://data.datacite.org/10.7299/X7VQ32SJ .  Can the creator of the DOI
redirect to any desired URI for the RDF?  

The resulting RDF metadata doesn't have any of the kind useful information
about the specimen that you get on the web page but rather looks like what
you would expect for a publication (creator, publisher, date, etc.):

turtle%3Bq%3D0.5&useragentheader=> >
rtle%3Bq%3D0.5&useragentheader=> >
"Derek S. Sikes" ;
e%3Bq%3D0.5&useragentheader=> >
"2004" ;
-turtle%3Bq%3D0.5&useragentheader=> >
"10.7299/X7VQ32SJ" ;
turtle%3Bq%3D0.5&useragentheader=> >
"University of Alaska Museum" ;
le%3Bq%3D0.5&useragentheader=> >
"UAM:Ento:230092 - Grylloblatta campodeiformis" ;
x-turtle%3Bq%3D0.5&useragentheader=> >
"info:doi/10.7299/X7VQ32SJ" , "doi:10.7299/X7VQ32SJ" .

Can one control what kinds of metadata are provided in "DataCite's
metadata"? Assuming that we get our act together and adopt an RDF guide for
Darwin Core, it would be nice for the RDF metadata to look more like the
description of a specimen and less like the description of a book.  But
maybe that's just a function of where the data provider choses to redirect
RDF requests.


John Deck wrote: 

 +1 on DOIs, and on ARKS  (see: https://wiki.ucop.edu/display/Curation/ARK
), and also i'll mention IGSN:'s  (see  http://www.geosamples.org/) IGSN: is
rapidly gaining traction for geo-samples.  I don't know of anyone using them
for bio-samples but they offer many features that we've been asking for as
well.  What our community considers a sample (or observation) is diverse
enough that multiple ID systems are probably inevitable and perhaps even


Whatever the ID system, the data providers (museums, field researchers,
labs, etc..) must adopt that identifier and use it whenever linking to
downstream sequence, image, and sub-sampling repository agencies. This is
great to say this in theory but difficult to do in reality because the
decision to adopt long term and stable identifiers is often an institutional
one, and the technology is still new and argued about, in particular, on
this fine list.  Further, those agencies that receive data associated with a
GUID must honor that source GUID when passing to consumers and other
aggregators, who must also have some level of confidence in the source GUIDs
as well.   Thus, a primary issue that we're confronted with here is trust.


Having Hilmar's hackathon support several possible GUID schemes (each with
their own long term persistence strategy), and sponsored by a well known
global institution affiliated with biodiversity informatics that could offer
technical guidance to data providers, good name branding, and the nuts and
bolts expertise to demonstrate good shepherding of source GUIDs through a
data aggregation chain would be ideal.  I nominate GBIF :)


John Deck


On Mon, May 5, 2014 at 1:09 PM, Roderic Page <r.page at bio.gla.ac.uk> wrote:

Hi Markus, 


I have three  use cases that


1. Linking sequences in GenBank to voucher specimens. Lots of voucher
specimens are listed in GenBank but not linked to digital records for those
specimens. These links are useful in two directions, one is to link GBIF to
genomic data, the second is to enhance data in both databases, see
http://iphylo.blogspot.co.uk/2012/02/linking-gbif-and-genbank.html (e.g., by
adding missing georeferencing that is available in one database but not the


2. Linking to specimens cited in the literature. I’ve done some work on this
in BioStor, see
html  One immediate benefit of this is that GBIF could display the
scientific literature associated with a specimen, so we get access to the
evidence supporting identification, georeferencing, etc. 


3. Citation metrics for collections, see
ml and
tml Based on citation sod specimens in the literature, and in databases such
as GenBank (i.e., basically combining 1 + 2 above) we can demonstrate the
value of a collection.


All of these use cases depend on GBIF occurenceIds remaining stable, I have
often ranted on iPHylo when this doesn’t happen:








On 5 May 2014, at 20:51, Markus Döring <mdoering at gbif.org> wrote:

Hi Rod, 


I agree GBIF has troubles to keep identifiers stable for *some* records, but
in general we do a much better job than the original publishers in the first
place. We try hard to keep GBIF ids stable even if publishers change
collection codes, registered datasets twice or do other things to break a
simple automated way of mapping source records to existing GBIF ids. Also
the stable identifier in GBIF never has been the URL, but it is the local
GBIF integer alone. The GBIF services that consume those ids have changed
over the years, but its pretty trivial to adjust if you use the GBIF ids
instead of the URLs. If there is a clear need to have stable URLs instead I
am sure we can get that working easily.


The two real issues for GBIF are a) duplicates and b) records with varying
local identifiers of any sort (triplet, occurrenceID or whatever else).


When it comes to the varying source identifiers I always liked the idea of
flagging those records and datasets as unstable, so it is obvious to users.
This is not a 100% safe, but most terrible datasets change all of their ids
and that is easily detectable.

Also with a service like that it would become more obvious to publishers how
important stable source ids are.


Before jumping on DOIs as the next big thing I would really like to
understand what needs the community has around specimen ids.

Gabi clearly has a very real use case, are there others we know about?








On 05 May 2014, at 21:05, Roderic Page <r.page at bio.gla.ac.uk> wrote:

Hi Hilmar, 


I’m not arguing that we shouldn’t build a resolver (I have one that I use,
Rich has mentioned he’s got one, Markus has one at GBIF, etc.).


Nor do I think we should wait for institutional and social commitment
(because then we’d never get anything done).


But I do think it would be useful to think it through. For example, it’s
easy to create a URL for a specimen. Easy peasy. OK, how do I discover that
URL? How do I discover these for all specimens? Sounds like I need a
centralised discover service like you’e described.


How do I handle changes in those URLs? I built a specimen code to GBIF
resolver for BioStor so that I could link to specimens, GBIF changed lots of
those URLs, all my work was undone, boy does GBIF suck sometimes. For
example, if I map codes to URLs, I need to handle cases when they change. 


If URLs can change, is there a way to defend against that (this is one
reason for DOIs, or other methods of indirection, such as PURLs). 


If providers change, will the URLs change? Is there a way to defend against
that (again, DOIs handle this nicely by virtue of (a) indirection, and (b)
lack of branding).


How can I encourage people to use the specimen service? What can I do to
make them think it will persist? Can I convince academic publishers to trust
it enough to link to it in articles? What’s the pitch to Pensoft, to
Magnolai Press, to Springer and Elsevier?


Is there some way to make the service itself become trusted? For example if
I look at a journal and see that it has DOIs issued by CrossRef, I take that
journal more seriously than if it’s just got simple URLs. I know that papers
in that journal will be linked into the citation network, I also know that
there is a backup plan if the journal goes under (because you need that to
have DOIs in CrossRef). Likewise, I think Figshare got a big boost when it
stared minting DOIs (wow, a DOI, I know DOIs, you mean I can now cite stuff
I’ve uploaded there?). 


How can museums and herbaria be persuaded to keep their identifiers stable?
What incentives can we provide (e.g., citation metrics for collections)?
What system would enable us to do this? What about tracing funding (e.g.,
the NSF paid for these n papers, and they cite these y specimens, from these
z collections, so science paid for by the NSF requires these collections to


I guess I’m arguing that we should think all this through, because a
specimen code to specimen URL is a small piece of the puzzle. Now, I’m
desperately trying not to simply say what I think is blindingly obvious here
(put DOIs on specimens, add metadata to specimen and specimen citation
services, and we are done), but I think if we sit back and look at where we
want to be, this is exactly what we need (or something functionally
equivalent). Until we see the bigger picture, we will be stuck in amateur


Take  a look at:


http://search.crossref.org <http://search.crossref.org/> 





Isn’t this the kind of stuff we’d like to do? If so, let’s work out what’s
needed and make it happen.


In short, I think we constantly solve an immediate problem in the quickest
way we know how, without thinking it through. I’d argue that if we think
about the bigger picture (what do we want to be able to, what are the
questions we want to be able to ask) then things become clearer. This is
independent of getting everyone’s agreement (but it would help if we made
their agreement seem a no brainer by providing solutions to things that
cause them pain).







On 5 May 2014, at 19:14, Hilmar Lapp <hlapp at nescent.org> wrote:


On Mon, May 5, 2014 at 1:29 PM, Roderic Page <r.page at bio.gla.ac.uk> wrote:

Contrary to Hilmar, there is more to this than simply a quick hackathon.
Yes, a service that takes metadata and returns one or more identifiers is a
good idea and easy to create (there will often be more than one because
museum codes are not unique). But who maintains this service? Who maintains
the identifiers? Who do I complain to if they break? How do we ensure that
they persist when, say, a museum closes down, moves its collection, changes
it’s web technology? Who provides the tools that add value to the
identifiers? (there’s no point having them if they are not useful)


Jonathan Rees pointed this out to me too off-list. Just for the record, this
isn't contrary but fully in line with what I was saying (or trying to say).
Yes, I didn't elaborate that part, assuming, perhaps rather erroneously,
that all this goes without saying, but I did mention that one part of this
becoming a real solution has to be an institution with an in-scope
cyberinfrastructure mandate that going in would make a commitment to sustain
the resolver, including working with partners on the above slew of
questions. The institution I gave was iDigBio; perhaps for some reason that
would not be a good choice, but whether they are or not wasn't my point.


I will add one point to this, though. It seems to me that by continuing to
argue that we can't go ahead with building a resolver that works (as far as
technical requirements are concerned) before we haven't first fully
addressed the institutional and social long-term sustainability commitment
problem, we are and have been making this one big hairy problem that we
can't make any practical pragmatic headway about, rather than breaking it
down into parts, some of which (namely the primarily technical ones) are
actually fairly straightforward to solve. As a result, to this day we don't
have some solution that even though it's not very sustainable yet, at least
proves to everyone how critical it is, and that the community can rally
behind. Perhaps that's naïve, but I do think that once there's a solution
the community rallies behind, ways to sustain it will be found. 




Hilmar Lapp -:- informatics.nescent.org/wiki -:- lappland.io





tdwg-content mailing list
tdwg-content at lists.tdwg.org


John Deck
(541) 321-0689 <tel:%28541%29%20321-0689> 

Steven J. Baskauf, Ph.D., Senior Lecturer
Vanderbilt University Dept. of Biological Sciences
postal mail address:
PMB 351634
Nashville, TN  37235-1634,  U.S.A.
delivery address:
2125 Stevenson Center
1161 21st Ave., S.
Nashville, TN 37235
office: 2128 Stevenson Center
phone: (615) 343-4582 <tel:%28615%29%20343-4582> ,  fax: (615) 322-4942
If you fax, please phone or email so that I will know to look for it.
http://bioimages.vanderbilt.edu <http://bioimages.vanderbilt.edu/> 

tdwg-content mailing list
tdwg-content at lists.tdwg.org


tdwg-content mailing list
tdwg-content at lists.tdwg.org



Hilmar Lapp -:- informatics.nescent.org/wiki -:- lappland.io


tdwg-content mailing list
tdwg-content at lists.tdwg.org



-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.tdwg.org/pipermail/tdwg-content/attachments/20140506/3f3f10da/attachment.html 

More information about the tdwg-content mailing list