[tdwg-content] delimiter characters for concatenated IDs

Steve Baskauf steve.baskauf at vanderbilt.edu
Tue May 6 16:21:03 CEST 2014


I like the idea of having layers of fallback with DOIs.  If GBIF can 
provide richer metadata, great.  If not, somebody else provides minimal 
metadata.  If Linked Data and RDF implode, we still get HTML in a 
browser.  If HTTP becomes defunct, we still have a globally unique string. 

The other thing I like about DOIs is that they must be paid for.  
Whenever somebody had money in the game, they tend to be more serious 
about their responsibilities.  However, I have never heard what the 
"cost" is per DOI?  $1?  $.10? $.01? $.00001??

With regards to John Deck's comments about assigning DOIs to other 
things like loci, I don't think we are required to solve every problem 
at once.  If we could only solve the problem of identifiers for 
specimens now, that would still be huge and facilitate progress on other 
fronts, like linking literature and specimens, or taxa and specimens.  
It would also be a lift from the gloom and doom attitude that the 
identifier issue is hopeless.  Because DOIs are team players in the 
Linked Data world, one could use different kinds of identifiers with 
other types of resources and they would still "link" perfectly file in 
terms of both RDF and HTML.

Steve

Hilmar Lapp wrote:
> Every registration agency has its own set of standard metadata which 
> members register for every DOI, but the content-negotiation strategy 
> does allow for a richer metadata response. By default it is the 
> registration agency's resolver that responds with RDF (and thus only 
> with the metadata it knows of), but members (the entities registering 
> DOIs) can register their own content-negotiation resolver, which would 
> allow them to return richer metadata. We have, for example, considered 
> doing this for Dryad (http://datadryad.org), but it hasn't risen to 
> high-enough priority yet.
>
> Hence, if GBIF were to register DOIs for specimens through DataCite 
> (rather than being its own RA), then GBIF could still operate its own 
> resolver for returning DwC metadata for RDF queries.
>
> That doesn't mean there couldn't still be good arguments for GBIF 
> serving as a RA.
>
>   -hilmar
>
> On Tue, May 6, 2014 at 5:53 AM, Roderic Page 
> <Roderic.Page at glasgow.ac.uk <mailto:Roderic.Page at glasgow.ac.uk>> wrote:
>
>     Hi Steve,
>
>     My understanding is that the non-HTML content is decided at the
>     level of registration agency. For a bibliographic DOI registered
>     with CrossRef, the HTML redirect goes to whatever the publisher
>     provides CrossRef (e.g., the article landing page), other content
>     (including RDF) is served by CrossRef based on the metadata they
>     hold for each article. Likewise, DataCite will serve metadata
>     based on what they have. Hence, metadata from CrossRef and
>     DatacIte look rather different.
>
>     So, this is something that would need to be worked out at the
>     level of registration agency
>     (see http://www.crossref.org/CrossTech/2010/03/dois_and_linked_data_some_conc.html
>     and http://crosstech.crossref.org/2011/04/content_negotiation_for_crossr.html
>     for background).
>
>     Hence, if GBIF were to be a DOI registration agency they could
>     serve Darwin Core RDF (and JSON and whoever else they want). This
>     is a strong argument for GBIF doing this, rather than using
>     DataCite (which serves very generic metadata).
>
>     Regards
>
>     Rod
>
>     On 6 May 2014, at 01:42, Steve Baskauf
>     <steve.baskauf at vanderbilt.edu
>     <mailto:steve.baskauf at vanderbilt.edu>> wrote:
>
>>     I'm a big fan of not reinventing the wheel, and as such find the
>>     idea of using DOIs appealing.  I think they pretty much follow
>>     all of the "rules" set out in the TDWG GUID Applicability
>>     Standard.   They also play nicely in the Linked Data universe in
>>     their HTTP URI form, i.e. they redirect to HTML or RDF depending
>>     on the request header. 
>>
>>     But I have a question for someone who understands how DOIs work
>>     better than I do.  The HTML representation seems to arise by
>>     redirection to whatever is the current web page  for the
>>     resource.  You can see this by pasting this DOI for a specimen
>>     into a browser: http://dx.doi.org/10.7299/X7VQ32SJ which
>>     redirects to http://arctos.database.museum/guid/UAM:Ento:230092
>>     when HTML is requested by a client.  However, when the client
>>     requests RDF, one gets redirected to a DataCite metadata page:
>>     http://data.datacite.org/10.7299/X7VQ32SJ .  Can the creator of
>>     the DOI redirect to any desired URI for the RDF? 
>>
>>     The resulting RDF metadata doesn't have any of the kind useful
>>     information about the specimen that you get on the web page but
>>     rather looks like what you would expect for a publication
>>     (creator, publisher, date, etc.):
>>
>>     <http://dx.doi.org/10.7299/X7VQ32SJ
>>     <http://linkeddata.informatik.hu-berlin.de/uridbg/index.php?url=http://dx.doi.org/10.7299/X7VQ32SJ&acceptheader=text%2Fturtle%3Bq%3D1%2Capplication%2Fx-turtle%3Bq%3D0.5&useragentheader=>>
>>     <http://purl.org/dc/terms/creator
>>     <http://linkeddata.informatik.hu-berlin.de/uridbg/index.php?url=http://purl.org/dc/terms/creator&acceptheader=text%2Fturtle%3Bq%3D1%2Capplication%2Fx-turtle%3Bq%3D0.5&useragentheader=>>
>>     "Derek S. Sikes" ;
>>     <http://purl.org/dc/terms/date
>>     <http://linkeddata.informatik.hu-berlin.de/uridbg/index.php?url=http://purl.org/dc/terms/date&acceptheader=text%2Fturtle%3Bq%3D1%2Capplication%2Fx-turtle%3Bq%3D0.5&useragentheader=>>
>>     "2004" ;
>>     <http://purl.org/dc/terms/identifier
>>     <http://linkeddata.informatik.hu-berlin.de/uridbg/index.php?url=http://purl.org/dc/terms/identifier&acceptheader=text%2Fturtle%3Bq%3D1%2Capplication%2Fx-turtle%3Bq%3D0.5&useragentheader=>>
>>     "10.7299/X7VQ32SJ" ;
>>     <http://purl.org/dc/terms/publisher
>>     <http://linkeddata.informatik.hu-berlin.de/uridbg/index.php?url=http://purl.org/dc/terms/publisher&acceptheader=text%2Fturtle%3Bq%3D1%2Capplication%2Fx-turtle%3Bq%3D0.5&useragentheader=>>
>>     "University of Alaska Museum" ;
>>     <http://purl.org/dc/terms/title
>>     <http://linkeddata.informatik.hu-berlin.de/uridbg/index.php?url=http://purl.org/dc/terms/title&acceptheader=text%2Fturtle%3Bq%3D1%2Capplication%2Fx-turtle%3Bq%3D0.5&useragentheader=>>
>>     "UAM:Ento:230092 - Grylloblatta campodeiformis" ;
>>     <http://www.w3.org/2002/07/owl#sameAs
>>     <http://linkeddata.informatik.hu-berlin.de/uridbg/index.php?url=http://www.w3.org/2002/07/owl#sameAs&acceptheader=text%2Fturtle%3Bq%3D1%2Capplication%2Fx-turtle%3Bq%3D0.5&useragentheader=>>
>>     "info:doi/10.7299/X7VQ32SJ" , "doi:10.7299/X7VQ32SJ" .
>>
>>     Can one control what kinds of metadata are provided in
>>     "DataCite's metadata"? Assuming that we get our act together and
>>     adopt an RDF guide for Darwin Core, it would be nice for the RDF
>>     metadata to look more like the description of a specimen and less
>>     like the description of a book.  But maybe that's just a function
>>     of where the data provider choses to redirect RDF requests.
>>
>>     Steve
>>
>>     John Deck wrote:
>>>      +1 on DOIs, and on ARKS
>>>      (see: https://wiki.ucop.edu/display/Curation/ARK ), and also
>>>     i'll mention IGSN:'s  (see  http://www.geosamples.org/) IGSN: is
>>>     rapidly gaining traction for geo-samples.  I don't know of
>>>     anyone using them for bio-samples but they offer many features
>>>     that we've been asking for as well.  What our community
>>>     considers a sample (or observation) is diverse enough that
>>>     multiple ID systems are probably inevitable and perhaps even
>>>     warranted.  
>>>
>>>     Whatever the ID system, the data providers (museums, field
>>>     researchers, labs, etc..) must adopt that identifier and use it
>>>     whenever linking to downstream sequence, image, and sub-sampling
>>>     repository agencies. This is great to say this in theory but
>>>     difficult to do in reality because the decision to adopt long
>>>     term and stable identifiers is often an institutional one, and
>>>     the technology is still new and argued about, in particular, on
>>>     this fine list.  Further, those agencies that receive data
>>>     associated with a GUID must honor that source GUID when passing
>>>     to consumers and other aggregators, who must also have some
>>>     level of confidence in the source GUIDs as well.   Thus, a
>>>     primary issue that we're confronted with here is trust.
>>>
>>>     Having Hilmar's hackathon support several possible GUID schemes
>>>     (each with their own long term persistence strategy), and
>>>     sponsored by a well known global institution affiliated with
>>>     biodiversity informatics that could offer technical guidance to
>>>     data providers, good name branding, and the nuts and bolts
>>>     expertise to demonstrate good shepherding of source GUIDs
>>>     through a data aggregation chain would be ideal.  I nominate GBIF :)
>>>
>>>     John Deck
>>>
>>>
>>>     On Mon, May 5, 2014 at 1:09 PM, Roderic Page
>>>     <r.page at bio.gla.ac.uk <mailto:r.page at bio.gla.ac.uk>> wrote:
>>>
>>>         Hi Markus,
>>>
>>>         I have three  use cases that
>>>
>>>         1. Linking sequences in GenBank to voucher specimens. Lots
>>>         of voucher specimens are listed in GenBank but not linked to
>>>         digital records for those specimens. These links are useful
>>>         in two directions, one is to link GBIF to genomic data, the
>>>         second is to enhance data in both databases,
>>>         see http://iphylo.blogspot.co.uk/2012/02/linking-gbif-and-genbank.html
>>>         (e.g., by adding missing georeferencing that is available in
>>>         one database but not the other).
>>>
>>>         2. Linking to specimens cited in the literature. I’ve done
>>>         some work on this in BioStor,
>>>         see http://iphylo.blogspot.co.uk/2012/02/linking-gbif-and-biodiversity-heritage.html
>>>          One immediate benefit of this is that GBIF could display
>>>         the scientific literature associated with a specimen, so we
>>>         get access to the evidence supporting identification,
>>>         georeferencing, etc. 
>>>
>>>         3. Citation metrics for collections,
>>>         see http://iphylo.blogspot.co.uk/2013/05/the-impact-of-museum-collections-one.html
>>>         and http://iphylo.blogspot.co.uk/2012/02/gbif-specimens-in-biostor-who-are-top.html
>>>         Based on citation sod specimens in the literature, and in
>>>         databases such as GenBank (i.e., basically combining 1 + 2
>>>         above) we can demonstrate the value of a collection.
>>>
>>>         All of these use cases depend on GBIF occurenceIds remaining
>>>         stable, I have often ranted on iPHylo when this doesn’t
>>>         happen: http://iphylo.blogspot.co.uk/2012/07/dear-gbif-please-stop-changing.html
>>>
>>>         Regards
>>>
>>>         Rod
>>>
>>>
>>>
>>>         On 5 May 2014, at 20:51, Markus Döring <mdoering at gbif.org
>>>         <mailto:mdoering at gbif.org>> wrote:
>>>
>>>>         Hi Rod,
>>>>
>>>>         I agree GBIF has troubles to keep identifiers stable for
>>>>         *some* records, but in general we do a much better job than
>>>>         the original publishers in the first place. We try hard to
>>>>         keep GBIF ids stable even if publishers change collection
>>>>         codes, registered datasets twice or do other things to
>>>>         break a simple automated way of mapping source records to
>>>>         existing GBIF ids. Also the stable identifier in GBIF never
>>>>         has been the URL, but it is the local GBIF integer alone.
>>>>         The GBIF services that consume those ids have changed over
>>>>         the years, but its pretty trivial to adjust if you use the
>>>>         GBIF ids instead of the URLs. If there is a clear need to
>>>>         have stable URLs instead I am sure we can get that working
>>>>         easily.
>>>>
>>>>         The two real issues for GBIF are a) duplicates and b)
>>>>         records with varying local identifiers of any sort
>>>>         (triplet, occurrenceID or whatever else).
>>>>
>>>>         When it comes to the varying source identifiers I always
>>>>         liked the idea of flagging those records and datasets as
>>>>         unstable, so it is obvious to users. This is not a 100%
>>>>         safe, but most terrible datasets change all of their ids
>>>>         and that is easily detectable.
>>>>         Also with a service like that it would become more obvious
>>>>         to publishers how important stable source ids are.
>>>>
>>>>         Before jumping on DOIs as the next big thing I would really
>>>>         like to understand what needs the community has around
>>>>         specimen ids.
>>>>         Gabi clearly has a very real use case, are there others we
>>>>         know about?
>>>>
>>>>
>>>>         Markus
>>>>
>>>>
>>>>
>>>>
>>>>         On 05 May 2014, at 21:05, Roderic Page
>>>>         <r.page at bio.gla.ac.uk <mailto:r.page at bio.gla.ac.uk>> wrote:
>>>>
>>>>>         Hi Hilmar,
>>>>>
>>>>>         I’m not arguing that we shouldn’t build a resolver (I have
>>>>>         one that I use, Rich has mentioned he’s got one, Markus
>>>>>         has one at GBIF, etc.).
>>>>>
>>>>>         Nor do I think we should wait for institutional and social
>>>>>         commitment (because then we’d never get anything done).
>>>>>
>>>>>         But I do think it would be useful to think it through. For
>>>>>         example, it’s easy to create a URL for a specimen. Easy
>>>>>         peasy. OK, how do I discover that URL? How do I discover
>>>>>         these for all specimens? Sounds like I need a centralised
>>>>>         discover service like you’e described.
>>>>>
>>>>>         How do I handle changes in those URLs? I built a specimen
>>>>>         code to GBIF resolver for BioStor so that I could link to
>>>>>         specimens, GBIF changed lots of those URLs, all my work
>>>>>         was undone, boy does GBIF suck sometimes. For example, if
>>>>>         I map codes to URLs, I need to handle cases when they change. 
>>>>>
>>>>>         If URLs can change, is there a way to defend against that
>>>>>         (this is one reason for DOIs, or other methods of
>>>>>         indirection, such as PURLs). 
>>>>>
>>>>>         If providers change, will the URLs change? Is there a way
>>>>>         to defend against that (again, DOIs handle this nicely by
>>>>>         virtue of (a) indirection, and (b) lack of branding).
>>>>>
>>>>>         How can I encourage people to use the specimen service?
>>>>>         What can I do to make them think it will persist? Can I
>>>>>         convince academic publishers to trust it enough to link to
>>>>>         it in articles? What’s the pitch to Pensoft, to Magnolai
>>>>>         Press, to Springer and Elsevier?
>>>>>
>>>>>         Is there some way to make the service itself become
>>>>>         trusted? For example if I look at a journal and see that
>>>>>         it has DOIs issued by CrossRef, I take that journal more
>>>>>         seriously than if it’s just got simple URLs. I know that
>>>>>         papers in that journal will be linked into the citation
>>>>>         network, I also know that there is a backup plan if the
>>>>>         journal goes under (because you need that to have DOIs in
>>>>>         CrossRef). Likewise, I think Figshare got a big boost when
>>>>>         it stared minting DOIs (wow, a DOI, I know DOIs, you mean
>>>>>         I can now cite stuff I’ve uploaded there?). 
>>>>>
>>>>>         How can museums and herbaria be persuaded to keep their
>>>>>         identifiers stable? What incentives can we provide (e.g.,
>>>>>         citation metrics for collections)? What system would
>>>>>         enable us to do this? What about tracing funding (e.g.,
>>>>>         the NSF paid for these n papers, and they cite these y
>>>>>         specimens, from these z collections, so science paid for
>>>>>         by the NSF requires these collections to exist).
>>>>>
>>>>>         I guess I’m arguing that we should think all this through,
>>>>>         because a specimen code to specimen URL is a small piece
>>>>>         of the puzzle. Now, I’m desperately trying not to simply
>>>>>         say what I think is blindingly obvious here (put DOIs on
>>>>>         specimens, add metadata to specimen and specimen citation
>>>>>         services, and we are done), but I think if we sit back and
>>>>>         look at where we want to be, this is exactly what we need
>>>>>         (or something functionally equivalent). Until we see the
>>>>>         bigger picture, we will be stuck in amateur hour.
>>>>>
>>>>>         Take  a look at:
>>>>>
>>>>>         http://search.crossref.org <http://search.crossref.org/>
>>>>>         http://www.crossref.org/fundref/
>>>>>         http://support.crossref.org/
>>>>>         https://prospect.crossref.org/splash/
>>>>>
>>>>>         Isn’t this the kind of stuff we’d like to do? If so, let’s
>>>>>         work out what’s needed and make it happen.
>>>>>
>>>>>         In short, I think we constantly solve an immediate problem
>>>>>         in the quickest way we know how, without thinking it
>>>>>         through. I’d argue that if we think about the bigger
>>>>>         picture (what do we want to be able to, what are the
>>>>>         questions we want to be able to ask) then things become
>>>>>         clearer. This is independent of getting everyone’s
>>>>>         agreement (but it would help if we made their agreement
>>>>>         seem a no brainer by providing solutions to things that
>>>>>         cause them pain).
>>>>>
>>>>>
>>>>>         Regards
>>>>>
>>>>>         Rod
>>>>>
>>>>>         On 5 May 2014, at 19:14, Hilmar Lapp <hlapp at nescent.org
>>>>>         <mailto:hlapp at nescent.org>> wrote:
>>>>>
>>>>>>
>>>>>>         On Mon, May 5, 2014 at 1:29 PM, Roderic Page
>>>>>>         <r.page at bio.gla.ac.uk <mailto:r.page at bio.gla.ac.uk>> wrote:
>>>>>>
>>>>>>             Contrary to Hilmar, there is more to this than simply
>>>>>>             a quick hackathon. Yes, a service that takes metadata
>>>>>>             and returns one or more identifiers is a good idea
>>>>>>             and easy to create (there will often be more than one
>>>>>>             because museum codes are not unique). But who
>>>>>>             maintains this service? Who maintains the
>>>>>>             identifiers? Who do I complain to if they break? How
>>>>>>             do we ensure that they persist when, say, a museum
>>>>>>             closes down, moves its collection, changes it’s web
>>>>>>             technology? Who provides the tools that add value to
>>>>>>             the identifiers? (there’s no point having them if
>>>>>>             they are not useful)
>>>>>>
>>>>>>
>>>>>>         Jonathan Rees pointed this out to me too off-list. Just
>>>>>>         for the record, this isn't contrary but fully in line
>>>>>>         with what I was saying (or trying to say). Yes, I didn't
>>>>>>         elaborate that part, assuming, perhaps rather
>>>>>>         erroneously, that all this goes without saying, but I did
>>>>>>         mention that one part of this becoming a real solution
>>>>>>         has to be an institution with an in-scope
>>>>>>         cyberinfrastructure mandate that going in would make a
>>>>>>         commitment to sustain the resolver, including working
>>>>>>         with partners on the above slew of questions. The
>>>>>>         institution I gave was iDigBio; perhaps for some reason
>>>>>>         that would not be a good choice, but whether they are or
>>>>>>         not wasn't my point.
>>>>>>
>>>>>>         I will add one point to this, though. It seems to me that
>>>>>>         by continuing to argue that we can't go ahead with
>>>>>>         building a resolver that works (as far as technical
>>>>>>         requirements are concerned) before we haven't first fully
>>>>>>         addressed the institutional and social long-term
>>>>>>         sustainability commitment problem, we are and have been
>>>>>>         making this one big hairy problem that we can't make any
>>>>>>         practical pragmatic headway about, rather than breaking
>>>>>>         it down into parts, some of which (namely the primarily
>>>>>>         technical ones) are actually fairly straightforward to
>>>>>>         solve. As a result, to this day we don't have some
>>>>>>         solution that even though it's not very sustainable yet,
>>>>>>         at least proves to everyone how critical it is, and that
>>>>>>         the community can rally behind. Perhaps that's naïve, but
>>>>>>         I do think that once there's a solution the community
>>>>>>         rallies behind, ways to sustain it will be found. 
>>>>>>
>>>>>>           -hilmar
>>>>>>         -- 
>>>>>>         Hilmar Lapp -:- informatics.nescent.org/wiki
>>>>>>         <http://informatics.nescent.org/wiki> -:- lappland.io
>>>>>>         <http://lappland.io/>
>>>>>>
>>>>>
>>>>
>>>
>>>
>>>         _______________________________________________
>>>         tdwg-content mailing list
>>>         tdwg-content at lists.tdwg.org <mailto:tdwg-content at lists.tdwg.org>
>>>         http://lists.tdwg.org/mailman/listinfo/tdwg-content
>>>
>>>
>>>
>>>
>>>     -- 
>>>     John Deck
>>>     (541) 321-0689 <tel:%28541%29%20321-0689>
>>
>>     -- 
>>     Steven J. Baskauf, Ph.D., Senior Lecturer
>>     Vanderbilt University Dept. of Biological Sciences
>>
>>     postal mail address:
>>     PMB 351634
>>     Nashville, TN  37235-1634,  U.S.A.
>>
>>     delivery address:
>>     2125 Stevenson Center
>>     1161 21st Ave., S.
>>     Nashville, TN 37235
>>
>>     office: 2128 Stevenson Center
>>     phone: (615) 343-4582 <tel:%28615%29%20343-4582>,  fax: (615) 322-4942 <tel:%28615%29%20322-4942>
>>     If you fax, please phone or email so that I will know to look for it.
>>     http://bioimages.vanderbilt.edu <http://bioimages.vanderbilt.edu/>
>>     http://vanderbilt.edu/trees
>>
>>           
>>     _______________________________________________
>>     tdwg-content mailing list
>>     tdwg-content at lists.tdwg.org <mailto:tdwg-content at lists.tdwg.org>
>>     http://lists.tdwg.org/mailman/listinfo/tdwg-content
>
>
>     _______________________________________________
>     tdwg-content mailing list
>     tdwg-content at lists.tdwg.org <mailto:tdwg-content at lists.tdwg.org>
>     http://lists.tdwg.org/mailman/listinfo/tdwg-content
>
>
>
>
> -- 
> Hilmar Lapp -:- informatics.nescent.org/wiki 
> <http://informatics.nescent.org/wiki> -:- lappland.io <http://lappland.io>
>

-- 
Steven J. Baskauf, Ph.D., Senior Lecturer
Vanderbilt University Dept. of Biological Sciences

postal mail address:
PMB 351634
Nashville, TN  37235-1634,  U.S.A.

delivery address:
2125 Stevenson Center
1161 21st Ave., S.
Nashville, TN 37235

office: 2128 Stevenson Center
phone: (615) 343-4582,  fax: (615) 322-4942
If you fax, please phone or email so that I will know to look for it.
http://bioimages.vanderbilt.edu
http://vanderbilt.edu/trees


-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.tdwg.org/pipermail/tdwg-content/attachments/20140506/2ebbf541/attachment.html 


More information about the tdwg-content mailing list