delimiter characters for concatenated IDs
Hi everyone,
I guess there might have been some discussions about proper delimiter characters in the past that I have missed.
In several projects, first of all in GGBN (Global Genome Biodiversity Network, http://www.ggbn.org), there is a need for making a decision now. We need to reference between different records and databases and within Darwin Core we want to use the relatedResourceID to do so.
During our GGBN workshop at TDWG last year we agreed on concatenating the traditional triple ID (Catalogue Number, Collection Code, Institution Code) and add further parameters if required too (e.g. GUID, access point). We have checked those parameters and can definitely not use a single character as delimiter.
So my question to you is, if there are already some suggestions on using two characters together as delimiters. It would be great if we could find a solution more than one community could agree on.
Otherwise I would like to open the discussion and suggest "\", "||", "|", "§|", "§§", or "\§".
Best wishes, Gabi ----------------------------------------------------------------- Gabriele Droege Coordinator - DNA Bank Network Global Genome Biodiversity Network (GGBN) Berlin-Dahlem DNA Bank Women's Officer ZE BGBM
Botanic Garden and Botanical Museum Berlin-Dahlem Freie Universität Berlin Koenigin-Luise-Str. 6-8 14195 Berlin Germany
+49 30 838 50 139
www.dnabank-network.orghttp://www.dnabank-network.org www.ggbn.orghttp://www.ggbn.org www.bgbm.orghttp://www.bgbm.org
I don't know what might be used commonly. However, among the choices you list, I would argue that anything involving "§", a character not readily visible on the (English-language) keyboard (and thus more likely to be misinterpreted by person or machine) would not be my choice.
I regularly use "\" and "||" as parsers for search and replace. "\" has the advantage of not needing to be accompanied by a shift to acquire the symbol, but the back slash "" is/was used in local file paths in the PC world. I have seen a single "|" used as a common delimiter on web page title text, but I don't think "||" is used that way (??). "||" also has the distinction (disadvantage? or less likely to be hit accidentally) of needing a shift to accomplish it on the traditional keyboard available in English.
I would be happy with either "\" or "||" until more information is provided.
Cheers! Gail
Gail E. Kampmeier Illinois Natural History Survey Prairie Research Institute University of Illinois 1816 So. Oak St. Champaign, IL 61820 www.inhs.illinois.edu/~gkamphttp://www.inhs.illinois.edu/~gkamp ________________________________ From: tdwg-content-bounces@lists.tdwg.org [tdwg-content-bounces@lists.tdwg.org] on behalf of "Dröge, Gabriele" [g.droege@BGBM.ORG] Sent: Monday, May 05, 2014 08:24 To: tdwg-content@lists.tdwg.org Subject: [tdwg-content] delimiter characters for concatenated IDs
Hi everyone,
I guess there might have been some discussions about proper delimiter characters in the past that I have missed.
In several projects, first of all in GGBN (Global Genome Biodiversity Network, http://www.ggbn.org), there is a need for making a decision now. We need to reference between different records and databases and within Darwin Core we want to use the relatedResourceID to do so.
During our GGBN workshop at TDWG last year we agreed on concatenating the traditional triple ID (Catalogue Number, Collection Code, Institution Code) and add further parameters if required too (e.g. GUID, access point). We have checked those parameters and can definitely not use a single character as delimiter.
So my question to you is, if there are already some suggestions on using two characters together as delimiters. It would be great if we could find a solution more than one community could agree on.
Otherwise I would like to open the discussion and suggest "\", "||", "|", "§|", "§§", or "\§".
Best wishes, Gabi ----------------------------------------------------------------- Gabriele Droege Coordinator - DNA Bank Network Global Genome Biodiversity Network (GGBN) Berlin-Dahlem DNA Bank Women's Officer ZE BGBM
Botanic Garden and Botanical Museum Berlin-Dahlem Freie Universität Berlin Koenigin-Luise-Str. 6-8 14195 Berlin Germany
+49 30 838 50 139
www.dnabank-network.orghttp://www.dnabank-network.org www.ggbn.orghttp://www.ggbn.org www.bgbm.orghttp://www.bgbm.org
Hi Gabi, can you explain a little more what you are trying to do giving an example maybe?
It appears to me you are creating (globally) unique identifiers on the basis of various existing fields which is fine. But when you use the identifier to create resource relations they should be considered opaque and you should not need to parse out the underlying pieces again. So in that scenario the character used to concatenate the triplet does not really matter for the end user as long as its unique and points to some existing resource, indicated by the occurrenceID in case of occurrences or the materialSampleID for samples.
Best, Markus
On 05 May 2014, at 15:24, Dröge, Gabriele g.droege@BGBM.ORG wrote:
Hi everyone,
I guess there might have been some discussions about proper delimiter characters in the past that I have missed.
In several projects, first of all in GGBN (Global Genome Biodiversity Network, http://www.ggbn.org), there is a need for making a decision now. We need to reference between different records and databases and within Darwin Core we want to use the relatedResourceID to do so.
During our GGBN workshop at TDWG last year we agreed on concatenating the traditional triple ID (Catalogue Number, Collection Code, Institution Code) and add further parameters if required too (e.g. GUID, access point). We have checked those parameters and can definitely not use a single character as delimiter.
So my question to you is, if there are already some suggestions on using two characters together as delimiters. It would be great if we could find a solution more than one community could agree on.
Otherwise I would like to open the discussion and suggest "\", "||", "|", "§|", "§§", or "\§".
Best wishes, Gabi
Gabriele Droege Coordinator - DNA Bank Network Global Genome Biodiversity Network (GGBN) Berlin-Dahlem DNA Bank Women's Officer ZE BGBM
Botanic Garden and Botanical Museum Berlin-Dahlem Freie Universität Berlin Koenigin-Luise-Str. 6-8 14195 Berlin Germany
+49 30 838 50 139
www.dnabank-network.org www.ggbn.org www.bgbm.org _______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
Markus, Didn't we reach a general consensus within the last couple of years that the vertical pipe (|) was the preferred concatenation symbol?
Chuck
From: tdwg-content-bounces@lists.tdwg.org [mailto:tdwg-content-bounces@lists.tdwg.org] On Behalf Of Markus Döring Sent: Monday, May 05, 2014 8:49 AM To: "Dröge, Gabriele" Cc: tdwg-content@lists.tdwg.org Subject: Re: [tdwg-content] delimiter characters for concatenated IDs
Hi Gabi, can you explain a little more what you are trying to do giving an example maybe?
It appears to me you are creating (globally) unique identifiers on the basis of various existing fields which is fine. But when you use the identifier to create resource relations they should be considered opaque and you should not need to parse out the underlying pieces again. So in that scenario the character used to concatenate the triplet does not really matter for the end user as long as its unique and points to some existing resource, indicated by the occurrenceID in case of occurrences or the materialSampleID for samples.
Best, Markus
On 05 May 2014, at 15:24, Dröge, Gabriele <g.droege@BGBM.ORGmailto:g.droege@BGBM.ORG> wrote:
Hi everyone,
I guess there might have been some discussions about proper delimiter characters in the past that I have missed.
In several projects, first of all in GGBN (Global Genome Biodiversity Network, http://www.ggbn.orghttp://www.ggbn.org/), there is a need for making a decision now. We need to reference between different records and databases and within Darwin Core we want to use the relatedResourceID to do so.
During our GGBN workshop at TDWG last year we agreed on concatenating the traditional triple ID (Catalogue Number, Collection Code, Institution Code) and add further parameters if required too (e.g. GUID, access point). We have checked those parameters and can definitely not use a single character as delimiter.
So my question to you is, if there are already some suggestions on using two characters together as delimiters. It would be great if we could find a solution more than one community could agree on.
Otherwise I would like to open the discussion and suggest "\", "||", "|", "§|", "§§", or "\§".
Best wishes, Gabi ----------------------------------------------------------------- Gabriele Droege Coordinator - DNA Bank Network Global Genome Biodiversity Network (GGBN) Berlin-Dahlem DNA Bank Women's Officer ZE BGBM
Botanic Garden and Botanical Museum Berlin-Dahlem Freie Universität Berlin Koenigin-Luise-Str. 6-8 14195 Berlin Germany
+49 30 838 50 139
www.dnabank-network.orghttp://www.dnabank-network.org/ www.ggbn.orghttp://www.ggbn.org/ www.bgbm.orghttp://www.bgbm.org/ _______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.orgmailto:tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
We've been examining the use (ad mis-use) of the DwC triplet, and how that propagates out of local portals and platforms into other ones. The end message from this work (and I am happy to share the manuscript and all the datasets we have compiled and examined) is that it is a _terrible_ choice for a global unique identifier.
There are so many better choices, that don't rely on delimiters or on what is ultimately a non-globally unique, non persistent, non resolvable choice for a (permanent, resolvable, globally unique) identifier. As opposed to having this conversation, I wonder why we aren't having one about ALL the other more rational choices...
Best, Rob
On Mon, May 5, 2014 at 8:14 AM, Chuck Miller Chuck.Miller@mobot.org wrote:
Markus,
Didn’t we reach a general consensus within the last couple of years that the vertical pipe (|) was the preferred concatenation symbol?
Chuck
*From:* tdwg-content-bounces@lists.tdwg.org [mailto: tdwg-content-bounces@lists.tdwg.org] *On Behalf Of *Markus Döring *Sent:* Monday, May 05, 2014 8:49 AM *To:* "Dröge, Gabriele" *Cc:* tdwg-content@lists.tdwg.org *Subject:* Re: [tdwg-content] delimiter characters for concatenated IDs
Hi Gabi,
can you explain a little more what you are trying to do giving an example maybe?
It appears to me you are creating (globally) unique identifiers on the basis of various existing fields which is fine. But when you use the identifier to create resource relations they should be considered opaque and you should not need to parse out the underlying pieces again. So in that scenario the character used to concatenate the triplet does not really matter for the end user as long as its unique and points to some existing resource, indicated by the occurrenceID in case of occurrences or the materialSampleID for samples.
Best,
Markus
On 05 May 2014, at 15:24, Dröge, Gabriele g.droege@BGBM.ORG wrote:
Hi everyone,
I guess there might have been some discussions about proper delimiter characters in the past that I have missed.
In several projects, first of all in GGBN (Global Genome Biodiversity Network, http://www.ggbn.org), there is a need for making a decision now. We need to reference between different records and databases and within Darwin Core we want to use the relatedResourceID to do so.
During our GGBN workshop at TDWG last year we agreed on concatenating the traditional triple ID (Catalogue Number, Collection Code, Institution Code) and add further parameters if required too (e.g. GUID, access point). We have checked those parameters and can definitely not use a single character as delimiter.
So my question to you is, if there are already some suggestions on using two characters together as delimiters. It would be great if we could find a solution more than one community could agree on.
Otherwise I would like to open the discussion and suggest "\", "||", "|", "§|", "§§", or "\§".
Best wishes,
Gabi
Gabriele Droege
Coordinator - DNA Bank Network
Global Genome Biodiversity Network (GGBN)
Berlin-Dahlem DNA Bank
Women's Officer ZE BGBM
Botanic Garden and Botanical Museum Berlin-Dahlem
Freie Universität Berlin
Koenigin-Luise-Str. 6-8
14195 Berlin
Germany
+49 30 838 50 139
www.dnabank-network.org
www.ggbn.org
www.bgbm.org
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
I completely agree with Rob and a tad discouraged that we are still channeling our discussions into the wrong places.
Nico
On May 5, 2014, at 10:23 AM, Robert Guralnick <Robert.Guralnick@Colorado.EDUmailto:Robert.Guralnick@Colorado.EDU> wrote:
We've been examining the use (ad mis-use) of the DwC triplet, and how that propagates out of local portals and platforms into other ones. The end message from this work (and I am happy to share the manuscript and all the datasets we have compiled and examined) is that it is a _terrible_ choice for a global unique identifier.
There are so many better choices, that don't rely on delimiters or on what is ultimately a non-globally unique, non persistent, non resolvable choice for a (permanent, resolvable, globally unique) identifier. As opposed to having this conversation, I wonder why we aren't having one about ALL the other more rational choices...
Best, Rob
On Mon, May 5, 2014 at 8:14 AM, Chuck Miller <Chuck.Miller@mobot.orgmailto:Chuck.Miller@mobot.org> wrote: Markus, Didn’t we reach a general consensus within the last couple of years that the vertical pipe (|) was the preferred concatenation symbol?
Chuck
From: tdwg-content-bounces@lists.tdwg.orgmailto:tdwg-content-bounces@lists.tdwg.org [mailto:tdwg-content-bounces@lists.tdwg.orgmailto:tdwg-content-bounces@lists.tdwg.org] On Behalf Of Markus Döring Sent: Monday, May 05, 2014 8:49 AM To: "Dröge, Gabriele" Cc: tdwg-content@lists.tdwg.orgmailto:tdwg-content@lists.tdwg.org Subject: Re: [tdwg-content] delimiter characters for concatenated IDs
Hi Gabi, can you explain a little more what you are trying to do giving an example maybe?
It appears to me you are creating (globally) unique identifiers on the basis of various existing fields which is fine. But when you use the identifier to create resource relations they should be considered opaque and you should not need to parse out the underlying pieces again. So in that scenario the character used to concatenate the triplet does not really matter for the end user as long as its unique and points to some existing resource, indicated by the occurrenceID in case of occurrences or the materialSampleID for samples.
Best, Markus
On 05 May 2014, at 15:24, Dröge, Gabriele <g.droege@BGBM.ORGmailto:g.droege@BGBM.ORG> wrote:
Hi everyone,
I guess there might have been some discussions about proper delimiter characters in the past that I have missed.
In several projects, first of all in GGBN (Global Genome Biodiversity Network, http://www.ggbn.orghttp://www.ggbn.org/), there is a need for making a decision now. We need to reference between different records and databases and within Darwin Core we want to use the relatedResourceID to do so.
During our GGBN workshop at TDWG last year we agreed on concatenating the traditional triple ID (Catalogue Number, Collection Code, Institution Code) and add further parameters if required too (e.g. GUID, access point). We have checked those parameters and can definitely not use a single character as delimiter.
So my question to you is, if there are already some suggestions on using two characters together as delimiters. It would be great if we could find a solution more than one community could agree on.
Otherwise I would like to open the discussion and suggest "\", "||", "|", "§|", "§§", or "\§".
Best wishes, Gabi ----------------------------------------------------------------- Gabriele Droege Coordinator - DNA Bank Network Global Genome Biodiversity Network (GGBN) Berlin-Dahlem DNA Bank Women's Officer ZE BGBM
Botanic Garden and Botanical Museum Berlin-Dahlem Freie Universität Berlin Koenigin-Luise-Str. 6-8 14195 Berlin Germany
+49 30 838 50 139tel:%2B49%2030%20838%2050%20139
www.dnabank-network.orghttp://www.dnabank-network.org/ www.ggbn.orghttp://www.ggbn.org/ www.bgbm.orghttp://www.bgbm.org/ _______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.orgmailto:tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
_______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.orgmailto:tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
<><><><><><><><><><><><><><><><><><><><><><><><> Nico Cellinese, Ph.D. Associate Curator, Botany & Informatics Joint Associate Professor, Department of Biology
Florida Museum of Natural History University of Florida 354 Dickinson Hall, PO Box 117800 Gainesville, FL 32611-7800, U.S.A. Tel. 352-273-1979 Fax 352-846-1861 http://cellinese.blogspot.com/
I couldn't agree more.
I would also ask why there still isn't a global resolver as a web-service that takes specimen metadata as input (such as the DwC triplet) and returns globally unique resolvable identifiers, minting them if necessary. If the technologically savvy people of this community came together, this could be built at least as a prototype in a couple of days. As I've suggested to iDigBio before, they could hold a hackathon on this, commit to hosting and further developing the outcome, and the problem would be solved once and for all. It would arguably be fully within their mandate.
If instead of the many workshops that have been held on talking about the problem we as a community would finally will ourselves to actually solving it, that part really isn't so difficult.
-hilmar
On Mon, May 5, 2014 at 10:23 AM, Robert Guralnick < Robert.Guralnick@colorado.edu> wrote:
We've been examining the use (ad mis-use) of the DwC triplet, and how that propagates out of local portals and platforms into other ones. The end message from this work (and I am happy to share the manuscript and all the datasets we have compiled and examined) is that it is a _terrible_ choice for a global unique identifier.
There are so many better choices, that don't rely on delimiters or on what is ultimately a non-globally unique, non persistent, non resolvable choice for a (permanent, resolvable, globally unique) identifier. As opposed to having this conversation, I wonder why we aren't having one about ALL the other more rational choices...
Best, Rob
On Mon, May 5, 2014 at 8:14 AM, Chuck Miller Chuck.Miller@mobot.orgwrote:
Markus,
Didn’t we reach a general consensus within the last couple of years that the vertical pipe (|) was the preferred concatenation symbol?
Chuck
*From:* tdwg-content-bounces@lists.tdwg.org [mailto: tdwg-content-bounces@lists.tdwg.org] *On Behalf Of *Markus Döring *Sent:* Monday, May 05, 2014 8:49 AM *To:* "Dröge, Gabriele" *Cc:* tdwg-content@lists.tdwg.org *Subject:* Re: [tdwg-content] delimiter characters for concatenated IDs
Hi Gabi,
can you explain a little more what you are trying to do giving an example maybe?
It appears to me you are creating (globally) unique identifiers on the basis of various existing fields which is fine. But when you use the identifier to create resource relations they should be considered opaque and you should not need to parse out the underlying pieces again. So in that scenario the character used to concatenate the triplet does not really matter for the end user as long as its unique and points to some existing resource, indicated by the occurrenceID in case of occurrences or the materialSampleID for samples.
Best,
Markus
On 05 May 2014, at 15:24, Dröge, Gabriele g.droege@BGBM.ORG wrote:
Hi everyone,
I guess there might have been some discussions about proper delimiter characters in the past that I have missed.
In several projects, first of all in GGBN (Global Genome Biodiversity Network, http://www.ggbn.org), there is a need for making a decision now. We need to reference between different records and databases and within Darwin Core we want to use the relatedResourceID to do so.
During our GGBN workshop at TDWG last year we agreed on concatenating the traditional triple ID (Catalogue Number, Collection Code, Institution Code) and add further parameters if required too (e.g. GUID, access point). We have checked those parameters and can definitely not use a single character as delimiter.
So my question to you is, if there are already some suggestions on using two characters together as delimiters. It would be great if we could find a solution more than one community could agree on.
Otherwise I would like to open the discussion and suggest "\", "||", "|", "§|", "§§", or "\§".
Best wishes,
Gabi
Gabriele Droege
Coordinator - DNA Bank Network
Global Genome Biodiversity Network (GGBN)
Berlin-Dahlem DNA Bank
Women's Officer ZE BGBM
Botanic Garden and Botanical Museum Berlin-Dahlem
Freie Universität Berlin
Koenigin-Luise-Str. 6-8
14195 Berlin
Germany
+49 30 838 50 139
www.dnabank-network.org
www.ggbn.org
www.bgbm.org
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
Hilmar, A “global” resolver that manages globally unique resolvable identifiers for every single specimen record in the world (billions?) as a web-service should be operated by a hosting facility with a global charter and globally funded resources. That is the definition of GBIF to my understanding. What other specimen/observation repository has greater critical mass to “mint” and maintain GUIDs for all the world?
Chuck
From: hilmar.lapp@gmail.com [mailto:hilmar.lapp@gmail.com] On Behalf Of Hilmar Lapp Sent: Monday, May 05, 2014 9:47 AM To: Robert Guralnick Cc: Chuck Miller; tdwg-content@lists.tdwg.org; John Deck; tomc@cs.uoregon.edu Subject: Re: [tdwg-content] delimiter characters for concatenated IDs
I couldn't agree more.
I would also ask why there still isn't a global resolver as a web-service that takes specimen metadata as input (such as the DwC triplet) and returns globally unique resolvable identifiers, minting them if necessary. If the technologically savvy people of this community came together, this could be built at least as a prototype in a couple of days. As I've suggested to iDigBio before, they could hold a hackathon on this, commit to hosting and further developing the outcome, and the problem would be solved once and for all. It would arguably be fully within their mandate.
If instead of the many workshops that have been held on talking about the problem we as a community would finally will ourselves to actually solving it, that part really isn't so difficult.
-hilmar
On Mon, May 5, 2014 at 10:23 AM, Robert Guralnick <Robert.Guralnick@colorado.edumailto:Robert.Guralnick@colorado.edu> wrote:
We've been examining the use (ad mis-use) of the DwC triplet, and how that propagates out of local portals and platforms into other ones. The end message from this work (and I am happy to share the manuscript and all the datasets we have compiled and examined) is that it is a _terrible_ choice for a global unique identifier.
There are so many better choices, that don't rely on delimiters or on what is ultimately a non-globally unique, non persistent, non resolvable choice for a (permanent, resolvable, globally unique) identifier. As opposed to having this conversation, I wonder why we aren't having one about ALL the other more rational choices...
Best, Rob
On Mon, May 5, 2014 at 8:14 AM, Chuck Miller <Chuck.Miller@mobot.orgmailto:Chuck.Miller@mobot.org> wrote: Markus, Didn’t we reach a general consensus within the last couple of years that the vertical pipe (|) was the preferred concatenation symbol?
Chuck
From: tdwg-content-bounces@lists.tdwg.orgmailto:tdwg-content-bounces@lists.tdwg.org [mailto:tdwg-content-bounces@lists.tdwg.orgmailto:tdwg-content-bounces@lists.tdwg.org] On Behalf Of Markus Döring Sent: Monday, May 05, 2014 8:49 AM To: "Dröge, Gabriele" Cc: tdwg-content@lists.tdwg.orgmailto:tdwg-content@lists.tdwg.org Subject: Re: [tdwg-content] delimiter characters for concatenated IDs
Hi Gabi, can you explain a little more what you are trying to do giving an example maybe?
It appears to me you are creating (globally) unique identifiers on the basis of various existing fields which is fine. But when you use the identifier to create resource relations they should be considered opaque and you should not need to parse out the underlying pieces again. So in that scenario the character used to concatenate the triplet does not really matter for the end user as long as its unique and points to some existing resource, indicated by the occurrenceID in case of occurrences or the materialSampleID for samples.
Best, Markus
On 05 May 2014, at 15:24, Dröge, Gabriele <g.droege@BGBM.ORGmailto:g.droege@BGBM.ORG> wrote:
Hi everyone,
I guess there might have been some discussions about proper delimiter characters in the past that I have missed.
In several projects, first of all in GGBN (Global Genome Biodiversity Network, http://www.ggbn.orghttp://www.ggbn.org/), there is a need for making a decision now. We need to reference between different records and databases and within Darwin Core we want to use the relatedResourceID to do so.
During our GGBN workshop at TDWG last year we agreed on concatenating the traditional triple ID (Catalogue Number, Collection Code, Institution Code) and add further parameters if required too (e.g. GUID, access point). We have checked those parameters and can definitely not use a single character as delimiter.
So my question to you is, if there are already some suggestions on using two characters together as delimiters. It would be great if we could find a solution more than one community could agree on.
Otherwise I would like to open the discussion and suggest "\", "||", "|", "§|", "§§", or "\§".
Best wishes, Gabi ----------------------------------------------------------------- Gabriele Droege Coordinator - DNA Bank Network Global Genome Biodiversity Network (GGBN) Berlin-Dahlem DNA Bank Women's Officer ZE BGBM
Botanic Garden and Botanical Museum Berlin-Dahlem Freie Universität Berlin Koenigin-Luise-Str. 6-8 14195 Berlin Germany
+49 30 838 50 139tel:%2B49%2030%20838%2050%20139
www.dnabank-network.orghttp://www.dnabank-network.org/ www.ggbn.orghttp://www.ggbn.org/ www.bgbm.orghttp://www.bgbm.org/ _______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.orgmailto:tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
_______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.orgmailto:tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
_______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.orgmailto:tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
-- Hilmar Lapp -:- informatics.nescent.org/wikihttp://informatics.nescent.org/wiki -:- lappland.iohttp://lappland.io
Friends, I agree with Hilmar and refuse to be drawn further into this debate at this point. Before I let the email chain grow along without me, I will just make two points: 1) I think there is an opportunity for further face to face discussion on this topic right before or during TDWG. We CAN solve this problem and we ARE making progress. 2) In the meantime, I will simply note that there are IMMEDIATE solutions to the issue of identifiers for billions of records that have been developed in conjunction with the California Digital Library that links more broadly to CrossRef, DataCite and other organizations. Ok, a final, third point: In my view, we should be building off of existing solutions, resources, and communities of practice, but recognizing the need to tend our own garden(s).
Best, Rob
On Mon, May 5, 2014 at 8:54 AM, Chuck Miller Chuck.Miller@mobot.org wrote:
Hilmar,
A “global” resolver that manages globally unique resolvable identifiers for every single specimen record in the world (billions?) as a web-service should be operated by a hosting facility with a global charter and globally funded resources. That is the definition of GBIF to my understanding. What other specimen/observation repository has greater critical mass to “mint” and maintain GUIDs for all the world?
Chuck
*From:* hilmar.lapp@gmail.com [mailto:hilmar.lapp@gmail.com] *On Behalf Of *Hilmar Lapp *Sent:* Monday, May 05, 2014 9:47 AM *To:* Robert Guralnick *Cc:* Chuck Miller; tdwg-content@lists.tdwg.org; John Deck; tomc@cs.uoregon.edu
*Subject:* Re: [tdwg-content] delimiter characters for concatenated IDs
I couldn't agree more.
I would also ask why there still isn't a global resolver as a web-service that takes specimen metadata as input (such as the DwC triplet) and returns globally unique resolvable identifiers, minting them if necessary. If the technologically savvy people of this community came together, this could be built at least as a prototype in a couple of days. As I've suggested to iDigBio before, they could hold a hackathon on this, commit to hosting and further developing the outcome, and the problem would be solved once and for all. It would arguably be fully within their mandate.
If instead of the many workshops that have been held on talking about the problem we as a community would finally will ourselves to actually solving it, that part really isn't so difficult.
-hilmar
On Mon, May 5, 2014 at 10:23 AM, Robert Guralnick < Robert.Guralnick@colorado.edu> wrote:
We've been examining the use (ad mis-use) of the DwC triplet, and how that propagates out of local portals and platforms into other ones. The end message from this work (and I am happy to share the manuscript and all the datasets we have compiled and examined) is that it is a _terrible_ choice for a global unique identifier.
There are so many better choices, that don't rely on delimiters or on what is ultimately a non-globally unique, non persistent, non resolvable choice for a (permanent, resolvable, globally unique) identifier. As opposed to having this conversation, I wonder why we aren't having one about ALL the other more rational choices...
Best, Rob
On Mon, May 5, 2014 at 8:14 AM, Chuck Miller Chuck.Miller@mobot.org wrote:
Markus,
Didn’t we reach a general consensus within the last couple of years that the vertical pipe (|) was the preferred concatenation symbol?
Chuck
*From:* tdwg-content-bounces@lists.tdwg.org [mailto: tdwg-content-bounces@lists.tdwg.org] *On Behalf Of *Markus Döring *Sent:* Monday, May 05, 2014 8:49 AM *To:* "Dröge, Gabriele" *Cc:* tdwg-content@lists.tdwg.org *Subject:* Re: [tdwg-content] delimiter characters for concatenated IDs
Hi Gabi,
can you explain a little more what you are trying to do giving an example maybe?
It appears to me you are creating (globally) unique identifiers on the basis of various existing fields which is fine. But when you use the identifier to create resource relations they should be considered opaque and you should not need to parse out the underlying pieces again. So in that scenario the character used to concatenate the triplet does not really matter for the end user as long as its unique and points to some existing resource, indicated by the occurrenceID in case of occurrences or the materialSampleID for samples.
Best,
Markus
On 05 May 2014, at 15:24, Dröge, Gabriele g.droege@BGBM.ORG wrote:
Hi everyone,
I guess there might have been some discussions about proper delimiter characters in the past that I have missed.
In several projects, first of all in GGBN (Global Genome Biodiversity Network, http://www.ggbn.org), there is a need for making a decision now. We need to reference between different records and databases and within Darwin Core we want to use the relatedResourceID to do so.
During our GGBN workshop at TDWG last year we agreed on concatenating the traditional triple ID (Catalogue Number, Collection Code, Institution Code) and add further parameters if required too (e.g. GUID, access point). We have checked those parameters and can definitely not use a single character as delimiter.
So my question to you is, if there are already some suggestions on using two characters together as delimiters. It would be great if we could find a solution more than one community could agree on.
Otherwise I would like to open the discussion and suggest "\", "||", "|", "§|", "§§", or "\§".
Best wishes,
Gabi
Gabriele Droege
Coordinator - DNA Bank Network
Global Genome Biodiversity Network (GGBN)
Berlin-Dahlem DNA Bank
Women's Officer ZE BGBM
Botanic Garden and Botanical Museum Berlin-Dahlem
Freie Universität Berlin
Koenigin-Luise-Str. 6-8
14195 Berlin
Germany
+49 30 838 50 139
www.dnabank-network.org
www.ggbn.org
www.bgbm.org
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
--
Hilmar Lapp -:- informatics.nescent.org/wiki -:- lappland.io
Chuck
Hilmar is not proposing a service for management of all identifiers, he is proposing discovery of existing, preferably resolvable and dereferanceable, identifiers based on queries for specimen record metadata such as DwC triplets, together with minting of resolvable ones when none is discoverable. Except on performance grounds---and possibly not even then--- this does not even require all the discoverable identifiers be held on the same machine as the proposed service is hosted, nor even on a single machine at all.
Hilmar's proposal, which I concur is useful and simple to accomplish, is independent of the quality, syntax, specification or utility of the returned identifiers, all of which are much argued in this thread and in this list from the beginning of time. Producing such a service is not beyond the skills required for an assignment in an undergraduate software engineering course and certainly could be accomplished in a few days' hackathon such as Hilmar proposes. As with any discovery service, its ultimate utility depends on the minters promoting underlying discoverability of the identifiers themselves. But that too is fairly trivial and well-understood, e.g. by the listing of them in resolvers' SiteMaps in published ways that major spiders can find and index them. An example is [1].
[1] Sitemap Formats and Guidelines https://support.google.com/webmasters/answer/183668?hl=en
On Mon, May 5, 2014 at 10:54 AM, Chuck Miller Chuck.Miller@mobot.org wrote:
Hilmar,
A “global” resolver that manages globally unique resolvable identifiers for every single specimen record in the world (billions?) as a web-service should be operated by a hosting facility with a global charter and globally funded resources. That is the definition of GBIF to my understanding. What other specimen/observation repository has greater critical mass to “mint” and maintain GUIDs for all the world?
Chuck
From: hilmar.lapp@gmail.com [mailto:hilmar.lapp@gmail.com] On Behalf Of Hilmar Lapp Sent: Monday, May 05, 2014 9:47 AM To: Robert Guralnick Cc: Chuck Miller; tdwg-content@lists.tdwg.org; John Deck; tomc@cs.uoregon.edu
Subject: Re: [tdwg-content] delimiter characters for concatenated IDs
I couldn't agree more.
I would also ask why there still isn't a global resolver as a web-service that takes specimen metadata as input (such as the DwC triplet) and returns globally unique resolvable identifiers, minting them if necessary. If the technologically savvy people of this community came together, this could be built at least as a prototype in a couple of days. As I've suggested to iDigBio before, they could hold a hackathon on this, commit to hosting and further developing the outcome, and the problem would be solved once and for all. It would arguably be fully within their mandate.
If instead of the many workshops that have been held on talking about the problem we as a community would finally will ourselves to actually solving it, that part really isn't so difficult.
-hilmar
On Mon, May 5, 2014 at 10:23 AM, Robert Guralnick Robert.Guralnick@colorado.edu wrote:
We've been examining the use (ad mis-use) of the DwC triplet, and how that propagates out of local portals and platforms into other ones. The end message from this work (and I am happy to share the manuscript and all the datasets we have compiled and examined) is that it is a _terrible_ choice for a global unique identifier.
There are so many better choices, that don't rely on delimiters or on what is ultimately a non-globally unique, non persistent, non resolvable choice for a (permanent, resolvable, globally unique) identifier. As opposed to having this conversation, I wonder why we aren't having one about ALL the other more rational choices...
Best, Rob
On Mon, May 5, 2014 at 8:14 AM, Chuck Miller Chuck.Miller@mobot.org wrote:
Markus,
Didn’t we reach a general consensus within the last couple of years that the vertical pipe (|) was the preferred concatenation symbol?
Chuck
From: tdwg-content-bounces@lists.tdwg.org [mailto:tdwg-content-bounces@lists.tdwg.org] On Behalf Of Markus Döring Sent: Monday, May 05, 2014 8:49 AM To: "Dröge, Gabriele" Cc: tdwg-content@lists.tdwg.org Subject: Re: [tdwg-content] delimiter characters for concatenated IDs
Hi Gabi,
can you explain a little more what you are trying to do giving an example maybe?
It appears to me you are creating (globally) unique identifiers on the basis of various existing fields which is fine. But when you use the identifier to create resource relations they should be considered opaque and you should not need to parse out the underlying pieces again. So in that scenario the character used to concatenate the triplet does not really matter for the end user as long as its unique and points to some existing resource, indicated by the occurrenceID in case of occurrences or the materialSampleID for samples.
Best,
Markus
On 05 May 2014, at 15:24, Dröge, Gabriele g.droege@BGBM.ORG wrote:
Hi everyone,
I guess there might have been some discussions about proper delimiter characters in the past that I have missed.
In several projects, first of all in GGBN (Global Genome Biodiversity Network, http://www.ggbn.org), there is a need for making a decision now. We need to reference between different records and databases and within Darwin Core we want to use the relatedResourceID to do so.
During our GGBN workshop at TDWG last year we agreed on concatenating the traditional triple ID (Catalogue Number, Collection Code, Institution Code) and add further parameters if required too (e.g. GUID, access point). We have checked those parameters and can definitely not use a single character as delimiter.
So my question to you is, if there are already some suggestions on using two characters together as delimiters. It would be great if we could find a solution more than one community could agree on.
Otherwise I would like to open the discussion and suggest "\", "||", "|", "§|", "§§", or "\§".
Best wishes,
Gabi
Gabriele Droege
Coordinator - DNA Bank Network
Global Genome Biodiversity Network (GGBN)
Berlin-Dahlem DNA Bank
Women's Officer ZE BGBM
Botanic Garden and Botanical Museum Berlin-Dahlem
Freie Universität Berlin
Koenigin-Luise-Str. 6-8
14195 Berlin
Germany
+49 30 838 50 139
www.dnabank-network.org
www.ggbn.org
www.bgbm.org
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
--
Hilmar Lapp -:- informatics.nescent.org/wiki -:- lappland.io
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
Bob, Well, like I said. There are many sides and many proponents. It would be great to see something accomplished, rather than endlessly debated.
Chuck
-----Original Message----- From: Bob Morris [mailto:morris.bob@gmail.com] Sent: Monday, May 05, 2014 11:46 AM To: Chuck Miller Cc: Hilmar Lapp; Robert Guralnick; tdwg-content@lists.tdwg.org; tomc@cs.uoregon.edu; John Deck Subject: Re: [tdwg-content] delimiter characters for concatenated IDs
Chuck
Hilmar is not proposing a service for management of all identifiers, he is proposing discovery of existing, preferably resolvable and dereferanceable, identifiers based on queries for specimen record metadata such as DwC triplets, together with minting of resolvable ones when none is discoverable. Except on performance grounds---and possibly not even then--- this does not even require all the discoverable identifiers be held on the same machine as the proposed service is hosted, nor even on a single machine at all.
Hilmar's proposal, which I concur is useful and simple to accomplish, is independent of the quality, syntax, specification or utility of the returned identifiers, all of which are much argued in this thread and in this list from the beginning of time. Producing such a service is not beyond the skills required for an assignment in an undergraduate software engineering course and certainly could be accomplished in a few days' hackathon such as Hilmar proposes. As with any discovery service, its ultimate utility depends on the minters promoting underlying discoverability of the identifiers themselves. But that too is fairly trivial and well-understood, e.g. by the listing of them in resolvers' SiteMaps in published ways that major spiders can find and index them. An example is [1].
[1] Sitemap Formats and Guidelines https://support.google.com/webmasters/answer/183668?hl=en
On Mon, May 5, 2014 at 10:54 AM, Chuck Miller Chuck.Miller@mobot.org wrote:
Hilmar,
A “global” resolver that manages globally unique resolvable identifiers for every single specimen record in the world (billions?) as a web-service should be operated by a hosting facility with a global charter and globally funded resources. That is the definition of GBIF to my understanding. What other specimen/observation repository has greater critical mass to “mint” and maintain GUIDs for all the world?
Chuck
From: hilmar.lapp@gmail.com [mailto:hilmar.lapp@gmail.com] On Behalf Of Hilmar Lapp Sent: Monday, May 05, 2014 9:47 AM To: Robert Guralnick Cc: Chuck Miller; tdwg-content@lists.tdwg.org; John Deck; tomc@cs.uoregon.edu
Subject: Re: [tdwg-content] delimiter characters for concatenated IDs
I couldn't agree more.
I would also ask why there still isn't a global resolver as a web-service that takes specimen metadata as input (such as the DwC triplet) and returns globally unique resolvable identifiers, minting them if necessary. If the technologically savvy people of this community came together, this could be built at least as a prototype in a couple of days. As I've suggested to iDigBio before, they could hold a hackathon on this, commit to hosting and further developing the outcome, and the problem would be solved once and for all. It would arguably be fully within their mandate.
If instead of the many workshops that have been held on talking about the problem we as a community would finally will ourselves to actually solving it, that part really isn't so difficult.
-hilmar
On Mon, May 5, 2014 at 10:23 AM, Robert Guralnick Robert.Guralnick@colorado.edu wrote:
We've been examining the use (ad mis-use) of the DwC triplet, and how that propagates out of local portals and platforms into other ones. The end message from this work (and I am happy to share the manuscript and all the datasets we have compiled and examined) is that it is a _terrible_ choice for a global unique identifier.
There are so many better choices, that don't rely on delimiters or on what is ultimately a non-globally unique, non persistent, non resolvable choice for a (permanent, resolvable, globally unique) identifier. As opposed to having this conversation, I wonder why we aren't having one about ALL the other more rational choices...
Best, Rob
On Mon, May 5, 2014 at 8:14 AM, Chuck Miller Chuck.Miller@mobot.org wrote:
Markus,
Didn’t we reach a general consensus within the last couple of years that the vertical pipe (|) was the preferred concatenation symbol?
Chuck
From: tdwg-content-bounces@lists.tdwg.org [mailto:tdwg-content-bounces@lists.tdwg.org] On Behalf Of Markus Döring Sent: Monday, May 05, 2014 8:49 AM To: "Dröge, Gabriele" Cc: tdwg-content@lists.tdwg.org Subject: Re: [tdwg-content] delimiter characters for concatenated IDs
Hi Gabi,
can you explain a little more what you are trying to do giving an example maybe?
It appears to me you are creating (globally) unique identifiers on the basis of various existing fields which is fine. But when you use the identifier to create resource relations they should be considered opaque and you should not need to parse out the underlying pieces again. So in that scenario the character used to concatenate the triplet does not really matter for the end user as long as its unique and points to some existing resource, indicated by the occurrenceID in case of occurrences or the materialSampleID for samples.
Best,
Markus
On 05 May 2014, at 15:24, Dröge, Gabriele g.droege@BGBM.ORG wrote:
Hi everyone,
I guess there might have been some discussions about proper delimiter characters in the past that I have missed.
In several projects, first of all in GGBN (Global Genome Biodiversity Network, http://www.ggbn.org), there is a need for making a decision now. We need to reference between different records and databases and within Darwin Core we want to use the relatedResourceID to do so.
During our GGBN workshop at TDWG last year we agreed on concatenating the traditional triple ID (Catalogue Number, Collection Code, Institution Code) and add further parameters if required too (e.g. GUID, access point). We have checked those parameters and can definitely not use a single character as delimiter.
So my question to you is, if there are already some suggestions on using two characters together as delimiters. It would be great if we could find a solution more than one community could agree on.
Otherwise I would like to open the discussion and suggest "\", "||", "|", "§|", "§§", or "\§".
Best wishes,
Gabi
Gabriele Droege
Coordinator - DNA Bank Network
Global Genome Biodiversity Network (GGBN)
Berlin-Dahlem DNA Bank
Women's Officer ZE BGBM
Botanic Garden and Botanical Museum Berlin-Dahlem
Freie Universität Berlin
Koenigin-Luise-Str. 6-8
14195 Berlin
Germany
+49 30 838 50 139
www.dnabank-network.org
www.ggbn.org
www.bgbm.org
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
--
Hilmar Lapp -:- informatics.nescent.org/wiki -:- lappland.io
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
-- Robert A. Morris
Emeritus Professor of Computer Science UMASS-Boston 100 Morrissey Blvd Boston, MA 02125-3390
Filtered Push Project Harvard University Herbaria Harvard University
email: morris.bob@gmail.com web: http://efg.cs.umb.edu/ web: http://wiki.filteredpush.org http://www.cs.umb.edu/~ram === The content of this communication is made entirely on my own behalf and in no way should be deemed to express official positions of The University of Massachusetts at Boston or Harvard University.
Like Rob, I can't get dragged into this discussion in full right now (departing for an expedition Sunday; much to do between now and then). However, I will make these comments:
1) It was very clear that the DwC triplet would not serve the needs of globally unique identifiers more than ten years ago; which led to a push for proper identifiers in our community by the SEEK project, and later two separate workshops on GUIDs supported by GBIF & TDWG. The latter yielded LSIDs (which, at the time, appeared to be the least of evils, with PURLs as the next plausible option, and DOIs & other Handles a distant third; hindsight has taught us some things since then).
2) A decade later, we are still arguing about the same things (in part because people not involved with those earlier efforts are discovering the same problems that were discussed back then).
3) We have built a simple identifier cross-referencing service along the lines of what Hilmar outlined, and it has proven to be EXTREMELY powerful. We have plans to further enrich and expand the service later this year. It currently works on a two-part approach to identifiers (IdentifierDomain + Identifier), where the former is globally unique, and the latter is any text string that is unique within the context of the IdentifierDomain). It would require very little additional effort to expand the service to accommodate three-part inputs (ala DwC triplets; where the institutionCode and collectionCode would together uniquely represent an IdentifierDomain, and catalogNumber would represent the Identifier). Suggestions & input welcome. Our service is currently serving identifiers for Agents, References, and TaxonNameUsage instances, but could very easily be expanded to other objects. It currently exists in GNUB-space, but we plan to separate it out into a generalized service (consumed by GNUB) later this year.
4) Getting back to the original question; we have standardized internally on two delimiters to allow for two-tier nested arrays, such that the pipe (|) serves the function of delimiting primary objects, and the tilde (~) serves the function of delimiting components within primary objects. For example, a nested array of DwC triplets for the Bishop Museum fish collection would look something like this:
BPBM~I~1234|BPBM~I~9876|BPBM~I~5678
Note the difference between "I" (collectionCode for Icthyology) and pipe (primary delimiter)
Not recommending this as a standard; just reporting what has worked very well for us internally. Haven't yet needed to escape either the pipe or the tilde. When I say we use this "Internally", it's because externally we typically parse stuff in json.
Aloha, Rich
Richard L. Pyle, PhD Database Coordinator for Natural Sciences Associate Zoologist in Ichthyology Dive Safety Officer Department of Natural Sciences, Bishop Museum 1525 Bernice St., Honolulu, HI 96817 Ph: (808)848-4115, Fax: (808)847-8252 email: deepreef@bishopmuseum.org http://hbs.bishopmuseum.org/staff/pylerichard.html
-----Original Message----- From: tdwg-content-bounces@lists.tdwg.org [mailto:tdwg-content- bounces@lists.tdwg.org] On Behalf Of Bob Morris Sent: Monday, May 05, 2014 6:46 AM To: Chuck Miller Cc: tdwg-content@lists.tdwg.org; John Deck; tomc@cs.uoregon.edu Subject: Re: [tdwg-content] delimiter characters for concatenated IDs
Chuck
Hilmar is not proposing a service for management of all identifiers, he is proposing discovery of existing, preferably resolvable and dereferanceable, identifiers based on queries for specimen record metadata such as DwC triplets, together with minting of resolvable ones when none is discoverable. Except on performance grounds---and possibly not even then--- this does not even require all the discoverable identifiers be held on the same machine as the proposed service is hosted, nor even on a single machine at all.
Hilmar's proposal, which I concur is useful and simple to accomplish, is independent of the quality, syntax, specification or utility of the returned identifiers, all of which are much argued in this thread and in this list from the beginning of time. Producing such a service is not beyond the skills required for an assignment in an undergraduate software engineering course and certainly could be accomplished in a few days' hackathon such as Hilmar proposes. As with any discovery service, its ultimate utility depends on the minters promoting underlying discoverability of the identifiers themselves. But that too is fairly trivial and well-understood, e.g. by the listing of them in resolvers' SiteMaps in published ways that major spiders can find and index them. An example is [1].
[1] Sitemap Formats and Guidelines https://support.google.com/webmasters/answer/183668?hl=en
On Mon, May 5, 2014 at 10:54 AM, Chuck Miller Chuck.Miller@mobot.org wrote:
Hilmar,
A “global” resolver that manages globally unique resolvable identifiers for every single specimen record in the world (billions?) as a web-service should be operated by a hosting facility with a global charter and globally funded resources. That is the definition of GBIF to my understanding. What other specimen/observation
repository has greater critical mass to “mint”
and maintain GUIDs for all the world?
Chuck
From: hilmar.lapp@gmail.com [mailto:hilmar.lapp@gmail.com] On Behalf Of Hilmar Lapp Sent: Monday, May 05, 2014 9:47 AM To: Robert Guralnick Cc: Chuck Miller; tdwg-content@lists.tdwg.org; John Deck; tomc@cs.uoregon.edu
Subject: Re: [tdwg-content] delimiter characters for concatenated IDs
I couldn't agree more.
I would also ask why there still isn't a global resolver as a web-service that takes specimen metadata as input (such as the DwC triplet) and returns globally unique resolvable identifiers, minting them if necessary. If the technologically savvy people of this community came together, this could be built at least as a prototype in a couple of days. As I've suggested to iDigBio before, they could hold a hackathon on this, commit to hosting and further developing the outcome, and the problem would be solved once and for all. It would
arguably be fully within their mandate.
If instead of the many workshops that have been held on talking about the problem we as a community would finally will ourselves to actually solving it, that part really isn't so difficult.
-hilmar
On Mon, May 5, 2014 at 10:23 AM, Robert Guralnick Robert.Guralnick@colorado.edu wrote:
We've been examining the use (ad mis-use) of the DwC triplet, and how
that
propagates out of local portals and platforms into other ones. The end message from this work (and I am happy to share the manuscript and all the datasets we have compiled and examined) is that it is a _terrible_ choice for a global unique identifier.
There are so many better choices, that don't rely on delimiters or on what is ultimately a non-globally unique, non persistent, non resolvable choice for a (permanent, resolvable, globally unique) identifier. As opposed to having this conversation, I wonder why we aren't having one about ALL the other more rational choices...
Best, Rob
On Mon, May 5, 2014 at 8:14 AM, Chuck Miller Chuck.Miller@mobot.org
wrote:
Markus,
Didn’t we reach a general consensus within the last couple of years that the vertical pipe (|) was the preferred concatenation symbol?
Chuck
From: tdwg-content-bounces@lists.tdwg.org [mailto:tdwg-content-bounces@lists.tdwg.org] On Behalf Of Markus Döring Sent: Monday, May 05, 2014 8:49 AM To: "Dröge, Gabriele" Cc: tdwg-content@lists.tdwg.org Subject: Re: [tdwg-content] delimiter characters for concatenated IDs
Hi Gabi,
can you explain a little more what you are trying to do giving an example maybe?
It appears to me you are creating (globally) unique identifiers on the basis of various existing fields which is fine. But when you use the identifier to create resource relations they should be considered opaque and you should not need to parse out the underlying pieces again. So in that scenario the character used to concatenate the triplet does not really matter for the end user as long as its unique and points to some existing resource, indicated by the occurrenceID in case of occurrences or the materialSampleID for samples.
Best,
Markus
On 05 May 2014, at 15:24, Dröge, Gabriele g.droege@BGBM.ORG wrote:
Hi everyone,
I guess there might have been some discussions about proper delimiter characters in the past that I have missed.
In several projects, first of all in GGBN (Global Genome Biodiversity Network, http://www.ggbn.org), there is a need for making a decision now. We need to reference between different records and databases and within Darwin Core we want to use the relatedResourceID to do so.
During our GGBN workshop at TDWG last year we agreed on concatenating the traditional triple ID (Catalogue Number, Collection Code, Institution Code) and add further parameters if required too (e.g. GUID, access point). We have checked those parameters and can definitely not use a single character as delimiter.
So my question to you is, if there are already some suggestions on using two characters together as delimiters. It would be great if we could find a solution more than one community could agree on.
Otherwise I would like to open the discussion and suggest "\", "||", "|", "§|", "§§", or "\§".
Best wishes,
Gabi
Gabriele Droege
Coordinator - DNA Bank Network
Global Genome Biodiversity Network (GGBN)
Berlin-Dahlem DNA Bank
Women's Officer ZE BGBM
Botanic Garden and Botanical Museum Berlin-Dahlem
Freie Universität Berlin
Koenigin-Luise-Str. 6-8
14195 Berlin
Germany
+49 30 838 50 139
www.dnabank-network.org
www.ggbn.org
www.bgbm.org
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
--
Hilmar Lapp -:- informatics.nescent.org/wiki -:- lappland.io
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
-- Robert A. Morris
Emeritus Professor of Computer Science UMASS-Boston 100 Morrissey Blvd Boston, MA 02125-3390
Filtered Push Project Harvard University Herbaria Harvard University
email: morris.bob@gmail.com web: http://efg.cs.umb.edu/ web: http://wiki.filteredpush.org http://www.cs.umb.edu/~ram === The content of this communication is made entirely on my own behalf and in no way should be deemed to express official positions of The University of Massachusetts at Boston or Harvard University. _______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
Hi Hilmar, what about this as a start:
http://api.gbif.org/v0.9/occurrence/search?institutionCode=B&collectionC...
... and some supporting services to dig into the content for each parameter: http://api.gbif.org/v0.9/occurrence/search/institution_code?q=B&limit=10... http://api.gbif.org/v0.9/occurrence/search/collection_code?q=B&limit=100 http://api.gbif.org/v0.9/occurrence/search/catalog_number?q=122&limit=10...
It would be dead simple to add that method to the GBIF portal, taking the triplet as path parameters, e.g. http://www.gbif.org/occurrence/%7BinstCode%7D/%7BcollCode%7D/%7BcatalogNumbe... and then do a redirect to the occurrence detail page, return a 404 or some disambiguation page. Think that's useful?
Markus
On 05 May 2014, at 16:46, Hilmar Lapp hlapp@nescent.org wrote:
I couldn't agree more.
I would also ask why there still isn't a global resolver as a web-service that takes specimen metadata as input (such as the DwC triplet) and returns globally unique resolvable identifiers, minting them if necessary. If the technologically savvy people of this community came together, this could be built at least as a prototype in a couple of days. As I've suggested to iDigBio before, they could hold a hackathon on this, commit to hosting and further developing the outcome, and the problem would be solved once and for all. It would arguably be fully within their mandate.
If instead of the many workshops that have been held on talking about the problem we as a community would finally will ourselves to actually solving it, that part really isn't so difficult.
-hilmar
On Mon, May 5, 2014 at 10:23 AM, Robert Guralnick Robert.Guralnick@colorado.edu wrote:
We've been examining the use (ad mis-use) of the DwC triplet, and how that propagates out of local portals and platforms into other ones. The end message from this work (and I am happy to share the manuscript and all the datasets we have compiled and examined) is that it is a _terrible_ choice for a global unique identifier.
There are so many better choices, that don't rely on delimiters or on what is ultimately a non-globally unique, non persistent, non resolvable choice for a (permanent, resolvable, globally unique) identifier. As opposed to having this conversation, I wonder why we aren't having one about ALL the other more rational choices...
Best, Rob
On Mon, May 5, 2014 at 8:14 AM, Chuck Miller Chuck.Miller@mobot.org wrote: Markus,
Didn’t we reach a general consensus within the last couple of years that the vertical pipe (|) was the preferred concatenation symbol?
Chuck
From: tdwg-content-bounces@lists.tdwg.org [mailto:tdwg-content-bounces@lists.tdwg.org] On Behalf Of Markus Döring Sent: Monday, May 05, 2014 8:49 AM To: "Dröge, Gabriele" Cc: tdwg-content@lists.tdwg.org Subject: Re: [tdwg-content] delimiter characters for concatenated IDs
Hi Gabi,
can you explain a little more what you are trying to do giving an example maybe?
It appears to me you are creating (globally) unique identifiers on the basis of various existing fields which is fine. But when you use the identifier to create resource relations they should be considered opaque and you should not need to parse out the underlying pieces again. So in that scenario the character used to concatenate the triplet does not really matter for the end user as long as its unique and points to some existing resource, indicated by the occurrenceID in case of occurrences or the materialSampleID for samples.
Best,
Markus
On 05 May 2014, at 15:24, Dröge, Gabriele g.droege@BGBM.ORG wrote:
Hi everyone,
I guess there might have been some discussions about proper delimiter characters in the past that I have missed.
In several projects, first of all in GGBN (Global Genome Biodiversity Network, http://www.ggbn.org), there is a need for making a decision now. We need to reference between different records and databases and within Darwin Core we want to use the relatedResourceID to do so.
During our GGBN workshop at TDWG last year we agreed on concatenating the traditional triple ID (Catalogue Number, Collection Code, Institution Code) and add further parameters if required too (e.g. GUID, access point). We have checked those parameters and can definitely not use a single character as delimiter.
So my question to you is, if there are already some suggestions on using two characters together as delimiters. It would be great if we could find a solution more than one community could agree on.
Otherwise I would like to open the discussion and suggest "\", "||", "|", "§|", "§§", or "\§".
Best wishes,
Gabi
Gabriele Droege
Coordinator - DNA Bank Network
Global Genome Biodiversity Network (GGBN)
Berlin-Dahlem DNA Bank
Women's Officer ZE BGBM
Botanic Garden and Botanical Museum Berlin-Dahlem
Freie Universität Berlin
Koenigin-Luise-Str. 6-8
14195 Berlin
Germany
+49 30 838 50 139
www.dnabank-network.org
www.ggbn.org
www.bgbm.org
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
-- Hilmar Lapp -:- informatics.nescent.org/wiki -:- lappland.io
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
Some quick points.
1. Unless I’m mistaken, there seems to be some conflation of two separate questions, namely:
(a) what is the best delimiter to use when concatenating strings to make an identifier (e.g., the Darwin Core Triplet)
(b) what is the best delimiter to use when putting multiple values in the same field (which is what I think Chuck is referring to when he recommends the pipe symbol “|”)
I think Gabriele is asking (a).
Gabriele wants a solution “now”. If it’s simply a case of a convention to create a identifier string that may or may not have any meaning or persistence, then any solution will do. You could follow what the NCBI have done, for example, and use Darwin Core Triplets such as YPM:MAM:140180 (which can be resolved, at present at least, at http://peabody.research.yale.edu/cgi-bin/Query.Ledger?LE=mam&SU=0&ID... ) This solves things for “now”, but what about tomorrow?
Contrary to Hilmar, there is more to this than simply a quick hackathon. Yes, a service that takes metadata and returns one or more identifiers is a good idea and easy to create (there will often be more than one because museum codes are not unique). But who maintains this service? Who maintains the identifiers? Who do I complain to if they break? How do we ensure that they persist when, say, a museum closes down, moves its collection, changes it’s web technology? Who provides the tools that add value to the identifiers? (there’s no point having them if they are not useful)
We have an obvious role model for how to do this stuff well, and that is CrossRef. Regardless of what you think of DOIs, CrossRef is a model of exactly the sort of thing we need. There’s more infrastructure here than simply a look-up service. As an aside, the lookup based on metadata idea is equivalent to OpenURL in the bibliographic world, which was all the rage for a while until people wanted something better, and now we have DOIs. We seem determined to reinvent the painful steps others have taken, rather than learn from others and actually solve the problem. It really is painful to watch.
Yes, GBIF is the obvious place to centralise a lot of this, but it would require that GBIF can maintain stable ids for specimens. So far it can’t do this, principally because it relies on metadata provided by museums and herbaria to recognise whether a record is new or an existing one, and, guess what, the metadata keeps changing :(
If you want GBIF to do this, are you happy that every specimen gets a GBIF URL? If not, what are you going to suggest? Perhaps every institution mints it’s own URL, say like Peabody has above. Anyone want to place a bet on how long http://peabody.research.yale.edu/cgi-bin/Query.Ledger?LE=mam&SU=0&ID... is going to survive as a resolvable URL? Anybody know how I can get machine readable data from that URL? Some botanical institutions are minting fairly clean looking URLs, but how do we discover these? How do we find URLs for every collection?
My prediction is that eventually we will learn from the experience of academic publishers, who went through pretty much all of this hurt a decade ago, facing pretty much exactly the same issues (although in their case even more pressing because actual money was at stake), and who came up with a solution that has clearly worked (in their case DOIs plus CrossRef services). We will finally realise that this requires resources, and that it requires thinking strategically about what we want (what’s the bigger picture?) rather than relying on local, small-scale, half-baked solutions. Until then, we can’t have nice shiny things.
Regards
Rod
On Mon, May 5, 2014 at 1:29 PM, Roderic Page r.page@bio.gla.ac.uk wrote:
Contrary to Hilmar, there is more to this than simply a quick hackathon. Yes, a service that takes metadata and returns one or more identifiers is a good idea and easy to create (there will often be more than one because museum codes are not unique). But who maintains this service? Who maintains the identifiers? Who do I complain to if they break? How do we ensure that they persist when, say, a museum closes down, moves its collection, changes it’s web technology? Who provides the tools that add value to the identifiers? (there’s no point having them if they are not useful)
Jonathan Rees pointed this out to me too off-list. Just for the record, this isn't contrary but fully in line with what I was saying (or trying to say). Yes, I didn't elaborate that part, assuming, perhaps rather erroneously, that all this goes without saying, but I did mention that one part of this becoming a real solution has to be an institution with an in-scope cyberinfrastructure mandate that going in would make a commitment to sustain the resolver, including working with partners on the above slew of questions. The institution I gave was iDigBio; perhaps for some reason that would not be a good choice, but whether they are or not wasn't my point.
I will add one point to this, though. It seems to me that by continuing to argue that we can't go ahead with building a resolver that works (as far as technical requirements are concerned) before we haven't first fully addressed the institutional and social long-term sustainability commitment problem, we are and have been making this one big hairy problem that we can't make any practical pragmatic headway about, rather than breaking it down into parts, some of which (namely the primarily technical ones) are actually fairly straightforward to solve. As a result, to this day we don't have some solution that even though it's not very sustainable yet, at least proves to everyone how critical it is, and that the community can rally behind. Perhaps that's naïve, but I do think that once there's a solution the community rallies behind, ways to sustain it will be found.
-hilmar
Darn it, Hilmar. I promised to not say more, but I am compelled to now because of this cogent and rational response. Nico Cellinese, John Deck and I are assembling folks for a pre-TDWG meeting to work towards a common solution, as much as it can be done and recognizing that there may be different needs and use cases. The goal of that is to produce a community developed manuscript that can be published, and that details our best efforts to develop common solution(s) that we can rally behind.
The meeting was originally conceived, for many reasons, as by invitation, but we are interested in reporting out the work accomplished the final afternoon of this meeting, and to open up the conversation more broadly. If people are interested, I am happy to pass along more about the intent and to remind people when we might have that open session for reporting and feedback.
Best, Rob
On Mon, May 5, 2014 at 12:14 PM, Hilmar Lapp hlapp@nescent.org wrote:
On Mon, May 5, 2014 at 1:29 PM, Roderic Page r.page@bio.gla.ac.uk wrote:
Contrary to Hilmar, there is more to this than simply a quick hackathon. Yes, a service that takes metadata and returns one or more identifiers is a good idea and easy to create (there will often be more than one because museum codes are not unique). But who maintains this service? Who maintains the identifiers? Who do I complain to if they break? How do we ensure that they persist when, say, a museum closes down, moves its collection, changes it’s web technology? Who provides the tools that add value to the identifiers? (there’s no point having them if they are not useful)
Jonathan Rees pointed this out to me too off-list. Just for the record, this isn't contrary but fully in line with what I was saying (or trying to say). Yes, I didn't elaborate that part, assuming, perhaps rather erroneously, that all this goes without saying, but I did mention that one part of this becoming a real solution has to be an institution with an in-scope cyberinfrastructure mandate that going in would make a commitment to sustain the resolver, including working with partners on the above slew of questions. The institution I gave was iDigBio; perhaps for some reason that would not be a good choice, but whether they are or not wasn't my point.
I will add one point to this, though. It seems to me that by continuing to argue that we can't go ahead with building a resolver that works (as far as technical requirements are concerned) before we haven't first fully addressed the institutional and social long-term sustainability commitment problem, we are and have been making this one big hairy problem that we can't make any practical pragmatic headway about, rather than breaking it down into parts, some of which (namely the primarily technical ones) are actually fairly straightforward to solve. As a result, to this day we don't have some solution that even though it's not very sustainable yet, at least proves to everyone how critical it is, and that the community can rally behind. Perhaps that's naïve, but I do think that once there's a solution the community rallies behind, ways to sustain it will be found.
-hilmar
Hilmar Lapp -:- informatics.nescent.org/wiki -:- lappland.io
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
Hi Hilmar,
I’m not arguing that we shouldn’t build a resolver (I have one that I use, Rich has mentioned he’s got one, Markus has one at GBIF, etc.).
Nor do I think we should wait for institutional and social commitment (because then we’d never get anything done).
But I do think it would be useful to think it through. For example, it’s easy to create a URL for a specimen. Easy peasy. OK, how do I discover that URL? How do I discover these for all specimens? Sounds like I need a centralised discover service like you’e described.
How do I handle changes in those URLs? I built a specimen code to GBIF resolver for BioStor so that I could link to specimens, GBIF changed lots of those URLs, all my work was undone, boy does GBIF suck sometimes. For example, if I map codes to URLs, I need to handle cases when they change.
If URLs can change, is there a way to defend against that (this is one reason for DOIs, or other methods of indirection, such as PURLs).
If providers change, will the URLs change? Is there a way to defend against that (again, DOIs handle this nicely by virtue of (a) indirection, and (b) lack of branding).
How can I encourage people to use the specimen service? What can I do to make them think it will persist? Can I convince academic publishers to trust it enough to link to it in articles? What’s the pitch to Pensoft, to Magnolai Press, to Springer and Elsevier?
Is there some way to make the service itself become trusted? For example if I look at a journal and see that it has DOIs issued by CrossRef, I take that journal more seriously than if it’s just got simple URLs. I know that papers in that journal will be linked into the citation network, I also know that there is a backup plan if the journal goes under (because you need that to have DOIs in CrossRef). Likewise, I think Figshare got a big boost when it stared minting DOIs (wow, a DOI, I know DOIs, you mean I can now cite stuff I’ve uploaded there?).
How can museums and herbaria be persuaded to keep their identifiers stable? What incentives can we provide (e.g., citation metrics for collections)? What system would enable us to do this? What about tracing funding (e.g., the NSF paid for these n papers, and they cite these y specimens, from these z collections, so science paid for by the NSF requires these collections to exist).
I guess I’m arguing that we should think all this through, because a specimen code to specimen URL is a small piece of the puzzle. Now, I’m desperately trying not to simply say what I think is blindingly obvious here (put DOIs on specimens, add metadata to specimen and specimen citation services, and we are done), but I think if we sit back and look at where we want to be, this is exactly what we need (or something functionally equivalent). Until we see the bigger picture, we will be stuck in amateur hour.
Take a look at:
http://search.crossref.org http://www.crossref.org/fundref/ http://support.crossref.org/ https://prospect.crossref.org/splash/
Isn’t this the kind of stuff we’d like to do? If so, let’s work out what’s needed and make it happen.
In short, I think we constantly solve an immediate problem in the quickest way we know how, without thinking it through. I’d argue that if we think about the bigger picture (what do we want to be able to, what are the questions we want to be able to ask) then things become clearer. This is independent of getting everyone’s agreement (but it would help if we made their agreement seem a no brainer by providing solutions to things that cause them pain).
Regards
Rod
On 5 May 2014, at 19:14, Hilmar Lapp hlapp@nescent.org wrote:
On Mon, May 5, 2014 at 1:29 PM, Roderic Page r.page@bio.gla.ac.uk wrote: Contrary to Hilmar, there is more to this than simply a quick hackathon. Yes, a service that takes metadata and returns one or more identifiers is a good idea and easy to create (there will often be more than one because museum codes are not unique). But who maintains this service? Who maintains the identifiers? Who do I complain to if they break? How do we ensure that they persist when, say, a museum closes down, moves its collection, changes it’s web technology? Who provides the tools that add value to the identifiers? (there’s no point having them if they are not useful)
Jonathan Rees pointed this out to me too off-list. Just for the record, this isn't contrary but fully in line with what I was saying (or trying to say). Yes, I didn't elaborate that part, assuming, perhaps rather erroneously, that all this goes without saying, but I did mention that one part of this becoming a real solution has to be an institution with an in-scope cyberinfrastructure mandate that going in would make a commitment to sustain the resolver, including working with partners on the above slew of questions. The institution I gave was iDigBio; perhaps for some reason that would not be a good choice, but whether they are or not wasn't my point.
I will add one point to this, though. It seems to me that by continuing to argue that we can't go ahead with building a resolver that works (as far as technical requirements are concerned) before we haven't first fully addressed the institutional and social long-term sustainability commitment problem, we are and have been making this one big hairy problem that we can't make any practical pragmatic headway about, rather than breaking it down into parts, some of which (namely the primarily technical ones) are actually fairly straightforward to solve. As a result, to this day we don't have some solution that even though it's not very sustainable yet, at least proves to everyone how critical it is, and that the community can rally behind. Perhaps that's naïve, but I do think that once there's a solution the community rallies behind, ways to sustain it will be found.
-hilmar
Hilmar Lapp -:- informatics.nescent.org/wiki -:- lappland.io
Hi Rod,
I agree GBIF has troubles to keep identifiers stable for *some* records, but in general we do a much better job than the original publishers in the first place. We try hard to keep GBIF ids stable even if publishers change collection codes, registered datasets twice or do other things to break a simple automated way of mapping source records to existing GBIF ids. Also the stable identifier in GBIF never has been the URL, but it is the local GBIF integer alone. The GBIF services that consume those ids have changed over the years, but its pretty trivial to adjust if you use the GBIF ids instead of the URLs. If there is a clear need to have stable URLs instead I am sure we can get that working easily.
The two real issues for GBIF are a) duplicates and b) records with varying local identifiers of any sort (triplet, occurrenceID or whatever else).
When it comes to the varying source identifiers I always liked the idea of flagging those records and datasets as unstable, so it is obvious to users. This is not a 100% safe, but most terrible datasets change all of their ids and that is easily detectable. Also with a service like that it would become more obvious to publishers how important stable source ids are.
Before jumping on DOIs as the next big thing I would really like to understand what needs the community has around specimen ids. Gabi clearly has a very real use case, are there others we know about?
Markus
On 05 May 2014, at 21:05, Roderic Page r.page@bio.gla.ac.uk wrote:
Hi Hilmar,
I’m not arguing that we shouldn’t build a resolver (I have one that I use, Rich has mentioned he’s got one, Markus has one at GBIF, etc.).
Nor do I think we should wait for institutional and social commitment (because then we’d never get anything done).
But I do think it would be useful to think it through. For example, it’s easy to create a URL for a specimen. Easy peasy. OK, how do I discover that URL? How do I discover these for all specimens? Sounds like I need a centralised discover service like you’e described.
How do I handle changes in those URLs? I built a specimen code to GBIF resolver for BioStor so that I could link to specimens, GBIF changed lots of those URLs, all my work was undone, boy does GBIF suck sometimes. For example, if I map codes to URLs, I need to handle cases when they change.
If URLs can change, is there a way to defend against that (this is one reason for DOIs, or other methods of indirection, such as PURLs).
If providers change, will the URLs change? Is there a way to defend against that (again, DOIs handle this nicely by virtue of (a) indirection, and (b) lack of branding).
How can I encourage people to use the specimen service? What can I do to make them think it will persist? Can I convince academic publishers to trust it enough to link to it in articles? What’s the pitch to Pensoft, to Magnolai Press, to Springer and Elsevier?
Is there some way to make the service itself become trusted? For example if I look at a journal and see that it has DOIs issued by CrossRef, I take that journal more seriously than if it’s just got simple URLs. I know that papers in that journal will be linked into the citation network, I also know that there is a backup plan if the journal goes under (because you need that to have DOIs in CrossRef). Likewise, I think Figshare got a big boost when it stared minting DOIs (wow, a DOI, I know DOIs, you mean I can now cite stuff I’ve uploaded there?).
How can museums and herbaria be persuaded to keep their identifiers stable? What incentives can we provide (e.g., citation metrics for collections)? What system would enable us to do this? What about tracing funding (e.g., the NSF paid for these n papers, and they cite these y specimens, from these z collections, so science paid for by the NSF requires these collections to exist).
I guess I’m arguing that we should think all this through, because a specimen code to specimen URL is a small piece of the puzzle. Now, I’m desperately trying not to simply say what I think is blindingly obvious here (put DOIs on specimens, add metadata to specimen and specimen citation services, and we are done), but I think if we sit back and look at where we want to be, this is exactly what we need (or something functionally equivalent). Until we see the bigger picture, we will be stuck in amateur hour.
Take a look at:
http://search.crossref.org http://www.crossref.org/fundref/ http://support.crossref.org/ https://prospect.crossref.org/splash/
Isn’t this the kind of stuff we’d like to do? If so, let’s work out what’s needed and make it happen.
In short, I think we constantly solve an immediate problem in the quickest way we know how, without thinking it through. I’d argue that if we think about the bigger picture (what do we want to be able to, what are the questions we want to be able to ask) then things become clearer. This is independent of getting everyone’s agreement (but it would help if we made their agreement seem a no brainer by providing solutions to things that cause them pain).
Regards
Rod
On 5 May 2014, at 19:14, Hilmar Lapp hlapp@nescent.org wrote:
On Mon, May 5, 2014 at 1:29 PM, Roderic Page r.page@bio.gla.ac.uk wrote: Contrary to Hilmar, there is more to this than simply a quick hackathon. Yes, a service that takes metadata and returns one or more identifiers is a good idea and easy to create (there will often be more than one because museum codes are not unique). But who maintains this service? Who maintains the identifiers? Who do I complain to if they break? How do we ensure that they persist when, say, a museum closes down, moves its collection, changes it’s web technology? Who provides the tools that add value to the identifiers? (there’s no point having them if they are not useful)
Jonathan Rees pointed this out to me too off-list. Just for the record, this isn't contrary but fully in line with what I was saying (or trying to say). Yes, I didn't elaborate that part, assuming, perhaps rather erroneously, that all this goes without saying, but I did mention that one part of this becoming a real solution has to be an institution with an in-scope cyberinfrastructure mandate that going in would make a commitment to sustain the resolver, including working with partners on the above slew of questions. The institution I gave was iDigBio; perhaps for some reason that would not be a good choice, but whether they are or not wasn't my point.
I will add one point to this, though. It seems to me that by continuing to argue that we can't go ahead with building a resolver that works (as far as technical requirements are concerned) before we haven't first fully addressed the institutional and social long-term sustainability commitment problem, we are and have been making this one big hairy problem that we can't make any practical pragmatic headway about, rather than breaking it down into parts, some of which (namely the primarily technical ones) are actually fairly straightforward to solve. As a result, to this day we don't have some solution that even though it's not very sustainable yet, at least proves to everyone how critical it is, and that the community can rally behind. Perhaps that's naïve, but I do think that once there's a solution the community rallies behind, ways to sustain it will be found.
-hilmar
Hilmar Lapp -:- informatics.nescent.org/wiki -:- lappland.io
How long will GBIF be around, Markus? Do we know? Should GBIF manage identifiers? iDigBio? Should any single organization in our community have that role? I don't know, but it seems to me that much depends on - sorry to say - currently frail systems that exist in our community without clear long term sustainability plans.
So, I will repeat what Rod has said: why not put DOIs on specimens, add metadata to specimen and specimen citation services. Don't we gain something immediately?
Although GBIF and its local solutions may not be the long term solution, couldn't it point the way here? It would simply start with GBIF advocating DOIs for publisher datasets (not yet the records but there are some neat ways to think about how to do that) that persist and push to GBIF when publishers create Darwin Core archives.
Best, Rob
On Mon, May 5, 2014 at 1:51 PM, Markus Döring mdoering@gbif.org wrote:
Hi Rod,
I agree GBIF has troubles to keep identifiers stable for *some* records, but in general we do a much better job than the original publishers in the first place. We try hard to keep GBIF ids stable even if publishers change collection codes, registered datasets twice or do other things to break a simple automated way of mapping source records to existing GBIF ids. Also the stable identifier in GBIF never has been the URL, but it is the local GBIF integer alone. The GBIF services that consume those ids have changed over the years, but its pretty trivial to adjust if you use the GBIF ids instead of the URLs. If there is a clear need to have stable URLs instead I am sure we can get that working easily.
The two real issues for GBIF are a) duplicates and b) records with varying local identifiers of any sort (triplet, occurrenceID or whatever else).
When it comes to the varying source identifiers I always liked the idea of flagging those records and datasets as unstable, so it is obvious to users. This is not a 100% safe, but most terrible datasets change all of their ids and that is easily detectable. Also with a service like that it would become more obvious to publishers how important stable source ids are.
Before jumping on DOIs as the next big thing I would really like to understand what needs the community has around specimen ids. Gabi clearly has a very real use case, are there others we know about?
Markus
On 05 May 2014, at 21:05, Roderic Page r.page@bio.gla.ac.uk wrote:
Hi Hilmar,
I’m not arguing that we shouldn’t build a resolver (I have one that I use, Rich has mentioned he’s got one, Markus has one at GBIF, etc.).
Nor do I think we should wait for institutional and social commitment (because then we’d never get anything done).
But I do think it would be useful to think it through. For example, it’s easy to create a URL for a specimen. Easy peasy. OK, how do I discover that URL? How do I discover these for all specimens? Sounds like I need a centralised discover service like you’e described.
How do I handle changes in those URLs? I built a specimen code to GBIF resolver for BioStor so that I could link to specimens, GBIF changed lots of those URLs, all my work was undone, boy does GBIF suck sometimes. For example, if I map codes to URLs, I need to handle cases when they change.
If URLs can change, is there a way to defend against that (this is one reason for DOIs, or other methods of indirection, such as PURLs).
If providers change, will the URLs change? Is there a way to defend against that (again, DOIs handle this nicely by virtue of (a) indirection, and (b) lack of branding).
How can I encourage people to use the specimen service? What can I do to make them think it will persist? Can I convince academic publishers to trust it enough to link to it in articles? What’s the pitch to Pensoft, to Magnolai Press, to Springer and Elsevier?
Is there some way to make the service itself become trusted? For example if I look at a journal and see that it has DOIs issued by CrossRef, I take that journal more seriously than if it’s just got simple URLs. I know that papers in that journal will be linked into the citation network, I also know that there is a backup plan if the journal goes under (because you need that to have DOIs in CrossRef). Likewise, I think Figshare got a big boost when it stared minting DOIs (wow, a DOI, I know DOIs, you mean I can now cite stuff I’ve uploaded there?).
How can museums and herbaria be persuaded to keep their identifiers stable? What incentives can we provide (e.g., citation metrics for collections)? What system would enable us to do this? What about tracing funding (e.g., the NSF paid for these n papers, and they cite these y specimens, from these z collections, so science paid for by the NSF requires these collections to exist).
I guess I’m arguing that we should think all this through, because a specimen code to specimen URL is a small piece of the puzzle. Now, I’m desperately trying not to simply say what I think is blindingly obvious here (put DOIs on specimens, add metadata to specimen and specimen citation services, and we are done), but I think if we sit back and look at where we want to be, this is exactly what we need (or something functionally equivalent). Until we see the bigger picture, we will be stuck in amateur hour.
Take a look at:
http://search.crossref.org http://www.crossref.org/fundref/ http://support.crossref.org/ https://prospect.crossref.org/splash/
Isn’t this the kind of stuff we’d like to do? If so, let’s work out what’s needed and make it happen.
In short, I think we constantly solve an immediate problem in the quickest way we know how, without thinking it through. I’d argue that if we think about the bigger picture (what do we want to be able to, what are the questions we want to be able to ask) then things become clearer. This is independent of getting everyone’s agreement (but it would help if we made their agreement seem a no brainer by providing solutions to things that cause them pain).
Regards
Rod
On 5 May 2014, at 19:14, Hilmar Lapp hlapp@nescent.org wrote:
On Mon, May 5, 2014 at 1:29 PM, Roderic Page r.page@bio.gla.ac.uk wrote:
Contrary to Hilmar, there is more to this than simply a quick hackathon. Yes, a service that takes metadata and returns one or more identifiers is a good idea and easy to create (there will often be more than one because museum codes are not unique). But who maintains this service? Who maintains the identifiers? Who do I complain to if they break? How do we ensure that they persist when, say, a museum closes down, moves its collection, changes it’s web technology? Who provides the tools that add value to the identifiers? (there’s no point having them if they are not useful)
Jonathan Rees pointed this out to me too off-list. Just for the record, this isn't contrary but fully in line with what I was saying (or trying to say). Yes, I didn't elaborate that part, assuming, perhaps rather erroneously, that all this goes without saying, but I did mention that one part of this becoming a real solution has to be an institution with an in-scope cyberinfrastructure mandate that going in would make a commitment to sustain the resolver, including working with partners on the above slew of questions. The institution I gave was iDigBio; perhaps for some reason that would not be a good choice, but whether they are or not wasn't my point.
I will add one point to this, though. It seems to me that by continuing to argue that we can't go ahead with building a resolver that works (as far as technical requirements are concerned) before we haven't first fully addressed the institutional and social long-term sustainability commitment problem, we are and have been making this one big hairy problem that we can't make any practical pragmatic headway about, rather than breaking it down into parts, some of which (namely the primarily technical ones) are actually fairly straightforward to solve. As a result, to this day we don't have some solution that even though it's not very sustainable yet, at least proves to everyone how critical it is, and that the community can rally behind. Perhaps that's naïve, but I do think that once there's a solution the community rallies behind, ways to sustain it will be found.
-hilmar
Hilmar Lapp -:- informatics.nescent.org/wiki -:- lappland.io
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
Hi Markus,
I have three use cases that
1. Linking sequences in GenBank to voucher specimens. Lots of voucher specimens are listed in GenBank but not linked to digital records for those specimens. These links are useful in two directions, one is to link GBIF to genomic data, the second is to enhance data in both databases, see http://iphylo.blogspot.co.uk/2012/02/linking-gbif-and-genbank.html (e.g., by adding missing georeferencing that is available in one database but not the other).
2. Linking to specimens cited in the literature. I’ve done some work on this in BioStor, see http://iphylo.blogspot.co.uk/2012/02/linking-gbif-and-biodiversity-heritage.... One immediate benefit of this is that GBIF could display the scientific literature associated with a specimen, so we get access to the evidence supporting identification, georeferencing, etc.
3. Citation metrics for collections, see http://iphylo.blogspot.co.uk/2013/05/the-impact-of-museum-collections-one.ht... and http://iphylo.blogspot.co.uk/2012/02/gbif-specimens-in-biostor-who-are-top.h... Based on citation sod specimens in the literature, and in databases such as GenBank (i.e., basically combining 1 + 2 above) we can demonstrate the value of a collection.
All of these use cases depend on GBIF occurenceIds remaining stable, I have often ranted on iPHylo when this doesn’t happen: http://iphylo.blogspot.co.uk/2012/07/dear-gbif-please-stop-changing.html
Regards
Rod
On 5 May 2014, at 20:51, Markus Döring mdoering@gbif.org wrote:
Hi Rod,
I agree GBIF has troubles to keep identifiers stable for *some* records, but in general we do a much better job than the original publishers in the first place. We try hard to keep GBIF ids stable even if publishers change collection codes, registered datasets twice or do other things to break a simple automated way of mapping source records to existing GBIF ids. Also the stable identifier in GBIF never has been the URL, but it is the local GBIF integer alone. The GBIF services that consume those ids have changed over the years, but its pretty trivial to adjust if you use the GBIF ids instead of the URLs. If there is a clear need to have stable URLs instead I am sure we can get that working easily.
The two real issues for GBIF are a) duplicates and b) records with varying local identifiers of any sort (triplet, occurrenceID or whatever else).
When it comes to the varying source identifiers I always liked the idea of flagging those records and datasets as unstable, so it is obvious to users. This is not a 100% safe, but most terrible datasets change all of their ids and that is easily detectable. Also with a service like that it would become more obvious to publishers how important stable source ids are.
Before jumping on DOIs as the next big thing I would really like to understand what needs the community has around specimen ids. Gabi clearly has a very real use case, are there others we know about?
Markus
On 05 May 2014, at 21:05, Roderic Page r.page@bio.gla.ac.uk wrote:
Hi Hilmar,
I’m not arguing that we shouldn’t build a resolver (I have one that I use, Rich has mentioned he’s got one, Markus has one at GBIF, etc.).
Nor do I think we should wait for institutional and social commitment (because then we’d never get anything done).
But I do think it would be useful to think it through. For example, it’s easy to create a URL for a specimen. Easy peasy. OK, how do I discover that URL? How do I discover these for all specimens? Sounds like I need a centralised discover service like you’e described.
How do I handle changes in those URLs? I built a specimen code to GBIF resolver for BioStor so that I could link to specimens, GBIF changed lots of those URLs, all my work was undone, boy does GBIF suck sometimes. For example, if I map codes to URLs, I need to handle cases when they change.
If URLs can change, is there a way to defend against that (this is one reason for DOIs, or other methods of indirection, such as PURLs).
If providers change, will the URLs change? Is there a way to defend against that (again, DOIs handle this nicely by virtue of (a) indirection, and (b) lack of branding).
How can I encourage people to use the specimen service? What can I do to make them think it will persist? Can I convince academic publishers to trust it enough to link to it in articles? What’s the pitch to Pensoft, to Magnolai Press, to Springer and Elsevier?
Is there some way to make the service itself become trusted? For example if I look at a journal and see that it has DOIs issued by CrossRef, I take that journal more seriously than if it’s just got simple URLs. I know that papers in that journal will be linked into the citation network, I also know that there is a backup plan if the journal goes under (because you need that to have DOIs in CrossRef). Likewise, I think Figshare got a big boost when it stared minting DOIs (wow, a DOI, I know DOIs, you mean I can now cite stuff I’ve uploaded there?).
How can museums and herbaria be persuaded to keep their identifiers stable? What incentives can we provide (e.g., citation metrics for collections)? What system would enable us to do this? What about tracing funding (e.g., the NSF paid for these n papers, and they cite these y specimens, from these z collections, so science paid for by the NSF requires these collections to exist).
I guess I’m arguing that we should think all this through, because a specimen code to specimen URL is a small piece of the puzzle. Now, I’m desperately trying not to simply say what I think is blindingly obvious here (put DOIs on specimens, add metadata to specimen and specimen citation services, and we are done), but I think if we sit back and look at where we want to be, this is exactly what we need (or something functionally equivalent). Until we see the bigger picture, we will be stuck in amateur hour.
Take a look at:
http://search.crossref.org http://www.crossref.org/fundref/ http://support.crossref.org/ https://prospect.crossref.org/splash/
Isn’t this the kind of stuff we’d like to do? If so, let’s work out what’s needed and make it happen.
In short, I think we constantly solve an immediate problem in the quickest way we know how, without thinking it through. I’d argue that if we think about the bigger picture (what do we want to be able to, what are the questions we want to be able to ask) then things become clearer. This is independent of getting everyone’s agreement (but it would help if we made their agreement seem a no brainer by providing solutions to things that cause them pain).
Regards
Rod
On 5 May 2014, at 19:14, Hilmar Lapp hlapp@nescent.org wrote:
On Mon, May 5, 2014 at 1:29 PM, Roderic Page r.page@bio.gla.ac.uk wrote: Contrary to Hilmar, there is more to this than simply a quick hackathon. Yes, a service that takes metadata and returns one or more identifiers is a good idea and easy to create (there will often be more than one because museum codes are not unique). But who maintains this service? Who maintains the identifiers? Who do I complain to if they break? How do we ensure that they persist when, say, a museum closes down, moves its collection, changes it’s web technology? Who provides the tools that add value to the identifiers? (there’s no point having them if they are not useful)
Jonathan Rees pointed this out to me too off-list. Just for the record, this isn't contrary but fully in line with what I was saying (or trying to say). Yes, I didn't elaborate that part, assuming, perhaps rather erroneously, that all this goes without saying, but I did mention that one part of this becoming a real solution has to be an institution with an in-scope cyberinfrastructure mandate that going in would make a commitment to sustain the resolver, including working with partners on the above slew of questions. The institution I gave was iDigBio; perhaps for some reason that would not be a good choice, but whether they are or not wasn't my point.
I will add one point to this, though. It seems to me that by continuing to argue that we can't go ahead with building a resolver that works (as far as technical requirements are concerned) before we haven't first fully addressed the institutional and social long-term sustainability commitment problem, we are and have been making this one big hairy problem that we can't make any practical pragmatic headway about, rather than breaking it down into parts, some of which (namely the primarily technical ones) are actually fairly straightforward to solve. As a result, to this day we don't have some solution that even though it's not very sustainable yet, at least proves to everyone how critical it is, and that the community can rally behind. Perhaps that's naïve, but I do think that once there's a solution the community rallies behind, ways to sustain it will be found.
-hilmar
Hilmar Lapp -:- informatics.nescent.org/wiki -:- lappland.io
+1 on DOIs, and on ARKS (see: https://wiki.ucop.edu/display/Curation/ARK), and also i'll mention IGSN:'s (see http://www.geosamples.org/) IGSN: is rapidly gaining traction for geo-samples. I don't know of anyone using them for bio-samples but they offer many features that we've been asking for as well. What our community considers a sample (or observation) is diverse enough that multiple ID systems are probably inevitable and perhaps even warranted.
Whatever the ID system, the data providers (museums, field researchers, labs, etc..) must adopt that identifier and use it whenever linking to downstream sequence, image, and sub-sampling repository agencies. This is great to say this in theory but difficult to do in reality because the decision to adopt long term and stable identifiers is often an institutional one, and the technology is still new and argued about, in particular, on this fine list. Further, those agencies that receive data associated with a GUID must honor that source GUID when passing to consumers and other aggregators, who must also have some level of confidence in the source GUIDs as well. Thus, a primary issue that we're confronted with here is trust.
Having Hilmar's hackathon support several possible GUID schemes (each with their own long term persistence strategy), and sponsored by a well known global institution affiliated with biodiversity informatics that could offer technical guidance to data providers, good name branding, and the nuts and bolts expertise to demonstrate good shepherding of source GUIDs through a data aggregation chain would be ideal. I nominate GBIF :)
John Deck
On Mon, May 5, 2014 at 1:09 PM, Roderic Page r.page@bio.gla.ac.uk wrote:
Hi Markus,
I have three use cases that
- Linking sequences in GenBank to voucher specimens. Lots of voucher
specimens are listed in GenBank but not linked to digital records for those specimens. These links are useful in two directions, one is to link GBIF to genomic data, the second is to enhance data in both databases, see http://iphylo.blogspot.co.uk/2012/02/linking-gbif-and-genbank.html (e.g., by adding missing georeferencing that is available in one database but not the other).
- Linking to specimens cited in the literature. I've done some work on
this in BioStor, see http://iphylo.blogspot.co.uk/2012/02/linking-gbif-and-biodiversity-heritage.... One immediate benefit of this is that GBIF could display the scientific literature associated with a specimen, so we get access to the evidence supporting identification, georeferencing, etc.
- Citation metrics for collections, see
http://iphylo.blogspot.co.uk/2013/05/the-impact-of-museum-collections-one.ht... http://iphylo.blogspot.co.uk/2012/02/gbif-specimens-in-biostor-who-are-top.h... on citation sod specimens in the literature, and in databases such as GenBank (i.e., basically combining 1 + 2 above) we can demonstrate the value of a collection.
All of these use cases depend on GBIF occurenceIds remaining stable, I have often ranted on iPHylo when this doesn't happen: http://iphylo.blogspot.co.uk/2012/07/dear-gbif-please-stop-changing.html
Regards
Rod
On 5 May 2014, at 20:51, Markus Döring mdoering@gbif.org wrote:
Hi Rod,
I agree GBIF has troubles to keep identifiers stable for *some* records, but in general we do a much better job than the original publishers in the first place. We try hard to keep GBIF ids stable even if publishers change collection codes, registered datasets twice or do other things to break a simple automated way of mapping source records to existing GBIF ids. Also the stable identifier in GBIF never has been the URL, but it is the local GBIF integer alone. The GBIF services that consume those ids have changed over the years, but its pretty trivial to adjust if you use the GBIF ids instead of the URLs. If there is a clear need to have stable URLs instead I am sure we can get that working easily.
The two real issues for GBIF are a) duplicates and b) records with varying local identifiers of any sort (triplet, occurrenceID or whatever else).
When it comes to the varying source identifiers I always liked the idea of flagging those records and datasets as unstable, so it is obvious to users. This is not a 100% safe, but most terrible datasets change all of their ids and that is easily detectable. Also with a service like that it would become more obvious to publishers how important stable source ids are.
Before jumping on DOIs as the next big thing I would really like to understand what needs the community has around specimen ids. Gabi clearly has a very real use case, are there others we know about?
Markus
On 05 May 2014, at 21:05, Roderic Page r.page@bio.gla.ac.uk wrote:
Hi Hilmar,
I'm not arguing that we shouldn't build a resolver (I have one that I use, Rich has mentioned he's got one, Markus has one at GBIF, etc.).
Nor do I think we should wait for institutional and social commitment (because then we'd never get anything done).
But I do think it would be useful to think it through. For example, it's easy to create a URL for a specimen. Easy peasy. OK, how do I discover that URL? How do I discover these for all specimens? Sounds like I need a centralised discover service like you'e described.
How do I handle changes in those URLs? I built a specimen code to GBIF resolver for BioStor so that I could link to specimens, GBIF changed lots of those URLs, all my work was undone, boy does GBIF suck sometimes. For example, if I map codes to URLs, I need to handle cases when they change.
If URLs can change, is there a way to defend against that (this is one reason for DOIs, or other methods of indirection, such as PURLs).
If providers change, will the URLs change? Is there a way to defend against that (again, DOIs handle this nicely by virtue of (a) indirection, and (b) lack of branding).
How can I encourage people to use the specimen service? What can I do to make them think it will persist? Can I convince academic publishers to trust it enough to link to it in articles? What's the pitch to Pensoft, to Magnolai Press, to Springer and Elsevier?
Is there some way to make the service itself become trusted? For example if I look at a journal and see that it has DOIs issued by CrossRef, I take that journal more seriously than if it's just got simple URLs. I know that papers in that journal will be linked into the citation network, I also know that there is a backup plan if the journal goes under (because you need that to have DOIs in CrossRef). Likewise, I think Figshare got a big boost when it stared minting DOIs (wow, a DOI, I know DOIs, you mean I can now cite stuff I've uploaded there?).
How can museums and herbaria be persuaded to keep their identifiers stable? What incentives can we provide (e.g., citation metrics for collections)? What system would enable us to do this? What about tracing funding (e.g., the NSF paid for these n papers, and they cite these y specimens, from these z collections, so science paid for by the NSF requires these collections to exist).
I guess I'm arguing that we should think all this through, because a specimen code to specimen URL is a small piece of the puzzle. Now, I'm desperately trying not to simply say what I think is blindingly obvious here (put DOIs on specimens, add metadata to specimen and specimen citation services, and we are done), but I think if we sit back and look at where we want to be, this is exactly what we need (or something functionally equivalent). Until we see the bigger picture, we will be stuck in amateur hour.
Take a look at:
http://search.crossref.org http://www.crossref.org/fundref/ http://support.crossref.org/ https://prospect.crossref.org/splash/
Isn't this the kind of stuff we'd like to do? If so, let's work out what's needed and make it happen.
In short, I think we constantly solve an immediate problem in the quickest way we know how, without thinking it through. I'd argue that if we think about the bigger picture (what do we want to be able to, what are the questions we want to be able to ask) then things become clearer. This is independent of getting everyone's agreement (but it would help if we made their agreement seem a no brainer by providing solutions to things that cause them pain).
Regards
Rod
On 5 May 2014, at 19:14, Hilmar Lapp hlapp@nescent.org wrote:
On Mon, May 5, 2014 at 1:29 PM, Roderic Page r.page@bio.gla.ac.uk wrote:
Contrary to Hilmar, there is more to this than simply a quick hackathon. Yes, a service that takes metadata and returns one or more identifiers is a good idea and easy to create (there will often be more than one because museum codes are not unique). But who maintains this service? Who maintains the identifiers? Who do I complain to if they break? How do we ensure that they persist when, say, a museum closes down, moves its collection, changes it's web technology? Who provides the tools that add value to the identifiers? (there's no point having them if they are not useful)
Jonathan Rees pointed this out to me too off-list. Just for the record, this isn't contrary but fully in line with what I was saying (or trying to say). Yes, I didn't elaborate that part, assuming, perhaps rather erroneously, that all this goes without saying, but I did mention that one part of this becoming a real solution has to be an institution with an in-scope cyberinfrastructure mandate that going in would make a commitment to sustain the resolver, including working with partners on the above slew of questions. The institution I gave was iDigBio; perhaps for some reason that would not be a good choice, but whether they are or not wasn't my point.
I will add one point to this, though. It seems to me that by continuing to argue that we can't go ahead with building a resolver that works (as far as technical requirements are concerned) before we haven't first fully addressed the institutional and social long-term sustainability commitment problem, we are and have been making this one big hairy problem that we can't make any practical pragmatic headway about, rather than breaking it down into parts, some of which (namely the primarily technical ones) are actually fairly straightforward to solve. As a result, to this day we don't have some solution that even though it's not very sustainable yet, at least proves to everyone how critical it is, and that the community can rally behind. Perhaps that's naïve, but I do think that once there's a solution the community rallies behind, ways to sustain it will be found.
-hilmar
Hilmar Lapp -:- informatics.nescent.org/wiki -:- lappland.io
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
I'm a big fan of not reinventing the wheel, and as such find the idea of using DOIs appealing. I think they pretty much follow all of the "rules" set out in the TDWG GUID Applicability Standard. They also play nicely in the Linked Data universe in their HTTP URI form, i.e. they redirect to HTML or RDF depending on the request header.
But I have a question for someone who understands how DOIs work better than I do. The HTML representation seems to arise by redirection to whatever is the current web page for the resource. You can see this by pasting this DOI for a specimen into a browser: http://dx.doi.org/10.7299/X7VQ32SJ which redirects to http://arctos.database.museum/guid/UAM:Ento:230092 when HTML is requested by a client. However, when the client requests RDF, one gets redirected to a DataCite metadata page: http://data.datacite.org/10.7299/X7VQ32SJ . Can the creator of the DOI redirect to any desired URI for the RDF?
The resulting RDF metadata doesn't have any of the kind useful information about the specimen that you get on the web page but rather looks like what you would expect for a publication (creator, publisher, date, etc.):
<http://dx.doi.org/10.7299/X7VQ32SJ http://linkeddata.informatik.hu-berlin.de/uridbg/index.php?url=http://dx.doi.org/10.7299/X7VQ32SJ&acceptheader=text%2Fturtle%3Bq%3D1%2Capplication%2Fx-turtle%3Bq%3D0.5&useragentheader=> <http://purl.org/dc/terms/creator http://linkeddata.informatik.hu-berlin.de/uridbg/index.php?url=http://purl.org/dc/terms/creator&acceptheader=text%2Fturtle%3Bq%3D1%2Capplication%2Fx-turtle%3Bq%3D0.5&useragentheader=> "Derek S. Sikes" ; <http://purl.org/dc/terms/date http://linkeddata.informatik.hu-berlin.de/uridbg/index.php?url=http://purl.org/dc/terms/date&acceptheader=text%2Fturtle%3Bq%3D1%2Capplication%2Fx-turtle%3Bq%3D0.5&useragentheader=> "2004" ; <http://purl.org/dc/terms/identifier http://linkeddata.informatik.hu-berlin.de/uridbg/index.php?url=http://purl.org/dc/terms/identifier&acceptheader=text%2Fturtle%3Bq%3D1%2Capplication%2Fx-turtle%3Bq%3D0.5&useragentheader=> "10.7299/X7VQ32SJ" ; <http://purl.org/dc/terms/publisher http://linkeddata.informatik.hu-berlin.de/uridbg/index.php?url=http://purl.org/dc/terms/publisher&acceptheader=text%2Fturtle%3Bq%3D1%2Capplication%2Fx-turtle%3Bq%3D0.5&useragentheader=> "University of Alaska Museum" ; <http://purl.org/dc/terms/title http://linkeddata.informatik.hu-berlin.de/uridbg/index.php?url=http://purl.org/dc/terms/title&acceptheader=text%2Fturtle%3Bq%3D1%2Capplication%2Fx-turtle%3Bq%3D0.5&useragentheader=> "UAM:Ento:230092 - Grylloblatta campodeiformis" ; <http://www.w3.org/2002/07/owl#sameAs http://linkeddata.informatik.hu-berlin.de/uridbg/index.php?url=http://www.w3.org/2002/07/owl#sameAs&acceptheader=text%2Fturtle%3Bq%3D1%2Capplication%2Fx-turtle%3Bq%3D0.5&useragentheader=> "info:doi/10.7299/X7VQ32SJ" , "doi:10.7299/X7VQ32SJ" .
Can one control what kinds of metadata are provided in "DataCite's metadata"? Assuming that we get our act together and adopt an RDF guide for Darwin Core, it would be nice for the RDF metadata to look more like the description of a specimen and less like the description of a book. But maybe that's just a function of where the data provider choses to redirect RDF requests.
Steve
John Deck wrote:
+1 on DOIs, and on ARKS (see: https://wiki.ucop.edu/display/Curation/ARK ), and also i'll mention IGSN:'s (see http://www.geosamples.org/) IGSN: is rapidly gaining traction for geo-samples. I don't know of anyone using them for bio-samples but they offer many features that we've been asking for as well. What our community considers a sample (or observation) is diverse enough that multiple ID systems are probably inevitable and perhaps even warranted.
Whatever the ID system, the data providers (museums, field researchers, labs, etc..) must adopt that identifier and use it whenever linking to downstream sequence, image, and sub-sampling repository agencies. This is great to say this in theory but difficult to do in reality because the decision to adopt long term and stable identifiers is often an institutional one, and the technology is still new and argued about, in particular, on this fine list. Further, those agencies that receive data associated with a GUID must honor that source GUID when passing to consumers and other aggregators, who must also have some level of confidence in the source GUIDs as well. Thus, a primary issue that we're confronted with here is trust.
Having Hilmar's hackathon support several possible GUID schemes (each with their own long term persistence strategy), and sponsored by a well known global institution affiliated with biodiversity informatics that could offer technical guidance to data providers, good name branding, and the nuts and bolts expertise to demonstrate good shepherding of source GUIDs through a data aggregation chain would be ideal. I nominate GBIF :)
John Deck
On Mon, May 5, 2014 at 1:09 PM, Roderic Page <r.page@bio.gla.ac.uk mailto:r.page@bio.gla.ac.uk> wrote:
Hi Markus, I have three use cases that 1. Linking sequences in GenBank to voucher specimens. Lots of voucher specimens are listed in GenBank but not linked to digital records for those specimens. These links are useful in two directions, one is to link GBIF to genomic data, the second is to enhance data in both databases, see http://iphylo.blogspot.co.uk/2012/02/linking-gbif-and-genbank.html (e.g., by adding missing georeferencing that is available in one database but not the other). 2. Linking to specimens cited in the literature. I've done some work on this in BioStor, see http://iphylo.blogspot.co.uk/2012/02/linking-gbif-and-biodiversity-heritage.html One immediate benefit of this is that GBIF could display the scientific literature associated with a specimen, so we get access to the evidence supporting identification, georeferencing, etc. 3. Citation metrics for collections, see http://iphylo.blogspot.co.uk/2013/05/the-impact-of-museum-collections-one.html and http://iphylo.blogspot.co.uk/2012/02/gbif-specimens-in-biostor-who-are-top.html Based on citation sod specimens in the literature, and in databases such as GenBank (i.e., basically combining 1 + 2 above) we can demonstrate the value of a collection. All of these use cases depend on GBIF occurenceIds remaining stable, I have often ranted on iPHylo when this doesn't happen: http://iphylo.blogspot.co.uk/2012/07/dear-gbif-please-stop-changing.html Regards Rod On 5 May 2014, at 20:51, Markus Döring <mdoering@gbif.org <mailto:mdoering@gbif.org>> wrote:
Hi Rod, I agree GBIF has troubles to keep identifiers stable for *some* records, but in general we do a much better job than the original publishers in the first place. We try hard to keep GBIF ids stable even if publishers change collection codes, registered datasets twice or do other things to break a simple automated way of mapping source records to existing GBIF ids. Also the stable identifier in GBIF never has been the URL, but it is the local GBIF integer alone. The GBIF services that consume those ids have changed over the years, but its pretty trivial to adjust if you use the GBIF ids instead of the URLs. If there is a clear need to have stable URLs instead I am sure we can get that working easily. The two real issues for GBIF are a) duplicates and b) records with varying local identifiers of any sort (triplet, occurrenceID or whatever else). When it comes to the varying source identifiers I always liked the idea of flagging those records and datasets as unstable, so it is obvious to users. This is not a 100% safe, but most terrible datasets change all of their ids and that is easily detectable. Also with a service like that it would become more obvious to publishers how important stable source ids are. Before jumping on DOIs as the next big thing I would really like to understand what needs the community has around specimen ids. Gabi clearly has a very real use case, are there others we know about? Markus On 05 May 2014, at 21:05, Roderic Page <r.page@bio.gla.ac.uk <mailto:r.page@bio.gla.ac.uk>> wrote:
Hi Hilmar, I'm not arguing that we shouldn't build a resolver (I have one that I use, Rich has mentioned he's got one, Markus has one at GBIF, etc.). Nor do I think we should wait for institutional and social commitment (because then we'd never get anything done). But I do think it would be useful to think it through. For example, it's easy to create a URL for a specimen. Easy peasy. OK, how do I discover that URL? How do I discover these for all specimens? Sounds like I need a centralised discover service like you'e described. How do I handle changes in those URLs? I built a specimen code to GBIF resolver for BioStor so that I could link to specimens, GBIF changed lots of those URLs, all my work was undone, boy does GBIF suck sometimes. For example, if I map codes to URLs, I need to handle cases when they change. If URLs can change, is there a way to defend against that (this is one reason for DOIs, or other methods of indirection, such as PURLs). If providers change, will the URLs change? Is there a way to defend against that (again, DOIs handle this nicely by virtue of (a) indirection, and (b) lack of branding). How can I encourage people to use the specimen service? What can I do to make them think it will persist? Can I convince academic publishers to trust it enough to link to it in articles? What's the pitch to Pensoft, to Magnolai Press, to Springer and Elsevier? Is there some way to make the service itself become trusted? For example if I look at a journal and see that it has DOIs issued by CrossRef, I take that journal more seriously than if it's just got simple URLs. I know that papers in that journal will be linked into the citation network, I also know that there is a backup plan if the journal goes under (because you need that to have DOIs in CrossRef). Likewise, I think Figshare got a big boost when it stared minting DOIs (wow, a DOI, I know DOIs, you mean I can now cite stuff I've uploaded there?). How can museums and herbaria be persuaded to keep their identifiers stable? What incentives can we provide (e.g., citation metrics for collections)? What system would enable us to do this? What about tracing funding (e.g., the NSF paid for these n papers, and they cite these y specimens, from these z collections, so science paid for by the NSF requires these collections to exist). I guess I'm arguing that we should think all this through, because a specimen code to specimen URL is a small piece of the puzzle. Now, I'm desperately trying not to simply say what I think is blindingly obvious here (put DOIs on specimens, add metadata to specimen and specimen citation services, and we are done), but I think if we sit back and look at where we want to be, this is exactly what we need (or something functionally equivalent). Until we see the bigger picture, we will be stuck in amateur hour. Take a look at: http://search.crossref.org <http://search.crossref.org/> http://www.crossref.org/fundref/ http://support.crossref.org/ https://prospect.crossref.org/splash/ Isn't this the kind of stuff we'd like to do? If so, let's work out what's needed and make it happen. In short, I think we constantly solve an immediate problem in the quickest way we know how, without thinking it through. I'd argue that if we think about the bigger picture (what do we want to be able to, what are the questions we want to be able to ask) then things become clearer. This is independent of getting everyone's agreement (but it would help if we made their agreement seem a no brainer by providing solutions to things that cause them pain). Regards Rod On 5 May 2014, at 19:14, Hilmar Lapp <hlapp@nescent.org <mailto:hlapp@nescent.org>> wrote:
On Mon, May 5, 2014 at 1:29 PM, Roderic Page <r.page@bio.gla.ac.uk <mailto:r.page@bio.gla.ac.uk>> wrote: Contrary to Hilmar, there is more to this than simply a quick hackathon. Yes, a service that takes metadata and returns one or more identifiers is a good idea and easy to create (there will often be more than one because museum codes are not unique). But who maintains this service? Who maintains the identifiers? Who do I complain to if they break? How do we ensure that they persist when, say, a museum closes down, moves its collection, changes it's web technology? Who provides the tools that add value to the identifiers? (there's no point having them if they are not useful) Jonathan Rees pointed this out to me too off-list. Just for the record, this isn't contrary but fully in line with what I was saying (or trying to say). Yes, I didn't elaborate that part, assuming, perhaps rather erroneously, that all this goes without saying, but I did mention that one part of this becoming a real solution has to be an institution with an in-scope cyberinfrastructure mandate that going in would make a commitment to sustain the resolver, including working with partners on the above slew of questions. The institution I gave was iDigBio; perhaps for some reason that would not be a good choice, but whether they are or not wasn't my point. I will add one point to this, though. It seems to me that by continuing to argue that we can't go ahead with building a resolver that works (as far as technical requirements are concerned) before we haven't first fully addressed the institutional and social long-term sustainability commitment problem, we are and have been making this one big hairy problem that we can't make any practical pragmatic headway about, rather than breaking it down into parts, some of which (namely the primarily technical ones) are actually fairly straightforward to solve. As a result, to this day we don't have some solution that even though it's not very sustainable yet, at least proves to everyone how critical it is, and that the community can rally behind. Perhaps that's naïve, but I do think that once there's a solution the community rallies behind, ways to sustain it will be found. -hilmar -- Hilmar Lapp -:- informatics.nescent.org/wiki <http://informatics.nescent.org/wiki> -:- lappland.io <http://lappland.io/>
_______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.org <mailto:tdwg-content@lists.tdwg.org> http://lists.tdwg.org/mailman/listinfo/tdwg-content
-- John Deck (541) 321-0689
The biggest appeal to me of the linked data framework is connecting data across domains... you have bio-, geo-, eco-, -omic, and not to mention all their various media representations. Hard to believe that DOIs solve all of these needs (just wait till someone wants to assign DOIs to Loci from NextGen sequencing, or delving into transcriptomics with this). I'd hope that GUID services could provide high level consistent metadata (such as the Datacite metadata, but maybe just a bit more like type), and provide a clear articulation of the service that stands behind the identifiers, no matter what you're dealing with.
As far as delivering more specific RDF, I'm more inclined to think along the lines of your last sentence: " But maybe that's just a function of where the data provider choses to redirect RDF requests" and let the providers themselves describe it.
John
On Mon, May 5, 2014 at 5:42 PM, Steve Baskauf steve.baskauf@vanderbilt.eduwrote:
I'm a big fan of not reinventing the wheel, and as such find the idea of using DOIs appealing. I think they pretty much follow all of the "rules" set out in the TDWG GUID Applicability Standard. They also play nicely in the Linked Data universe in their HTTP URI form, i.e. they redirect to HTML or RDF depending on the request header.
But I have a question for someone who understands how DOIs work better than I do. The HTML representation seems to arise by redirection to whatever is the current web page for the resource. You can see this by pasting this DOI for a specimen into a browser: http://dx.doi.org/10.7299/X7VQ32SJ which redirects to http://arctos.database.museum/guid/UAM:Ento:230092 when HTML is requested by a client. However, when the client requests RDF, one gets redirected to a DataCite metadata page: http://data.datacite.org/10.7299/X7VQ32SJ . Can the creator of the DOI redirect to any desired URI for the RDF?
The resulting RDF metadata doesn't have any of the kind useful information about the specimen that you get on the web page but rather looks like what you would expect for a publication (creator, publisher, date, etc.):
<http://dx.doi.org/10.7299/X7VQ32SJhttp://linkeddata.informatik.hu-berlin.de/uridbg/index.php?url=http://dx.doi.org/10.7299/X7VQ32SJ&acceptheader=text%2Fturtle%3Bq%3D1%2Capplication%2Fx-turtle%3Bq%3D0.5&useragentheader=
<http://purl.org/dc/terms/creatorhttp://linkeddata.informatik.hu-berlin.de/uridbg/index.php?url=http://purl.org/dc/terms/creator&acceptheader=text%2Fturtle%3Bq%3D1%2Capplication%2Fx-turtle%3Bq%3D0.5&useragentheader=
"Derek S. Sikes" ; <http://purl.org/dc/terms/datehttp://linkeddata.informatik.hu-berlin.de/uridbg/index.php?url=http://purl.org/dc/terms/date&acceptheader=text%2Fturtle%3Bq%3D1%2Capplication%2Fx-turtle%3Bq%3D0.5&useragentheader=
"2004" ; <http://purl.org/dc/terms/identifierhttp://linkeddata.informatik.hu-berlin.de/uridbg/index.php?url=http://purl.org/dc/terms/identifier&acceptheader=text%2Fturtle%3Bq%3D1%2Capplication%2Fx-turtle%3Bq%3D0.5&useragentheader=
"10.7299/X7VQ32SJ" ; <http://purl.org/dc/terms/publisherhttp://linkeddata.informatik.hu-berlin.de/uridbg/index.php?url=http://purl.org/dc/terms/publisher&acceptheader=text%2Fturtle%3Bq%3D1%2Capplication%2Fx-turtle%3Bq%3D0.5&useragentheader=
"University of Alaska Museum" ; <http://purl.org/dc/terms/titlehttp://linkeddata.informatik.hu-berlin.de/uridbg/index.php?url=http://purl.org/dc/terms/title&acceptheader=text%2Fturtle%3Bq%3D1%2Capplication%2Fx-turtle%3Bq%3D0.5&useragentheader=
"UAM:Ento:230092 - Grylloblatta campodeiformis" ; <http://www.w3.org/2002/07/owl#sameAshttp://linkeddata.informatik.hu-berlin.de/uridbg/index.php?url=http://www.w3.org/2002/07/owl#sameAs&acceptheader=text%2Fturtle%3Bq%3D1%2Capplication%2Fx-turtle%3Bq%3D0.5&useragentheader=
"info:doi/10.7299/X7VQ32SJ" , "doi:10.7299/X7VQ32SJ" .
Can one control what kinds of metadata are provided in "DataCite's metadata"? Assuming that we get our act together and adopt an RDF guide for Darwin Core, it would be nice for the RDF metadata to look more like the description of a specimen and less like the description of a book. But maybe that's just a function of where the data provider choses to redirect RDF requests.
Steve
John Deck wrote:
+1 on DOIs, and on ARKS (see: https://wiki.ucop.edu/display/Curation/ARK ), and also i'll mention IGSN:'s (see http://www.geosamples.org/) IGSN: is rapidly gaining traction for geo-samples. I don't know of anyone using them for bio-samples but they offer many features that we've been asking for as well. What our community considers a sample (or observation) is diverse enough that multiple ID systems are probably inevitable and perhaps even warranted.
Whatever the ID system, the data providers (museums, field researchers, labs, etc..) must adopt that identifier and use it whenever linking to downstream sequence, image, and sub-sampling repository agencies. This is great to say this in theory but difficult to do in reality because the decision to adopt long term and stable identifiers is often an institutional one, and the technology is still new and argued about, in particular, on this fine list. Further, those agencies that receive data associated with a GUID must honor that source GUID when passing to consumers and other aggregators, who must also have some level of confidence in the source GUIDs as well. Thus, a primary issue that we're confronted with here is trust.
Having Hilmar's hackathon support several possible GUID schemes (each with their own long term persistence strategy), and sponsored by a well known global institution affiliated with biodiversity informatics that could offer technical guidance to data providers, good name branding, and the nuts and bolts expertise to demonstrate good shepherding of source GUIDs through a data aggregation chain would be ideal. I nominate GBIF :)
John Deck
On Mon, May 5, 2014 at 1:09 PM, Roderic Page r.page@bio.gla.ac.uk wrote:
Hi Markus,
I have three use cases that
- Linking sequences in GenBank to voucher specimens. Lots of voucher
specimens are listed in GenBank but not linked to digital records for those specimens. These links are useful in two directions, one is to link GBIF to genomic data, the second is to enhance data in both databases, see http://iphylo.blogspot.co.uk/2012/02/linking-gbif-and-genbank.html(e.g., by adding missing georeferencing that is available in one database but not the other).
- Linking to specimens cited in the literature. I've done some work on
this in BioStor, see http://iphylo.blogspot.co.uk/2012/02/linking-gbif-and-biodiversity-heritage.... One immediate benefit of this is that GBIF could display the scientific literature associated with a specimen, so we get access to the evidence supporting identification, georeferencing, etc.
- Citation metrics for collections, see
http://iphylo.blogspot.co.uk/2013/05/the-impact-of-museum-collections-one.ht... http://iphylo.blogspot.co.uk/2012/02/gbif-specimens-in-biostor-who-are-top.h... on citation sod specimens in the literature, and in databases such as GenBank (i.e., basically combining 1 + 2 above) we can demonstrate the value of a collection.
All of these use cases depend on GBIF occurenceIds remaining stable, I have often ranted on iPHylo when this doesn't happen: http://iphylo.blogspot.co.uk/2012/07/dear-gbif-please-stop-changing.html
Regards
Rod
On 5 May 2014, at 20:51, Markus Döring mdoering@gbif.org wrote:
Hi Rod,
I agree GBIF has troubles to keep identifiers stable for *some* records, but in general we do a much better job than the original publishers in the first place. We try hard to keep GBIF ids stable even if publishers change collection codes, registered datasets twice or do other things to break a simple automated way of mapping source records to existing GBIF ids. Also the stable identifier in GBIF never has been the URL, but it is the local GBIF integer alone. The GBIF services that consume those ids have changed over the years, but its pretty trivial to adjust if you use the GBIF ids instead of the URLs. If there is a clear need to have stable URLs instead I am sure we can get that working easily.
The two real issues for GBIF are a) duplicates and b) records with varying local identifiers of any sort (triplet, occurrenceID or whatever else).
When it comes to the varying source identifiers I always liked the idea of flagging those records and datasets as unstable, so it is obvious to users. This is not a 100% safe, but most terrible datasets change all of their ids and that is easily detectable. Also with a service like that it would become more obvious to publishers how important stable source ids are.
Before jumping on DOIs as the next big thing I would really like to understand what needs the community has around specimen ids. Gabi clearly has a very real use case, are there others we know about?
Markus
On 05 May 2014, at 21:05, Roderic Page r.page@bio.gla.ac.uk wrote:
Hi Hilmar,
I'm not arguing that we shouldn't build a resolver (I have one that I use, Rich has mentioned he's got one, Markus has one at GBIF, etc.).
Nor do I think we should wait for institutional and social commitment (because then we'd never get anything done).
But I do think it would be useful to think it through. For example, it's easy to create a URL for a specimen. Easy peasy. OK, how do I discover that URL? How do I discover these for all specimens? Sounds like I need a centralised discover service like you'e described.
How do I handle changes in those URLs? I built a specimen code to GBIF resolver for BioStor so that I could link to specimens, GBIF changed lots of those URLs, all my work was undone, boy does GBIF suck sometimes. For example, if I map codes to URLs, I need to handle cases when they change.
If URLs can change, is there a way to defend against that (this is one reason for DOIs, or other methods of indirection, such as PURLs).
If providers change, will the URLs change? Is there a way to defend against that (again, DOIs handle this nicely by virtue of (a) indirection, and (b) lack of branding).
How can I encourage people to use the specimen service? What can I do to make them think it will persist? Can I convince academic publishers to trust it enough to link to it in articles? What's the pitch to Pensoft, to Magnolai Press, to Springer and Elsevier?
Is there some way to make the service itself become trusted? For example if I look at a journal and see that it has DOIs issued by CrossRef, I take that journal more seriously than if it's just got simple URLs. I know that papers in that journal will be linked into the citation network, I also know that there is a backup plan if the journal goes under (because you need that to have DOIs in CrossRef). Likewise, I think Figshare got a big boost when it stared minting DOIs (wow, a DOI, I know DOIs, you mean I can now cite stuff I've uploaded there?).
How can museums and herbaria be persuaded to keep their identifiers stable? What incentives can we provide (e.g., citation metrics for collections)? What system would enable us to do this? What about tracing funding (e.g., the NSF paid for these n papers, and they cite these y specimens, from these z collections, so science paid for by the NSF requires these collections to exist).
I guess I'm arguing that we should think all this through, because a specimen code to specimen URL is a small piece of the puzzle. Now, I'm desperately trying not to simply say what I think is blindingly obvious here (put DOIs on specimens, add metadata to specimen and specimen citation services, and we are done), but I think if we sit back and look at where we want to be, this is exactly what we need (or something functionally equivalent). Until we see the bigger picture, we will be stuck in amateur hour.
Take a look at:
http://search.crossref.org http://www.crossref.org/fundref/ http://support.crossref.org/ https://prospect.crossref.org/splash/
Isn't this the kind of stuff we'd like to do? If so, let's work out what's needed and make it happen.
In short, I think we constantly solve an immediate problem in the quickest way we know how, without thinking it through. I'd argue that if we think about the bigger picture (what do we want to be able to, what are the questions we want to be able to ask) then things become clearer. This is independent of getting everyone's agreement (but it would help if we made their agreement seem a no brainer by providing solutions to things that cause them pain).
Regards
Rod
On 5 May 2014, at 19:14, Hilmar Lapp hlapp@nescent.org wrote:
On Mon, May 5, 2014 at 1:29 PM, Roderic Page r.page@bio.gla.ac.ukwrote:
Contrary to Hilmar, there is more to this than simply a quick hackathon. Yes, a service that takes metadata and returns one or more identifiers is a good idea and easy to create (there will often be more than one because museum codes are not unique). But who maintains this service? Who maintains the identifiers? Who do I complain to if they break? How do we ensure that they persist when, say, a museum closes down, moves its collection, changes it's web technology? Who provides the tools that add value to the identifiers? (there's no point having them if they are not useful)
Jonathan Rees pointed this out to me too off-list. Just for the record, this isn't contrary but fully in line with what I was saying (or trying to say). Yes, I didn't elaborate that part, assuming, perhaps rather erroneously, that all this goes without saying, but I did mention that one part of this becoming a real solution has to be an institution with an in-scope cyberinfrastructure mandate that going in would make a commitment to sustain the resolver, including working with partners on the above slew of questions. The institution I gave was iDigBio; perhaps for some reason that would not be a good choice, but whether they are or not wasn't my point.
I will add one point to this, though. It seems to me that by continuing to argue that we can't go ahead with building a resolver that works (as far as technical requirements are concerned) before we haven't first fully addressed the institutional and social long-term sustainability commitment problem, we are and have been making this one big hairy problem that we can't make any practical pragmatic headway about, rather than breaking it down into parts, some of which (namely the primarily technical ones) are actually fairly straightforward to solve. As a result, to this day we don't have some solution that even though it's not very sustainable yet, at least proves to everyone how critical it is, and that the community can rally behind. Perhaps that's naïve, but I do think that once there's a solution the community rallies behind, ways to sustain it will be found.
-hilmar
Hilmar Lapp -:- informatics.nescent.org/wiki -:- lappland.io
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
-- John Deck (541) 321-0689
-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences
postal mail address: PMB 351634 Nashville, TN 37235-1634, U.S.A.
delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235
office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 322-4942 If you fax, please phone or email so that I will know to look for it.http://bioimages.vanderbilt.eduhttp://vanderbilt.edu/trees
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
The irony is this is what LSIDs were supposed to do, they had semantics for describing the nature of the data, and how to access it (HTTP, FTP, SOAP, etc.). In the end too many moving parts, and a failure to keep things simple (i.e., resolve in a browser, be easy to serve) killed them. Need to keep things simple and useful if we are to avoid that trap again.
Regards
Rod
On 6 May 2014, at 07:28, John Deck <jdeck@berkeley.edumailto:jdeck@berkeley.edu> wrote:
The biggest appeal to me of the linked data framework is connecting data across domains... you have bio-, geo-, eco-, -omic, and not to mention all their various media representations. Hard to believe that DOIs solve all of these needs (just wait till someone wants to assign DOIs to Loci from NextGen sequencing, or delving into transcriptomics with this). I'd hope that GUID services could provide high level consistent metadata (such as the Datacite metadata, but maybe just a bit more like type), and provide a clear articulation of the service that stands behind the identifiers, no matter what you're dealing with.
As far as delivering more specific RDF, I'm more inclined to think along the lines of your last sentence: " But maybe that's just a function of where the data provider choses to redirect RDF requests" and let the providers themselves describe it.
John
On Mon, May 5, 2014 at 5:42 PM, Steve Baskauf <steve.baskauf@vanderbilt.edumailto:steve.baskauf@vanderbilt.edu> wrote: I'm a big fan of not reinventing the wheel, and as such find the idea of using DOIs appealing. I think they pretty much follow all of the "rules" set out in the TDWG GUID Applicability Standard. They also play nicely in the Linked Data universe in their HTTP URI form, i.e. they redirect to HTML or RDF depending on the request header.
But I have a question for someone who understands how DOIs work better than I do. The HTML representation seems to arise by redirection to whatever is the current web page for the resource. You can see this by pasting this DOI for a specimen into a browser: http://dx.doi.org/10.7299/X7VQ32SJ which redirects to http://arctos.database.museum/guid/UAM:Ento:230092 when HTML is requested by a client. However, when the client requests RDF, one gets redirected to a DataCite metadata page: http://data.datacite.org/10.7299/X7VQ32SJ . Can the creator of the DOI redirect to any desired URI for the RDF?
The resulting RDF metadata doesn't have any of the kind useful information about the specimen that you get on the web page but rather looks like what you would expect for a publication (creator, publisher, date, etc.):
<http://dx.doi.org/10.7299/X7VQ32SJhttp://linkeddata.informatik.hu-berlin.de/uridbg/index.php?url=http://dx.doi.org/10.7299/X7VQ32SJ&acceptheader=text%2Fturtle%3Bq%3D1%2Capplication%2Fx-turtle%3Bq%3D0.5&useragentheader=> <http://purl.org/dc/terms/creatorhttp://linkeddata.informatik.hu-berlin.de/uridbg/index.php?url=http://purl.org/dc/terms/creator&acceptheader=text%2Fturtle%3Bq%3D1%2Capplication%2Fx-turtle%3Bq%3D0.5&useragentheader=> "Derek S. Sikes" ; <http://purl.org/dc/terms/datehttp://linkeddata.informatik.hu-berlin.de/uridbg/index.php?url=http://purl.org/dc/terms/date&acceptheader=text%2Fturtle%3Bq%3D1%2Capplication%2Fx-turtle%3Bq%3D0.5&useragentheader=> "2004" ; <http://purl.org/dc/terms/identifierhttp://linkeddata.informatik.hu-berlin.de/uridbg/index.php?url=http://purl.org/dc/terms/identifier&acceptheader=text%2Fturtle%3Bq%3D1%2Capplication%2Fx-turtle%3Bq%3D0.5&useragentheader=> "10.7299/X7VQ32SJ" ; <http://purl.org/dc/terms/publisherhttp://linkeddata.informatik.hu-berlin.de/uridbg/index.php?url=http://purl.org/dc/terms/publisher&acceptheader=text%2Fturtle%3Bq%3D1%2Capplication%2Fx-turtle%3Bq%3D0.5&useragentheader=> "University of Alaska Museum" ; <http://purl.org/dc/terms/titlehttp://linkeddata.informatik.hu-berlin.de/uridbg/index.php?url=http://purl.org/dc/terms/title&acceptheader=text%2Fturtle%3Bq%3D1%2Capplication%2Fx-turtle%3Bq%3D0.5&useragentheader=> "UAM:Ento:230092 - Grylloblatta campodeiformis" ; <http://www.w3.org/2002/07/owl#sameAshttp://linkeddata.informatik.hu-berlin.de/uridbg/index.php?url=http://www.w3.org/2002/07/owl#sameAs&acceptheader=text%2Fturtle%3Bq%3D1%2Capplication%2Fx-turtle%3Bq%3D0.5&useragentheader=> "info:doi/10.7299/X7VQ32SJ" , "doi:10.7299/X7VQ32SJ" .
Can one control what kinds of metadata are provided in "DataCite's metadata"? Assuming that we get our act together and adopt an RDF guide for Darwin Core, it would be nice for the RDF metadata to look more like the description of a specimen and less like the description of a book. But maybe that's just a function of where the data provider choses to redirect RDF requests.
Steve
John Deck wrote: +1 on DOIs, and on ARKS (see: https://wiki.ucop.edu/display/Curation/ARK ), and also i'll mention IGSN:'s (see http://www.geosamples.org/) IGSN: is rapidly gaining traction for geo-samples. I don't know of anyone using them for bio-samples but they offer many features that we've been asking for as well. What our community considers a sample (or observation) is diverse enough that multiple ID systems are probably inevitable and perhaps even warranted.
Whatever the ID system, the data providers (museums, field researchers, labs, etc..) must adopt that identifier and use it whenever linking to downstream sequence, image, and sub-sampling repository agencies. This is great to say this in theory but difficult to do in reality because the decision to adopt long term and stable identifiers is often an institutional one, and the technology is still new and argued about, in particular, on this fine list. Further, those agencies that receive data associated with a GUID must honor that source GUID when passing to consumers and other aggregators, who must also have some level of confidence in the source GUIDs as well. Thus, a primary issue that we're confronted with here is trust.
Having Hilmar's hackathon support several possible GUID schemes (each with their own long term persistence strategy), and sponsored by a well known global institution affiliated with biodiversity informatics that could offer technical guidance to data providers, good name branding, and the nuts and bolts expertise to demonstrate good shepherding of source GUIDs through a data aggregation chain would be ideal. I nominate GBIF :)
John Deck
On Mon, May 5, 2014 at 1:09 PM, Roderic Page <r.page@bio.gla.ac.ukmailto:r.page@bio.gla.ac.uk> wrote: Hi Markus,
I have three use cases that
1. Linking sequences in GenBank to voucher specimens. Lots of voucher specimens are listed in GenBank but not linked to digital records for those specimens. These links are useful in two directions, one is to link GBIF to genomic data, the second is to enhance data in both databases, see http://iphylo.blogspot.co.uk/2012/02/linking-gbif-and-genbank.html (e.g., by adding missing georeferencing that is available in one database but not the other).
2. Linking to specimens cited in the literature. I’ve done some work on this in BioStor, see http://iphylo.blogspot.co.uk/2012/02/linking-gbif-and-biodiversity-heritage.... One immediate benefit of this is that GBIF could display the scientific literature associated with a specimen, so we get access to the evidence supporting identification, georeferencing, etc.
3. Citation metrics for collections, see http://iphylo.blogspot.co.uk/2013/05/the-impact-of-museum-collections-one.ht... and http://iphylo.blogspot.co.uk/2012/02/gbif-specimens-in-biostor-who-are-top.h... Based on citation sod specimens in the literature, and in databases such as GenBank (i.e., basically combining 1 + 2 above) we can demonstrate the value of a collection.
All of these use cases depend on GBIF occurenceIds remaining stable, I have often ranted on iPHylo when this doesn’t happen: http://iphylo.blogspot.co.uk/2012/07/dear-gbif-please-stop-changing.html
Regards
Rod
On 5 May 2014, at 20:51, Markus Döring <mdoering@gbif.orgmailto:mdoering@gbif.org> wrote:
Hi Rod,
I agree GBIF has troubles to keep identifiers stable for *some* records, but in general we do a much better job than the original publishers in the first place. We try hard to keep GBIF ids stable even if publishers change collection codes, registered datasets twice or do other things to break a simple automated way of mapping source records to existing GBIF ids. Also the stable identifier in GBIF never has been the URL, but it is the local GBIF integer alone. The GBIF services that consume those ids have changed over the years, but its pretty trivial to adjust if you use the GBIF ids instead of the URLs. If there is a clear need to have stable URLs instead I am sure we can get that working easily.
The two real issues for GBIF are a) duplicates and b) records with varying local identifiers of any sort (triplet, occurrenceID or whatever else).
When it comes to the varying source identifiers I always liked the idea of flagging those records and datasets as unstable, so it is obvious to users. This is not a 100% safe, but most terrible datasets change all of their ids and that is easily detectable. Also with a service like that it would become more obvious to publishers how important stable source ids are.
Before jumping on DOIs as the next big thing I would really like to understand what needs the community has around specimen ids. Gabi clearly has a very real use case, are there others we know about?
Markus
On 05 May 2014, at 21:05, Roderic Page <r.page@bio.gla.ac.ukmailto:r.page@bio.gla.ac.uk> wrote:
Hi Hilmar,
I’m not arguing that we shouldn’t build a resolver (I have one that I use, Rich has mentioned he’s got one, Markus has one at GBIF, etc.).
Nor do I think we should wait for institutional and social commitment (because then we’d never get anything done).
But I do think it would be useful to think it through. For example, it’s easy to create a URL for a specimen. Easy peasy. OK, how do I discover that URL? How do I discover these for all specimens? Sounds like I need a centralised discover service like you’e described.
How do I handle changes in those URLs? I built a specimen code to GBIF resolver for BioStor so that I could link to specimens, GBIF changed lots of those URLs, all my work was undone, boy does GBIF suck sometimes. For example, if I map codes to URLs, I need to handle cases when they change.
If URLs can change, is there a way to defend against that (this is one reason for DOIs, or other methods of indirection, such as PURLs).
If providers change, will the URLs change? Is there a way to defend against that (again, DOIs handle this nicely by virtue of (a) indirection, and (b) lack of branding).
How can I encourage people to use the specimen service? What can I do to make them think it will persist? Can I convince academic publishers to trust it enough to link to it in articles? What’s the pitch to Pensoft, to Magnolai Press, to Springer and Elsevier?
Is there some way to make the service itself become trusted? For example if I look at a journal and see that it has DOIs issued by CrossRef, I take that journal more seriously than if it’s just got simple URLs. I know that papers in that journal will be linked into the citation network, I also know that there is a backup plan if the journal goes under (because you need that to have DOIs in CrossRef). Likewise, I think Figshare got a big boost when it stared minting DOIs (wow, a DOI, I know DOIs, you mean I can now cite stuff I’ve uploaded there?).
How can museums and herbaria be persuaded to keep their identifiers stable? What incentives can we provide (e.g., citation metrics for collections)? What system would enable us to do this? What about tracing funding (e.g., the NSF paid for these n papers, and they cite these y specimens, from these z collections, so science paid for by the NSF requires these collections to exist).
I guess I’m arguing that we should think all this through, because a specimen code to specimen URL is a small piece of the puzzle. Now, I’m desperately trying not to simply say what I think is blindingly obvious here (put DOIs on specimens, add metadata to specimen and specimen citation services, and we are done), but I think if we sit back and look at where we want to be, this is exactly what we need (or something functionally equivalent). Until we see the bigger picture, we will be stuck in amateur hour.
Take a look at:
http://search.crossref.orghttp://search.crossref.org/ http://www.crossref.org/fundref/ http://support.crossref.org/ https://prospect.crossref.org/splash/
Isn’t this the kind of stuff we’d like to do? If so, let’s work out what’s needed and make it happen.
In short, I think we constantly solve an immediate problem in the quickest way we know how, without thinking it through. I’d argue that if we think about the bigger picture (what do we want to be able to, what are the questions we want to be able to ask) then things become clearer. This is independent of getting everyone’s agreement (but it would help if we made their agreement seem a no brainer by providing solutions to things that cause them pain).
Regards
Rod
On 5 May 2014, at 19:14, Hilmar Lapp <hlapp@nescent.orgmailto:hlapp@nescent.org> wrote:
On Mon, May 5, 2014 at 1:29 PM, Roderic Page <r.page@bio.gla.ac.ukmailto:r.page@bio.gla.ac.uk> wrote: Contrary to Hilmar, there is more to this than simply a quick hackathon. Yes, a service that takes metadata and returns one or more identifiers is a good idea and easy to create (there will often be more than one because museum codes are not unique). But who maintains this service? Who maintains the identifiers? Who do I complain to if they break? How do we ensure that they persist when, say, a museum closes down, moves its collection, changes it’s web technology? Who provides the tools that add value to the identifiers? (there’s no point having them if they are not useful)
Jonathan Rees pointed this out to me too off-list. Just for the record, this isn't contrary but fully in line with what I was saying (or trying to say). Yes, I didn't elaborate that part, assuming, perhaps rather erroneously, that all this goes without saying, but I did mention that one part of this becoming a real solution has to be an institution with an in-scope cyberinfrastructure mandate that going in would make a commitment to sustain the resolver, including working with partners on the above slew of questions. The institution I gave was iDigBio; perhaps for some reason that would not be a good choice, but whether they are or not wasn't my point.
I will add one point to this, though. It seems to me that by continuing to argue that we can't go ahead with building a resolver that works (as far as technical requirements are concerned) before we haven't first fully addressed the institutional and social long-term sustainability commitment problem, we are and have been making this one big hairy problem that we can't make any practical pragmatic headway about, rather than breaking it down into parts, some of which (namely the primarily technical ones) are actually fairly straightforward to solve. As a result, to this day we don't have some solution that even though it's not very sustainable yet, at least proves to everyone how critical it is, and that the community can rally behind. Perhaps that's naïve, but I do think that once there's a solution the community rallies behind, ways to sustain it will be found.
-hilmar -- Hilmar Lapp -:- informatics.nescent.org/wikihttp://informatics.nescent.org/wiki -:- lappland.iohttp://lappland.io/
_______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.orgmailto:tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
-- John Deck (541) 321-0689tel:%28541%29%20321-0689
-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences
postal mail address: PMB 351634 Nashville, TN 37235-1634, U.S.A.
delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235
office: 2128 Stevenson Center phone: (615) 343-4582tel:%28615%29%20343-4582, fax: (615) 322-4942tel:%28615%29%20322-4942 If you fax, please phone or email so that I will know to look for it. http://bioimages.vanderbilt.eduhttp://bioimages.vanderbilt.edu/ http://vanderbilt.edu/trees
_______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.orgmailto:tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
-- John Deck (541) 321-0689 _______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.orgmailto:tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
Hi Steve,
My understanding is that the non-HTML content is decided at the level of registration agency. For a bibliographic DOI registered with CrossRef, the HTML redirect goes to whatever the publisher provides CrossRef (e.g., the article landing page), other content (including RDF) is served by CrossRef based on the metadata they hold for each article. Likewise, DataCite will serve metadata based on what they have. Hence, metadata from CrossRef and DatacIte look rather different.
So, this is something that would need to be worked out at the level of registration agency (see http://www.crossref.org/CrossTech/2010/03/dois_and_linked_data_some_conc.htm... and http://crosstech.crossref.org/2011/04/content_negotiation_for_crossr.html for background).
Hence, if GBIF were to be a DOI registration agency they could serve Darwin Core RDF (and JSON and whoever else they want). This is a strong argument for GBIF doing this, rather than using DataCite (which serves very generic metadata).
Regards
Rod
On 6 May 2014, at 01:42, Steve Baskauf <steve.baskauf@vanderbilt.edumailto:steve.baskauf@vanderbilt.edu> wrote:
I'm a big fan of not reinventing the wheel, and as such find the idea of using DOIs appealing. I think they pretty much follow all of the "rules" set out in the TDWG GUID Applicability Standard. They also play nicely in the Linked Data universe in their HTTP URI form, i.e. they redirect to HTML or RDF depending on the request header.
But I have a question for someone who understands how DOIs work better than I do. The HTML representation seems to arise by redirection to whatever is the current web page for the resource. You can see this by pasting this DOI for a specimen into a browser: http://dx.doi.org/10.7299/X7VQ32SJ which redirects to http://arctos.database.museum/guid/UAM:Ento:230092 when HTML is requested by a client. However, when the client requests RDF, one gets redirected to a DataCite metadata page: http://data.datacite.org/10.7299/X7VQ32SJ . Can the creator of the DOI redirect to any desired URI for the RDF?
The resulting RDF metadata doesn't have any of the kind useful information about the specimen that you get on the web page but rather looks like what you would expect for a publication (creator, publisher, date, etc.):
<http://dx.doi.org/10.7299/X7VQ32SJhttp://linkeddata.informatik.hu-berlin.de/uridbg/index.php?url=http://dx.doi.org/10.7299/X7VQ32SJ&acceptheader=text%2Fturtle%3Bq%3D1%2Capplication%2Fx-turtle%3Bq%3D0.5&useragentheader=> <http://purl.org/dc/terms/creatorhttp://linkeddata.informatik.hu-berlin.de/uridbg/index.php?url=http://purl.org/dc/terms/creator&acceptheader=text%2Fturtle%3Bq%3D1%2Capplication%2Fx-turtle%3Bq%3D0.5&useragentheader=> "Derek S. Sikes" ; <http://purl.org/dc/terms/datehttp://linkeddata.informatik.hu-berlin.de/uridbg/index.php?url=http://purl.org/dc/terms/date&acceptheader=text%2Fturtle%3Bq%3D1%2Capplication%2Fx-turtle%3Bq%3D0.5&useragentheader=> "2004" ; <http://purl.org/dc/terms/identifierhttp://linkeddata.informatik.hu-berlin.de/uridbg/index.php?url=http://purl.org/dc/terms/identifier&acceptheader=text%2Fturtle%3Bq%3D1%2Capplication%2Fx-turtle%3Bq%3D0.5&useragentheader=> "10.7299/X7VQ32SJ" ; <http://purl.org/dc/terms/publisherhttp://linkeddata.informatik.hu-berlin.de/uridbg/index.php?url=http://purl.org/dc/terms/publisher&acceptheader=text%2Fturtle%3Bq%3D1%2Capplication%2Fx-turtle%3Bq%3D0.5&useragentheader=> "University of Alaska Museum" ; <http://purl.org/dc/terms/titlehttp://linkeddata.informatik.hu-berlin.de/uridbg/index.php?url=http://purl.org/dc/terms/title&acceptheader=text%2Fturtle%3Bq%3D1%2Capplication%2Fx-turtle%3Bq%3D0.5&useragentheader=> "UAM:Ento:230092 - Grylloblatta campodeiformis" ; <http://www.w3.org/2002/07/owl#sameAshttp://linkeddata.informatik.hu-berlin.de/uridbg/index.php?url=http://www.w3.org/2002/07/owl#sameAs&acceptheader=text%2Fturtle%3Bq%3D1%2Capplication%2Fx-turtle%3Bq%3D0.5&useragentheader=> "info:doi/10.7299/X7VQ32SJ" , "doi:10.7299/X7VQ32SJ" .
Can one control what kinds of metadata are provided in "DataCite's metadata"? Assuming that we get our act together and adopt an RDF guide for Darwin Core, it would be nice for the RDF metadata to look more like the description of a specimen and less like the description of a book. But maybe that's just a function of where the data provider choses to redirect RDF requests.
Steve
John Deck wrote: +1 on DOIs, and on ARKS (see: https://wiki.ucop.edu/display/Curation/ARK ), and also i'll mention IGSN:'s (see http://www.geosamples.org/) IGSN: is rapidly gaining traction for geo-samples. I don't know of anyone using them for bio-samples but they offer many features that we've been asking for as well. What our community considers a sample (or observation) is diverse enough that multiple ID systems are probably inevitable and perhaps even warranted.
Whatever the ID system, the data providers (museums, field researchers, labs, etc..) must adopt that identifier and use it whenever linking to downstream sequence, image, and sub-sampling repository agencies. This is great to say this in theory but difficult to do in reality because the decision to adopt long term and stable identifiers is often an institutional one, and the technology is still new and argued about, in particular, on this fine list. Further, those agencies that receive data associated with a GUID must honor that source GUID when passing to consumers and other aggregators, who must also have some level of confidence in the source GUIDs as well. Thus, a primary issue that we're confronted with here is trust.
Having Hilmar's hackathon support several possible GUID schemes (each with their own long term persistence strategy), and sponsored by a well known global institution affiliated with biodiversity informatics that could offer technical guidance to data providers, good name branding, and the nuts and bolts expertise to demonstrate good shepherding of source GUIDs through a data aggregation chain would be ideal. I nominate GBIF :)
John Deck
On Mon, May 5, 2014 at 1:09 PM, Roderic Page <r.page@bio.gla.ac.ukmailto:r.page@bio.gla.ac.uk> wrote: Hi Markus,
I have three use cases that
1. Linking sequences in GenBank to voucher specimens. Lots of voucher specimens are listed in GenBank but not linked to digital records for those specimens. These links are useful in two directions, one is to link GBIF to genomic data, the second is to enhance data in both databases, see http://iphylo.blogspot.co.uk/2012/02/linking-gbif-and-genbank.html (e.g., by adding missing georeferencing that is available in one database but not the other).
2. Linking to specimens cited in the literature. I’ve done some work on this in BioStor, see http://iphylo.blogspot.co.uk/2012/02/linking-gbif-and-biodiversity-heritage.... One immediate benefit of this is that GBIF could display the scientific literature associated with a specimen, so we get access to the evidence supporting identification, georeferencing, etc.
3. Citation metrics for collections, see http://iphylo.blogspot.co.uk/2013/05/the-impact-of-museum-collections-one.ht... and http://iphylo.blogspot.co.uk/2012/02/gbif-specimens-in-biostor-who-are-top.h... Based on citation sod specimens in the literature, and in databases such as GenBank (i.e., basically combining 1 + 2 above) we can demonstrate the value of a collection.
All of these use cases depend on GBIF occurenceIds remaining stable, I have often ranted on iPHylo when this doesn’t happen: http://iphylo.blogspot.co.uk/2012/07/dear-gbif-please-stop-changing.html
Regards
Rod
On 5 May 2014, at 20:51, Markus Döring <mdoering@gbif.orgmailto:mdoering@gbif.org> wrote:
Hi Rod,
I agree GBIF has troubles to keep identifiers stable for *some* records, but in general we do a much better job than the original publishers in the first place. We try hard to keep GBIF ids stable even if publishers change collection codes, registered datasets twice or do other things to break a simple automated way of mapping source records to existing GBIF ids. Also the stable identifier in GBIF never has been the URL, but it is the local GBIF integer alone. The GBIF services that consume those ids have changed over the years, but its pretty trivial to adjust if you use the GBIF ids instead of the URLs. If there is a clear need to have stable URLs instead I am sure we can get that working easily.
The two real issues for GBIF are a) duplicates and b) records with varying local identifiers of any sort (triplet, occurrenceID or whatever else).
When it comes to the varying source identifiers I always liked the idea of flagging those records and datasets as unstable, so it is obvious to users. This is not a 100% safe, but most terrible datasets change all of their ids and that is easily detectable. Also with a service like that it would become more obvious to publishers how important stable source ids are.
Before jumping on DOIs as the next big thing I would really like to understand what needs the community has around specimen ids. Gabi clearly has a very real use case, are there others we know about?
Markus
On 05 May 2014, at 21:05, Roderic Page <r.page@bio.gla.ac.ukmailto:r.page@bio.gla.ac.uk> wrote:
Hi Hilmar,
I’m not arguing that we shouldn’t build a resolver (I have one that I use, Rich has mentioned he’s got one, Markus has one at GBIF, etc.).
Nor do I think we should wait for institutional and social commitment (because then we’d never get anything done).
But I do think it would be useful to think it through. For example, it’s easy to create a URL for a specimen. Easy peasy. OK, how do I discover that URL? How do I discover these for all specimens? Sounds like I need a centralised discover service like you’e described.
How do I handle changes in those URLs? I built a specimen code to GBIF resolver for BioStor so that I could link to specimens, GBIF changed lots of those URLs, all my work was undone, boy does GBIF suck sometimes. For example, if I map codes to URLs, I need to handle cases when they change.
If URLs can change, is there a way to defend against that (this is one reason for DOIs, or other methods of indirection, such as PURLs).
If providers change, will the URLs change? Is there a way to defend against that (again, DOIs handle this nicely by virtue of (a) indirection, and (b) lack of branding).
How can I encourage people to use the specimen service? What can I do to make them think it will persist? Can I convince academic publishers to trust it enough to link to it in articles? What’s the pitch to Pensoft, to Magnolai Press, to Springer and Elsevier?
Is there some way to make the service itself become trusted? For example if I look at a journal and see that it has DOIs issued by CrossRef, I take that journal more seriously than if it’s just got simple URLs. I know that papers in that journal will be linked into the citation network, I also know that there is a backup plan if the journal goes under (because you need that to have DOIs in CrossRef). Likewise, I think Figshare got a big boost when it stared minting DOIs (wow, a DOI, I know DOIs, you mean I can now cite stuff I’ve uploaded there?).
How can museums and herbaria be persuaded to keep their identifiers stable? What incentives can we provide (e.g., citation metrics for collections)? What system would enable us to do this? What about tracing funding (e.g., the NSF paid for these n papers, and they cite these y specimens, from these z collections, so science paid for by the NSF requires these collections to exist).
I guess I’m arguing that we should think all this through, because a specimen code to specimen URL is a small piece of the puzzle. Now, I’m desperately trying not to simply say what I think is blindingly obvious here (put DOIs on specimens, add metadata to specimen and specimen citation services, and we are done), but I think if we sit back and look at where we want to be, this is exactly what we need (or something functionally equivalent). Until we see the bigger picture, we will be stuck in amateur hour.
Take a look at:
http://search.crossref.orghttp://search.crossref.org/ http://www.crossref.org/fundref/ http://support.crossref.org/ https://prospect.crossref.org/splash/
Isn’t this the kind of stuff we’d like to do? If so, let’s work out what’s needed and make it happen.
In short, I think we constantly solve an immediate problem in the quickest way we know how, without thinking it through. I’d argue that if we think about the bigger picture (what do we want to be able to, what are the questions we want to be able to ask) then things become clearer. This is independent of getting everyone’s agreement (but it would help if we made their agreement seem a no brainer by providing solutions to things that cause them pain).
Regards
Rod
On 5 May 2014, at 19:14, Hilmar Lapp <hlapp@nescent.orgmailto:hlapp@nescent.org> wrote:
On Mon, May 5, 2014 at 1:29 PM, Roderic Page <r.page@bio.gla.ac.ukmailto:r.page@bio.gla.ac.uk> wrote: Contrary to Hilmar, there is more to this than simply a quick hackathon. Yes, a service that takes metadata and returns one or more identifiers is a good idea and easy to create (there will often be more than one because museum codes are not unique). But who maintains this service? Who maintains the identifiers? Who do I complain to if they break? How do we ensure that they persist when, say, a museum closes down, moves its collection, changes it’s web technology? Who provides the tools that add value to the identifiers? (there’s no point having them if they are not useful)
Jonathan Rees pointed this out to me too off-list. Just for the record, this isn't contrary but fully in line with what I was saying (or trying to say). Yes, I didn't elaborate that part, assuming, perhaps rather erroneously, that all this goes without saying, but I did mention that one part of this becoming a real solution has to be an institution with an in-scope cyberinfrastructure mandate that going in would make a commitment to sustain the resolver, including working with partners on the above slew of questions. The institution I gave was iDigBio; perhaps for some reason that would not be a good choice, but whether they are or not wasn't my point.
I will add one point to this, though. It seems to me that by continuing to argue that we can't go ahead with building a resolver that works (as far as technical requirements are concerned) before we haven't first fully addressed the institutional and social long-term sustainability commitment problem, we are and have been making this one big hairy problem that we can't make any practical pragmatic headway about, rather than breaking it down into parts, some of which (namely the primarily technical ones) are actually fairly straightforward to solve. As a result, to this day we don't have some solution that even though it's not very sustainable yet, at least proves to everyone how critical it is, and that the community can rally behind. Perhaps that's naïve, but I do think that once there's a solution the community rallies behind, ways to sustain it will be found.
-hilmar -- Hilmar Lapp -:- informatics.nescent.org/wikihttp://informatics.nescent.org/wiki -:- lappland.iohttp://lappland.io/
_______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.orgmailto:tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
-- John Deck (541) 321-0689
-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences
postal mail address: PMB 351634 Nashville, TN 37235-1634, U.S.A.
delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235
office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 322-4942 If you fax, please phone or email so that I will know to look for it. http://bioimages.vanderbilt.eduhttp://bioimages.vanderbilt.edu/ http://vanderbilt.edu/trees
_______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.orgmailto:tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
Every registration agency has its own set of standard metadata which members register for every DOI, but the content-negotiation strategy does allow for a richer metadata response. By default it is the registration agency's resolver that responds with RDF (and thus only with the metadata it knows of), but members (the entities registering DOIs) can register their own content-negotiation resolver, which would allow them to return richer metadata. We have, for example, considered doing this for Dryad ( http://datadryad.org), but it hasn't risen to high-enough priority yet.
Hence, if GBIF were to register DOIs for specimens through DataCite (rather than being its own RA), then GBIF could still operate its own resolver for returning DwC metadata for RDF queries.
That doesn't mean there couldn't still be good arguments for GBIF serving as a RA.
-hilmar
On Tue, May 6, 2014 at 5:53 AM, Roderic Page Roderic.Page@glasgow.ac.ukwrote:
Hi Steve,
My understanding is that the non-HTML content is decided at the level of registration agency. For a bibliographic DOI registered with CrossRef, the HTML redirect goes to whatever the publisher provides CrossRef (e.g., the article landing page), other content (including RDF) is served by CrossRef based on the metadata they hold for each article. Likewise, DataCite will serve metadata based on what they have. Hence, metadata from CrossRef and DatacIte look rather different.
So, this is something that would need to be worked out at the level of registration agency (see http://www.crossref.org/CrossTech/2010/03/dois_and_linked_data_some_conc.htm... http://crosstech.crossref.org/2011/04/content_negotiation_for_crossr.htmlfor background).
Hence, if GBIF were to be a DOI registration agency they could serve Darwin Core RDF (and JSON and whoever else they want). This is a strong argument for GBIF doing this, rather than using DataCite (which serves very generic metadata).
Regards
Rod
On 6 May 2014, at 01:42, Steve Baskauf steve.baskauf@vanderbilt.edu wrote:
I'm a big fan of not reinventing the wheel, and as such find the idea of using DOIs appealing. I think they pretty much follow all of the "rules" set out in the TDWG GUID Applicability Standard. They also play nicely in the Linked Data universe in their HTTP URI form, i.e. they redirect to HTML or RDF depending on the request header.
But I have a question for someone who understands how DOIs work better than I do. The HTML representation seems to arise by redirection to whatever is the current web page for the resource. You can see this by pasting this DOI for a specimen into a browser: http://dx.doi.org/10.7299/X7VQ32SJ which redirects to http://arctos.database.museum/guid/UAM:Ento:230092 when HTML is requested by a client. However, when the client requests RDF, one gets redirected to a DataCite metadata page: http://data.datacite.org/10.7299/X7VQ32SJ . Can the creator of the DOI redirect to any desired URI for the RDF?
The resulting RDF metadata doesn't have any of the kind useful information about the specimen that you get on the web page but rather looks like what you would expect for a publication (creator, publisher, date, etc.):
<http://dx.doi.org/10.7299/X7VQ32SJhttp://linkeddata.informatik.hu-berlin.de/uridbg/index.php?url=http://dx.doi.org/10.7299/X7VQ32SJ&acceptheader=text%2Fturtle%3Bq%3D1%2Capplication%2Fx-turtle%3Bq%3D0.5&useragentheader=
<http://purl.org/dc/terms/creatorhttp://linkeddata.informatik.hu-berlin.de/uridbg/index.php?url=http://purl.org/dc/terms/creator&acceptheader=text%2Fturtle%3Bq%3D1%2Capplication%2Fx-turtle%3Bq%3D0.5&useragentheader=
"Derek S. Sikes" ; <http://purl.org/dc/terms/datehttp://linkeddata.informatik.hu-berlin.de/uridbg/index.php?url=http://purl.org/dc/terms/date&acceptheader=text%2Fturtle%3Bq%3D1%2Capplication%2Fx-turtle%3Bq%3D0.5&useragentheader=
"2004" ; <http://purl.org/dc/terms/identifierhttp://linkeddata.informatik.hu-berlin.de/uridbg/index.php?url=http://purl.org/dc/terms/identifier&acceptheader=text%2Fturtle%3Bq%3D1%2Capplication%2Fx-turtle%3Bq%3D0.5&useragentheader=
"10.7299/X7VQ32SJ" ; <http://purl.org/dc/terms/publisherhttp://linkeddata.informatik.hu-berlin.de/uridbg/index.php?url=http://purl.org/dc/terms/publisher&acceptheader=text%2Fturtle%3Bq%3D1%2Capplication%2Fx-turtle%3Bq%3D0.5&useragentheader=
"University of Alaska Museum" ; <http://purl.org/dc/terms/titlehttp://linkeddata.informatik.hu-berlin.de/uridbg/index.php?url=http://purl.org/dc/terms/title&acceptheader=text%2Fturtle%3Bq%3D1%2Capplication%2Fx-turtle%3Bq%3D0.5&useragentheader=
"UAM:Ento:230092 - Grylloblatta campodeiformis" ; <http://www.w3.org/2002/07/owl#sameAshttp://linkeddata.informatik.hu-berlin.de/uridbg/index.php?url=http://www.w3.org/2002/07/owl#sameAs&acceptheader=text%2Fturtle%3Bq%3D1%2Capplication%2Fx-turtle%3Bq%3D0.5&useragentheader=
"info:doi/10.7299/X7VQ32SJ" , "doi:10.7299/X7VQ32SJ" .
Can one control what kinds of metadata are provided in "DataCite's metadata"? Assuming that we get our act together and adopt an RDF guide for Darwin Core, it would be nice for the RDF metadata to look more like the description of a specimen and less like the description of a book. But maybe that's just a function of where the data provider choses to redirect RDF requests.
Steve
John Deck wrote:
+1 on DOIs, and on ARKS (see: https://wiki.ucop.edu/display/Curation/ARK ), and also i'll mention IGSN:'s (see http://www.geosamples.org/) IGSN: is rapidly gaining traction for geo-samples. I don't know of anyone using them for bio-samples but they offer many features that we've been asking for as well. What our community considers a sample (or observation) is diverse enough that multiple ID systems are probably inevitable and perhaps even warranted.
Whatever the ID system, the data providers (museums, field researchers, labs, etc..) must adopt that identifier and use it whenever linking to downstream sequence, image, and sub-sampling repository agencies. This is great to say this in theory but difficult to do in reality because the decision to adopt long term and stable identifiers is often an institutional one, and the technology is still new and argued about, in particular, on this fine list. Further, those agencies that receive data associated with a GUID must honor that source GUID when passing to consumers and other aggregators, who must also have some level of confidence in the source GUIDs as well. Thus, a primary issue that we're confronted with here is trust.
Having Hilmar's hackathon support several possible GUID schemes (each with their own long term persistence strategy), and sponsored by a well known global institution affiliated with biodiversity informatics that could offer technical guidance to data providers, good name branding, and the nuts and bolts expertise to demonstrate good shepherding of source GUIDs through a data aggregation chain would be ideal. I nominate GBIF :)
John Deck
On Mon, May 5, 2014 at 1:09 PM, Roderic Page r.page@bio.gla.ac.uk wrote:
Hi Markus,
I have three use cases that
- Linking sequences in GenBank to voucher specimens. Lots of voucher
specimens are listed in GenBank but not linked to digital records for those specimens. These links are useful in two directions, one is to link GBIF to genomic data, the second is to enhance data in both databases, see http://iphylo.blogspot.co.uk/2012/02/linking-gbif-and-genbank.html(e.g., by adding missing georeferencing that is available in one database but not the other).
- Linking to specimens cited in the literature. I’ve done some work on
this in BioStor, see http://iphylo.blogspot.co.uk/2012/02/linking-gbif-and-biodiversity-heritage.... One immediate benefit of this is that GBIF could display the scientific literature associated with a specimen, so we get access to the evidence supporting identification, georeferencing, etc.
- Citation metrics for collections, see
http://iphylo.blogspot.co.uk/2013/05/the-impact-of-museum-collections-one.ht... http://iphylo.blogspot.co.uk/2012/02/gbif-specimens-in-biostor-who-are-top.h... on citation sod specimens in the literature, and in databases such as GenBank (i.e., basically combining 1 + 2 above) we can demonstrate the value of a collection.
All of these use cases depend on GBIF occurenceIds remaining stable, I have often ranted on iPHylo when this doesn’t happen: http://iphylo.blogspot.co.uk/2012/07/dear-gbif-please-stop-changing.html
Regards
Rod
On 5 May 2014, at 20:51, Markus Döring mdoering@gbif.org wrote:
Hi Rod,
I agree GBIF has troubles to keep identifiers stable for *some* records, but in general we do a much better job than the original publishers in the first place. We try hard to keep GBIF ids stable even if publishers change collection codes, registered datasets twice or do other things to break a simple automated way of mapping source records to existing GBIF ids. Also the stable identifier in GBIF never has been the URL, but it is the local GBIF integer alone. The GBIF services that consume those ids have changed over the years, but its pretty trivial to adjust if you use the GBIF ids instead of the URLs. If there is a clear need to have stable URLs instead I am sure we can get that working easily.
The two real issues for GBIF are a) duplicates and b) records with varying local identifiers of any sort (triplet, occurrenceID or whatever else).
When it comes to the varying source identifiers I always liked the idea of flagging those records and datasets as unstable, so it is obvious to users. This is not a 100% safe, but most terrible datasets change all of their ids and that is easily detectable. Also with a service like that it would become more obvious to publishers how important stable source ids are.
Before jumping on DOIs as the next big thing I would really like to understand what needs the community has around specimen ids. Gabi clearly has a very real use case, are there others we know about?
Markus
On 05 May 2014, at 21:05, Roderic Page r.page@bio.gla.ac.uk wrote:
Hi Hilmar,
I’m not arguing that we shouldn’t build a resolver (I have one that I use, Rich has mentioned he’s got one, Markus has one at GBIF, etc.).
Nor do I think we should wait for institutional and social commitment (because then we’d never get anything done).
But I do think it would be useful to think it through. For example, it’s easy to create a URL for a specimen. Easy peasy. OK, how do I discover that URL? How do I discover these for all specimens? Sounds like I need a centralised discover service like you’e described.
How do I handle changes in those URLs? I built a specimen code to GBIF resolver for BioStor so that I could link to specimens, GBIF changed lots of those URLs, all my work was undone, boy does GBIF suck sometimes. For example, if I map codes to URLs, I need to handle cases when they change.
If URLs can change, is there a way to defend against that (this is one reason for DOIs, or other methods of indirection, such as PURLs).
If providers change, will the URLs change? Is there a way to defend against that (again, DOIs handle this nicely by virtue of (a) indirection, and (b) lack of branding).
How can I encourage people to use the specimen service? What can I do to make them think it will persist? Can I convince academic publishers to trust it enough to link to it in articles? What’s the pitch to Pensoft, to Magnolai Press, to Springer and Elsevier?
Is there some way to make the service itself become trusted? For example if I look at a journal and see that it has DOIs issued by CrossRef, I take that journal more seriously than if it’s just got simple URLs. I know that papers in that journal will be linked into the citation network, I also know that there is a backup plan if the journal goes under (because you need that to have DOIs in CrossRef). Likewise, I think Figshare got a big boost when it stared minting DOIs (wow, a DOI, I know DOIs, you mean I can now cite stuff I’ve uploaded there?).
How can museums and herbaria be persuaded to keep their identifiers stable? What incentives can we provide (e.g., citation metrics for collections)? What system would enable us to do this? What about tracing funding (e.g., the NSF paid for these n papers, and they cite these y specimens, from these z collections, so science paid for by the NSF requires these collections to exist).
I guess I’m arguing that we should think all this through, because a specimen code to specimen URL is a small piece of the puzzle. Now, I’m desperately trying not to simply say what I think is blindingly obvious here (put DOIs on specimens, add metadata to specimen and specimen citation services, and we are done), but I think if we sit back and look at where we want to be, this is exactly what we need (or something functionally equivalent). Until we see the bigger picture, we will be stuck in amateur hour.
Take a look at:
http://search.crossref.org http://www.crossref.org/fundref/ http://support.crossref.org/ https://prospect.crossref.org/splash/
Isn’t this the kind of stuff we’d like to do? If so, let’s work out what’s needed and make it happen.
In short, I think we constantly solve an immediate problem in the quickest way we know how, without thinking it through. I’d argue that if we think about the bigger picture (what do we want to be able to, what are the questions we want to be able to ask) then things become clearer. This is independent of getting everyone’s agreement (but it would help if we made their agreement seem a no brainer by providing solutions to things that cause them pain).
Regards
Rod
On 5 May 2014, at 19:14, Hilmar Lapp hlapp@nescent.org wrote:
On Mon, May 5, 2014 at 1:29 PM, Roderic Page r.page@bio.gla.ac.ukwrote:
Contrary to Hilmar, there is more to this than simply a quick hackathon. Yes, a service that takes metadata and returns one or more identifiers is a good idea and easy to create (there will often be more than one because museum codes are not unique). But who maintains this service? Who maintains the identifiers? Who do I complain to if they break? How do we ensure that they persist when, say, a museum closes down, moves its collection, changes it’s web technology? Who provides the tools that add value to the identifiers? (there’s no point having them if they are not useful)
Jonathan Rees pointed this out to me too off-list. Just for the record, this isn't contrary but fully in line with what I was saying (or trying to say). Yes, I didn't elaborate that part, assuming, perhaps rather erroneously, that all this goes without saying, but I did mention that one part of this becoming a real solution has to be an institution with an in-scope cyberinfrastructure mandate that going in would make a commitment to sustain the resolver, including working with partners on the above slew of questions. The institution I gave was iDigBio; perhaps for some reason that would not be a good choice, but whether they are or not wasn't my point.
I will add one point to this, though. It seems to me that by continuing to argue that we can't go ahead with building a resolver that works (as far as technical requirements are concerned) before we haven't first fully addressed the institutional and social long-term sustainability commitment problem, we are and have been making this one big hairy problem that we can't make any practical pragmatic headway about, rather than breaking it down into parts, some of which (namely the primarily technical ones) are actually fairly straightforward to solve. As a result, to this day we don't have some solution that even though it's not very sustainable yet, at least proves to everyone how critical it is, and that the community can rally behind. Perhaps that's naïve, but I do think that once there's a solution the community rallies behind, ways to sustain it will be found.
-hilmar
Hilmar Lapp -:- informatics.nescent.org/wiki -:- lappland.io
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
-- John Deck (541) 321-0689
-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences
postal mail address: PMB 351634 Nashville, TN 37235-1634, U.S.A.
delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235
office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 322-4942 If you fax, please phone or email so that I will know to look for it.http://bioimages.vanderbilt.eduhttp://vanderbilt.edu/trees
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
Hi all,
Supposing GBIF or some other body were interested in offering such a central service as proposed in this thread. Can we articulate what we envisage would be the process?
a) client has a specimen record they wish to stamp with an identifier b) client requests DOI (or other format) from the issuing service and provides the minimum metadata in a DwC-esque profile, potentially with a preferred suffix c) service provides identifier, and client stores this along with their digital record d) from this point on, the DOI identifies the record
If so, what would happen on resolution? Does the client provide the target URL during minting which will be the redirection target on resolution? Does the service have to monitor the availability and return the cached copy on outage?
In such a model, effectively we would have a central specimen registration service where data owners push individual specimen records. Is that something we envisage the community would accept? Presumably the minimum metadata would include things like dwc:scientificName - would someone register a DOI pushing that for specimens of a new species before they have published on the name?
This model will not in itself stop duplicate IDs. The scientists assembling datasets of specimens referenced in a paper might submit those references specimens for DOIs, while the original specimen curators might also submit the same records - thus the specimen is identified twice. Which piece of the infrastructure would capture that relationship?
What seems most important to me when I think this through is that the identifier needs to be minted as early on as possible in the record life - before it is shared with others. Which leads us back to the question of whether we envisage people adopting a model where they effectively submit their record data in order to get an identifier. If not, at least if we got stable IDs on records in whatever form, we can manage the resolvability bit later, and identify duplicates.
It would be interesting to hear how others imagine such a service operating.
Thanks, Tim
On 06 May 2014, at 15:34, Hilmar Lapp hlapp@nescent.org wrote:
Every registration agency has its own set of standard metadata which members register for every DOI, but the content-negotiation strategy does allow for a richer metadata response. By default it is the registration agency's resolver that responds with RDF (and thus only with the metadata it knows of), but members (the entities registering DOIs) can register their own content-negotiation resolver, which would allow them to return richer metadata. We have, for example, considered doing this for Dryad (http://datadryad.org), but it hasn't risen to high-enough priority yet.
Hence, if GBIF were to register DOIs for specimens through DataCite (rather than being its own RA), then GBIF could still operate its own resolver for returning DwC metadata for RDF queries.
That doesn't mean there couldn't still be good arguments for GBIF serving as a RA.
-hilmar
On Tue, May 6, 2014 at 5:53 AM, Roderic Page Roderic.Page@glasgow.ac.uk wrote: Hi Steve,
My understanding is that the non-HTML content is decided at the level of registration agency. For a bibliographic DOI registered with CrossRef, the HTML redirect goes to whatever the publisher provides CrossRef (e.g., the article landing page), other content (including RDF) is served by CrossRef based on the metadata they hold for each article. Likewise, DataCite will serve metadata based on what they have. Hence, metadata from CrossRef and DatacIte look rather different.
So, this is something that would need to be worked out at the level of registration agency (see http://www.crossref.org/CrossTech/2010/03/dois_and_linked_data_some_conc.htm... and http://crosstech.crossref.org/2011/04/content_negotiation_for_crossr.html for background).
Hence, if GBIF were to be a DOI registration agency they could serve Darwin Core RDF (and JSON and whoever else they want). This is a strong argument for GBIF doing this, rather than using DataCite (which serves very generic metadata).
Regards
Rod
On 6 May 2014, at 01:42, Steve Baskauf steve.baskauf@vanderbilt.edu wrote:
I'm a big fan of not reinventing the wheel, and as such find the idea of using DOIs appealing. I think they pretty much follow all of the "rules" set out in the TDWG GUID Applicability Standard. They also play nicely in the Linked Data universe in their HTTP URI form, i.e. they redirect to HTML or RDF depending on the request header.
But I have a question for someone who understands how DOIs work better than I do. The HTML representation seems to arise by redirection to whatever is the current web page for the resource. You can see this by pasting this DOI for a specimen into a browser: http://dx.doi.org/10.7299/X7VQ32SJ which redirects to http://arctos.database.museum/guid/UAM:Ento:230092 when HTML is requested by a client. However, when the client requests RDF, one gets redirected to a DataCite metadata page: http://data.datacite.org/10.7299/X7VQ32SJ . Can the creator of the DOI redirect to any desired URI for the RDF?
The resulting RDF metadata doesn't have any of the kind useful information about the specimen that you get on the web page but rather looks like what you would expect for a publication (creator, publisher, date, etc.):
http://dx.doi.org/10.7299/X7VQ32SJ http://purl.org/dc/terms/creator "Derek S. Sikes" ; http://purl.org/dc/terms/date "2004" ; http://purl.org/dc/terms/identifier "10.7299/X7VQ32SJ" ; http://purl.org/dc/terms/publisher "University of Alaska Museum" ; http://purl.org/dc/terms/title "UAM:Ento:230092 - Grylloblatta campodeiformis" ; http://www.w3.org/2002/07/owl#sameAs "info:doi/10.7299/X7VQ32SJ" , "doi:10.7299/X7VQ32SJ" .
Can one control what kinds of metadata are provided in "DataCite's metadata"? Assuming that we get our act together and adopt an RDF guide for Darwin Core, it would be nice for the RDF metadata to look more like the description of a specimen and less like the description of a book. But maybe that's just a function of where the data provider choses to redirect RDF requests.
Steve
John Deck wrote:
+1 on DOIs, and on ARKS (see: https://wiki.ucop.edu/display/Curation/ARK ), and also i'll mention IGSN:'s (see http://www.geosamples.org/) IGSN: is rapidly gaining traction for geo-samples. I don't know of anyone using them for bio-samples but they offer many features that we've been asking for as well. What our community considers a sample (or observation) is diverse enough that multiple ID systems are probably inevitable and perhaps even warranted.
Whatever the ID system, the data providers (museums, field researchers, labs, etc..) must adopt that identifier and use it whenever linking to downstream sequence, image, and sub-sampling repository agencies. This is great to say this in theory but difficult to do in reality because the decision to adopt long term and stable identifiers is often an institutional one, and the technology is still new and argued about, in particular, on this fine list. Further, those agencies that receive data associated with a GUID must honor that source GUID when passing to consumers and other aggregators, who must also have some level of confidence in the source GUIDs as well. Thus, a primary issue that we're confronted with here is trust.
Having Hilmar's hackathon support several possible GUID schemes (each with their own long term persistence strategy), and sponsored by a well known global institution affiliated with biodiversity informatics that could offer technical guidance to data providers, good name branding, and the nuts and bolts expertise to demonstrate good shepherding of source GUIDs through a data aggregation chain would be ideal. I nominate GBIF :)
John Deck
On Mon, May 5, 2014 at 1:09 PM, Roderic Page r.page@bio.gla.ac.uk wrote: Hi Markus,
I have three use cases that
Linking sequences in GenBank to voucher specimens. Lots of voucher specimens are listed in GenBank but not linked to digital records for those specimens. These links are useful in two directions, one is to link GBIF to genomic data, the second is to enhance data in both databases, see http://iphylo.blogspot.co.uk/2012/02/linking-gbif-and-genbank.html (e.g., by adding missing georeferencing that is available in one database but not the other).
Linking to specimens cited in the literature. I’ve done some work on this in BioStor, see http://iphylo.blogspot.co.uk/2012/02/linking-gbif-and-biodiversity-heritage.... One immediate benefit of this is that GBIF could display the scientific literature associated with a specimen, so we get access to the evidence supporting identification, georeferencing, etc.
Citation metrics for collections, see http://iphylo.blogspot.co.uk/2013/05/the-impact-of-museum-collections-one.ht... and http://iphylo.blogspot.co.uk/2012/02/gbif-specimens-in-biostor-who-are-top.h... Based on citation sod specimens in the literature, and in databases such as GenBank (i.e., basically combining 1 + 2 above) we can demonstrate the value of a collection.
All of these use cases depend on GBIF occurenceIds remaining stable, I have often ranted on iPHylo when this doesn’t happen: http://iphylo.blogspot.co.uk/2012/07/dear-gbif-please-stop-changing.html
Regards
Rod
On 5 May 2014, at 20:51, Markus Döring mdoering@gbif.org wrote:
Hi Rod,
I agree GBIF has troubles to keep identifiers stable for *some* records, but in general we do a much better job than the original publishers in the first place. We try hard to keep GBIF ids stable even if publishers change collection codes, registered datasets twice or do other things to break a simple automated way of mapping source records to existing GBIF ids. Also the stable identifier in GBIF never has been the URL, but it is the local GBIF integer alone. The GBIF services that consume those ids have changed over the years, but its pretty trivial to adjust if you use the GBIF ids instead of the URLs. If there is a clear need to have stable URLs instead I am sure we can get that working easily.
The two real issues for GBIF are a) duplicates and b) records with varying local identifiers of any sort (triplet, occurrenceID or whatever else).
When it comes to the varying source identifiers I always liked the idea of flagging those records and datasets as unstable, so it is obvious to users. This is not a 100% safe, but most terrible datasets change all of their ids and that is easily detectable. Also with a service like that it would become more obvious to publishers how important stable source ids are.
Before jumping on DOIs as the next big thing I would really like to understand what needs the community has around specimen ids. Gabi clearly has a very real use case, are there others we know about?
Markus
On 05 May 2014, at 21:05, Roderic Page r.page@bio.gla.ac.uk wrote:
Hi Hilmar,
I’m not arguing that we shouldn’t build a resolver (I have one that I use, Rich has mentioned he’s got one, Markus has one at GBIF, etc.).
Nor do I think we should wait for institutional and social commitment (because then we’d never get anything done).
But I do think it would be useful to think it through. For example, it’s easy to create a URL for a specimen. Easy peasy. OK, how do I discover that URL? How do I discover these for all specimens? Sounds like I need a centralised discover service like you’e described.
How do I handle changes in those URLs? I built a specimen code to GBIF resolver for BioStor so that I could link to specimens, GBIF changed lots of those URLs, all my work was undone, boy does GBIF suck sometimes. For example, if I map codes to URLs, I need to handle cases when they change.
If URLs can change, is there a way to defend against that (this is one reason for DOIs, or other methods of indirection, such as PURLs).
If providers change, will the URLs change? Is there a way to defend against that (again, DOIs handle this nicely by virtue of (a) indirection, and (b) lack of branding).
How can I encourage people to use the specimen service? What can I do to make them think it will persist? Can I convince academic publishers to trust it enough to link to it in articles? What’s the pitch to Pensoft, to Magnolai Press, to Springer and Elsevier?
Is there some way to make the service itself become trusted? For example if I look at a journal and see that it has DOIs issued by CrossRef, I take that journal more seriously than if it’s just got simple URLs. I know that papers in that journal will be linked into the citation network, I also know that there is a backup plan if the journal goes under (because you need that to have DOIs in CrossRef). Likewise, I think Figshare got a big boost when it stared minting DOIs (wow, a DOI, I know DOIs, you mean I can now cite stuff I’ve uploaded there?).
How can museums and herbaria be persuaded to keep their identifiers stable? What incentives can we provide (e.g., citation metrics for collections)? What system would enable us to do this? What about tracing funding (e.g., the NSF paid for these n papers, and they cite these y specimens, from these z collections, so science paid for by the NSF requires these collections to exist).
I guess I’m arguing that we should think all this through, because a specimen code to specimen URL is a small piece of the puzzle. Now, I’m desperately trying not to simply say what I think is blindingly obvious here (put DOIs on specimens, add metadata to specimen and specimen citation services, and we are done), but I think if we sit back and look at where we want to be, this is exactly what we need (or something functionally equivalent). Until we see the bigger picture, we will be stuck in amateur hour.
Take a look at:
http://search.crossref.org http://www.crossref.org/fundref/ http://support.crossref.org/ https://prospect.crossref.org/splash/
Isn’t this the kind of stuff we’d like to do? If so, let’s work out what’s needed and make it happen.
In short, I think we constantly solve an immediate problem in the quickest way we know how, without thinking it through. I’d argue that if we think about the bigger picture (what do we want to be able to, what are the questions we want to be able to ask) then things become clearer. This is independent of getting everyone’s agreement (but it would help if we made their agreement seem a no brainer by providing solutions to things that cause them pain).
Regards
Rod
On 5 May 2014, at 19:14, Hilmar Lapp hlapp@nescent.org wrote:
On Mon, May 5, 2014 at 1:29 PM, Roderic Page r.page@bio.gla.ac.uk wrote: Contrary to Hilmar, there is more to this than simply a quick hackathon. Yes, a service that takes metadata and returns one or more identifiers is a good idea and easy to create (there will often be more than one because museum codes are not unique). But who maintains this service? Who maintains the identifiers? Who do I complain to if they break? How do we ensure that they persist when, say, a museum closes down, moves its collection, changes it’s web technology? Who provides the tools that add value to the identifiers? (there’s no point having them if they are not useful)
Jonathan Rees pointed this out to me too off-list. Just for the record, this isn't contrary but fully in line with what I was saying (or trying to say). Yes, I didn't elaborate that part, assuming, perhaps rather erroneously, that all this goes without saying, but I did mention that one part of this becoming a real solution has to be an institution with an in-scope cyberinfrastructure mandate that going in would make a commitment to sustain the resolver, including working with partners on the above slew of questions. The institution I gave was iDigBio; perhaps for some reason that would not be a good choice, but whether they are or not wasn't my point.
I will add one point to this, though. It seems to me that by continuing to argue that we can't go ahead with building a resolver that works (as far as technical requirements are concerned) before we haven't first fully addressed the institutional and social long-term sustainability commitment problem, we are and have been making this one big hairy problem that we can't make any practical pragmatic headway about, rather than breaking it down into parts, some of which (namely the primarily technical ones) are actually fairly straightforward to solve. As a result, to this day we don't have some solution that even though it's not very sustainable yet, at least proves to everyone how critical it is, and that the community can rally behind. Perhaps that's naïve, but I do think that once there's a solution the community rallies behind, ways to sustain it will be found.
-hilmar
Hilmar Lapp -:- informatics.nescent.org/wiki -:- lappland.io
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
-- John Deck (541) 321-0689
-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences
postal mail address: PMB 351634 Nashville, TN 37235-1634, U.S.A.
delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235
office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 322-4942 If you fax, please phone or email so that I will know to look for it. http://bioimages.vanderbilt.edu http://vanderbilt.edu/trees
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
-- Hilmar Lapp -:- informatics.nescent.org/wiki -:- lappland.io
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
On Tue, May 6, 2014 at 10:10 AM, Tim Robertson [GBIF] trobertson@gbif.orgwrote:
a) client has a specimen record they wish to stamp with an identifier b) client requests DOI (or other format) from the issuing service and provides the minimum metadata in a DwC-esque profile, potentially with a preferred suffix c) service provides identifier, and client stores this along with their digital record d) from this point on, the DOI identifies the record
Pretty much, I think.
If so, what would happen on resolution? Does the client provide the target URL during minting which will be the redirection target on resolution?
Yes.
Does the service have to monitor the availability and return the cached copy on outage?
No. This is why you sometimes get an error when resolving a DOI for an article.
The thing with CrossRef is that by agreeing to the terms of service publishers are obligated to maintain the DOI metadata record, including a properly resolving landing page. Repeat-offending publishers get dinged. DataCite also has a dashboard that shows which members have how many DOIs for which they fail resolution.
What seems most important to me when I think this through is that the
identifier needs to be minted as early on as possible in the record life - before it is shared with others. Which leads us back to the question of whether we envisage people adopting a model where they effectively submit their record data in order to get an identifier. If not, at least if we got stable IDs on records in whatever form, we can manage the resolvability bit later, and identify duplicates.
Note that you can mint the DOI before registering it. It won't resolve until it's registered, but there is no requirement that publication must precede deciding on the DOI.
In other words, the client decides on what the DOI is to be, not the registration agency.
-hilmar
Does all of this scale to millions or 100s of millions of records? “DataCite also has a dashboard that shows which members have how many DOIs for which they fail resolution.” If the number of failed occurrence record DOIs for a member is say 100,000 out of 10 million, who has the resources to sort through 100,000 records and fix them? It’s one thing to have a journal article with a DOI, it’s a totally different one to have a million records with DOIs.
And the key question of them all – Who pays for the 100s of millions of DOIs if DataCite or CrossRef mint them? If GBIF could be set up as an RA paying a single annual fee to DataCite, then the DOI cost problem could possibly be resolved.
Chuck
From: tdwg-content-bounces@lists.tdwg.org [mailto:tdwg-content-bounces@lists.tdwg.org] On Behalf Of Hilmar Lapp Sent: Tuesday, May 06, 2014 9:27 AM To: Tim Robertson [GBIF] Cc: tdwg-content@lists.tdwg.org; Markus Döring (GBIF); tomc@cs.uoregon.edu Subject: Re: [tdwg-content] delimiter characters for concatenated IDs
On Tue, May 6, 2014 at 10:10 AM, Tim Robertson [GBIF] <trobertson@gbif.orgmailto:trobertson@gbif.org> wrote:
a) client has a specimen record they wish to stamp with an identifier b) client requests DOI (or other format) from the issuing service and provides the minimum metadata in a DwC-esque profile, potentially with a preferred suffix c) service provides identifier, and client stores this along with their digital record d) from this point on, the DOI identifies the record
Pretty much, I think.
If so, what would happen on resolution? Does the client provide the target URL during minting which will be the redirection target on resolution?
Yes.
Does the service have to monitor the availability and return the cached copy on outage?
No. This is why you sometimes get an error when resolving a DOI for an article.
The thing with CrossRef is that by agreeing to the terms of service publishers are obligated to maintain the DOI metadata record, including a properly resolving landing page. Repeat-offending publishers get dinged. DataCite also has a dashboard that shows which members have how many DOIs for which they fail resolution.
What seems most important to me when I think this through is that the identifier needs to be minted as early on as possible in the record life - before it is shared with others. Which leads us back to the question of whether we envisage people adopting a model where they effectively submit their record data in order to get an identifier. If not, at least if we got stable IDs on records in whatever form, we can manage the resolvability bit later, and identify duplicates.
Note that you can mint the DOI before registering it. It won't resolve until it's registered, but there is no requirement that publication must precede deciding on the DOI.
In other words, the client decides on what the DOI is to be, not the registration agency.
-hilmar
-- Hilmar Lapp -:- informatics.nescent.org/wikihttp://informatics.nescent.org/wiki -:- lappland.iohttp://lappland.io
Chuck, The scaling issue is one of the reasons BiSciCol started working with ARKs. But key to any solution is having a process for maintaining the identifiers and that is best done by a 3rd party and costs $ Good idea on sharing that burden!
Sent from my iPhone
On May 6, 2014, at 7:52 AM, Chuck Miller Chuck.Miller@mobot.org wrote:
Does all of this scale to millions or 100s of millions of records? “DataCite also has a dashboard that shows which members have how many DOIs for which they fail resolution.” If the number of failed occurrence record DOIs for a member is say 100,000 out of 10 million, who has the resources to sort through 100,000 records and fix them? It’s one thing to have a journal article with a DOI, it’s a totally different one to have a million records with DOIs.
And the key question of them all – Who pays for the 100s of millions of DOIs if DataCite or CrossRef mint them? If GBIF could be set up as an RA paying a single annual fee to DataCite, then the DOI cost problem could possibly be resolved.
Chuck
From: tdwg-content-bounces@lists.tdwg.org [mailto:tdwg-content-bounces@lists.tdwg.org] On Behalf Of Hilmar Lapp Sent: Tuesday, May 06, 2014 9:27 AM To: Tim Robertson [GBIF] Cc: tdwg-content@lists.tdwg.org; Markus Döring (GBIF); tomc@cs.uoregon.edu Subject: Re: [tdwg-content] delimiter characters for concatenated IDs
On Tue, May 6, 2014 at 10:10 AM, Tim Robertson [GBIF] trobertson@gbif.org wrote:
a) client has a specimen record they wish to stamp with an identifier b) client requests DOI (or other format) from the issuing service and provides the minimum metadata in a DwC-esque profile, potentially with a preferred suffix c) service provides identifier, and client stores this along with their digital record d) from this point on, the DOI identifies the record
Pretty much, I think.
If so, what would happen on resolution? Does the client provide the target URL during minting which will be the redirection target on resolution?
Yes.
Does the service have to monitor the availability and return the cached copy on outage?
No. This is why you sometimes get an error when resolving a DOI for an article.
The thing with CrossRef is that by agreeing to the terms of service publishers are obligated to maintain the DOI metadata record, including a properly resolving landing page. Repeat-offending publishers get dinged. DataCite also has a dashboard that shows which members have how many DOIs for which they fail resolution.
What seems most important to me when I think this through is that the identifier needs to be minted as early on as possible in the record life - before it is shared with others. Which leads us back to the question of whether we envisage people adopting a model where they effectively submit their record data in order to get an identifier. If not, at least if we got stable IDs on records in whatever form, we can manage the resolvability bit later, and identify duplicates.
Note that you can mint the DOI before registering it. It won't resolve until it's registered, but there is no requirement that publication must precede deciding on the DOI.
In other words, the client decides on what the DOI is to be, not the registration agency.
-hilmar
-- Hilmar Lapp -:- informatics.nescent.org/wiki -:- lappland.io
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
Hi Tim,
On 6 May 2014, at 15:10, Tim Robertson [GBIF] <trobertson@gbif.orgmailto:trobertson@gbif.org> wrote:
Hi all,
Supposing GBIF or some other body were interested in offering such a central service as proposed in this thread. Can we articulate what we envisage would be the process?
a) client has a specimen record they wish to stamp with an identifier
A crucial first step is having data providers generate and maintain ids that are unique within their organisation. By maintain I mean they don’t change just because someone decides to use new software, or to rename "HERP” as “HERPS”, etc.
In the academic publishing world you will see all sorts of conventions for naming articles, e.g.
bms.2012.1078
S0025315407056196
16085914.2012.668850
0278-0372(2000)020[0603:DIMAWI]2.0.CO;2
that are designed to create identifiers unique within a publisher (some of these are obviously based on article metadata). These can then be migrated to DOis by prepending them with a DOI, e.g.
http://dx.doi.org/10.5343/bms.2012.1078
http://dx.doi.org/10.1651/0278-0372(2000)020%5B0603:DIMAWI%5D2.0.CO;2 (yuck)
b) client requests DOI (or other format) from the issuing service and provides the minimum metadata in a DwC-esque profile, potentially with a preferred suffix
Yes. Isn’t this pretty much what happens already, in the sense that data providers send GBIF data in an agreed format?
c) service provides identifier, and client stores this along with their digital record
Service provider mints DOI, which ideally is as simple as putting 10.xxx prefix in front of their locally unique identifier (if 10.xxx prefix is unique to that client), or adding additional string to make it globally unique.
d) from this point on, the DOI identifies the record
Yes
If so, what would happen on resolution? Does the client provide the target URL during minting which will be the redirection target on resolution? Does the service have to monitor the availability and return the cached copy on outage?
Yes, typically client provides the URL. Alternatively, client could delegate that to GBIF (“we can’t handle this right know, can GBIF please host HTML for occurrence”), in which case GBIF would add it’s own URL.
As Hilmar has said, CrossRef will penalise providers that fail to resolve. But just as importantly, they provide a support tool where you can report a DOI that doesn’t resolve. I think an obvious approach here is for GBIF to provide a fallback. One way to do this would be to have a resolver, e.g. doi.gbif.orghttp://doi.gbif.org that will, by default, go to client’s page, otherwise resolve to GBIF’s version of the HTML for a DOI.
In such a model, effectively we would have a central specimen registration service where data owners push individual specimen records. Is that something we envisage the community would accept? Presumably the minimum metadata would include things like dwc:scientificName - would someone register a DOI pushing that for specimens of a new species before they have published on the name?
Isn’t this pretty much GBIF’s current model? Data owners push data to GBIF, GBIF expects certain standard fields, and assigns an occurrence id. What changes is the id becomes a DOI.
This model will not in itself stop duplicate IDs. The scientists assembling datasets of specimens referenced in a paper might submit those references specimens for DOIs, while the original specimen curators might also submit the same records - thus the specimen is identified twice. Which piece of the infrastructure would capture that relationship?
No, this is equivalent to me writing a paper, citing some articles, and saying I’d like to mint DOis for those articles! Either the DOis exist (in which case I or the publisher of my article can link to them) or the DOis don’t, in which case I get a “naked” citation string. Either way, authors of papers on’t get to mint DOIs for specimens .
One model is DOIs are minted for specimens in a collection (the “publisher” or “data owner"). For example, only the AMNH can apply for DOIs for it’s specimens. I guess this does raise some issues about how easy it is to identify who actually “owns” the specimen (i.e., is the primary source of data for that specimen).
What seems most important to me when I think this through is that the identifier needs to be minted as early on as possible in the record life - before it is shared with others. Which leads us back to the question of whether we envisage people adopting a model where they effectively submit their record data in order to get an identifier. If not, at least if we got stable IDs on records in whatever form, we can manage the resolvability bit later, and identify duplicates.
So, a first step is for providers submitting data to GBIF to have locally unique, stable identifiers. You’d think this would be an obvious and trivial thing, but I suspect it won’t be...
Regards
Rod
It would be interesting to hear how others imagine such a service operating.
Thanks, Tim
On 06 May 2014, at 15:34, Hilmar Lapp <hlapp@nescent.orgmailto:hlapp@nescent.org> wrote:
Every registration agency has its own set of standard metadata which members register for every DOI, but the content-negotiation strategy does allow for a richer metadata response. By default it is the registration agency's resolver that responds with RDF (and thus only with the metadata it knows of), but members (the entities registering DOIs) can register their own content-negotiation resolver, which would allow them to return richer metadata. We have, for example, considered doing this for Dryad (http://datadryad.orghttp://datadryad.org/), but it hasn't risen to high-enough priority yet.
Hence, if GBIF were to register DOIs for specimens through DataCite (rather than being its own RA), then GBIF could still operate its own resolver for returning DwC metadata for RDF queries.
That doesn't mean there couldn't still be good arguments for GBIF serving as a RA.
-hilmar
On Tue, May 6, 2014 at 5:53 AM, Roderic Page <Roderic.Page@glasgow.ac.ukmailto:Roderic.Page@glasgow.ac.uk> wrote: Hi Steve,
My understanding is that the non-HTML content is decided at the level of registration agency. For a bibliographic DOI registered with CrossRef, the HTML redirect goes to whatever the publisher provides CrossRef (e.g., the article landing page), other content (including RDF) is served by CrossRef based on the metadata they hold for each article. Likewise, DataCite will serve metadata based on what they have. Hence, metadata from CrossRef and DatacIte look rather different.
So, this is something that would need to be worked out at the level of registration agency (see http://www.crossref.org/CrossTech/2010/03/dois_and_linked_data_some_conc.htm... and http://crosstech.crossref.org/2011/04/content_negotiation_for_crossr.html for background).
Hence, if GBIF were to be a DOI registration agency they could serve Darwin Core RDF (and JSON and whoever else they want). This is a strong argument for GBIF doing this, rather than using DataCite (which serves very generic metadata).
Regards
Rod
On 6 May 2014, at 01:42, Steve Baskauf <steve.baskauf@vanderbilt.edumailto:steve.baskauf@vanderbilt.edu> wrote:
I'm a big fan of not reinventing the wheel, and as such find the idea of using DOIs appealing. I think they pretty much follow all of the "rules" set out in the TDWG GUID Applicability Standard. They also play nicely in the Linked Data universe in their HTTP URI form, i.e. they redirect to HTML or RDF depending on the request header.
But I have a question for someone who understands how DOIs work better than I do. The HTML representation seems to arise by redirection to whatever is the current web page for the resource. You can see this by pasting this DOI for a specimen into a browser: http://dx.doi.org/10.7299/X7VQ32SJ which redirects to http://arctos.database.museum/guid/UAM:Ento:230092 when HTML is requested by a client. However, when the client requests RDF, one gets redirected to a DataCite metadata page: http://data.datacite.org/10.7299/X7VQ32SJ . Can the creator of the DOI redirect to any desired URI for the RDF?
The resulting RDF metadata doesn't have any of the kind useful information about the specimen that you get on the web page but rather looks like what you would expect for a publication (creator, publisher, date, etc.):
<http://dx.doi.org/10.7299/X7VQ32SJhttp://linkeddata.informatik.hu-berlin.de/uridbg/index.php?url=http://dx.doi.org/10.7299/X7VQ32SJ&acceptheader=text%2Fturtle%3Bq%3D1%2Capplication%2Fx-turtle%3Bq%3D0.5&useragentheader=> <http://purl.org/dc/terms/creatorhttp://linkeddata.informatik.hu-berlin.de/uridbg/index.php?url=http://purl.org/dc/terms/creator&acceptheader=text%2Fturtle%3Bq%3D1%2Capplication%2Fx-turtle%3Bq%3D0.5&useragentheader=> "Derek S. Sikes" ; <http://purl.org/dc/terms/datehttp://linkeddata.informatik.hu-berlin.de/uridbg/index.php?url=http://purl.org/dc/terms/date&acceptheader=text%2Fturtle%3Bq%3D1%2Capplication%2Fx-turtle%3Bq%3D0.5&useragentheader=> "2004" ; <http://purl.org/dc/terms/identifierhttp://linkeddata.informatik.hu-berlin.de/uridbg/index.php?url=http://purl.org/dc/terms/identifier&acceptheader=text%2Fturtle%3Bq%3D1%2Capplication%2Fx-turtle%3Bq%3D0.5&useragentheader=> "10.7299/X7VQ32SJ" ; <http://purl.org/dc/terms/publisherhttp://linkeddata.informatik.hu-berlin.de/uridbg/index.php?url=http://purl.org/dc/terms/publisher&acceptheader=text%2Fturtle%3Bq%3D1%2Capplication%2Fx-turtle%3Bq%3D0.5&useragentheader=> "University of Alaska Museum" ; <http://purl.org/dc/terms/titlehttp://linkeddata.informatik.hu-berlin.de/uridbg/index.php?url=http://purl.org/dc/terms/title&acceptheader=text%2Fturtle%3Bq%3D1%2Capplication%2Fx-turtle%3Bq%3D0.5&useragentheader=> "UAM:Ento:230092 - Grylloblatta campodeiformis" ; <http://www.w3.org/2002/07/owl#sameAshttp://linkeddata.informatik.hu-berlin.de/uridbg/index.php?url=http://www.w3.org/2002/07/owl#sameAs&acceptheader=text%2Fturtle%3Bq%3D1%2Capplication%2Fx-turtle%3Bq%3D0.5&useragentheader=> "info:doi/10.7299/X7VQ32SJ" , "doi:10.7299/X7VQ32SJ" .
Can one control what kinds of metadata are provided in "DataCite's metadata"? Assuming that we get our act together and adopt an RDF guide for Darwin Core, it would be nice for the RDF metadata to look more like the description of a specimen and less like the description of a book. But maybe that's just a function of where the data provider choses to redirect RDF requests.
Steve
John Deck wrote: +1 on DOIs, and on ARKS (see: https://wiki.ucop.edu/display/Curation/ARK ), and also i'll mention IGSN:'s (see http://www.geosamples.org/) IGSN: is rapidly gaining traction for geo-samples. I don't know of anyone using them for bio-samples but they offer many features that we've been asking for as well. What our community considers a sample (or observation) is diverse enough that multiple ID systems are probably inevitable and perhaps even warranted.
Whatever the ID system, the data providers (museums, field researchers, labs, etc..) must adopt that identifier and use it whenever linking to downstream sequence, image, and sub-sampling repository agencies. This is great to say this in theory but difficult to do in reality because the decision to adopt long term and stable identifiers is often an institutional one, and the technology is still new and argued about, in particular, on this fine list. Further, those agencies that receive data associated with a GUID must honor that source GUID when passing to consumers and other aggregators, who must also have some level of confidence in the source GUIDs as well. Thus, a primary issue that we're confronted with here is trust.
Having Hilmar's hackathon support several possible GUID schemes (each with their own long term persistence strategy), and sponsored by a well known global institution affiliated with biodiversity informatics that could offer technical guidance to data providers, good name branding, and the nuts and bolts expertise to demonstrate good shepherding of source GUIDs through a data aggregation chain would be ideal. I nominate GBIF :)
John Deck
On Mon, May 5, 2014 at 1:09 PM, Roderic Page <r.page@bio.gla.ac.ukmailto:r.page@bio.gla.ac.uk> wrote: Hi Markus,
I have three use cases that
1. Linking sequences in GenBank to voucher specimens. Lots of voucher specimens are listed in GenBank but not linked to digital records for those specimens. These links are useful in two directions, one is to link GBIF to genomic data, the second is to enhance data in both databases, see http://iphylo.blogspot.co.uk/2012/02/linking-gbif-and-genbank.html (e.g., by adding missing georeferencing that is available in one database but not the other).
2. Linking to specimens cited in the literature. I’ve done some work on this in BioStor, see http://iphylo.blogspot.co.uk/2012/02/linking-gbif-and-biodiversity-heritage.... One immediate benefit of this is that GBIF could display the scientific literature associated with a specimen, so we get access to the evidence supporting identification, georeferencing, etc.
3. Citation metrics for collections, see http://iphylo.blogspot.co.uk/2013/05/the-impact-of-museum-collections-one.ht... and http://iphylo.blogspot.co.uk/2012/02/gbif-specimens-in-biostor-who-are-top.h... Based on citation sod specimens in the literature, and in databases such as GenBank (i.e., basically combining 1 + 2 above) we can demonstrate the value of a collection.
All of these use cases depend on GBIF occurenceIds remaining stable, I have often ranted on iPHylo when this doesn’t happen: http://iphylo.blogspot.co.uk/2012/07/dear-gbif-please-stop-changing.html
Regards
Rod
On 5 May 2014, at 20:51, Markus Döring <mdoering@gbif.orgmailto:mdoering@gbif.org> wrote:
Hi Rod,
I agree GBIF has troubles to keep identifiers stable for *some* records, but in general we do a much better job than the original publishers in the first place. We try hard to keep GBIF ids stable even if publishers change collection codes, registered datasets twice or do other things to break a simple automated way of mapping source records to existing GBIF ids. Also the stable identifier in GBIF never has been the URL, but it is the local GBIF integer alone. The GBIF services that consume those ids have changed over the years, but its pretty trivial to adjust if you use the GBIF ids instead of the URLs. If there is a clear need to have stable URLs instead I am sure we can get that working easily.
The two real issues for GBIF are a) duplicates and b) records with varying local identifiers of any sort (triplet, occurrenceID or whatever else).
When it comes to the varying source identifiers I always liked the idea of flagging those records and datasets as unstable, so it is obvious to users. This is not a 100% safe, but most terrible datasets change all of their ids and that is easily detectable. Also with a service like that it would become more obvious to publishers how important stable source ids are.
Before jumping on DOIs as the next big thing I would really like to understand what needs the community has around specimen ids. Gabi clearly has a very real use case, are there others we know about?
Markus
On 05 May 2014, at 21:05, Roderic Page <r.page@bio.gla.ac.ukmailto:r.page@bio.gla.ac.uk> wrote:
Hi Hilmar,
I’m not arguing that we shouldn’t build a resolver (I have one that I use, Rich has mentioned he’s got one, Markus has one at GBIF, etc.).
Nor do I think we should wait for institutional and social commitment (because then we’d never get anything done).
But I do think it would be useful to think it through. For example, it’s easy to create a URL for a specimen. Easy peasy. OK, how do I discover that URL? How do I discover these for all specimens? Sounds like I need a centralised discover service like you’e described.
How do I handle changes in those URLs? I built a specimen code to GBIF resolver for BioStor so that I could link to specimens, GBIF changed lots of those URLs, all my work was undone, boy does GBIF suck sometimes. For example, if I map codes to URLs, I need to handle cases when they change.
If URLs can change, is there a way to defend against that (this is one reason for DOIs, or other methods of indirection, such as PURLs).
If providers change, will the URLs change? Is there a way to defend against that (again, DOIs handle this nicely by virtue of (a) indirection, and (b) lack of branding).
How can I encourage people to use the specimen service? What can I do to make them think it will persist? Can I convince academic publishers to trust it enough to link to it in articles? What’s the pitch to Pensoft, to Magnolai Press, to Springer and Elsevier?
Is there some way to make the service itself become trusted? For example if I look at a journal and see that it has DOIs issued by CrossRef, I take that journal more seriously than if it’s just got simple URLs. I know that papers in that journal will be linked into the citation network, I also know that there is a backup plan if the journal goes under (because you need that to have DOIs in CrossRef). Likewise, I think Figshare got a big boost when it stared minting DOIs (wow, a DOI, I know DOIs, you mean I can now cite stuff I’ve uploaded there?).
How can museums and herbaria be persuaded to keep their identifiers stable? What incentives can we provide (e.g., citation metrics for collections)? What system would enable us to do this? What about tracing funding (e.g., the NSF paid for these n papers, and they cite these y specimens, from these z collections, so science paid for by the NSF requires these collections to exist).
I guess I’m arguing that we should think all this through, because a specimen code to specimen URL is a small piece of the puzzle. Now, I’m desperately trying not to simply say what I think is blindingly obvious here (put DOIs on specimens, add metadata to specimen and specimen citation services, and we are done), but I think if we sit back and look at where we want to be, this is exactly what we need (or something functionally equivalent). Until we see the bigger picture, we will be stuck in amateur hour.
Take a look at:
http://search.crossref.orghttp://search.crossref.org/ http://www.crossref.org/fundref/ http://support.crossref.org/ https://prospect.crossref.org/splash/
Isn’t this the kind of stuff we’d like to do? If so, let’s work out what’s needed and make it happen.
In short, I think we constantly solve an immediate problem in the quickest way we know how, without thinking it through. I’d argue that if we think about the bigger picture (what do we want to be able to, what are the questions we want to be able to ask) then things become clearer. This is independent of getting everyone’s agreement (but it would help if we made their agreement seem a no brainer by providing solutions to things that cause them pain).
Regards
Rod
On 5 May 2014, at 19:14, Hilmar Lapp <hlapp@nescent.orgmailto:hlapp@nescent.org> wrote:
On Mon, May 5, 2014 at 1:29 PM, Roderic Page <r.page@bio.gla.ac.ukmailto:r.page@bio.gla.ac.uk> wrote: Contrary to Hilmar, there is more to this than simply a quick hackathon. Yes, a service that takes metadata and returns one or more identifiers is a good idea and easy to create (there will often be more than one because museum codes are not unique). But who maintains this service? Who maintains the identifiers? Who do I complain to if they break? How do we ensure that they persist when, say, a museum closes down, moves its collection, changes it’s web technology? Who provides the tools that add value to the identifiers? (there’s no point having them if they are not useful)
Jonathan Rees pointed this out to me too off-list. Just for the record, this isn't contrary but fully in line with what I was saying (or trying to say). Yes, I didn't elaborate that part, assuming, perhaps rather erroneously, that all this goes without saying, but I did mention that one part of this becoming a real solution has to be an institution with an in-scope cyberinfrastructure mandate that going in would make a commitment to sustain the resolver, including working with partners on the above slew of questions. The institution I gave was iDigBio; perhaps for some reason that would not be a good choice, but whether they are or not wasn't my point.
I will add one point to this, though. It seems to me that by continuing to argue that we can't go ahead with building a resolver that works (as far as technical requirements are concerned) before we haven't first fully addressed the institutional and social long-term sustainability commitment problem, we are and have been making this one big hairy problem that we can't make any practical pragmatic headway about, rather than breaking it down into parts, some of which (namely the primarily technical ones) are actually fairly straightforward to solve. As a result, to this day we don't have some solution that even though it's not very sustainable yet, at least proves to everyone how critical it is, and that the community can rally behind. Perhaps that's naïve, but I do think that once there's a solution the community rallies behind, ways to sustain it will be found.
-hilmar -- Hilmar Lapp -:- informatics.nescent.org/wikihttp://informatics.nescent.org/wiki -:- lappland.iohttp://lappland.io/
_______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.orgmailto:tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
-- John Deck (541) 321-0689tel:%28541%29%20321-0689
-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences
postal mail address: PMB 351634 Nashville, TN 37235-1634, U.S.A.
delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235
office: 2128 Stevenson Center phone: (615) 343-4582tel:%28615%29%20343-4582, fax: (615) 322-4942tel:%28615%29%20322-4942 If you fax, please phone or email so that I will know to look for it. http://bioimages.vanderbilt.eduhttp://bioimages.vanderbilt.edu/ http://vanderbilt.edu/trees
_______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.orgmailto:tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
_______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.orgmailto:tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
-- Hilmar Lapp -:- informatics.nescent.org/wikihttp://informatics.nescent.org/wiki -:- lappland.iohttp://lappland.io/
_______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.orgmailto:tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
--------------------------------------------------------- Roderic Page Professor of Taxonomy Institute of Biodiversity, Animal Health and Comparative Medicine College of Medical, Veterinary and Life Sciences Graham Kerr Building University of Glasgow Glasgow G12 8QQ, UK
Email: r.page@bio.gla.ac.ukmailto:r.page@bio.gla.ac.uk Tel: +44 141 330 4778 Fax: +44 141 330 2792 Skype: rdmpage Facebook: http://www.facebook.com/rdmpage LinkedIn: http://uk.linkedin.com/in/rdmpage Twitter: http://twitter.com/rdmpage Blog: http://en.wikipedia.org/wiki/Roderic_D._M._Page Citations: http://scholar.google.co.uk/citations?hl=en&user=4Z5WABAAAAAJ ORCID: http://orcid.org/0000-0002-7101-9767
Hi Tim,
What you outline below (a service that issues community-wide actionable and persistent identifiers) is what I have been advocating since before the first TDWG/GBIF GUID workshops. I still believe it would be a very useful service; however, it seems we are really talking about two different things here.
The first, which is what you outline, is a service that mints new identifiers for objects that do not already have good identifiers.
The other service, which is what I was addressing earlier, is a centralized mechanism for cross-linking *existing* identifiers to each other. Several organizations already have an internal system for doing this. I already mentioned the GNUB version (which allows the little icons of related records to show up on ZooBank pages). EoL also has one(e.g.: http://eol.org/pages/992573/resources/partner_links). So does NCBI (http://www.ncbi.nlm.nih.gov/projects/linkout/). What I think we REALLY need is a single, centralized system that manages cross-links among identifiers and (separately) identifier dereferencing services. The reality is that we already have MANY identifiers minted for the same object (e.g.: http://zoobank.org/2C6327E1-5560-4DB4-B9CA-76A0FA03D975) and, sadly, there will no-doubt be more redundant identifiers minted in the future.
While I think it would be GREAT if GBIF did offer a service to mint proper/actionable identifiers for the community, ultimately this may end up representing yet another identifier. It kind of reminds me of that standards joke: http://www.howtogeek.com/geekers/up/sshot50509a6b8cb11.jpg Just replace the word standards with identifiers, and were in the same place.
And Ill make this point one more time: Almost all of our problems with identifiers revolve around the fact that we have drank the TBL Cool-Aid and assumed/insisted that we conflate the role of object identification with the role of metadata dereferencing (i.e., actionability). Sigh ..
Aloha,
Rich
Richard L. Pyle, PhD Database Coordinator for Natural Sciences Associate Zoologist in Ichthyology Dive Safety Officer Department of Natural Sciences, Bishop Museum 1525 Bernice St., Honolulu, HI 96817 Ph: (808)848-4115, Fax: (808)847-8252 email: deepreef@bishopmuseum.org http://hbs.bishopmuseum.org/staff/pylerichard.html
From: tdwg-content-bounces@lists.tdwg.org [mailto:tdwg-content-bounces@lists.tdwg.org] On Behalf Of Tim Robertson [GBIF] Sent: Tuesday, May 06, 2014 4:11 AM To: Hilmar Lapp; Roderic Page; tdwg-content@lists.tdwg.org; "Markus Döring (GBIF)"; tomc@cs.uoregon.edu Subject: Re: [tdwg-content] delimiter characters for concatenated IDs
Hi all,
Supposing GBIF or some other body were interested in offering such a central service as proposed in this thread. Can we articulate what we envisage would be the process?
a) client has a specimen record they wish to stamp with an identifier
b) client requests DOI (or other format) from the issuing service and provides the minimum metadata in a DwC-esque profile, potentially with a preferred suffix
c) service provides identifier, and client stores this along with their digital record
d) from this point on, the DOI identifies the record
If so, what would happen on resolution? Does the client provide the target URL during minting which will be the redirection target on resolution? Does the service have to monitor the availability and return the cached copy on outage?
In such a model, effectively we would have a central specimen registration service where data owners push individual specimen records. Is that something we envisage the community would accept? Presumably the minimum metadata would include things like dwc:scientificName - would someone register a DOI pushing that for specimens of a new species before they have published on the name?
This model will not in itself stop duplicate IDs. The scientists assembling datasets of specimens referenced in a paper might submit those references specimens for DOIs, while the original specimen curators might also submit the same records - thus the specimen is identified twice. Which piece of the infrastructure would capture that relationship?
What seems most important to me when I think this through is that the identifier needs to be minted as early on as possible in the record life - before it is shared with others. Which leads us back to the question of whether we envisage people adopting a model where they effectively submit their record data in order to get an identifier. If not, at least if we got stable IDs on records in whatever form, we can manage the resolvability bit later, and identify duplicates.
It would be interesting to hear how others imagine such a service operating.
Thanks,
Tim
On 06 May 2014, at 15:34, Hilmar Lapp hlapp@nescent.org wrote:
Every registration agency has its own set of standard metadata which members register for every DOI, but the content-negotiation strategy does allow for a richer metadata response. By default it is the registration agency's resolver that responds with RDF (and thus only with the metadata it knows of), but members (the entities registering DOIs) can register their own content-negotiation resolver, which would allow them to return richer metadata. We have, for example, considered doing this for Dryad (http://datadryad.org http://datadryad.org/ ), but it hasn't risen to high-enough priority yet.
Hence, if GBIF were to register DOIs for specimens through DataCite (rather than being its own RA), then GBIF could still operate its own resolver for returning DwC metadata for RDF queries.
That doesn't mean there couldn't still be good arguments for GBIF serving as a RA.
-hilmar
On Tue, May 6, 2014 at 5:53 AM, Roderic Page Roderic.Page@glasgow.ac.uk wrote:
Hi Steve,
My understanding is that the non-HTML content is decided at the level of registration agency. For a bibliographic DOI registered with CrossRef, the HTML redirect goes to whatever the publisher provides CrossRef (e.g., the article landing page), other content (including RDF) is served by CrossRef based on the metadata they hold for each article. Likewise, DataCite will serve metadata based on what they have. Hence, metadata from CrossRef and DatacIte look rather different.
So, this is something that would need to be worked out at the level of registration agency (see http://www.crossref.org/CrossTech/2010/03/dois_and_linked_data_some_conc.htm l and http://crosstech.crossref.org/2011/04/content_negotiation_for_crossr.html for background).
Hence, if GBIF were to be a DOI registration agency they could serve Darwin Core RDF (and JSON and whoever else they want). This is a strong argument for GBIF doing this, rather than using DataCite (which serves very generic metadata).
Regards
Rod
On 6 May 2014, at 01:42, Steve Baskauf steve.baskauf@vanderbilt.edu wrote:
I'm a big fan of not reinventing the wheel, and as such find the idea of using DOIs appealing. I think they pretty much follow all of the "rules" set out in the TDWG GUID Applicability Standard. They also play nicely in the Linked Data universe in their HTTP URI form, i.e. they redirect to HTML or RDF depending on the request header.
But I have a question for someone who understands how DOIs work better than I do. The HTML representation seems to arise by redirection to whatever is the current web page for the resource. You can see this by pasting this DOI for a specimen into a browser: http://dx.doi.org/10.7299/X7VQ32SJ which redirects to http://arctos.database.museum/guid/UAM:Ento:230092 when HTML is requested by a client. However, when the client requests RDF, one gets redirected to a DataCite metadata page: http://data.datacite.org/10.7299/X7VQ32SJ . Can the creator of the DOI redirect to any desired URI for the RDF?
The resulting RDF metadata doesn't have any of the kind useful information about the specimen that you get on the web page but rather looks like what you would expect for a publication (creator, publisher, date, etc.):
<http://dx.doi.org/10.7299/X7VQ32SJ http://linkeddata.informatik.hu-berlin.de/uridbg/index.php?url=http://dx.do i.org/10.7299/X7VQ32SJ&acceptheader=text%2Fturtle%3Bq%3D1%2Capplication%2Fx- turtle%3Bq%3D0.5&useragentheader= > <http://purl.org/dc/terms/creator http://linkeddata.informatik.hu-berlin.de/uridbg/index.php?url=http://purl. org/dc/terms/creator&acceptheader=text%2Fturtle%3Bq%3D1%2Capplication%2Fx-tu rtle%3Bq%3D0.5&useragentheader= > "Derek S. Sikes" ; <http://purl.org/dc/terms/date http://linkeddata.informatik.hu-berlin.de/uridbg/index.php?url=http://purl. org/dc/terms/date&acceptheader=text%2Fturtle%3Bq%3D1%2Capplication%2Fx-turtl e%3Bq%3D0.5&useragentheader= > "2004" ; <http://purl.org/dc/terms/identifier http://linkeddata.informatik.hu-berlin.de/uridbg/index.php?url=http://purl. org/dc/terms/identifier&acceptheader=text%2Fturtle%3Bq%3D1%2Capplication%2Fx -turtle%3Bq%3D0.5&useragentheader= > "10.7299/X7VQ32SJ" ; <http://purl.org/dc/terms/publisher http://linkeddata.informatik.hu-berlin.de/uridbg/index.php?url=http://purl. org/dc/terms/publisher&acceptheader=text%2Fturtle%3Bq%3D1%2Capplication%2Fx- turtle%3Bq%3D0.5&useragentheader= > "University of Alaska Museum" ; <http://purl.org/dc/terms/title http://linkeddata.informatik.hu-berlin.de/uridbg/index.php?url=http://purl. org/dc/terms/title&acceptheader=text%2Fturtle%3Bq%3D1%2Capplication%2Fx-turt le%3Bq%3D0.5&useragentheader= > "UAM:Ento:230092 - Grylloblatta campodeiformis" ; <http://www.w3.org/2002/07/owl#sameAs http://linkeddata.informatik.hu-berlin.de/uridbg/index.php?url=http://www.w 3.org/2002/07/owl#sameAs&acceptheader=text%2Fturtle%3Bq%3D1%2Capplication%2F x-turtle%3Bq%3D0.5&useragentheader= > "info:doi/10.7299/X7VQ32SJ" , "doi:10.7299/X7VQ32SJ" .
Can one control what kinds of metadata are provided in "DataCite's metadata"? Assuming that we get our act together and adopt an RDF guide for Darwin Core, it would be nice for the RDF metadata to look more like the description of a specimen and less like the description of a book. But maybe that's just a function of where the data provider choses to redirect RDF requests.
Steve
John Deck wrote:
+1 on DOIs, and on ARKS (see: https://wiki.ucop.edu/display/Curation/ARK ), and also i'll mention IGSN:'s (see http://www.geosamples.org/) IGSN: is rapidly gaining traction for geo-samples. I don't know of anyone using them for bio-samples but they offer many features that we've been asking for as well. What our community considers a sample (or observation) is diverse enough that multiple ID systems are probably inevitable and perhaps even warranted.
Whatever the ID system, the data providers (museums, field researchers, labs, etc..) must adopt that identifier and use it whenever linking to downstream sequence, image, and sub-sampling repository agencies. This is great to say this in theory but difficult to do in reality because the decision to adopt long term and stable identifiers is often an institutional one, and the technology is still new and argued about, in particular, on this fine list. Further, those agencies that receive data associated with a GUID must honor that source GUID when passing to consumers and other aggregators, who must also have some level of confidence in the source GUIDs as well. Thus, a primary issue that we're confronted with here is trust.
Having Hilmar's hackathon support several possible GUID schemes (each with their own long term persistence strategy), and sponsored by a well known global institution affiliated with biodiversity informatics that could offer technical guidance to data providers, good name branding, and the nuts and bolts expertise to demonstrate good shepherding of source GUIDs through a data aggregation chain would be ideal. I nominate GBIF :)
John Deck
On Mon, May 5, 2014 at 1:09 PM, Roderic Page r.page@bio.gla.ac.uk wrote:
Hi Markus,
I have three use cases that
1. Linking sequences in GenBank to voucher specimens. Lots of voucher specimens are listed in GenBank but not linked to digital records for those specimens. These links are useful in two directions, one is to link GBIF to genomic data, the second is to enhance data in both databases, see http://iphylo.blogspot.co.uk/2012/02/linking-gbif-and-genbank.html (e.g., by adding missing georeferencing that is available in one database but not the other).
2. Linking to specimens cited in the literature. Ive done some work on this in BioStor, see http://iphylo.blogspot.co.uk/2012/02/linking-gbif-and-biodiversity-heritage. html One immediate benefit of this is that GBIF could display the scientific literature associated with a specimen, so we get access to the evidence supporting identification, georeferencing, etc.
3. Citation metrics for collections, see http://iphylo.blogspot.co.uk/2013/05/the-impact-of-museum-collections-one.ht ml and http://iphylo.blogspot.co.uk/2012/02/gbif-specimens-in-biostor-who-are-top.h tml Based on citation sod specimens in the literature, and in databases such as GenBank (i.e., basically combining 1 + 2 above) we can demonstrate the value of a collection.
All of these use cases depend on GBIF occurenceIds remaining stable, I have often ranted on iPHylo when this doesnt happen: http://iphylo.blogspot.co.uk/2012/07/dear-gbif-please-stop-changing.html
Regards
Rod
On 5 May 2014, at 20:51, Markus Döring mdoering@gbif.org wrote:
Hi Rod,
I agree GBIF has troubles to keep identifiers stable for *some* records, but in general we do a much better job than the original publishers in the first place. We try hard to keep GBIF ids stable even if publishers change collection codes, registered datasets twice or do other things to break a simple automated way of mapping source records to existing GBIF ids. Also the stable identifier in GBIF never has been the URL, but it is the local GBIF integer alone. The GBIF services that consume those ids have changed over the years, but its pretty trivial to adjust if you use the GBIF ids instead of the URLs. If there is a clear need to have stable URLs instead I am sure we can get that working easily.
The two real issues for GBIF are a) duplicates and b) records with varying local identifiers of any sort (triplet, occurrenceID or whatever else).
When it comes to the varying source identifiers I always liked the idea of flagging those records and datasets as unstable, so it is obvious to users. This is not a 100% safe, but most terrible datasets change all of their ids and that is easily detectable.
Also with a service like that it would become more obvious to publishers how important stable source ids are.
Before jumping on DOIs as the next big thing I would really like to understand what needs the community has around specimen ids.
Gabi clearly has a very real use case, are there others we know about?
Markus
On 05 May 2014, at 21:05, Roderic Page r.page@bio.gla.ac.uk wrote:
Hi Hilmar,
Im not arguing that we shouldnt build a resolver (I have one that I use, Rich has mentioned hes got one, Markus has one at GBIF, etc.).
Nor do I think we should wait for institutional and social commitment (because then wed never get anything done).
But I do think it would be useful to think it through. For example, its easy to create a URL for a specimen. Easy peasy. OK, how do I discover that URL? How do I discover these for all specimens? Sounds like I need a centralised discover service like youe described.
How do I handle changes in those URLs? I built a specimen code to GBIF resolver for BioStor so that I could link to specimens, GBIF changed lots of those URLs, all my work was undone, boy does GBIF suck sometimes. For example, if I map codes to URLs, I need to handle cases when they change.
If URLs can change, is there a way to defend against that (this is one reason for DOIs, or other methods of indirection, such as PURLs).
If providers change, will the URLs change? Is there a way to defend against that (again, DOIs handle this nicely by virtue of (a) indirection, and (b) lack of branding).
How can I encourage people to use the specimen service? What can I do to make them think it will persist? Can I convince academic publishers to trust it enough to link to it in articles? Whats the pitch to Pensoft, to Magnolai Press, to Springer and Elsevier?
Is there some way to make the service itself become trusted? For example if I look at a journal and see that it has DOIs issued by CrossRef, I take that journal more seriously than if its just got simple URLs. I know that papers in that journal will be linked into the citation network, I also know that there is a backup plan if the journal goes under (because you need that to have DOIs in CrossRef). Likewise, I think Figshare got a big boost when it stared minting DOIs (wow, a DOI, I know DOIs, you mean I can now cite stuff Ive uploaded there?).
How can museums and herbaria be persuaded to keep their identifiers stable? What incentives can we provide (e.g., citation metrics for collections)? What system would enable us to do this? What about tracing funding (e.g., the NSF paid for these n papers, and they cite these y specimens, from these z collections, so science paid for by the NSF requires these collections to exist).
I guess Im arguing that we should think all this through, because a specimen code to specimen URL is a small piece of the puzzle. Now, Im desperately trying not to simply say what I think is blindingly obvious here (put DOIs on specimens, add metadata to specimen and specimen citation services, and we are done), but I think if we sit back and look at where we want to be, this is exactly what we need (or something functionally equivalent). Until we see the bigger picture, we will be stuck in amateur hour.
Take a look at:
http://search.crossref.org http://search.crossref.org/
http://www.crossref.org/fundref/
https://prospect.crossref.org/splash/
Isnt this the kind of stuff wed like to do? If so, lets work out whats needed and make it happen.
In short, I think we constantly solve an immediate problem in the quickest way we know how, without thinking it through. Id argue that if we think about the bigger picture (what do we want to be able to, what are the questions we want to be able to ask) then things become clearer. This is independent of getting everyones agreement (but it would help if we made their agreement seem a no brainer by providing solutions to things that cause them pain).
Regards
Rod
On 5 May 2014, at 19:14, Hilmar Lapp hlapp@nescent.org wrote:
On Mon, May 5, 2014 at 1:29 PM, Roderic Page r.page@bio.gla.ac.uk wrote:
Contrary to Hilmar, there is more to this than simply a quick hackathon. Yes, a service that takes metadata and returns one or more identifiers is a good idea and easy to create (there will often be more than one because museum codes are not unique). But who maintains this service? Who maintains the identifiers? Who do I complain to if they break? How do we ensure that they persist when, say, a museum closes down, moves its collection, changes its web technology? Who provides the tools that add value to the identifiers? (theres no point having them if they are not useful)
Jonathan Rees pointed this out to me too off-list. Just for the record, this isn't contrary but fully in line with what I was saying (or trying to say). Yes, I didn't elaborate that part, assuming, perhaps rather erroneously, that all this goes without saying, but I did mention that one part of this becoming a real solution has to be an institution with an in-scope cyberinfrastructure mandate that going in would make a commitment to sustain the resolver, including working with partners on the above slew of questions. The institution I gave was iDigBio; perhaps for some reason that would not be a good choice, but whether they are or not wasn't my point.
I will add one point to this, though. It seems to me that by continuing to argue that we can't go ahead with building a resolver that works (as far as technical requirements are concerned) before we haven't first fully addressed the institutional and social long-term sustainability commitment problem, we are and have been making this one big hairy problem that we can't make any practical pragmatic headway about, rather than breaking it down into parts, some of which (namely the primarily technical ones) are actually fairly straightforward to solve. As a result, to this day we don't have some solution that even though it's not very sustainable yet, at least proves to everyone how critical it is, and that the community can rally behind. Perhaps that's naïve, but I do think that once there's a solution the community rallies behind, ways to sustain it will be found.
-hilmar
I like the idea of having layers of fallback with DOIs. If GBIF can provide richer metadata, great. If not, somebody else provides minimal metadata. If Linked Data and RDF implode, we still get HTML in a browser. If HTTP becomes defunct, we still have a globally unique string.
The other thing I like about DOIs is that they must be paid for. Whenever somebody had money in the game, they tend to be more serious about their responsibilities. However, I have never heard what the "cost" is per DOI? $1? $.10? $.01? $.00001??
With regards to John Deck's comments about assigning DOIs to other things like loci, I don't think we are required to solve every problem at once. If we could only solve the problem of identifiers for specimens now, that would still be huge and facilitate progress on other fronts, like linking literature and specimens, or taxa and specimens. It would also be a lift from the gloom and doom attitude that the identifier issue is hopeless. Because DOIs are team players in the Linked Data world, one could use different kinds of identifiers with other types of resources and they would still "link" perfectly file in terms of both RDF and HTML.
Steve
Hilmar Lapp wrote:
Every registration agency has its own set of standard metadata which members register for every DOI, but the content-negotiation strategy does allow for a richer metadata response. By default it is the registration agency's resolver that responds with RDF (and thus only with the metadata it knows of), but members (the entities registering DOIs) can register their own content-negotiation resolver, which would allow them to return richer metadata. We have, for example, considered doing this for Dryad (http://datadryad.org), but it hasn't risen to high-enough priority yet.
Hence, if GBIF were to register DOIs for specimens through DataCite (rather than being its own RA), then GBIF could still operate its own resolver for returning DwC metadata for RDF queries.
That doesn't mean there couldn't still be good arguments for GBIF serving as a RA.
-hilmar
On Tue, May 6, 2014 at 5:53 AM, Roderic Page <Roderic.Page@glasgow.ac.uk mailto:Roderic.Page@glasgow.ac.uk> wrote:
Hi Steve, My understanding is that the non-HTML content is decided at the level of registration agency. For a bibliographic DOI registered with CrossRef, the HTML redirect goes to whatever the publisher provides CrossRef (e.g., the article landing page), other content (including RDF) is served by CrossRef based on the metadata they hold for each article. Likewise, DataCite will serve metadata based on what they have. Hence, metadata from CrossRef and DatacIte look rather different. So, this is something that would need to be worked out at the level of registration agency (see http://www.crossref.org/CrossTech/2010/03/dois_and_linked_data_some_conc.html and http://crosstech.crossref.org/2011/04/content_negotiation_for_crossr.html for background). Hence, if GBIF were to be a DOI registration agency they could serve Darwin Core RDF (and JSON and whoever else they want). This is a strong argument for GBIF doing this, rather than using DataCite (which serves very generic metadata). Regards Rod On 6 May 2014, at 01:42, Steve Baskauf <steve.baskauf@vanderbilt.edu <mailto:steve.baskauf@vanderbilt.edu>> wrote:
I'm a big fan of not reinventing the wheel, and as such find the idea of using DOIs appealing. I think they pretty much follow all of the "rules" set out in the TDWG GUID Applicability Standard. They also play nicely in the Linked Data universe in their HTTP URI form, i.e. they redirect to HTML or RDF depending on the request header. But I have a question for someone who understands how DOIs work better than I do. The HTML representation seems to arise by redirection to whatever is the current web page for the resource. You can see this by pasting this DOI for a specimen into a browser: http://dx.doi.org/10.7299/X7VQ32SJ which redirects to http://arctos.database.museum/guid/UAM:Ento:230092 when HTML is requested by a client. However, when the client requests RDF, one gets redirected to a DataCite metadata page: http://data.datacite.org/10.7299/X7VQ32SJ . Can the creator of the DOI redirect to any desired URI for the RDF? The resulting RDF metadata doesn't have any of the kind useful information about the specimen that you get on the web page but rather looks like what you would expect for a publication (creator, publisher, date, etc.): <http://dx.doi.org/10.7299/X7VQ32SJ <http://linkeddata.informatik.hu-berlin.de/uridbg/index.php?url=http://dx.doi.org/10.7299/X7VQ32SJ&acceptheader=text%2Fturtle%3Bq%3D1%2Capplication%2Fx-turtle%3Bq%3D0.5&useragentheader=>> <http://purl.org/dc/terms/creator <http://linkeddata.informatik.hu-berlin.de/uridbg/index.php?url=http://purl.org/dc/terms/creator&acceptheader=text%2Fturtle%3Bq%3D1%2Capplication%2Fx-turtle%3Bq%3D0.5&useragentheader=>> "Derek S. Sikes" ; <http://purl.org/dc/terms/date <http://linkeddata.informatik.hu-berlin.de/uridbg/index.php?url=http://purl.org/dc/terms/date&acceptheader=text%2Fturtle%3Bq%3D1%2Capplication%2Fx-turtle%3Bq%3D0.5&useragentheader=>> "2004" ; <http://purl.org/dc/terms/identifier <http://linkeddata.informatik.hu-berlin.de/uridbg/index.php?url=http://purl.org/dc/terms/identifier&acceptheader=text%2Fturtle%3Bq%3D1%2Capplication%2Fx-turtle%3Bq%3D0.5&useragentheader=>> "10.7299/X7VQ32SJ" ; <http://purl.org/dc/terms/publisher <http://linkeddata.informatik.hu-berlin.de/uridbg/index.php?url=http://purl.org/dc/terms/publisher&acceptheader=text%2Fturtle%3Bq%3D1%2Capplication%2Fx-turtle%3Bq%3D0.5&useragentheader=>> "University of Alaska Museum" ; <http://purl.org/dc/terms/title <http://linkeddata.informatik.hu-berlin.de/uridbg/index.php?url=http://purl.org/dc/terms/title&acceptheader=text%2Fturtle%3Bq%3D1%2Capplication%2Fx-turtle%3Bq%3D0.5&useragentheader=>> "UAM:Ento:230092 - Grylloblatta campodeiformis" ; <http://www.w3.org/2002/07/owl#sameAs <http://linkeddata.informatik.hu-berlin.de/uridbg/index.php?url=http://www.w3.org/2002/07/owl#sameAs&acceptheader=text%2Fturtle%3Bq%3D1%2Capplication%2Fx-turtle%3Bq%3D0.5&useragentheader=>> "info:doi/10.7299/X7VQ32SJ" , "doi:10.7299/X7VQ32SJ" . Can one control what kinds of metadata are provided in "DataCite's metadata"? Assuming that we get our act together and adopt an RDF guide for Darwin Core, it would be nice for the RDF metadata to look more like the description of a specimen and less like the description of a book. But maybe that's just a function of where the data provider choses to redirect RDF requests. Steve John Deck wrote:
+1 on DOIs, and on ARKS (see: https://wiki.ucop.edu/display/Curation/ARK ), and also i'll mention IGSN:'s (see http://www.geosamples.org/) IGSN: is rapidly gaining traction for geo-samples. I don't know of anyone using them for bio-samples but they offer many features that we've been asking for as well. What our community considers a sample (or observation) is diverse enough that multiple ID systems are probably inevitable and perhaps even warranted. Whatever the ID system, the data providers (museums, field researchers, labs, etc..) must adopt that identifier and use it whenever linking to downstream sequence, image, and sub-sampling repository agencies. This is great to say this in theory but difficult to do in reality because the decision to adopt long term and stable identifiers is often an institutional one, and the technology is still new and argued about, in particular, on this fine list. Further, those agencies that receive data associated with a GUID must honor that source GUID when passing to consumers and other aggregators, who must also have some level of confidence in the source GUIDs as well. Thus, a primary issue that we're confronted with here is trust. Having Hilmar's hackathon support several possible GUID schemes (each with their own long term persistence strategy), and sponsored by a well known global institution affiliated with biodiversity informatics that could offer technical guidance to data providers, good name branding, and the nuts and bolts expertise to demonstrate good shepherding of source GUIDs through a data aggregation chain would be ideal. I nominate GBIF :) John Deck On Mon, May 5, 2014 at 1:09 PM, Roderic Page <r.page@bio.gla.ac.uk <mailto:r.page@bio.gla.ac.uk>> wrote: Hi Markus, I have three use cases that 1. Linking sequences in GenBank to voucher specimens. Lots of voucher specimens are listed in GenBank but not linked to digital records for those specimens. These links are useful in two directions, one is to link GBIF to genomic data, the second is to enhance data in both databases, see http://iphylo.blogspot.co.uk/2012/02/linking-gbif-and-genbank.html (e.g., by adding missing georeferencing that is available in one database but not the other). 2. Linking to specimens cited in the literature. I’ve done some work on this in BioStor, see http://iphylo.blogspot.co.uk/2012/02/linking-gbif-and-biodiversity-heritage.html One immediate benefit of this is that GBIF could display the scientific literature associated with a specimen, so we get access to the evidence supporting identification, georeferencing, etc. 3. Citation metrics for collections, see http://iphylo.blogspot.co.uk/2013/05/the-impact-of-museum-collections-one.html and http://iphylo.blogspot.co.uk/2012/02/gbif-specimens-in-biostor-who-are-top.html Based on citation sod specimens in the literature, and in databases such as GenBank (i.e., basically combining 1 + 2 above) we can demonstrate the value of a collection. All of these use cases depend on GBIF occurenceIds remaining stable, I have often ranted on iPHylo when this doesn’t happen: http://iphylo.blogspot.co.uk/2012/07/dear-gbif-please-stop-changing.html Regards Rod On 5 May 2014, at 20:51, Markus Döring <mdoering@gbif.org <mailto:mdoering@gbif.org>> wrote:
Hi Rod, I agree GBIF has troubles to keep identifiers stable for *some* records, but in general we do a much better job than the original publishers in the first place. We try hard to keep GBIF ids stable even if publishers change collection codes, registered datasets twice or do other things to break a simple automated way of mapping source records to existing GBIF ids. Also the stable identifier in GBIF never has been the URL, but it is the local GBIF integer alone. The GBIF services that consume those ids have changed over the years, but its pretty trivial to adjust if you use the GBIF ids instead of the URLs. If there is a clear need to have stable URLs instead I am sure we can get that working easily. The two real issues for GBIF are a) duplicates and b) records with varying local identifiers of any sort (triplet, occurrenceID or whatever else). When it comes to the varying source identifiers I always liked the idea of flagging those records and datasets as unstable, so it is obvious to users. This is not a 100% safe, but most terrible datasets change all of their ids and that is easily detectable. Also with a service like that it would become more obvious to publishers how important stable source ids are. Before jumping on DOIs as the next big thing I would really like to understand what needs the community has around specimen ids. Gabi clearly has a very real use case, are there others we know about? Markus On 05 May 2014, at 21:05, Roderic Page <r.page@bio.gla.ac.uk <mailto:r.page@bio.gla.ac.uk>> wrote:
Hi Hilmar, I’m not arguing that we shouldn’t build a resolver (I have one that I use, Rich has mentioned he’s got one, Markus has one at GBIF, etc.). Nor do I think we should wait for institutional and social commitment (because then we’d never get anything done). But I do think it would be useful to think it through. For example, it’s easy to create a URL for a specimen. Easy peasy. OK, how do I discover that URL? How do I discover these for all specimens? Sounds like I need a centralised discover service like you’e described. How do I handle changes in those URLs? I built a specimen code to GBIF resolver for BioStor so that I could link to specimens, GBIF changed lots of those URLs, all my work was undone, boy does GBIF suck sometimes. For example, if I map codes to URLs, I need to handle cases when they change. If URLs can change, is there a way to defend against that (this is one reason for DOIs, or other methods of indirection, such as PURLs). If providers change, will the URLs change? Is there a way to defend against that (again, DOIs handle this nicely by virtue of (a) indirection, and (b) lack of branding). How can I encourage people to use the specimen service? What can I do to make them think it will persist? Can I convince academic publishers to trust it enough to link to it in articles? What’s the pitch to Pensoft, to Magnolai Press, to Springer and Elsevier? Is there some way to make the service itself become trusted? For example if I look at a journal and see that it has DOIs issued by CrossRef, I take that journal more seriously than if it’s just got simple URLs. I know that papers in that journal will be linked into the citation network, I also know that there is a backup plan if the journal goes under (because you need that to have DOIs in CrossRef). Likewise, I think Figshare got a big boost when it stared minting DOIs (wow, a DOI, I know DOIs, you mean I can now cite stuff I’ve uploaded there?). How can museums and herbaria be persuaded to keep their identifiers stable? What incentives can we provide (e.g., citation metrics for collections)? What system would enable us to do this? What about tracing funding (e.g., the NSF paid for these n papers, and they cite these y specimens, from these z collections, so science paid for by the NSF requires these collections to exist). I guess I’m arguing that we should think all this through, because a specimen code to specimen URL is a small piece of the puzzle. Now, I’m desperately trying not to simply say what I think is blindingly obvious here (put DOIs on specimens, add metadata to specimen and specimen citation services, and we are done), but I think if we sit back and look at where we want to be, this is exactly what we need (or something functionally equivalent). Until we see the bigger picture, we will be stuck in amateur hour. Take a look at: http://search.crossref.org <http://search.crossref.org/> http://www.crossref.org/fundref/ http://support.crossref.org/ https://prospect.crossref.org/splash/ Isn’t this the kind of stuff we’d like to do? If so, let’s work out what’s needed and make it happen. In short, I think we constantly solve an immediate problem in the quickest way we know how, without thinking it through. I’d argue that if we think about the bigger picture (what do we want to be able to, what are the questions we want to be able to ask) then things become clearer. This is independent of getting everyone’s agreement (but it would help if we made their agreement seem a no brainer by providing solutions to things that cause them pain). Regards Rod On 5 May 2014, at 19:14, Hilmar Lapp <hlapp@nescent.org <mailto:hlapp@nescent.org>> wrote:
On Mon, May 5, 2014 at 1:29 PM, Roderic Page <r.page@bio.gla.ac.uk <mailto:r.page@bio.gla.ac.uk>> wrote: Contrary to Hilmar, there is more to this than simply a quick hackathon. Yes, a service that takes metadata and returns one or more identifiers is a good idea and easy to create (there will often be more than one because museum codes are not unique). But who maintains this service? Who maintains the identifiers? Who do I complain to if they break? How do we ensure that they persist when, say, a museum closes down, moves its collection, changes it’s web technology? Who provides the tools that add value to the identifiers? (there’s no point having them if they are not useful) Jonathan Rees pointed this out to me too off-list. Just for the record, this isn't contrary but fully in line with what I was saying (or trying to say). Yes, I didn't elaborate that part, assuming, perhaps rather erroneously, that all this goes without saying, but I did mention that one part of this becoming a real solution has to be an institution with an in-scope cyberinfrastructure mandate that going in would make a commitment to sustain the resolver, including working with partners on the above slew of questions. The institution I gave was iDigBio; perhaps for some reason that would not be a good choice, but whether they are or not wasn't my point. I will add one point to this, though. It seems to me that by continuing to argue that we can't go ahead with building a resolver that works (as far as technical requirements are concerned) before we haven't first fully addressed the institutional and social long-term sustainability commitment problem, we are and have been making this one big hairy problem that we can't make any practical pragmatic headway about, rather than breaking it down into parts, some of which (namely the primarily technical ones) are actually fairly straightforward to solve. As a result, to this day we don't have some solution that even though it's not very sustainable yet, at least proves to everyone how critical it is, and that the community can rally behind. Perhaps that's naïve, but I do think that once there's a solution the community rallies behind, ways to sustain it will be found. -hilmar -- Hilmar Lapp -:- informatics.nescent.org/wiki <http://informatics.nescent.org/wiki> -:- lappland.io <http://lappland.io/>
_______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.org <mailto:tdwg-content@lists.tdwg.org> http://lists.tdwg.org/mailman/listinfo/tdwg-content -- John Deck (541) 321-0689 <tel:%28541%29%20321-0689>
-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences postal mail address: PMB 351634 Nashville, TN 37235-1634, U.S.A. delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235 office: 2128 Stevenson Center phone: (615) 343-4582 <tel:%28615%29%20343-4582>, fax: (615) 322-4942 <tel:%28615%29%20322-4942> If you fax, please phone or email so that I will know to look for it. http://bioimages.vanderbilt.edu <http://bioimages.vanderbilt.edu/> http://vanderbilt.edu/trees _______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.org <mailto:tdwg-content@lists.tdwg.org> http://lists.tdwg.org/mailman/listinfo/tdwg-content
_______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.org <mailto:tdwg-content@lists.tdwg.org> http://lists.tdwg.org/mailman/listinfo/tdwg-content
-- Hilmar Lapp -:- informatics.nescent.org/wiki http://informatics.nescent.org/wiki -:- lappland.io http://lappland.io
Hi John,
Regarding the hackathon, I don’t want to be negative about it (and I agree that it’s the sort of thing GBIF should be involved in).
I guess I’d want a hackthon that didn’t just make a resolver, but which modelled the whole process (e.g., the resolver, minting identifiers, keeping records linked to identifiers over time, sending those to GBIF, tools for journals and authors to link to the specimens, tools to retrieve citation counts so providers get metrics on use, etc.).
In other words, let’s mock up the 4-5 key bits of the overall package and show how it works. This is why I keep banging on about CrossRef, it is not simply slapping a DOI on an article, there’s a bunch of things that go with that that make the whole thing work so well.
Trust is a key issue, as it commitment. A key reason DOIs work is that (a) people citing them have an expectation they will be around for a while, and (b) providers are freed from exposing whatever URL they use today for their content, they can change everything (server, domain, technology, etc.) without worrying about breaking links for users. This is a huge win for them, otherwise any URL for a specimen they mint they have to maintain forever (e.g., http://peabody.research.yale.edu/cgi-bin/Query.Ledger?LE=mam&SU=0&ID... ). I think this is one reason people are reticent about creating links for specimens, it’s a hell of a commitment to say we will support those in perpetuity. Regards
Rod
On 5 May 2014, at 23:57, John Deck <jdeck@berkeley.edumailto:jdeck@berkeley.edu> wrote:
+1 on DOIs, and on ARKS (see: https://wiki.ucop.edu/display/Curation/ARK ), and also i'll mention IGSN:'s (see http://www.geosamples.org/) IGSN: is rapidly gaining traction for geo-samples. I don't know of anyone using them for bio-samples but they offer many features that we've been asking for as well. What our community considers a sample (or observation) is diverse enough that multiple ID systems are probably inevitable and perhaps even warranted.
Whatever the ID system, the data providers (museums, field researchers, labs, etc..) must adopt that identifier and use it whenever linking to downstream sequence, image, and sub-sampling repository agencies. This is great to say this in theory but difficult to do in reality because the decision to adopt long term and stable identifiers is often an institutional one, and the technology is still new and argued about, in particular, on this fine list. Further, those agencies that receive data associated with a GUID must honor that source GUID when passing to consumers and other aggregators, who must also have some level of confidence in the source GUIDs as well. Thus, a primary issue that we're confronted with here is trust.
Having Hilmar's hackathon support several possible GUID schemes (each with their own long term persistence strategy), and sponsored by a well known global institution affiliated with biodiversity informatics that could offer technical guidance to data providers, good name branding, and the nuts and bolts expertise to demonstrate good shepherding of source GUIDs through a data aggregation chain would be ideal. I nominate GBIF :)
John Deck
On Mon, May 5, 2014 at 1:09 PM, Roderic Page <r.page@bio.gla.ac.ukmailto:r.page@bio.gla.ac.uk> wrote: Hi Markus,
I have three use cases that
1. Linking sequences in GenBank to voucher specimens. Lots of voucher specimens are listed in GenBank but not linked to digital records for those specimens. These links are useful in two directions, one is to link GBIF to genomic data, the second is to enhance data in both databases, see http://iphylo.blogspot.co.uk/2012/02/linking-gbif-and-genbank.html (e.g., by adding missing georeferencing that is available in one database but not the other).
2. Linking to specimens cited in the literature. I’ve done some work on this in BioStor, see http://iphylo.blogspot.co.uk/2012/02/linking-gbif-and-biodiversity-heritage.... One immediate benefit of this is that GBIF could display the scientific literature associated with a specimen, so we get access to the evidence supporting identification, georeferencing, etc.
3. Citation metrics for collections, see http://iphylo.blogspot.co.uk/2013/05/the-impact-of-museum-collections-one.ht... and http://iphylo.blogspot.co.uk/2012/02/gbif-specimens-in-biostor-who-are-top.h... Based on citation sod specimens in the literature, and in databases such as GenBank (i.e., basically combining 1 + 2 above) we can demonstrate the value of a collection.
All of these use cases depend on GBIF occurenceIds remaining stable, I have often ranted on iPHylo when this doesn’t happen: http://iphylo.blogspot.co.uk/2012/07/dear-gbif-please-stop-changing.html
Regards
Rod
On 5 May 2014, at 20:51, Markus Döring <mdoering@gbif.orgmailto:mdoering@gbif.org> wrote:
Hi Rod,
I agree GBIF has troubles to keep identifiers stable for *some* records, but in general we do a much better job than the original publishers in the first place. We try hard to keep GBIF ids stable even if publishers change collection codes, registered datasets twice or do other things to break a simple automated way of mapping source records to existing GBIF ids. Also the stable identifier in GBIF never has been the URL, but it is the local GBIF integer alone. The GBIF services that consume those ids have changed over the years, but its pretty trivial to adjust if you use the GBIF ids instead of the URLs. If there is a clear need to have stable URLs instead I am sure we can get that working easily.
The two real issues for GBIF are a) duplicates and b) records with varying local identifiers of any sort (triplet, occurrenceID or whatever else).
When it comes to the varying source identifiers I always liked the idea of flagging those records and datasets as unstable, so it is obvious to users. This is not a 100% safe, but most terrible datasets change all of their ids and that is easily detectable. Also with a service like that it would become more obvious to publishers how important stable source ids are.
Before jumping on DOIs as the next big thing I would really like to understand what needs the community has around specimen ids. Gabi clearly has a very real use case, are there others we know about?
Markus
On 05 May 2014, at 21:05, Roderic Page <r.page@bio.gla.ac.ukmailto:r.page@bio.gla.ac.uk> wrote:
Hi Hilmar,
I’m not arguing that we shouldn’t build a resolver (I have one that I use, Rich has mentioned he’s got one, Markus has one at GBIF, etc.).
Nor do I think we should wait for institutional and social commitment (because then we’d never get anything done).
But I do think it would be useful to think it through. For example, it’s easy to create a URL for a specimen. Easy peasy. OK, how do I discover that URL? How do I discover these for all specimens? Sounds like I need a centralised discover service like you’e described.
How do I handle changes in those URLs? I built a specimen code to GBIF resolver for BioStor so that I could link to specimens, GBIF changed lots of those URLs, all my work was undone, boy does GBIF suck sometimes. For example, if I map codes to URLs, I need to handle cases when they change.
If URLs can change, is there a way to defend against that (this is one reason for DOIs, or other methods of indirection, such as PURLs).
If providers change, will the URLs change? Is there a way to defend against that (again, DOIs handle this nicely by virtue of (a) indirection, and (b) lack of branding).
How can I encourage people to use the specimen service? What can I do to make them think it will persist? Can I convince academic publishers to trust it enough to link to it in articles? What’s the pitch to Pensoft, to Magnolai Press, to Springer and Elsevier?
Is there some way to make the service itself become trusted? For example if I look at a journal and see that it has DOIs issued by CrossRef, I take that journal more seriously than if it’s just got simple URLs. I know that papers in that journal will be linked into the citation network, I also know that there is a backup plan if the journal goes under (because you need that to have DOIs in CrossRef). Likewise, I think Figshare got a big boost when it stared minting DOIs (wow, a DOI, I know DOIs, you mean I can now cite stuff I’ve uploaded there?).
How can museums and herbaria be persuaded to keep their identifiers stable? What incentives can we provide (e.g., citation metrics for collections)? What system would enable us to do this? What about tracing funding (e.g., the NSF paid for these n papers, and they cite these y specimens, from these z collections, so science paid for by the NSF requires these collections to exist).
I guess I’m arguing that we should think all this through, because a specimen code to specimen URL is a small piece of the puzzle. Now, I’m desperately trying not to simply say what I think is blindingly obvious here (put DOIs on specimens, add metadata to specimen and specimen citation services, and we are done), but I think if we sit back and look at where we want to be, this is exactly what we need (or something functionally equivalent). Until we see the bigger picture, we will be stuck in amateur hour.
Take a look at:
http://search.crossref.orghttp://search.crossref.org/ http://www.crossref.org/fundref/ http://support.crossref.org/ https://prospect.crossref.org/splash/
Isn’t this the kind of stuff we’d like to do? If so, let’s work out what’s needed and make it happen.
In short, I think we constantly solve an immediate problem in the quickest way we know how, without thinking it through. I’d argue that if we think about the bigger picture (what do we want to be able to, what are the questions we want to be able to ask) then things become clearer. This is independent of getting everyone’s agreement (but it would help if we made their agreement seem a no brainer by providing solutions to things that cause them pain).
Regards
Rod
On 5 May 2014, at 19:14, Hilmar Lapp <hlapp@nescent.orgmailto:hlapp@nescent.org> wrote:
On Mon, May 5, 2014 at 1:29 PM, Roderic Page <r.page@bio.gla.ac.ukmailto:r.page@bio.gla.ac.uk> wrote: Contrary to Hilmar, there is more to this than simply a quick hackathon. Yes, a service that takes metadata and returns one or more identifiers is a good idea and easy to create (there will often be more than one because museum codes are not unique). But who maintains this service? Who maintains the identifiers? Who do I complain to if they break? How do we ensure that they persist when, say, a museum closes down, moves its collection, changes it’s web technology? Who provides the tools that add value to the identifiers? (there’s no point having them if they are not useful)
Jonathan Rees pointed this out to me too off-list. Just for the record, this isn't contrary but fully in line with what I was saying (or trying to say). Yes, I didn't elaborate that part, assuming, perhaps rather erroneously, that all this goes without saying, but I did mention that one part of this becoming a real solution has to be an institution with an in-scope cyberinfrastructure mandate that going in would make a commitment to sustain the resolver, including working with partners on the above slew of questions. The institution I gave was iDigBio; perhaps for some reason that would not be a good choice, but whether they are or not wasn't my point.
I will add one point to this, though. It seems to me that by continuing to argue that we can't go ahead with building a resolver that works (as far as technical requirements are concerned) before we haven't first fully addressed the institutional and social long-term sustainability commitment problem, we are and have been making this one big hairy problem that we can't make any practical pragmatic headway about, rather than breaking it down into parts, some of which (namely the primarily technical ones) are actually fairly straightforward to solve. As a result, to this day we don't have some solution that even though it's not very sustainable yet, at least proves to everyone how critical it is, and that the community can rally behind. Perhaps that's naïve, but I do think that once there's a solution the community rallies behind, ways to sustain it will be found.
-hilmar -- Hilmar Lapp -:- informatics.nescent.org/wikihttp://informatics.nescent.org/wiki -:- lappland.iohttp://lappland.io/
_______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.orgmailto:tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
-- John Deck (541) 321-0689
Hi Rod,
those are excellent use cases, that GGBN also aims to reach. I just have a comment on the first one. NCBI and CBOL defined a mechanism to add voucher ids to a GenBank record, which is great. But unfortunately they use different values than GBIF does. Their institution code is based on GRBio (http://grbio.org/). So e.g. my institution is listed there with "B", but we use "BGBM" for GBIF records and so do many institutions. Same happens for Catalogue Number. Therefore the in principal good idea does not work for a dynamic linkage to GBIF records. One of my next steps on my to do list to talk to CBOL and GenBank about this issue, additionally establishing dynamic linkage between the sequence record and corresponding DNA and tissue samples at GGBN.
For my current problem I guess the short term solution is to define GGBN conventions on delimiters and switch to a better solution when it might be available. We would be happy to be part of those discussion and to help.
Best, Gabi
Von: tdwg-content-bounces@lists.tdwg.org [mailto:tdwg-content-bounces@lists.tdwg.org] Im Auftrag von Roderic Page Gesendet: Montag, 5. Mai 2014 22:10 An: Markus Döring Cc: tdwg-content@lists.tdwg.org; Miller, Chuck; tomc@cs.uoregon.edu; John Deck Betreff: Re: [tdwg-content] delimiter characters for concatenated IDs
Hi Markus,
I have three use cases that
1. Linking sequences in GenBank to voucher specimens. Lots of voucher specimens are listed in GenBank but not linked to digital records for those specimens. These links are useful in two directions, one is to link GBIF to genomic data, the second is to enhance data in both databases, see http://iphylo.blogspot.co.uk/2012/02/linking-gbif-and-genbank.html (e.g., by adding missing georeferencing that is available in one database but not the other).
2. Linking to specimens cited in the literature. I've done some work on this in BioStor, see http://iphylo.blogspot.co.uk/2012/02/linking-gbif-and-biodiversity-heritage.... One immediate benefit of this is that GBIF could display the scientific literature associated with a specimen, so we get access to the evidence supporting identification, georeferencing, etc.
3. Citation metrics for collections, see http://iphylo.blogspot.co.uk/2013/05/the-impact-of-museum-collections-one.ht... and http://iphylo.blogspot.co.uk/2012/02/gbif-specimens-in-biostor-who-are-top.h... Based on citation sod specimens in the literature, and in databases such as GenBank (i.e., basically combining 1 + 2 above) we can demonstrate the value of a collection.
All of these use cases depend on GBIF occurenceIds remaining stable, I have often ranted on iPHylo when this doesn't happen: http://iphylo.blogspot.co.uk/2012/07/dear-gbif-please-stop-changing.html
Regards
Rod
On 5 May 2014, at 20:51, Markus Döring <mdoering@gbif.orgmailto:mdoering@gbif.org> wrote:
Hi Rod,
I agree GBIF has troubles to keep identifiers stable for *some* records, but in general we do a much better job than the original publishers in the first place. We try hard to keep GBIF ids stable even if publishers change collection codes, registered datasets twice or do other things to break a simple automated way of mapping source records to existing GBIF ids. Also the stable identifier in GBIF never has been the URL, but it is the local GBIF integer alone. The GBIF services that consume those ids have changed over the years, but its pretty trivial to adjust if you use the GBIF ids instead of the URLs. If there is a clear need to have stable URLs instead I am sure we can get that working easily.
The two real issues for GBIF are a) duplicates and b) records with varying local identifiers of any sort (triplet, occurrenceID or whatever else).
When it comes to the varying source identifiers I always liked the idea of flagging those records and datasets as unstable, so it is obvious to users. This is not a 100% safe, but most terrible datasets change all of their ids and that is easily detectable. Also with a service like that it would become more obvious to publishers how important stable source ids are.
Before jumping on DOIs as the next big thing I would really like to understand what needs the community has around specimen ids. Gabi clearly has a very real use case, are there others we know about?
Markus
On 05 May 2014, at 21:05, Roderic Page <r.page@bio.gla.ac.ukmailto:r.page@bio.gla.ac.uk> wrote:
Hi Hilmar,
I'm not arguing that we shouldn't build a resolver (I have one that I use, Rich has mentioned he's got one, Markus has one at GBIF, etc.).
Nor do I think we should wait for institutional and social commitment (because then we'd never get anything done).
But I do think it would be useful to think it through. For example, it's easy to create a URL for a specimen. Easy peasy. OK, how do I discover that URL? How do I discover these for all specimens? Sounds like I need a centralised discover service like you'e described.
How do I handle changes in those URLs? I built a specimen code to GBIF resolver for BioStor so that I could link to specimens, GBIF changed lots of those URLs, all my work was undone, boy does GBIF suck sometimes. For example, if I map codes to URLs, I need to handle cases when they change.
If URLs can change, is there a way to defend against that (this is one reason for DOIs, or other methods of indirection, such as PURLs).
If providers change, will the URLs change? Is there a way to defend against that (again, DOIs handle this nicely by virtue of (a) indirection, and (b) lack of branding).
How can I encourage people to use the specimen service? What can I do to make them think it will persist? Can I convince academic publishers to trust it enough to link to it in articles? What's the pitch to Pensoft, to Magnolai Press, to Springer and Elsevier?
Is there some way to make the service itself become trusted? For example if I look at a journal and see that it has DOIs issued by CrossRef, I take that journal more seriously than if it's just got simple URLs. I know that papers in that journal will be linked into the citation network, I also know that there is a backup plan if the journal goes under (because you need that to have DOIs in CrossRef). Likewise, I think Figshare got a big boost when it stared minting DOIs (wow, a DOI, I know DOIs, you mean I can now cite stuff I've uploaded there?).
How can museums and herbaria be persuaded to keep their identifiers stable? What incentives can we provide (e.g., citation metrics for collections)? What system would enable us to do this? What about tracing funding (e.g., the NSF paid for these n papers, and they cite these y specimens, from these z collections, so science paid for by the NSF requires these collections to exist).
I guess I'm arguing that we should think all this through, because a specimen code to specimen URL is a small piece of the puzzle. Now, I'm desperately trying not to simply say what I think is blindingly obvious here (put DOIs on specimens, add metadata to specimen and specimen citation services, and we are done), but I think if we sit back and look at where we want to be, this is exactly what we need (or something functionally equivalent). Until we see the bigger picture, we will be stuck in amateur hour.
Take a look at:
http://search.crossref.orghttp://search.crossref.org/ http://www.crossref.org/fundref/ http://support.crossref.org/ https://prospect.crossref.org/splash/
Isn't this the kind of stuff we'd like to do? If so, let's work out what's needed and make it happen.
In short, I think we constantly solve an immediate problem in the quickest way we know how, without thinking it through. I'd argue that if we think about the bigger picture (what do we want to be able to, what are the questions we want to be able to ask) then things become clearer. This is independent of getting everyone's agreement (but it would help if we made their agreement seem a no brainer by providing solutions to things that cause them pain).
Regards
Rod
On 5 May 2014, at 19:14, Hilmar Lapp <hlapp@nescent.orgmailto:hlapp@nescent.org> wrote:
On Mon, May 5, 2014 at 1:29 PM, Roderic Page <r.page@bio.gla.ac.ukmailto:r.page@bio.gla.ac.uk> wrote: Contrary to Hilmar, there is more to this than simply a quick hackathon. Yes, a service that takes metadata and returns one or more identifiers is a good idea and easy to create (there will often be more than one because museum codes are not unique). But who maintains this service? Who maintains the identifiers? Who do I complain to if they break? How do we ensure that they persist when, say, a museum closes down, moves its collection, changes it's web technology? Who provides the tools that add value to the identifiers? (there's no point having them if they are not useful)
Jonathan Rees pointed this out to me too off-list. Just for the record, this isn't contrary but fully in line with what I was saying (or trying to say). Yes, I didn't elaborate that part, assuming, perhaps rather erroneously, that all this goes without saying, but I did mention that one part of this becoming a real solution has to be an institution with an in-scope cyberinfrastructure mandate that going in would make a commitment to sustain the resolver, including working with partners on the above slew of questions. The institution I gave was iDigBio; perhaps for some reason that would not be a good choice, but whether they are or not wasn't my point.
I will add one point to this, though. It seems to me that by continuing to argue that we can't go ahead with building a resolver that works (as far as technical requirements are concerned) before we haven't first fully addressed the institutional and social long-term sustainability commitment problem, we are and have been making this one big hairy problem that we can't make any practical pragmatic headway about, rather than breaking it down into parts, some of which (namely the primarily technical ones) are actually fairly straightforward to solve. As a result, to this day we don't have some solution that even though it's not very sustainable yet, at least proves to everyone how critical it is, and that the community can rally behind. Perhaps that's naïve, but I do think that once there's a solution the community rallies behind, ways to sustain it will be found.
-hilmar -- Hilmar Lapp -:- informatics.nescent.org/wikihttp://informatics.nescent.org/wiki -:- lappland.iohttp://lappland.io/
Hi Gabi,
This is pretty much exactly why linking based on metadata will ultimately fail. The academic publishing community went through this step with OpenURL (which is still used by lots of libraries to locate digital copies of publications), before moving to stable, universally used identifiers (DOIs).
Regards
Rod
On 6 May 2014, at 10:11, Dröge, Gabriele <g.droege@BGBM.ORGmailto:g.droege@BGBM.ORG> wrote:
Hi Rod,
those are excellent use cases, that GGBN also aims to reach. I just have a comment on the first one. NCBI and CBOL defined a mechanism to add voucher ids to a GenBank record, which is great. But unfortunately they use different values than GBIF does. Their institution code is based on GRBio (http://grbio.org/). So e.g. my institution is listed there with “B”, but we use “BGBM” for GBIF records and so do many institutions. Same happens for Catalogue Number. Therefore the in principal good idea does not work for a dynamic linkage to GBIF records. One of my next steps on my to do list to talk to CBOL and GenBank about this issue, additionally establishing dynamic linkage between the sequence record and corresponding DNA and tissue samples at GGBN.
For my current problem I guess the short term solution is to define GGBN conventions on delimiters and switch to a better solution when it might be available. We would be happy to be part of those discussion and to help.
Best, Gabi
Von: tdwg-content-bounces@lists.tdwg.orgmailto:tdwg-content-bounces@lists.tdwg.org [mailto:tdwg-content-bounces@lists.tdwg.org] Im Auftrag von Roderic Page Gesendet: Montag, 5. Mai 2014 22:10 An: Markus Döring Cc: tdwg-content@lists.tdwg.orgmailto:tdwg-content@lists.tdwg.org; Miller, Chuck; tomc@cs.uoregon.edumailto:tomc@cs.uoregon.edu; John Deck Betreff: Re: [tdwg-content] delimiter characters for concatenated IDs
Hi Markus,
I have three use cases that
1. Linking sequences in GenBank to voucher specimens. Lots of voucher specimens are listed in GenBank but not linked to digital records for those specimens. These links are useful in two directions, one is to link GBIF to genomic data, the second is to enhance data in both databases, see http://iphylo.blogspot.co.uk/2012/02/linking-gbif-and-genbank.html (e.g., by adding missing georeferencing that is available in one database but not the other).
2. Linking to specimens cited in the literature. I’ve done some work on this in BioStor, see http://iphylo.blogspot.co.uk/2012/02/linking-gbif-and-biodiversity-heritage.... One immediate benefit of this is that GBIF could display the scientific literature associated with a specimen, so we get access to the evidence supporting identification, georeferencing, etc.
3. Citation metrics for collections, see http://iphylo.blogspot.co.uk/2013/05/the-impact-of-museum-collections-one.ht... and http://iphylo.blogspot.co.uk/2012/02/gbif-specimens-in-biostor-who-are-top.h... Based on citation sod specimens in the literature, and in databases such as GenBank (i.e., basically combining 1 + 2 above) we can demonstrate the value of a collection.
All of these use cases depend on GBIF occurenceIds remaining stable, I have often ranted on iPHylo when this doesn’t happen: http://iphylo.blogspot.co.uk/2012/07/dear-gbif-please-stop-changing.html
Regards
Rod
On 5 May 2014, at 20:51, Markus Döring <mdoering@gbif.orgmailto:mdoering@gbif.org> wrote:
Hi Rod,
I agree GBIF has troubles to keep identifiers stable for *some* records, but in general we do a much better job than the original publishers in the first place. We try hard to keep GBIF ids stable even if publishers change collection codes, registered datasets twice or do other things to break a simple automated way of mapping source records to existing GBIF ids. Also the stable identifier in GBIF never has been the URL, but it is the local GBIF integer alone. The GBIF services that consume those ids have changed over the years, but its pretty trivial to adjust if you use the GBIF ids instead of the URLs. If there is a clear need to have stable URLs instead I am sure we can get that working easily.
The two real issues for GBIF are a) duplicates and b) records with varying local identifiers of any sort (triplet, occurrenceID or whatever else).
When it comes to the varying source identifiers I always liked the idea of flagging those records and datasets as unstable, so it is obvious to users. This is not a 100% safe, but most terrible datasets change all of their ids and that is easily detectable. Also with a service like that it would become more obvious to publishers how important stable source ids are.
Before jumping on DOIs as the next big thing I would really like to understand what needs the community has around specimen ids. Gabi clearly has a very real use case, are there others we know about?
Markus
On 05 May 2014, at 21:05, Roderic Page <r.page@bio.gla.ac.ukmailto:r.page@bio.gla.ac.uk> wrote:
Hi Hilmar,
I’m not arguing that we shouldn’t build a resolver (I have one that I use, Rich has mentioned he’s got one, Markus has one at GBIF, etc.).
Nor do I think we should wait for institutional and social commitment (because then we’d never get anything done).
But I do think it would be useful to think it through. For example, it’s easy to create a URL for a specimen. Easy peasy. OK, how do I discover that URL? How do I discover these for all specimens? Sounds like I need a centralised discover service like you’e described.
How do I handle changes in those URLs? I built a specimen code to GBIF resolver for BioStor so that I could link to specimens, GBIF changed lots of those URLs, all my work was undone, boy does GBIF suck sometimes. For example, if I map codes to URLs, I need to handle cases when they change.
If URLs can change, is there a way to defend against that (this is one reason for DOIs, or other methods of indirection, such as PURLs).
If providers change, will the URLs change? Is there a way to defend against that (again, DOIs handle this nicely by virtue of (a) indirection, and (b) lack of branding).
How can I encourage people to use the specimen service? What can I do to make them think it will persist? Can I convince academic publishers to trust it enough to link to it in articles? What’s the pitch to Pensoft, to Magnolai Press, to Springer and Elsevier?
Is there some way to make the service itself become trusted? For example if I look at a journal and see that it has DOIs issued by CrossRef, I take that journal more seriously than if it’s just got simple URLs. I know that papers in that journal will be linked into the citation network, I also know that there is a backup plan if the journal goes under (because you need that to have DOIs in CrossRef). Likewise, I think Figshare got a big boost when it stared minting DOIs (wow, a DOI, I know DOIs, you mean I can now cite stuff I’ve uploaded there?).
How can museums and herbaria be persuaded to keep their identifiers stable? What incentives can we provide (e.g., citation metrics for collections)? What system would enable us to do this? What about tracing funding (e.g., the NSF paid for these n papers, and they cite these y specimens, from these z collections, so science paid for by the NSF requires these collections to exist).
I guess I’m arguing that we should think all this through, because a specimen code to specimen URL is a small piece of the puzzle. Now, I’m desperately trying not to simply say what I think is blindingly obvious here (put DOIs on specimens, add metadata to specimen and specimen citation services, and we are done), but I think if we sit back and look at where we want to be, this is exactly what we need (or something functionally equivalent). Until we see the bigger picture, we will be stuck in amateur hour.
Take a look at:
http://search.crossref.orghttp://search.crossref.org/ http://www.crossref.org/fundref/ http://support.crossref.org/ https://prospect.crossref.org/splash/
Isn’t this the kind of stuff we’d like to do? If so, let’s work out what’s needed and make it happen.
In short, I think we constantly solve an immediate problem in the quickest way we know how, without thinking it through. I’d argue that if we think about the bigger picture (what do we want to be able to, what are the questions we want to be able to ask) then things become clearer. This is independent of getting everyone’s agreement (but it would help if we made their agreement seem a no brainer by providing solutions to things that cause them pain).
Regards
Rod
On 5 May 2014, at 19:14, Hilmar Lapp <hlapp@nescent.orgmailto:hlapp@nescent.org> wrote:
On Mon, May 5, 2014 at 1:29 PM, Roderic Page <r.page@bio.gla.ac.ukmailto:r.page@bio.gla.ac.uk> wrote: Contrary to Hilmar, there is more to this than simply a quick hackathon. Yes, a service that takes metadata and returns one or more identifiers is a good idea and easy to create (there will often be more than one because museum codes are not unique). But who maintains this service? Who maintains the identifiers? Who do I complain to if they break? How do we ensure that they persist when, say, a museum closes down, moves its collection, changes it’s web technology? Who provides the tools that add value to the identifiers? (there’s no point having them if they are not useful)
Jonathan Rees pointed this out to me too off-list. Just for the record, this isn't contrary but fully in line with what I was saying (or trying to say). Yes, I didn't elaborate that part, assuming, perhaps rather erroneously, that all this goes without saying, but I did mention that one part of this becoming a real solution has to be an institution with an in-scope cyberinfrastructure mandate that going in would make a commitment to sustain the resolver, including working with partners on the above slew of questions. The institution I gave was iDigBio; perhaps for some reason that would not be a good choice, but whether they are or not wasn't my point.
I will add one point to this, though. It seems to me that by continuing to argue that we can't go ahead with building a resolver that works (as far as technical requirements are concerned) before we haven't first fully addressed the institutional and social long-term sustainability commitment problem, we are and have been making this one big hairy problem that we can't make any practical pragmatic headway about, rather than breaking it down into parts, some of which (namely the primarily technical ones) are actually fairly straightforward to solve. As a result, to this day we don't have some solution that even though it's not very sustainable yet, at least proves to everyone how critical it is, and that the community can rally behind. Perhaps that's naïve, but I do think that once there's a solution the community rallies behind, ways to sustain it will be found.
-hilmar -- Hilmar Lapp -:- informatics.nescent.org/wikihttp://informatics.nescent.org/wiki -:- lappland.iohttp://lappland.io/
Hi Gabi,
That's indeed an odd acronym for BGBM. Have you tried to edit the record so it uses BGBM instead? It seems editable (though presumably edits go through approval), and GRBio expressly solicits the community to help with curating the accuracy of their records.
-hilmar
On Tue, May 6, 2014 at 5:11 AM, "Dröge, Gabriele" g.droege@bgbm.org wrote:
Hi Rod,
those are excellent use cases, that GGBN also aims to reach.
I just have a comment on the first one. NCBI and CBOL defined a mechanism to add voucher ids to a GenBank record, which is great. But unfortunately they use different values than GBIF does. Their institution code is based on GRBio (http://grbio.org/). So e.g. my institution is listed there with “B”, but we use “BGBM” for GBIF records and so do many institutions. Same happens for Catalogue Number. Therefore the in principal good idea does not work for a dynamic linkage to GBIF records.
One of my next steps on my to do list to talk to CBOL and GenBank about this issue, additionally establishing dynamic linkage between the sequence record and corresponding DNA and tissue samples at GGBN.
For my current problem I guess the short term solution is to define GGBN conventions on delimiters and switch to a better solution when it might be available. We would be happy to be part of those discussion and to help.
Best,
Gabi
*Von:* tdwg-content-bounces@lists.tdwg.org [mailto: tdwg-content-bounces@lists.tdwg.org] *Im Auftrag von *Roderic Page *Gesendet:* Montag, 5. Mai 2014 22:10 *An:* Markus Döring *Cc:* tdwg-content@lists.tdwg.org; Miller, Chuck; tomc@cs.uoregon.edu; John Deck *Betreff:* Re: [tdwg-content] delimiter characters for concatenated IDs
Hi Markus,
I have three use cases that
- Linking sequences in GenBank to voucher specimens. Lots of voucher
specimens are listed in GenBank but not linked to digital records for those specimens. These links are useful in two directions, one is to link GBIF to genomic data, the second is to enhance data in both databases, see http://iphylo.blogspot.co.uk/2012/02/linking-gbif-and-genbank.html (e.g., by adding missing georeferencing that is available in one database but not the other).
- Linking to specimens cited in the literature. I’ve done some work on
this in BioStor, see http://iphylo.blogspot.co.uk/2012/02/linking-gbif-and-biodiversity-heritage.... One immediate benefit of this is that GBIF could display the scientific literature associated with a specimen, so we get access to the evidence supporting identification, georeferencing, etc.
- Citation metrics for collections, see
http://iphylo.blogspot.co.uk/2013/05/the-impact-of-museum-collections-one.ht... http://iphylo.blogspot.co.uk/2012/02/gbif-specimens-in-biostor-who-are-top.h... on citation sod specimens in the literature, and in databases such as GenBank (i.e., basically combining 1 + 2 above) we can demonstrate the value of a collection.
All of these use cases depend on GBIF occurenceIds remaining stable, I have often ranted on iPHylo when this doesn’t happen: http://iphylo.blogspot.co.uk/2012/07/dear-gbif-please-stop-changing.html
Regards
Rod
On 5 May 2014, at 20:51, Markus Döring mdoering@gbif.org wrote:
Hi Rod,
I agree GBIF has troubles to keep identifiers stable for *some* records, but in general we do a much better job than the original publishers in the first place. We try hard to keep GBIF ids stable even if publishers change collection codes, registered datasets twice or do other things to break a simple automated way of mapping source records to existing GBIF ids. Also the stable identifier in GBIF never has been the URL, but it is the local GBIF integer alone. The GBIF services that consume those ids have changed over the years, but its pretty trivial to adjust if you use the GBIF ids instead of the URLs. If there is a clear need to have stable URLs instead I am sure we can get that working easily.
The two real issues for GBIF are a) duplicates and b) records with varying local identifiers of any sort (triplet, occurrenceID or whatever else).
When it comes to the varying source identifiers I always liked the idea of flagging those records and datasets as unstable, so it is obvious to users. This is not a 100% safe, but most terrible datasets change all of their ids and that is easily detectable.
Also with a service like that it would become more obvious to publishers how important stable source ids are.
Before jumping on DOIs as the next big thing I would really like to understand what needs the community has around specimen ids.
Gabi clearly has a very real use case, are there others we know about?
Markus
On 05 May 2014, at 21:05, Roderic Page r.page@bio.gla.ac.uk wrote:
Hi Hilmar,
I’m not arguing that we shouldn’t build a resolver (I have one that I use, Rich has mentioned he’s got one, Markus has one at GBIF, etc.).
Nor do I think we should wait for institutional and social commitment (because then we’d never get anything done).
But I do think it would be useful to think it through. For example, it’s easy to create a URL for a specimen. Easy peasy. OK, how do I discover that URL? How do I discover these for all specimens? Sounds like I need a centralised discover service like you’e described.
How do I handle changes in those URLs? I built a specimen code to GBIF resolver for BioStor so that I could link to specimens, GBIF changed lots of those URLs, all my work was undone, boy does GBIF suck sometimes. For example, if I map codes to URLs, I need to handle cases when they change.
If URLs can change, is there a way to defend against that (this is one reason for DOIs, or other methods of indirection, such as PURLs).
If providers change, will the URLs change? Is there a way to defend against that (again, DOIs handle this nicely by virtue of (a) indirection, and (b) lack of branding).
How can I encourage people to use the specimen service? What can I do to make them think it will persist? Can I convince academic publishers to trust it enough to link to it in articles? What’s the pitch to Pensoft, to Magnolai Press, to Springer and Elsevier?
Is there some way to make the service itself become trusted? For example if I look at a journal and see that it has DOIs issued by CrossRef, I take that journal more seriously than if it’s just got simple URLs. I know that papers in that journal will be linked into the citation network, I also know that there is a backup plan if the journal goes under (because you need that to have DOIs in CrossRef). Likewise, I think Figshare got a big boost when it stared minting DOIs (wow, a DOI, I know DOIs, you mean I can now cite stuff I’ve uploaded there?).
How can museums and herbaria be persuaded to keep their identifiers stable? What incentives can we provide (e.g., citation metrics for collections)? What system would enable us to do this? What about tracing funding (e.g., the NSF paid for these n papers, and they cite these y specimens, from these z collections, so science paid for by the NSF requires these collections to exist).
I guess I’m arguing that we should think all this through, because a specimen code to specimen URL is a small piece of the puzzle. Now, I’m desperately trying not to simply say what I think is blindingly obvious here (put DOIs on specimens, add metadata to specimen and specimen citation services, and we are done), but I think if we sit back and look at where we want to be, this is exactly what we need (or something functionally equivalent). Until we see the bigger picture, we will be stuck in amateur hour.
Take a look at:
http://www.crossref.org/fundref/
https://prospect.crossref.org/splash/
Isn’t this the kind of stuff we’d like to do? If so, let’s work out what’s needed and make it happen.
In short, I think we constantly solve an immediate problem in the quickest way we know how, without thinking it through. I’d argue that if we think about the bigger picture (what do we want to be able to, what are the questions we want to be able to ask) then things become clearer. This is independent of getting everyone’s agreement (but it would help if we made their agreement seem a no brainer by providing solutions to things that cause them pain).
Regards
Rod
On 5 May 2014, at 19:14, Hilmar Lapp hlapp@nescent.org wrote:
On Mon, May 5, 2014 at 1:29 PM, Roderic Page r.page@bio.gla.ac.uk wrote:
Contrary to Hilmar, there is more to this than simply a quick hackathon. Yes, a service that takes metadata and returns one or more identifiers is a good idea and easy to create (there will often be more than one because museum codes are not unique). But who maintains this service? Who maintains the identifiers? Who do I complain to if they break? How do we ensure that they persist when, say, a museum closes down, moves its collection, changes it’s web technology? Who provides the tools that add value to the identifiers? (there’s no point having them if they are not useful)
Jonathan Rees pointed this out to me too off-list. Just for the record, this isn't contrary but fully in line with what I was saying (or trying to say). Yes, I didn't elaborate that part, assuming, perhaps rather erroneously, that all this goes without saying, but I did mention that one part of this becoming a real solution has to be an institution with an in-scope cyberinfrastructure mandate that going in would make a commitment to sustain the resolver, including working with partners on the above slew of questions. The institution I gave was iDigBio; perhaps for some reason that would not be a good choice, but whether they are or not wasn't my point.
I will add one point to this, though. It seems to me that by continuing to argue that we can't go ahead with building a resolver that works (as far as technical requirements are concerned) before we haven't first fully addressed the institutional and social long-term sustainability commitment problem, we are and have been making this one big hairy problem that we can't make any practical pragmatic headway about, rather than breaking it down into parts, some of which (namely the primarily technical ones) are actually fairly straightforward to solve. As a result, to this day we don't have some solution that even though it's not very sustainable yet, at least proves to everyone how critical it is, and that the community can rally behind. Perhaps that's naïve, but I do think that once there's a solution the community rallies behind, ways to sustain it will be found.
-hilmar
--
Hilmar Lapp -:- informatics.nescent.org/wiki -:- lappland.io
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
This illustrates the problems of constructing identifiers from metadata.
The “Darwin Core Triple” comes AFAIK from US vertebrate collections, where specimens are typically identified by acronym + catalogue number. This is often not unique (the same acronym + catalogue number combination such as FMNH 266214 may be used for a frog, a bird, or a mollusc spider within the same museum). Hence we add a “collection code” to make them distinct. Unfortunately, these are rarely used outside museum databases and GBIF (hardly any papers that cite specimens use the three-part codes). If you’re a zoologist seeing, say FMNH 266214, you “know” which specimen is being referred to by the taxonomic context.
My understanding of herbaria is that there are (like zoological collections) long standing abbreviations (see https://en.wikipedia.org/wiki/List_of_herbaria#Europe ) so if you’re botanist then B100094759 tells you that the herbarium specimen comes from Berlin. Hence B100094759 is enough (and is not, as such, an “odd acronym”). Roger Hyam’s proposal http://stories.rbge.org.uk/archives/1284 to make “cool URIs” out of these (e.g., http://data.rbge.org.uk/herb/E00421509 ) is based on this idea. Within that collection the catalogue number is unique (zoologists are rarely so lucky).
So, do we now recreate these URIs using Darwin Core Triples? Does B or BGBM identify a specimen from Berlin? If the prefix “B” is enough to identify a plant specimen as coming from Berlin, why do we then add “BGBM”? Would botanists used to citing B100094759 in their papers and floras recognise something like BGBM: B100094759 (which is somewhat redundant).
In short, yuck.
Regards
Rod
--------------------------------------------------------- Roderic Page Professor of Taxonomy Institute of Biodiversity, Animal Health and Comparative Medicine College of Medical, Veterinary and Life Sciences Graham Kerr Building University of Glasgow Glasgow G12 8QQ, UK
Email: r.page@bio.gla.ac.ukmailto:r.page@bio.gla.ac.uk Tel: +44 141 330 4778 Fax: +44 141 330 2792 Skype: rdmpage Facebook: http://www.facebook.com/rdmpage LinkedIn: http://uk.linkedin.com/in/rdmpage Twitter: http://twitter.com/rdmpage Blog: http://en.wikipedia.org/wiki/Roderic_D._M._Page Citations: http://scholar.google.co.uk/citations?hl=en&user=4Z5WABAAAAAJ ORCID: http://orcid.org/0000-0002-7101-9767
On 6 May 2014, at 14:16, Hilmar Lapp <hlapp@nescent.orgmailto:hlapp@nescent.org> wrote:
Hi Gabi,
That's indeed an odd acronym for BGBM. Have you tried to edit the record so it uses BGBM instead? It seems editable (though presumably edits go through approval), and GRBio expressly solicits the community to help with curating the accuracy of their records.
-hilmar
On Tue, May 6, 2014 at 5:11 AM, "Dröge, Gabriele" <g.droege@bgbm.orgmailto:g.droege@bgbm.org> wrote: Hi Rod,
those are excellent use cases, that GGBN also aims to reach. I just have a comment on the first one. NCBI and CBOL defined a mechanism to add voucher ids to a GenBank record, which is great. But unfortunately they use different values than GBIF does. Their institution code is based on GRBio (http://grbio.org/). So e.g. my institution is listed there with “B”, but we use “BGBM” for GBIF records and so do many institutions. Same happens for Catalogue Number. Therefore the in principal good idea does not work for a dynamic linkage to GBIF records. One of my next steps on my to do list to talk to CBOL and GenBank about this issue, additionally establishing dynamic linkage between the sequence record and corresponding DNA and tissue samples at GGBN.
For my current problem I guess the short term solution is to define GGBN conventions on delimiters and switch to a better solution when it might be available. We would be happy to be part of those discussion and to help.
Best, Gabi
Von: tdwg-content-bounces@lists.tdwg.orgmailto:tdwg-content-bounces@lists.tdwg.org [mailto:tdwg-content-bounces@lists.tdwg.orgmailto:tdwg-content-bounces@lists.tdwg.org] Im Auftrag von Roderic Page Gesendet: Montag, 5. Mai 2014 22:10 An: Markus Döring Cc: tdwg-content@lists.tdwg.orgmailto:tdwg-content@lists.tdwg.org; Miller, Chuck; tomc@cs.uoregon.edumailto:tomc@cs.uoregon.edu; John Deck Betreff: Re: [tdwg-content] delimiter characters for concatenated IDs
Hi Markus,
I have three use cases that
1. Linking sequences in GenBank to voucher specimens. Lots of voucher specimens are listed in GenBank but not linked to digital records for those specimens. These links are useful in two directions, one is to link GBIF to genomic data, the second is to enhance data in both databases, see http://iphylo.blogspot.co.uk/2012/02/linking-gbif-and-genbank.html (e.g., by adding missing georeferencing that is available in one database but not the other).
2. Linking to specimens cited in the literature. I’ve done some work on this in BioStor, see http://iphylo.blogspot.co.uk/2012/02/linking-gbif-and-biodiversity-heritage.... One immediate benefit of this is that GBIF could display the scientific literature associated with a specimen, so we get access to the evidence supporting identification, georeferencing, etc.
3. Citation metrics for collections, see http://iphylo.blogspot.co.uk/2013/05/the-impact-of-museum-collections-one.ht... and http://iphylo.blogspot.co.uk/2012/02/gbif-specimens-in-biostor-who-are-top.h... Based on citation sod specimens in the literature, and in databases such as GenBank (i.e., basically combining 1 + 2 above) we can demonstrate the value of a collection.
All of these use cases depend on GBIF occurenceIds remaining stable, I have often ranted on iPHylo when this doesn’t happen: http://iphylo.blogspot.co.uk/2012/07/dear-gbif-please-stop-changing.html
Regards
Rod
On 5 May 2014, at 20:51, Markus Döring <mdoering@gbif.orgmailto:mdoering@gbif.org> wrote:
Hi Rod,
I agree GBIF has troubles to keep identifiers stable for *some* records, but in general we do a much better job than the original publishers in the first place. We try hard to keep GBIF ids stable even if publishers change collection codes, registered datasets twice or do other things to break a simple automated way of mapping source records to existing GBIF ids. Also the stable identifier in GBIF never has been the URL, but it is the local GBIF integer alone. The GBIF services that consume those ids have changed over the years, but its pretty trivial to adjust if you use the GBIF ids instead of the URLs. If there is a clear need to have stable URLs instead I am sure we can get that working easily.
The two real issues for GBIF are a) duplicates and b) records with varying local identifiers of any sort (triplet, occurrenceID or whatever else).
When it comes to the varying source identifiers I always liked the idea of flagging those records and datasets as unstable, so it is obvious to users. This is not a 100% safe, but most terrible datasets change all of their ids and that is easily detectable. Also with a service like that it would become more obvious to publishers how important stable source ids are.
Before jumping on DOIs as the next big thing I would really like to understand what needs the community has around specimen ids. Gabi clearly has a very real use case, are there others we know about?
Markus
On 05 May 2014, at 21:05, Roderic Page <r.page@bio.gla.ac.ukmailto:r.page@bio.gla.ac.uk> wrote:
Hi Hilmar,
I’m not arguing that we shouldn’t build a resolver (I have one that I use, Rich has mentioned he’s got one, Markus has one at GBIF, etc.).
Nor do I think we should wait for institutional and social commitment (because then we’d never get anything done).
But I do think it would be useful to think it through. For example, it’s easy to create a URL for a specimen. Easy peasy. OK, how do I discover that URL? How do I discover these for all specimens? Sounds like I need a centralised discover service like you’e described.
How do I handle changes in those URLs? I built a specimen code to GBIF resolver for BioStor so that I could link to specimens, GBIF changed lots of those URLs, all my work was undone, boy does GBIF suck sometimes. For example, if I map codes to URLs, I need to handle cases when they change.
If URLs can change, is there a way to defend against that (this is one reason for DOIs, or other methods of indirection, such as PURLs).
If providers change, will the URLs change? Is there a way to defend against that (again, DOIs handle this nicely by virtue of (a) indirection, and (b) lack of branding).
How can I encourage people to use the specimen service? What can I do to make them think it will persist? Can I convince academic publishers to trust it enough to link to it in articles? What’s the pitch to Pensoft, to Magnolai Press, to Springer and Elsevier?
Is there some way to make the service itself become trusted? For example if I look at a journal and see that it has DOIs issued by CrossRef, I take that journal more seriously than if it’s just got simple URLs. I know that papers in that journal will be linked into the citation network, I also know that there is a backup plan if the journal goes under (because you need that to have DOIs in CrossRef). Likewise, I think Figshare got a big boost when it stared minting DOIs (wow, a DOI, I know DOIs, you mean I can now cite stuff I’ve uploaded there?).
How can museums and herbaria be persuaded to keep their identifiers stable? What incentives can we provide (e.g., citation metrics for collections)? What system would enable us to do this? What about tracing funding (e.g., the NSF paid for these n papers, and they cite these y specimens, from these z collections, so science paid for by the NSF requires these collections to exist).
I guess I’m arguing that we should think all this through, because a specimen code to specimen URL is a small piece of the puzzle. Now, I’m desperately trying not to simply say what I think is blindingly obvious here (put DOIs on specimens, add metadata to specimen and specimen citation services, and we are done), but I think if we sit back and look at where we want to be, this is exactly what we need (or something functionally equivalent). Until we see the bigger picture, we will be stuck in amateur hour.
Take a look at:
http://search.crossref.orghttp://search.crossref.org/ http://www.crossref.org/fundref/ http://support.crossref.org/ https://prospect.crossref.org/splash/
Isn’t this the kind of stuff we’d like to do? If so, let’s work out what’s needed and make it happen.
In short, I think we constantly solve an immediate problem in the quickest way we know how, without thinking it through. I’d argue that if we think about the bigger picture (what do we want to be able to, what are the questions we want to be able to ask) then things become clearer. This is independent of getting everyone’s agreement (but it would help if we made their agreement seem a no brainer by providing solutions to things that cause them pain).
Regards
Rod
On 5 May 2014, at 19:14, Hilmar Lapp <hlapp@nescent.orgmailto:hlapp@nescent.org> wrote:
On Mon, May 5, 2014 at 1:29 PM, Roderic Page <r.page@bio.gla.ac.ukmailto:r.page@bio.gla.ac.uk> wrote: Contrary to Hilmar, there is more to this than simply a quick hackathon. Yes, a service that takes metadata and returns one or more identifiers is a good idea and easy to create (there will often be more than one because museum codes are not unique). But who maintains this service? Who maintains the identifiers? Who do I complain to if they break? How do we ensure that they persist when, say, a museum closes down, moves its collection, changes it’s web technology? Who provides the tools that add value to the identifiers? (there’s no point having them if they are not useful)
Jonathan Rees pointed this out to me too off-list. Just for the record, this isn't contrary but fully in line with what I was saying (or trying to say). Yes, I didn't elaborate that part, assuming, perhaps rather erroneously, that all this goes without saying, but I did mention that one part of this becoming a real solution has to be an institution with an in-scope cyberinfrastructure mandate that going in would make a commitment to sustain the resolver, including working with partners on the above slew of questions. The institution I gave was iDigBio; perhaps for some reason that would not be a good choice, but whether they are or not wasn't my point.
I will add one point to this, though. It seems to me that by continuing to argue that we can't go ahead with building a resolver that works (as far as technical requirements are concerned) before we haven't first fully addressed the institutional and social long-term sustainability commitment problem, we are and have been making this one big hairy problem that we can't make any practical pragmatic headway about, rather than breaking it down into parts, some of which (namely the primarily technical ones) are actually fairly straightforward to solve. As a result, to this day we don't have some solution that even though it's not very sustainable yet, at least proves to everyone how critical it is, and that the community can rally behind. Perhaps that's naïve, but I do think that once there's a solution the community rallies behind, ways to sustain it will be found.
-hilmar -- Hilmar Lapp -:- informatics.nescent.org/wikihttp://informatics.nescent.org/wiki -:- lappland.iohttp://lappland.io/
_______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.orgmailto:tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
-- Hilmar Lapp -:- informatics.nescent.org/wikihttp://informatics.nescent.org/wiki -:- lappland.iohttp://lappland.io/
Hi Gabi, I'm the main moderator for GRBio.org, so hopefully I can provide some help. First of all, I want to acknowledge that GRBio does not provide all of the solutions to the globally unique, persistent identifier issues that have come up throughout this thread. However, until that solution is built, DwC triplets continue to be used "in the wild". GRBio provides a way to find more information about, for example: those mysterious Institution and Collection Codes in the specimen_voucher field of a sequence record in GenBank.
Specifically for your institution, we have an institution-level record for B ( http://grbio.org/institution/botanischer-garten-und-botanisches-museum-berli...) that corresponds with the B entry in Index Herbariorum ( http://sweetgum.nybg.org/ih/herbarium.php?irn=124103). You'll see towards the bottom of the B GRBio record, that we've inherited the LSIDs that Roger Hyam minted as part of the Biodversity Collection Index project.
If you decide to publish your specimens and/or sequences using the BGBM Institution Code, be sure to add an entry to GRBio so that someone who comes across your publication will know where to find more information.
I hope this helps,
--Mike
Mike Trizna Data Development Specialist Consortium for the Barcode of Life 202.633.0810 (telephone); 202.633.2938 (fax)
On Tue, May 6, 2014 at 9:16 AM, Hilmar Lapp hlapp@nescent.org wrote:
Hi Gabi,
That's indeed an odd acronym for BGBM. Have you tried to edit the record so it uses BGBM instead? It seems editable (though presumably edits go through approval), and GRBio expressly solicits the community to help with curating the accuracy of their records.
-hilmar
On Tue, May 6, 2014 at 5:11 AM, "Dröge, Gabriele" g.droege@bgbm.orgwrote:
Hi Rod,
those are excellent use cases, that GGBN also aims to reach.
I just have a comment on the first one. NCBI and CBOL defined a mechanism to add voucher ids to a GenBank record, which is great. But unfortunately they use different values than GBIF does. Their institution code is based on GRBio (http://grbio.org/). So e.g. my institution is listed there with “B”, but we use “BGBM” for GBIF records and so do many institutions. Same happens for Catalogue Number. Therefore the in principal good idea does not work for a dynamic linkage to GBIF records.
One of my next steps on my to do list to talk to CBOL and GenBank about this issue, additionally establishing dynamic linkage between the sequence record and corresponding DNA and tissue samples at GGBN.
For my current problem I guess the short term solution is to define GGBN conventions on delimiters and switch to a better solution when it might be available. We would be happy to be part of those discussion and to help.
Best,
Gabi
*Von:* tdwg-content-bounces@lists.tdwg.org [mailto: tdwg-content-bounces@lists.tdwg.org] *Im Auftrag von *Roderic Page *Gesendet:* Montag, 5. Mai 2014 22:10 *An:* Markus Döring *Cc:* tdwg-content@lists.tdwg.org; Miller, Chuck; tomc@cs.uoregon.edu; John Deck *Betreff:* Re: [tdwg-content] delimiter characters for concatenated IDs
Hi Markus,
I have three use cases that
- Linking sequences in GenBank to voucher specimens. Lots of voucher
specimens are listed in GenBank but not linked to digital records for those specimens. These links are useful in two directions, one is to link GBIF to genomic data, the second is to enhance data in both databases, see http://iphylo.blogspot.co.uk/2012/02/linking-gbif-and-genbank.html(e.g., by adding missing georeferencing that is available in one database but not the other).
- Linking to specimens cited in the literature. I’ve done some work on
this in BioStor, see http://iphylo.blogspot.co.uk/2012/02/linking-gbif-and-biodiversity-heritage.... One immediate benefit of this is that GBIF could display the scientific literature associated with a specimen, so we get access to the evidence supporting identification, georeferencing, etc.
- Citation metrics for collections, see
http://iphylo.blogspot.co.uk/2013/05/the-impact-of-museum-collections-one.ht... http://iphylo.blogspot.co.uk/2012/02/gbif-specimens-in-biostor-who-are-top.h... on citation sod specimens in the literature, and in databases such as GenBank (i.e., basically combining 1 + 2 above) we can demonstrate the value of a collection.
All of these use cases depend on GBIF occurenceIds remaining stable, I have often ranted on iPHylo when this doesn’t happen: http://iphylo.blogspot.co.uk/2012/07/dear-gbif-please-stop-changing.html
Regards
Rod
On 5 May 2014, at 20:51, Markus Döring mdoering@gbif.org wrote:
Hi Rod,
I agree GBIF has troubles to keep identifiers stable for *some* records, but in general we do a much better job than the original publishers in the first place. We try hard to keep GBIF ids stable even if publishers change collection codes, registered datasets twice or do other things to break a simple automated way of mapping source records to existing GBIF ids. Also the stable identifier in GBIF never has been the URL, but it is the local GBIF integer alone. The GBIF services that consume those ids have changed over the years, but its pretty trivial to adjust if you use the GBIF ids instead of the URLs. If there is a clear need to have stable URLs instead I am sure we can get that working easily.
The two real issues for GBIF are a) duplicates and b) records with varying local identifiers of any sort (triplet, occurrenceID or whatever else).
When it comes to the varying source identifiers I always liked the idea of flagging those records and datasets as unstable, so it is obvious to users. This is not a 100% safe, but most terrible datasets change all of their ids and that is easily detectable.
Also with a service like that it would become more obvious to publishers how important stable source ids are.
Before jumping on DOIs as the next big thing I would really like to understand what needs the community has around specimen ids.
Gabi clearly has a very real use case, are there others we know about?
Markus
On 05 May 2014, at 21:05, Roderic Page r.page@bio.gla.ac.uk wrote:
Hi Hilmar,
I’m not arguing that we shouldn’t build a resolver (I have one that I use, Rich has mentioned he’s got one, Markus has one at GBIF, etc.).
Nor do I think we should wait for institutional and social commitment (because then we’d never get anything done).
But I do think it would be useful to think it through. For example, it’s easy to create a URL for a specimen. Easy peasy. OK, how do I discover that URL? How do I discover these for all specimens? Sounds like I need a centralised discover service like you’e described.
How do I handle changes in those URLs? I built a specimen code to GBIF resolver for BioStor so that I could link to specimens, GBIF changed lots of those URLs, all my work was undone, boy does GBIF suck sometimes. For example, if I map codes to URLs, I need to handle cases when they change.
If URLs can change, is there a way to defend against that (this is one reason for DOIs, or other methods of indirection, such as PURLs).
If providers change, will the URLs change? Is there a way to defend against that (again, DOIs handle this nicely by virtue of (a) indirection, and (b) lack of branding).
How can I encourage people to use the specimen service? What can I do to make them think it will persist? Can I convince academic publishers to trust it enough to link to it in articles? What’s the pitch to Pensoft, to Magnolai Press, to Springer and Elsevier?
Is there some way to make the service itself become trusted? For example if I look at a journal and see that it has DOIs issued by CrossRef, I take that journal more seriously than if it’s just got simple URLs. I know that papers in that journal will be linked into the citation network, I also know that there is a backup plan if the journal goes under (because you need that to have DOIs in CrossRef). Likewise, I think Figshare got a big boost when it stared minting DOIs (wow, a DOI, I know DOIs, you mean I can now cite stuff I’ve uploaded there?).
How can museums and herbaria be persuaded to keep their identifiers stable? What incentives can we provide (e.g., citation metrics for collections)? What system would enable us to do this? What about tracing funding (e.g., the NSF paid for these n papers, and they cite these y specimens, from these z collections, so science paid for by the NSF requires these collections to exist).
I guess I’m arguing that we should think all this through, because a specimen code to specimen URL is a small piece of the puzzle. Now, I’m desperately trying not to simply say what I think is blindingly obvious here (put DOIs on specimens, add metadata to specimen and specimen citation services, and we are done), but I think if we sit back and look at where we want to be, this is exactly what we need (or something functionally equivalent). Until we see the bigger picture, we will be stuck in amateur hour.
Take a look at:
http://www.crossref.org/fundref/
https://prospect.crossref.org/splash/
Isn’t this the kind of stuff we’d like to do? If so, let’s work out what’s needed and make it happen.
In short, I think we constantly solve an immediate problem in the quickest way we know how, without thinking it through. I’d argue that if we think about the bigger picture (what do we want to be able to, what are the questions we want to be able to ask) then things become clearer. This is independent of getting everyone’s agreement (but it would help if we made their agreement seem a no brainer by providing solutions to things that cause them pain).
Regards
Rod
On 5 May 2014, at 19:14, Hilmar Lapp hlapp@nescent.org wrote:
On Mon, May 5, 2014 at 1:29 PM, Roderic Page r.page@bio.gla.ac.uk wrote:
Contrary to Hilmar, there is more to this than simply a quick hackathon. Yes, a service that takes metadata and returns one or more identifiers is a good idea and easy to create (there will often be more than one because museum codes are not unique). But who maintains this service? Who maintains the identifiers? Who do I complain to if they break? How do we ensure that they persist when, say, a museum closes down, moves its collection, changes it’s web technology? Who provides the tools that add value to the identifiers? (there’s no point having them if they are not useful)
Jonathan Rees pointed this out to me too off-list. Just for the record, this isn't contrary but fully in line with what I was saying (or trying to say). Yes, I didn't elaborate that part, assuming, perhaps rather erroneously, that all this goes without saying, but I did mention that one part of this becoming a real solution has to be an institution with an in-scope cyberinfrastructure mandate that going in would make a commitment to sustain the resolver, including working with partners on the above slew of questions. The institution I gave was iDigBio; perhaps for some reason that would not be a good choice, but whether they are or not wasn't my point.
I will add one point to this, though. It seems to me that by continuing to argue that we can't go ahead with building a resolver that works (as far as technical requirements are concerned) before we haven't first fully addressed the institutional and social long-term sustainability commitment problem, we are and have been making this one big hairy problem that we can't make any practical pragmatic headway about, rather than breaking it down into parts, some of which (namely the primarily technical ones) are actually fairly straightforward to solve. As a result, to this day we don't have some solution that even though it's not very sustainable yet, at least proves to everyone how critical it is, and that the community can rally behind. Perhaps that's naïve, but I do think that once there's a solution the community rallies behind, ways to sustain it will be found.
-hilmar
--
Hilmar Lapp -:- informatics.nescent.org/wiki -:- lappland.io
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
-- Hilmar Lapp -:- informatics.nescent.org/wiki -:- lappland.io
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
Rob, The question/debate about “best” GUID is a complex one that appears unending after about 7 years and running. Is there any aspect of this question that does not have two (or three) sides and proponents (some strong and vocal) on both sides? We still don’t have community plurality on any “best” approach, much less a majority. We have a few voices, but we need a chorus.
Chuck
From: robgur@gmail.com [mailto:robgur@gmail.com] On Behalf Of Robert Guralnick Sent: Monday, May 05, 2014 9:23 AM To: Chuck Miller Cc: Markus Döring; Dröge, Gabriele; tdwg-content@lists.tdwg.org; John Deck; tomc@cs.uoregon.edu; Nico Cellinese Subject: Re: [tdwg-content] delimiter characters for concatenated IDs
We've been examining the use (ad mis-use) of the DwC triplet, and how that propagates out of local portals and platforms into other ones. The end message from this work (and I am happy to share the manuscript and all the datasets we have compiled and examined) is that it is a _terrible_ choice for a global unique identifier.
There are so many better choices, that don't rely on delimiters or on what is ultimately a non-globally unique, non persistent, non resolvable choice for a (permanent, resolvable, globally unique) identifier. As opposed to having this conversation, I wonder why we aren't having one about ALL the other more rational choices...
Best, Rob
On Mon, May 5, 2014 at 8:14 AM, Chuck Miller <Chuck.Miller@mobot.orgmailto:Chuck.Miller@mobot.org> wrote: Markus, Didn’t we reach a general consensus within the last couple of years that the vertical pipe (|) was the preferred concatenation symbol?
Chuck
From: tdwg-content-bounces@lists.tdwg.orgmailto:tdwg-content-bounces@lists.tdwg.org [mailto:tdwg-content-bounces@lists.tdwg.orgmailto:tdwg-content-bounces@lists.tdwg.org] On Behalf Of Markus Döring Sent: Monday, May 05, 2014 8:49 AM To: "Dröge, Gabriele" Cc: tdwg-content@lists.tdwg.orgmailto:tdwg-content@lists.tdwg.org Subject: Re: [tdwg-content] delimiter characters for concatenated IDs
Hi Gabi, can you explain a little more what you are trying to do giving an example maybe?
It appears to me you are creating (globally) unique identifiers on the basis of various existing fields which is fine. But when you use the identifier to create resource relations they should be considered opaque and you should not need to parse out the underlying pieces again. So in that scenario the character used to concatenate the triplet does not really matter for the end user as long as its unique and points to some existing resource, indicated by the occurrenceID in case of occurrences or the materialSampleID for samples.
Best, Markus
On 05 May 2014, at 15:24, Dröge, Gabriele <g.droege@BGBM.ORGmailto:g.droege@BGBM.ORG> wrote:
Hi everyone,
I guess there might have been some discussions about proper delimiter characters in the past that I have missed.
In several projects, first of all in GGBN (Global Genome Biodiversity Network, http://www.ggbn.orghttp://www.ggbn.org/), there is a need for making a decision now. We need to reference between different records and databases and within Darwin Core we want to use the relatedResourceID to do so.
During our GGBN workshop at TDWG last year we agreed on concatenating the traditional triple ID (Catalogue Number, Collection Code, Institution Code) and add further parameters if required too (e.g. GUID, access point). We have checked those parameters and can definitely not use a single character as delimiter.
So my question to you is, if there are already some suggestions on using two characters together as delimiters. It would be great if we could find a solution more than one community could agree on.
Otherwise I would like to open the discussion and suggest "\", "||", "|", "§|", "§§", or "\§".
Best wishes, Gabi ----------------------------------------------------------------- Gabriele Droege Coordinator - DNA Bank Network Global Genome Biodiversity Network (GGBN) Berlin-Dahlem DNA Bank Women's Officer ZE BGBM
Botanic Garden and Botanical Museum Berlin-Dahlem Freie Universität Berlin Koenigin-Luise-Str. 6-8 14195 Berlin Germany
+49 30 838 50 139tel:%2B49%2030%20838%2050%20139
www.dnabank-network.orghttp://www.dnabank-network.org/ www.ggbn.orghttp://www.ggbn.org/ www.bgbm.orghttp://www.bgbm.org/ _______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.orgmailto:tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
_______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.orgmailto:tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
Hi everyone,
thanks for your responses, it seems to be a hot topic ;) I agree a global web service would be fantastic, but would in the first place only work for GBIF records if located at GBIF. Triple IDs change quite often and are not unique worldwide. The OccurrenceID should be unique, but only a few providers are using it in general and as a unique identifier particularly.
If you ask me I am fine with using the triple ID together with access point (BioCASE url or DwC-A url), but unfortunately the relatedResource class in Darwin Core does not allow this. It only contains relatedResourceID. So we don’t have a choice but using a concatenated version of the triple id.
We need to refer from DNA to Tissue to Specimen to whatever else. Any single object/record can be located in different institutions/databases and only the Specimen data are provided to GBIF.
The pipe (|) is used in several Catalogue Numbers and can’t be used. That § cannot be found on English keyboards for me is an advantage, because than it might also not appear that often in the triple ID.
So I think it would be great if we could discuss this at TDWG this year, we (GGBN) need a solution now. So either we build our own or we find a more generic one very soon. I agree that we are quite close to a solution and need a suitable roadmap to realize it.
So I guess we should try to propose another workshop at TDWG if there are still free slots available.
Best, Gabi
Von: Chuck Miller [mailto:Chuck.Miller@mobot.org] Gesendet: Montag, 5. Mai 2014 16:58 An: Robert Guralnick Cc: Markus Döring; Dröge, Gabriele; tdwg-content@lists.tdwg.org; John Deck; tomc@cs.uoregon.edu; Nico Cellinese Betreff: RE: [tdwg-content] delimiter characters for concatenated IDs
Rob, The question/debate about “best” GUID is a complex one that appears unending after about 7 years and running. Is there any aspect of this question that does not have two (or three) sides and proponents (some strong and vocal) on both sides? We still don’t have community plurality on any “best” approach, much less a majority. We have a few voices, but we need a chorus.
Chuck
From: robgur@gmail.commailto:robgur@gmail.com [mailto:robgur@gmail.com] On Behalf Of Robert Guralnick Sent: Monday, May 05, 2014 9:23 AM To: Chuck Miller Cc: Markus Döring; Dröge, Gabriele; tdwg-content@lists.tdwg.orgmailto:tdwg-content@lists.tdwg.org; John Deck; tomc@cs.uoregon.edumailto:tomc@cs.uoregon.edu; Nico Cellinese Subject: Re: [tdwg-content] delimiter characters for concatenated IDs
We've been examining the use (ad mis-use) of the DwC triplet, and how that propagates out of local portals and platforms into other ones. The end message from this work (and I am happy to share the manuscript and all the datasets we have compiled and examined) is that it is a _terrible_ choice for a global unique identifier.
There are so many better choices, that don't rely on delimiters or on what is ultimately a non-globally unique, non persistent, non resolvable choice for a (permanent, resolvable, globally unique) identifier. As opposed to having this conversation, I wonder why we aren't having one about ALL the other more rational choices...
Best, Rob
On Mon, May 5, 2014 at 8:14 AM, Chuck Miller <Chuck.Miller@mobot.orgmailto:Chuck.Miller@mobot.org> wrote: Markus, Didn’t we reach a general consensus within the last couple of years that the vertical pipe (|) was the preferred concatenation symbol?
Chuck
From: tdwg-content-bounces@lists.tdwg.orgmailto:tdwg-content-bounces@lists.tdwg.org [mailto:tdwg-content-bounces@lists.tdwg.orgmailto:tdwg-content-bounces@lists.tdwg.org] On Behalf Of Markus Döring Sent: Monday, May 05, 2014 8:49 AM To: "Dröge, Gabriele" Cc: tdwg-content@lists.tdwg.orgmailto:tdwg-content@lists.tdwg.org Subject: Re: [tdwg-content] delimiter characters for concatenated IDs
Hi Gabi, can you explain a little more what you are trying to do giving an example maybe?
It appears to me you are creating (globally) unique identifiers on the basis of various existing fields which is fine. But when you use the identifier to create resource relations they should be considered opaque and you should not need to parse out the underlying pieces again. So in that scenario the character used to concatenate the triplet does not really matter for the end user as long as its unique and points to some existing resource, indicated by the occurrenceID in case of occurrences or the materialSampleID for samples.
Best, Markus
On 05 May 2014, at 15:24, Dröge, Gabriele <g.droege@BGBM.ORGmailto:g.droege@BGBM.ORG> wrote:
Hi everyone,
I guess there might have been some discussions about proper delimiter characters in the past that I have missed.
In several projects, first of all in GGBN (Global Genome Biodiversity Network, http://www.ggbn.orghttp://www.ggbn.org/), there is a need for making a decision now. We need to reference between different records and databases and within Darwin Core we want to use the relatedResourceID to do so.
During our GGBN workshop at TDWG last year we agreed on concatenating the traditional triple ID (Catalogue Number, Collection Code, Institution Code) and add further parameters if required too (e.g. GUID, access point). We have checked those parameters and can definitely not use a single character as delimiter.
So my question to you is, if there are already some suggestions on using two characters together as delimiters. It would be great if we could find a solution more than one community could agree on.
Otherwise I would like to open the discussion and suggest "\", "||", "|", "§|", "§§", or "\§".
Best wishes, Gabi ----------------------------------------------------------------- Gabriele Droege Coordinator - DNA Bank Network Global Genome Biodiversity Network (GGBN) Berlin-Dahlem DNA Bank Women's Officer ZE BGBM
Botanic Garden and Botanical Museum Berlin-Dahlem Freie Universität Berlin Koenigin-Luise-Str. 6-8 14195 Berlin Germany
+49 30 838 50 139tel:%2B49%2030%20838%2050%20139
www.dnabank-network.orghttp://www.dnabank-network.org/ www.ggbn.orghttp://www.ggbn.org/ www.bgbm.orghttp://www.bgbm.org/ _______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.orgmailto:tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
_______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.orgmailto:tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
participants (17)
-
"Dröge, Gabriele"
-
Bob Morris
-
Cellinese,Nico
-
Chuck Miller
-
Hilmar Lapp
-
John Deck
-
John Deck
-
Kampmeier, Gail E
-
Markus Döring
-
Markus Döring
-
Mike Trizna
-
Richard Pyle
-
Robert Guralnick
-
Roderic Page
-
Roderic Page
-
Steve Baskauf
-
Tim Robertson [GBIF]