Re: [tdwg-content] delimiter characters for concatenated IDs

5 May 2014

      Like Rob, I can't get dragged into this discussion in full right now (departing for an expedition Sunday; much to do between now and then).  However, I will make these comments:

1) It was very clear that the DwC triplet would not serve the needs of globally unique identifiers more than ten years ago; which led to a push for proper identifiers in our community by the SEEK project, and later two separate workshops on GUIDs supported by GBIF & TDWG.  The latter yielded LSIDs (which, at the time, appeared to be the least of evils, with PURLs as the next plausible option, and DOIs & other Handles a distant third; hindsight has taught us some things since then).

2) A decade later, we are still arguing about the same things (in part because people not involved with those earlier efforts are discovering the same problems that were discussed back then).

3) We have built a simple identifier cross-referencing service along the lines of what Hilmar outlined, and it has proven to be EXTREMELY powerful. We have plans to further enrich and expand the service later this year.  It currently works on a two-part approach to identifiers (IdentifierDomain + Identifier), where the former is globally unique, and the latter is any text string that is unique within the context of the IdentifierDomain).  It would require very little additional effort to expand the service to accommodate three-part inputs (ala DwC triplets; where the institutionCode and collectionCode would together uniquely represent an IdentifierDomain, and catalogNumber would represent the Identifier).  Suggestions & input welcome.  Our service is currently serving identifiers for Agents, References, and TaxonNameUsage instances, but could very easily be expanded to other objects.  It currently exists in GNUB-space, but we plan to separate it out into a generalized service (consumed by GNUB) later this year.

4) Getting back to the original question; we have standardized internally on two delimiters to allow for two-tier nested arrays, such that the pipe (|) serves the function of delimiting primary objects, and the tilde (~) serves the function of delimiting components within primary objects.  For example, a nested array of DwC triplets for the Bishop Museum fish collection would look something like this:

BPBM~I~1234|BPBM~I~9876|BPBM~I~5678

Note the difference between "I" (collectionCode for Icthyology) and pipe (primary delimiter)

Not recommending this as a standard; just reporting what has worked very well for us internally.  Haven't yet needed to escape either the pipe or the tilde.  When I say we use this "Internally", it's because externally we typically parse stuff in json.

Aloha,
Rich

Richard L. Pyle, PhD
Database Coordinator for Natural Sciences
Associate Zoologist in Ichthyology
Dive Safety Officer
Department of Natural Sciences, Bishop Museum
1525 Bernice St., Honolulu, HI 96817
Ph: (808)848-4115, Fax: (808)847-8252
email: deepreef@bishopmuseum.org
http://hbs.bishopmuseum.org/staff/pylerichard.html
...
-----Original Message-----
From: tdwg-content-bounces@lists.tdwg.org [mailto:tdwg-content-
bounces@lists.tdwg.org] On Behalf Of Bob Morris
Sent: Monday, May 05, 2014 6:46 AM
To: Chuck Miller
Cc: tdwg-content@lists.tdwg.org; John Deck; tomc@cs.uoregon.edu
Subject: Re: [tdwg-content] delimiter characters for concatenated IDs
Chuck
Hilmar is not proposing a service for management of all identifiers, he is
proposing discovery of existing, preferably resolvable and dereferanceable,
identifiers based on queries for specimen record metadata such as  DwC
triplets, together with minting of resolvable ones when none is discoverable.
Except on performance grounds---and possibly not even then--- this does
not even require all the discoverable identifiers be held on the same
machine as the proposed service is hosted, nor even on a single machine at
all.
Hilmar's proposal,  which I concur is useful and simple to accomplish, is
independent of the quality, syntax, specification or utility of the returned
identifiers, all of which are much argued in this thread and in this list from the
beginning of time.  Producing such a service is not beyond the skills required
for an assignment in an undergraduate software engineering course and
certainly could be accomplished in a few days' hackathon such as Hilmar
proposes.  As with any discovery service, its ultimate utility depends on the
minters promoting underlying discoverability of the identifiers themselves.
But that too is fairly trivial and well-understood, e.g. by the listing of them in
resolvers' SiteMaps in published ways that major spiders can find and index
them.  An example is [1].
[1] Sitemap Formats and Guidelines
https://support.google.com/webmasters/answer/183668?hl=en
On Mon, May 5, 2014 at 10:54 AM, Chuck Miller <Chuck.Miller@mobot.org>
wrote:
...
Hilmar,
A “global” resolver that manages globally unique resolvable
identifiers for every single specimen record in the world (billions?)
as a web-service should be operated by a hosting facility with a
global charter and globally funded resources.  That is the definition
of GBIF to my understanding.  What other specimen/observation
repository has greater critical mass to “mint”
and maintain GUIDs for all the world?
Chuck
From: hilmar.lapp@gmail.com [mailto:hilmar.lapp@gmail.com] On Behalf
Of Hilmar Lapp
Sent: Monday, May 05, 2014 9:47 AM
To: Robert Guralnick
Cc: Chuck Miller; tdwg-content@lists.tdwg.org; John Deck;
tomc@cs.uoregon.edu
Subject: Re: [tdwg-content] delimiter characters for concatenated IDs
I couldn't agree more.
I would also ask why there still isn't a global resolver as a
web-service that takes specimen metadata as input (such as the DwC
triplet) and returns globally unique resolvable identifiers, minting
them if necessary. If the technologically savvy people of this
community came together, this could be built at least as a prototype
in a couple of days. As I've suggested to iDigBio before, they could
hold a hackathon on this, commit to hosting and further developing the
outcome, and the problem would be solved once and for all. It would
arguably be fully within their mandate.
If instead of the many workshops that have been held on talking about
the problem we as a community would finally will ourselves to actually
solving it, that part really isn't so difficult.
-hilmar
On Mon, May 5, 2014 at 10:23 AM, Robert Guralnick
<Robert.Guralnick@colorado.edu> wrote:
We've been examining the use (ad mis-use) of the DwC triplet, and how
that
propagates out of local portals and platforms into other ones.   The end
message from this work (and I am happy to share the manuscript and all
the datasets we have compiled and examined) is that it is a _terrible_
choice for a global unique identifier.
There are so many better choices, that don't rely on delimiters or
on what is ultimately a non-globally unique, non persistent,  non
resolvable choice for a (permanent, resolvable, globally unique)
identifier.  As opposed to having this conversation, I wonder why we
aren't having one about ALL the other more rational choices...
Best, Rob
On Mon, May 5, 2014 at 8:14 AM, Chuck Miller <Chuck.Miller@mobot.org>
wrote:
Markus,
Didn’t we reach a general consensus within the last couple of years
that the vertical pipe (|) was the preferred concatenation symbol?
Chuck
From: tdwg-content-bounces@lists.tdwg.org
[mailto:tdwg-content-bounces@lists.tdwg.org] On Behalf Of Markus
Döring
Sent: Monday, May 05, 2014 8:49 AM
To: "Dröge, Gabriele"
Cc: tdwg-content@lists.tdwg.org
Subject: Re: [tdwg-content] delimiter characters for concatenated IDs
Hi Gabi,
can you explain a little more what you are trying to do giving an
example maybe?
It appears to me you are creating (globally) unique identifiers on the
basis of various existing fields which is fine. But when you use the
identifier to create resource relations they should be considered
opaque and you should not need to parse out the underlying pieces
again. So in that scenario the character used to concatenate the
triplet does not really matter for the end user as long as its unique
and points to some existing resource, indicated by the occurrenceID in
case of occurrences or the materialSampleID for samples.
Best,
Markus
On 05 May 2014, at 15:24, Dröge, Gabriele <g.droege@BGBM.ORG> wrote:
Hi everyone,
I guess there might have been some discussions about proper delimiter
characters in the past that I have missed.
In several projects, first of all in GGBN (Global Genome Biodiversity
Network, http://www.ggbn.org), there is a need for making a decision
now. We need to reference between different records and databases and
within Darwin Core we want to use the relatedResourceID to do so.
During our GGBN workshop at TDWG last year we agreed on concatenating
the traditional triple ID (Catalogue Number, Collection Code,
Institution Code) and add further parameters if required too (e.g.
GUID, access point). We have checked those parameters and can
definitely not use a single character as delimiter.
So my question to you is, if there are already some suggestions on
using two characters together as delimiters. It would be great if we
could find a solution more than one community could agree on.
Otherwise I would like to open the discussion and suggest "\\", "||",
"\|", "§|", "§§", or "\§".
Best wishes,
Gabi
-----------------------------------------------------------------
Gabriele Droege
Coordinator - DNA Bank Network
Global Genome Biodiversity Network (GGBN)
Berlin-Dahlem DNA Bank
Women's Officer ZE BGBM
Botanic Garden and Botanical Museum Berlin-Dahlem
Freie Universität Berlin
Koenigin-Luise-Str. 6-8
14195 Berlin
Germany
+49 30 838 50 139
www.dnabank-network.org
www.ggbn.org
www.bgbm.org
_______________________________________________
tdwg-content mailing list
tdwg-content@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-content
_______________________________________________
tdwg-content mailing list
tdwg-content@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-content
_______________________________________________
tdwg-content mailing list
tdwg-content@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-content
--
Hilmar Lapp -:- informatics.nescent.org/wiki -:- lappland.io
_______________________________________________
tdwg-content mailing list
tdwg-content@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-content
--
Robert A. Morris
Emeritus Professor  of Computer Science
UMASS-Boston
100 Morrissey Blvd
Boston, MA 02125-3390
Filtered Push Project
Harvard University Herbaria
Harvard University
email: morris.bob@gmail.com
web: http://efg.cs.umb.edu/
web: http://wiki.filteredpush.org
http://www.cs.umb.edu/~ram
===
The content of this communication is made entirely on my own behalf and in
no way should be deemed to express official positions of The University of
Massachusetts at Boston or Harvard University.
_______________________________________________
tdwg-content mailing list
tdwg-content@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-content

Re: [tdwg-content] delimiter characters for concatenated IDs

Richard Pyle