[tdwg-tag] Specimen identifiers [SEC=UNCLASSIFIED]
steve.baskauf at vanderbilt.edu
Mon Feb 27 22:17:48 CET 2012
In all of this discussion I am surprised that there has been no mention
of Biodiversity Collections Index (BCI;
http://www.biodiversitycollectionsindex.org/). To my knowledge, it has
never been "down" for any significant period of time and has an
extremely comprehensive listing of collections. Any collection that
isn't there can be added in a matter of a few minutes.
The reason why URLs are globally unique is because a centralized
authority (ICANN) makes sure that no two entities can have the same
domain name. It is the responsibility of the domain owner to not have
two URLs that are the same within that domain. In other words, the
domain owner makes sure that they identify their resources using locally
unique identifiers which in combination with the domain name creates a
globally unique identifier.
BCI essentially performs an analogous function to ICANN in the
biodiversity informatics community. It assigns a unique number to each
collection and ensures that no two collections can have the same
number. It slaps that number onto the end of the string
"urn:lsid:biocol.org:col:" to create an LSID and onto the end of
"http://biocol.org/urn:lsid:biocol.org:col:" to create an HTTP URI, both
of which are globally unique, actionable (in their own ways), and
All of the hand wringing about people changing their collection codes or
institution codes, or about two institutions in different fields (or
units within the same institution) having the same institution codes
goes away if we simply use the BCI-assigned number to identify the
collection. Within a particular collection, it is the institution's
responsibility to create and maintain locally unique identifiers for
their specimens. BCI has a systematic way to relate subcollections
within collections (each with their own identifier) and a large
institution with subcollections would just have to delegate at what
level the coordination of locally unique identifiers would be done.
Nobody outside the institution can do it for them - they just need to
bear the responsibility to stick with a system and not change it.
I mention this because there are really three categories of
1. Those with enough stability and the financial and IT resources to
generate and provide dereferencing for their own actionable GUIDs.
2. Those with the ability to generate and maintain a database of
non-HTTP-dereferenceable globally unique identifiers (I'm thinking about
UUIDs or UUIDs that are part of LSIDs) and to associate them with
specimens in their database, but which do not have the IT infrastructure
or the inclination to provide actionability for their globally unique
3. Those who have a system of assigning locally unique identifiers (I'm
thinking bar codes) to their specimens but who because of small size
will probably never have sophisticated IT capabilities nor the ability
to provide dereferencing for actionable GUIDs.
Either categories 2 or 3 would include institutions that do not have
control over a stable domain name or which have institutional
restrictions on the use of domain names that would preclude use of their
domain name as part of an HTTP URI.
Category 1 institutions create HTTP URI GUIDs using their domain names
and do whatever they want as far as the locally unique part of their
GUID is concerned. Their freedom comes with the responsibility of
providing dereferencing under their domain name forever.
Category 2 and 3 institutions create globally unique and persistent, but
not (yet) dereferenceable identifiers with the hope of transforming them
into HTTP URIs at a later time. Category 2 institutions have this
already in the form of their UUIDs. Category 3 institutions create
their own globally unique identifiers by means of a simple rule: "place
the BCI number for our collection, followed by a slash, in front of our
locally unique identifier" (e.g. "15590/" for the LSU herbarium +
"LSU00000434" for the barcode to create "15590/LSU00000434" as an
identifier for the specimen shown at
Category 3 institutions go to BCI and write in the "note" for their
collection what their rule is and then anybody who knows the barcode (or
accession number or whatever kind of locally unique number they commit
to) for the specimen knows the non-actionable globally unique
identifier. If the institution already consistently uses a "Darwin Core
triple" (institutionID:collectionID:catalogNumber) as a "poor-man's
GUID" in their database, they could slap "the BCI number for our
collection, followed by a slash" in front of it to guarantee that it
didn't clash with any others Darwin Core triples.
As for the transformation of the non-actionable globally unique
identifiers created by category 2 and 3 institutions into actionable
ones, a benevolent large institution (let us assume GBIF) who is willing
to take on the job of providing dereferencing services for the category
2 and 3 institutions acquires "http://purl.org/specimen/" (or some other
purl.org name) else if that's already taken) to use as the means to
create the HTTP-proxied forms of the non-actionable globally unique
identifiers. I suggest using a purl.org prefix rather than using a
subdomain of gbif.org in the event that in the next hundred years gbif
looses their funding or gets tired of providing this service. (See
http://www.nbii.gov/termination/index.html for an example of how a big
program with a nearly 20 year history can disappear in a puff of
political idiocy.) If necessary, the "http://purl.org/specimen/" prefix
could get passed over to some other big benevolent institution without
requiring GBIF to give control of part of their domain to a non-GBIF
Now we have another simple rule. If we discover an identifier that has
http:// at its front end, we dereference it to access its metadata. If
we discover an identifier which we think represents a specimen that does
not begin with "http://", we try putting "http://purl.org/specimen/" on
the front of it. If nothing happens we are no worse off than before.
If we are lucky, we get metadata. Preferably the proxy system would get
established quickly and we would tell the type 3 institutions to place
"http://purl.org/specimen/" + the BCI number for our collection,
followed by a slash, in front of our locally unique identifier". But if
in typical TDWG fashion it takes five years to decide to do this, the
small institution still has an identifier (in the form of the
non-actionable identifier) guaranteed to be globally unique among
identifiers generated by institutions who agree to abide by this set of
rules. In any case, we don't risk mucking up the Linked Data cloud with
a bunch of synonymous URIs that need to be linked with owl:sameAs, since
the UUIDs and category 3 globally unique identifiers can't be used as
URI references in RDF. One could later write in RDF:
to make sure that semantic clients understand that the non-URI globally
unique identifier is associated with the proxied version.
There would be technical details to figure out how the information about
the specimens would be transferred between the smaller data-providing
institution and the benevolent provider of dereferencing, but people are
already doing that with GBIF so it doesn't seem so impossible to imagine
that this could be worked out.
The unveiling of BCI was done with great fanfare and it is one of the
few biodiversity-related resources which actually follows all of the
rules about persistent, actionable, and unique identifiers. Yet it
rarely gets mentioned any more. Let's leverage it.
On 2/26/2012 9:27 PM, Paul Murray wrote:
> On 25/02/2012, at 4:29 AM, Dean Pentcheff wrote:
>> This is directly in response to Rod's response to Paul. I think the two of you may have just articulated nearly the same idea, though you seem not to think you did.
>> Paul envisions institutions each declaring their own URI-creating formula (to resolve down to a specimen at that institution), promulgated at a "forum" location.
>> Rod envisions URI formulation as happening at a GBIFesque centralized site.
>> If Paul's forum were GBIF (or similar), with an added function that GBIF (or similar) renegotiates any institutional declaration that collides with a pre-existing declaration, does that map to the same thing for both of you?
> Well, if institutions are assigning URIs with their own domain names in them, or if GBIF is handing out URI prefixes that the institutions use, then collisions wouldn't be an issue.
> As a technical person, perhaps I don't quite see things from the point of view of institutions whose interest in the web stops at having a pretty website, as someone suggested. It seems to me the easiest thing in the world to spark up a server and say "these are our URIs". But if people are outsourcing their web presence, then I can appreciate that creating a SemWeb presence might not seem as easy a thing to do to them. This is also the case for people who live in large institutions with byzantine rules about what may and may not go on the corporate websites.
> If there are places where the issuing of ids to specimens is as chaotic as Rod describes, well - I think the flip side of what I was saying earlier, that people that create the numbers can easily create URIs, is that if the people who create the numbers have bits and bobs all over the place, then an external institution like GBIF is not going to be able to sort it out remotely. Someone has to be on the ground, treading the dusty caverns under the museum, their feeble yellowish torch beam counterpoint to the flickering and burned-out bluish fluorescent lights above, flicking the spiders away and copying labels into their iPad and working out what's what, trying not to accidentally kick over the skeletons.
> Or the equivalent in cyberspace - the forgotten databases with their cryptic column names distant echoes of those hidden recesses where the specimen boxes are packed.
> A start might be:
> * GBIF issues URI prefixes to people/institutions that want them. A system for doing this would need to be decided on, and that will involve (shudder) people.
> * GBIF advises the institution on setting up the namespace under that, trying to make the point that URIs should be persistent, unique, all those good things
> * GBIF acts as a registry for these namespaces, a place to declare "if you have a specimen record from collection X, then for sem-web purposes the URI should look like *this*" - allowing all that legacy data to be knitted together.
> The GBIF webserver might manage incoming http requests by
> * holding some very basic, minimal data - even just a dcterms:title and nothing else
> * or, 303 redirecting to the institution's own webserver (much in the manner of a PURL server) according to rules expressed simply as a regular expression find/replace.
> * or, fetching the RDF from the institutions' server, and ADDING some RDF facts of its own to the result
> This third option means that the GBIF database can serve as a central spot where movements of specimens (ie, the assignment of a new accession number) can be put. Hopefully not the only spot, though. Best practice is always to serve up the initial and the immediately prior URI along with any URI you give to the specimen. (this only makes sense for RDF, though: you can't just "add" things to a nicely formatted HTML page).
> To make all this happen, you would want some sort of usable machine-to-machine service, you'll have to manage authentication (Passwords are a bit of a pain - perhaps a cryptographic certificate given out when the namespace prefix is assigned? Easy enough to do.). You'll want a test/staging service and a real service …
> Its a fair bit of work, come to think of it, just on the technical side, and this is without starting on the "part-of" issues.
> (Perhaps "uri.gbif.org" as the virtual host name? http:/uri.gbif.org/institution-code/collection-id/number. We'd also like a URI for "the list of institutions" and for each institution "the list of collections". Perhaps reserve "meta"? Thus http:/uri.gbif.org/uq/meta, http:/uri.gbif.org/uq/collectionX/meta as the well-known locations for README information.)
> (Allocation of URIs would cover more than just specimens. here at biodiversity.org.au, we use dotted names rather than slashes for our namespaces, meaning that our URIs have natural LSID equivalents. I think LSID componens can have slashes, so urn:lsid:uri.gbif.org:uq/collecitonX:12345)
> If you have received this transmission in error please notify us immediately by return e-mail and delete all copies. If this e-mail or any attachments have been sent to you in error, that error does not constitute waiver of any confidentiality, privilege or copyright in respect of information in the e-mail or attachments.
> Please consider the environment before printing this email.
> tdwg-tag mailing list
> tdwg-tag at lists.tdwg.org
Steven J. Baskauf, Ph.D., Senior Lecturer
Vanderbilt University Dept. of Biological Sciences
postal mail address:
VU Station B 351634
Nashville, TN 37235-1634, U.S.A.
2125 Stevenson Center
1161 21st Ave., S.
Nashville, TN 37235
office: 2128 Stevenson Center
phone: (615) 343-4582, fax: (615) 343-6707
More information about the tdwg-tag