[tdwg-tag] Specimen identifiers [SEC=UNCLASSIFIED]

Mon Feb 27 22:17:48 CET 2012

In all of this discussion I am surprised that there has been no mention 
of Biodiversity Collections Index (BCI; 
http://www.biodiversitycollectionsindex.org/).  To my knowledge, it has 
never been "down" for any significant period of time and has an 
extremely comprehensive listing of collections.  Any collection that 
isn't there can be added in a matter of a few minutes.

The reason why URLs are globally unique is because a centralized 
authority (ICANN) makes sure that no two entities can have the same 
domain name.  It is the responsibility of the domain owner to not have 
two URLs that are the same within that domain.  In other words, the 
domain owner makes sure that they identify their resources using locally 
unique identifiers which in combination with the domain name creates a 
globally unique identifier.

BCI essentially performs an analogous function to ICANN in the 
biodiversity informatics community.  It assigns a unique number to each 
collection and ensures that no two collections can have the same 
number.  It slaps that number onto the end of the string 
"urn:lsid:biocol.org:col:" to create an LSID and onto the end of 
"http://biocol.org/urn:lsid:biocol.org:col:" to create an HTTP URI, both 
of which are globally unique, actionable (in their own ways), and 
persistent.

All of the hand wringing about people changing their collection codes or 
institution codes, or about two institutions in different fields (or 
units within the same institution) having the same institution codes 
goes away if we simply use the BCI-assigned number to identify the 
collection.  Within a particular collection, it is the institution's 
responsibility to create and maintain locally unique identifiers for 
their specimens.  BCI has a systematic way to relate subcollections 
within collections (each with their own identifier) and a large 
institution with subcollections would just have to delegate at what 
level the coordination of locally unique identifiers would be done.  
Nobody outside the institution can do it for them - they just need to 
bear the responsibility to stick with a system and not change it.

I mention this because there are really three categories of 
specimen-containing institutions:
1. Those with enough stability and the financial and IT resources to 
generate and provide dereferencing for their own actionable GUIDs.
2. Those with the ability to generate and maintain a database of 
non-HTTP-dereferenceable globally unique identifiers (I'm thinking about 
UUIDs or UUIDs that are part of LSIDs) and to associate them with 
specimens in their database, but which do not have the IT infrastructure 
or the inclination to provide actionability for their globally unique 
identifiers.
3. Those who have a system of assigning locally unique identifiers (I'm 
thinking bar codes) to their specimens but who because of small size 
will probably never have sophisticated IT capabilities nor the ability 
to provide dereferencing for actionable GUIDs.

Either categories 2 or 3 would include institutions that do not have 
control over a stable domain name or which have institutional 
restrictions on the use of domain names that would preclude use of their 
domain name as part of an HTTP URI.

Category 1 institutions create HTTP URI GUIDs using their domain names 
and do whatever they want as far as the locally unique part of their 
GUID is concerned.  Their freedom comes with the responsibility of 
providing dereferencing under their domain name forever.

Category 2 and 3 institutions create globally unique and persistent, but 
not (yet) dereferenceable identifiers with the hope of transforming them 
into HTTP URIs at a later time.  Category 2 institutions have this 
already in the form of their UUIDs.  Category 3 institutions create 
their own globally unique identifiers by means of a simple rule: "place 
the BCI number for our collection, followed by a slash, in front of our 
locally unique identifier" (e.g. "15590/" for the LSU herbarium + 
"LSU00000434" for the barcode to create "15590/LSU00000434" as an 
identifier for the specimen shown at 
http://images.cyberfloralouisiana.com/images/specimensheets/lsu/0/0/4/34/LSU00000434_l.jpg).  
Category 3 institutions go to BCI and write in the "note" for their 
collection what their rule is and then anybody who knows the barcode (or 
accession number or whatever kind of locally unique number they commit 
to) for the specimen knows the non-actionable globally unique 
identifier.  If the institution already consistently uses a "Darwin Core 
triple" (institutionID:collectionID:catalogNumber) as a "poor-man's 
GUID" in their database, they could slap "the BCI number for our 
collection, followed by a slash" in front of it to guarantee that it 
didn't clash with any others Darwin Core triples.

As for the transformation of the non-actionable globally unique 
identifiers created by category 2 and 3 institutions into actionable 
ones, a benevolent large institution (let us assume GBIF) who is willing 
to take on the job of providing dereferencing services for the category 
2 and 3 institutions acquires "http://purl.org/specimen/" (or some other 
purl.org name) else if that's already taken) to use as the means to 
create the HTTP-proxied forms of the non-actionable globally unique 
identifiers.  I suggest using a purl.org prefix rather than using a 
subdomain of gbif.org in the event that in the next hundred years gbif 
looses their funding or gets tired of providing this service.  (See 
http://www.nbii.gov/termination/index.html for an example of how a big 
program with a nearly 20 year history can disappear in a puff of 
political idiocy.)  If necessary, the "http://purl.org/specimen/" prefix 
could get passed over to some other big benevolent institution without 
requiring GBIF to give control of part of their domain to a non-GBIF 
entity.

Now we have another simple rule.  If we discover an identifier that has 
http:// at its front end, we dereference it to access its metadata.  If 
we discover an identifier which we think represents a specimen that does 
not begin with "http://", we try putting "http://purl.org/specimen/" on 
the front of it.  If nothing happens we are no worse off than before.  
If we are lucky, we get metadata.  Preferably the proxy system would get 
established quickly and we would tell the type 3 institutions to place 
"http://purl.org/specimen/" + the BCI number for our collection, 
followed by a slash, in front of our locally unique identifier".  But if 
in typical TDWG fashion it takes five years to decide to do this,  the 
small institution still has an identifier (in the form of the 
non-actionable identifier) guaranteed to be globally unique among 
identifiers generated by institutions who agree to abide by this set of 
rules.  In any case, we don't risk mucking up the Linked Data cloud with 
a bunch of synonymous URIs that need to be linked with owl:sameAs, since 
the UUIDs and category 3 globally unique identifiers can't be used as 
URI references in RDF.  One could later write in RDF:

<rdf:Description  rdf:about="http://purl.org/specimen/15590/LSU00000434">
      <dc:identifier>15590/LSU00000434</dc:identifier>
</rdf:Description>

to make sure that semantic clients understand that the non-URI globally 
unique identifier is associated with the proxied version.

There would be technical details to figure out how the information about 
the specimens would be transferred between the smaller data-providing 
institution and the benevolent provider of dereferencing, but people are 
already doing that with GBIF so it doesn't seem so impossible to imagine 
that this could be worked out.

The unveiling of BCI was done with great fanfare and it is one of the 
few biodiversity-related resources which actually follows all of the 
rules about persistent, actionable, and unique identifiers.  Yet it 
rarely gets mentioned any more.  Let's leverage it.

Steve

On 2/26/2012 9:27 PM, Paul Murray wrote:
> On 25/02/2012, at 4:29 AM, Dean Pentcheff wrote:
>
>> This is directly in response to Rod's response to Paul. I think the two of you may have just articulated nearly the same idea, though you seem not to think you did.
>>
>> Paul envisions institutions each declaring their own URI-creating formula (to resolve down to a specimen at that institution), promulgated at a "forum" location.
>>
>> Rod envisions URI formulation as happening at a GBIFesque centralized site.
>>
>> If Paul's forum were GBIF (or similar), with an added function that GBIF (or similar) renegotiates any institutional declaration that collides with a pre-existing declaration, does that map to the same thing for both of you?
> Well, if institutions are assigning URIs with their own domain names in them, or if GBIF is handing out URI prefixes that the institutions use, then collisions wouldn't be an issue.
>
> As a technical person, perhaps I don't quite see things from the point of view of institutions whose interest in the web stops at having a pretty website, as someone suggested. It seems to me the easiest thing in the world to spark up a server and say "these are our URIs". But if people are outsourcing their web presence, then I can appreciate that creating a SemWeb presence might not seem as easy a thing to do to them. This is also the case for people who live in large institutions with byzantine rules about what may and may not go on the corporate websites.
>
> If there are places where the issuing of ids to specimens is as chaotic as Rod describes, well - I think the flip side of what I was saying earlier, that people that create the numbers can easily create URIs, is that if the people who create the numbers have bits and bobs all over the place, then an external institution like GBIF is not going to be able to sort it out remotely. Someone has to be on the ground, treading the dusty caverns under the museum, their feeble yellowish torch beam counterpoint to the flickering and burned-out bluish fluorescent lights above, flicking the spiders away and copying labels into their iPad and working out what's what, trying not to accidentally kick over the skeletons.
>
> Or the equivalent in cyberspace - the forgotten databases with their cryptic column names distant echoes of those hidden recesses where the specimen boxes are packed.
>
> A start might be:
>
> * GBIF issues URI prefixes to people/institutions that want them. A system for doing this would need to be decided on, and that will involve (shudder) people.
> * GBIF advises the institution on setting up the namespace under that, trying to make the point that URIs should be persistent, unique, all those good things
> * GBIF acts as a registry for these namespaces, a place to declare "if you have a specimen record from collection X, then for sem-web purposes the URI should look like *this*" - allowing all that legacy data to be knitted together.
>
> The GBIF webserver might manage incoming http requests by
> * holding some very basic, minimal data - even just a dcterms:title and nothing else
> * or, 303 redirecting to the institution's own webserver (much in the manner of a PURL server) according to rules expressed simply as a regular expression find/replace.
> * or, fetching the RDF from the institutions' server, and ADDING some RDF facts of its own to the result
>
> This third option means that the GBIF database can serve as a central spot where movements of specimens (ie, the assignment of a new accession number) can be put. Hopefully not the only spot, though. Best practice is always to serve up the initial and the immediately prior URI along with any URI you give to the specimen. (this only makes sense for RDF, though: you can't just "add" things to a nicely formatted HTML page).
>
> To make all this happen, you would want some sort of usable machine-to-machine service, you'll have to manage authentication (Passwords are a bit of a pain - perhaps a cryptographic certificate given out when the namespace prefix is assigned? Easy enough to do.). You'll want a test/staging service and a real service …
>
> Its a fair bit of work, come to think of it, just on the technical side, and this is without starting on the "part-of" issues.
>
> ----------------
> (Perhaps "uri.gbif.org" as the virtual host name? http:/uri.gbif.org/institution-code/collection-id/number. We'd also like a URI for "the list of institutions" and for each institution "the list of collections". Perhaps reserve "meta"? Thus http:/uri.gbif.org/uq/meta, http:/uri.gbif.org/uq/collectionX/meta as the well-known locations for README information.)
>
> (Allocation of URIs would cover more than just specimens. here at biodiversity.org.au, we use dotted names rather than slashes for our namespaces, meaning that our URIs have natural LSID equivalents. I think LSID componens can have slashes, so urn:lsid:uri.gbif.org:uq/collecitonX:12345)
>
>
>
>
> If you have received this transmission in error please notify us immediately by return e-mail and delete all copies. If this e-mail or any attachments have been sent to you in error, that error does not constitute waiver of any confidentiality, privilege or copyright in respect of information in the e-mail or attachments.
>
> Please consider the environment before printing this email.
> _______________________________________________
> tdwg-tag mailing list
> tdwg-tag at lists.tdwg.org
> http://lists.tdwg.org/mailman/listinfo/tdwg-tag
>
> .
>

-- 
Steven J. Baskauf, Ph.D., Senior Lecturer
Vanderbilt University Dept. of Biological Sciences

postal mail address:
VU Station B 351634
Nashville, TN  37235-1634,  U.S.A.

delivery address:
2125 Stevenson Center
1161 21st Ave., S.
Nashville, TN 37235

office: 2128 Stevenson Center
phone: (615) 343-4582,  fax: (615) 343-6707
http://bioimages.vanderbilt.edu