[tdwg-tag] Specimen identifiers [SEC=UNCLASSIFIED]

Tue Feb 28 11:23:47 CET 2012

I'm trying not to get sucked into this discussion but thank you for all the kind words about BCI - flattery will get you almost anywhere!

I'll say my tuppence worth but I have not followed everything so please excuse me if I am out of line.

I am just working on a contribution to a paper that I hope will sum up these thoughts.

Basically I am nervous about any middleman approach to issuing identifiers for specimens. For publications it is different as one may be able to retrieve the actual works from several places. If a DOI resolves to metadata about the work that is often enough because the metadata can be used to retrieve the actual publication from a library somewhere even if the publishers site is gone. 

If you are reading a paper that talks about a specimen and you want to find out more about the specimen you invariably already have the key metadata in the paper (location and recent determination)  what you want to do is actually see the specimen from the authoritative source. To do this you need to resolve an identifier back to that source. It doesn't matter if you have a middle man running a DOI/LSID/PURL service you still need the target *data* HTTP URI to be live or the link is "broken".

Collections must maintain live HTTP URIs for each specimen for click through to raw data to work. There are no quick third party fixes. 

Most specimens are in big collections and doing this is a matter of education and resource prioritisation not total lack of resources.

Talking about middleman solutions just clouds the water because managers begin to think they can outsource the solution and it will go away. They can't. Maintaining an online catalogue is now a core curation task. 

None of this precludes the fact that we need big indexes and services linking things together but again this is different from publications - specimens don't contain many links to other specimens whereas it is a major feature of publications. Specimens tend to have pointers to them and not to point to other things.

Enough already. I have some deadlines.

Roger

On 27 Feb 2012, at 21:59, Roderic Page wrote:

> Dear Steve,
> 
> I like BCI -- Roger Hyam did a very nice job creating this service. Indeed, I think Roger was offering to set up something rather like what you describe (see http://www.biocol.org/static/bcisgs.html ).
> 
> BCI would be one way to create a namespace for specimen identifiers. As always, there's more than one such tool in our community. The Repository of Biological Repositories (http://biorepositories.org/) is a similar service from the barcoding community, and I gather there are moves to try and integrate these two resources (sigh). The other consideration would be how the BCI identifiers actually map to digital resources at the institutions (for example do the BCI identifiers map onto the dataset ids that GBIF has for each collection?).
> 
> Let's hope that implementing resolvable specimen identifiers does not the typical fives years to actually happen...
> 
> Regards
> 
> Rod
> 
> On 27 Feb 2012, at 21:17, Steve Baskauf wrote:
> 
>> In all of this discussion I am surprised that there has been no mention 
>> of Biodiversity Collections Index (BCI; 
>> http://www.biodiversitycollectionsindex.org/).  To my knowledge, it has 
>> never been "down" for any significant period of time and has an 
>> extremely comprehensive listing of collections.  Any collection that 
>> isn't there can be added in a matter of a few minutes.
>> 
>> The reason why URLs are globally unique is because a centralized 
>> authority (ICANN) makes sure that no two entities can have the same 
>> domain name.  It is the responsibility of the domain owner to not have 
>> two URLs that are the same within that domain.  In other words, the 
>> domain owner makes sure that they identify their resources using locally 
>> unique identifiers which in combination with the domain name creates a 
>> globally unique identifier.
>> 
>> BCI essentially performs an analogous function to ICANN in the 
>> biodiversity informatics community.  It assigns a unique number to each 
>> collection and ensures that no two collections can have the same 
>> number.  It slaps that number onto the end of the string 
>> "urn:lsid:biocol.org:col:" to create an LSID and onto the end of 
>> "http://biocol.org/urn:lsid:biocol.org:col:" to create an HTTP URI, both 
>> of which are globally unique, actionable (in their own ways), and 
>> persistent.
>> 
>> All of the hand wringing about people changing their collection codes or 
>> institution codes, or about two institutions in different fields (or 
>> units within the same institution) having the same institution codes 
>> goes away if we simply use the BCI-assigned number to identify the 
>> collection.  Within a particular collection, it is the institution's 
>> responsibility to create and maintain locally unique identifiers for 
>> their specimens.  BCI has a systematic way to relate subcollections 
>> within collections (each with their own identifier) and a large 
>> institution with subcollections would just have to delegate at what 
>> level the coordination of locally unique identifiers would be done.  
>> Nobody outside the institution can do it for them - they just need to 
>> bear the responsibility to stick with a system and not change it.
>> 
>> I mention this because there are really three categories of 
>> specimen-containing institutions:
>> 1. Those with enough stability and the financial and IT resources to 
>> generate and provide dereferencing for their own actionable GUIDs.
>> 2. Those with the ability to generate and maintain a database of 
>> non-HTTP-dereferenceable globally unique identifiers (I'm thinking about 
>> UUIDs or UUIDs that are part of LSIDs) and to associate them with 
>> specimens in their database, but which do not have the IT infrastructure 
>> or the inclination to provide actionability for their globally unique 
>> identifiers.
>> 3. Those who have a system of assigning locally unique identifiers (I'm 
>> thinking bar codes) to their specimens but who because of small size 
>> will probably never have sophisticated IT capabilities nor the ability 
>> to provide dereferencing for actionable GUIDs.
>> 
>> Either categories 2 or 3 would include institutions that do not have 
>> control over a stable domain name or which have institutional 
>> restrictions on the use of domain names that would preclude use of their 
>> domain name as part of an HTTP URI.
>> 
>> Category 1 institutions create HTTP URI GUIDs using their domain names 
>> and do whatever they want as far as the locally unique part of their 
>> GUID is concerned.  Their freedom comes with the responsibility of 
>> providing dereferencing under their domain name forever.
>> 
>> Category 2 and 3 institutions create globally unique and persistent, but 
>> not (yet) dereferenceable identifiers with the hope of transforming them 
>> into HTTP URIs at a later time.  Category 2 institutions have this 
>> already in the form of their UUIDs.  Category 3 institutions create 
>> their own globally unique identifiers by means of a simple rule: "place 
>> the BCI number for our collection, followed by a slash, in front of our 
>> locally unique identifier" (e.g. "15590/" for the LSU herbarium + 
>> "LSU00000434" for the barcode to create "15590/LSU00000434" as an 
>> identifier for the specimen shown at 
>> http://images.cyberfloralouisiana.com/images/specimensheets/lsu/0/0/4/34/LSU00000434_l.jpg).  
>> Category 3 institutions go to BCI and write in the "note" for their 
>> collection what their rule is and then anybody who knows the barcode (or 
>> accession number or whatever kind of locally unique number they commit 
>> to) for the specimen knows the non-actionable globally unique 
>> identifier.  If the institution already consistently uses a "Darwin Core 
>> triple" (institutionID:collectionID:catalogNumber) as a "poor-man's 
>> GUID" in their database, they could slap "the BCI number for our 
>> collection, followed by a slash" in front of it to guarantee that it 
>> didn't clash with any others Darwin Core triples.
>> 
>> As for the transformation of the non-actionable globally unique 
>> identifiers created by category 2 and 3 institutions into actionable 
>> ones, a benevolent large institution (let us assume GBIF) who is willing 
>> to take on the job of providing dereferencing services for the category 
>> 2 and 3 institutions acquires "http://purl.org/specimen/" (or some other 
>> purl.org name) else if that's already taken) to use as the means to 
>> create the HTTP-proxied forms of the non-actionable globally unique 
>> identifiers.  I suggest using a purl.org prefix rather than using a 
>> subdomain of gbif.org in the event that in the next hundred years gbif 
>> looses their funding or gets tired of providing this service.  (See 
>> http://www.nbii.gov/termination/index.html for an example of how a big 
>> program with a nearly 20 year history can disappear in a puff of 
>> political idiocy.)  If necessary, the "http://purl.org/specimen/" prefix 
>> could get passed over to some other big benevolent institution without 
>> requiring GBIF to give control of part of their domain to a non-GBIF 
>> entity.
>> 
>> Now we have another simple rule.  If we discover an identifier that has 
>> http:// at its front end, we dereference it to access its metadata.  If 
>> we discover an identifier which we think represents a specimen that does 
>> not begin with "http://", we try putting "http://purl.org/specimen/" on 
>> the front of it.  If nothing happens we are no worse off than before.  
>> If we are lucky, we get metadata.  Preferably the proxy system would get 
>> established quickly and we would tell the type 3 institutions to place 
>> "http://purl.org/specimen/" + the BCI number for our collection, 
>> followed by a slash, in front of our locally unique identifier".  But if 
>> in typical TDWG fashion it takes five years to decide to do this,  the 
>> small institution still has an identifier (in the form of the 
>> non-actionable identifier) guaranteed to be globally unique among 
>> identifiers generated by institutions who agree to abide by this set of 
>> rules.  In any case, we don't risk mucking up the Linked Data cloud with 
>> a bunch of synonymous URIs that need to be linked with owl:sameAs, since 
>> the UUIDs and category 3 globally unique identifiers can't be used as 
>> URI references in RDF.  One could later write in RDF:
>> 
>> <rdf:Description  rdf:about="http://purl.org/specimen/15590/LSU00000434">
>>      <dc:identifier>15590/LSU00000434</dc:identifier>
>> </rdf:Description>
>> 
>> to make sure that semantic clients understand that the non-URI globally 
>> unique identifier is associated with the proxied version.
>> 
>> There would be technical details to figure out how the information about 
>> the specimens would be transferred between the smaller data-providing 
>> institution and the benevolent provider of dereferencing, but people are 
>> already doing that with GBIF so it doesn't seem so impossible to imagine 
>> that this could be worked out.
>> 
>> The unveiling of BCI was done with great fanfare and it is one of the 
>> few biodiversity-related resources which actually follows all of the 
>> rules about persistent, actionable, and unique identifiers.  Yet it 
>> rarely gets mentioned any more.  Let's leverage it.
>> 
>> Steve
>> 
>> On 2/26/2012 9:27 PM, Paul Murray wrote:
>>> On 25/02/2012, at 4:29 AM, Dean Pentcheff wrote:
>>> 
>>>> This is directly in response to Rod's response to Paul. I think the two of you may have just articulated nearly the same idea, though you seem not to think you did.
>>>> 
>>>> Paul envisions institutions each declaring their own URI-creating formula (to resolve down to a specimen at that institution), promulgated at a "forum" location.
>>>> 
>>>> Rod envisions URI formulation as happening at a GBIFesque centralized site.
>>>> 
>>>> If Paul's forum were GBIF (or similar), with an added function that GBIF (or similar) renegotiates any institutional declaration that collides with a pre-existing declaration, does that map to the same thing for both of you?
>>> Well, if institutions are assigning URIs with their own domain names in them, or if GBIF is handing out URI prefixes that the institutions use, then collisions wouldn't be an issue.
>>> 
>>> As a technical person, perhaps I don't quite see things from the point of view of institutions whose interest in the web stops at having a pretty website, as someone suggested. It seems to me the easiest thing in the world to spark up a server and say "these are our URIs". But if people are outsourcing their web presence, then I can appreciate that creating a SemWeb presence might not seem as easy a thing to do to them. This is also the case for people who live in large institutions with byzantine rules about what may and may not go on the corporate websites.
>>> 
>>> If there are places where the issuing of ids to specimens is as chaotic as Rod describes, well - I think the flip side of what I was saying earlier, that people that create the numbers can easily create URIs, is that if the people who create the numbers have bits and bobs all over the place, then an external institution like GBIF is not going to be able to sort it out remotely. Someone has to be on the ground, treading the dusty caverns under the museum, their feeble yellowish torch beam counterpoint to the flickering and burned-out bluish fluorescent lights above, flicking the spiders away and copying labels into their iPad and working out what's what, trying not to accidentally kick over the skeletons.
>>> 
>>> Or the equivalent in cyberspace - the forgotten databases with their cryptic column names distant echoes of those hidden recesses where the specimen boxes are packed.
>>> 
>>> A start might be:
>>> 
>>> * GBIF issues URI prefixes to people/institutions that want them. A system for doing this would need to be decided on, and that will involve (shudder) people.
>>> * GBIF advises the institution on setting up the namespace under that, trying to make the point that URIs should be persistent, unique, all those good things
>>> * GBIF acts as a registry for these namespaces, a place to declare "if you have a specimen record from collection X, then for sem-web purposes the URI should look like *this*" - allowing all that legacy data to be knitted together.
>>> 
>>> The GBIF webserver might manage incoming http requests by
>>> * holding some very basic, minimal data - even just a dcterms:title and nothing else
>>> * or, 303 redirecting to the institution's own webserver (much in the manner of a PURL server) according to rules expressed simply as a regular expression find/replace.
>>> * or, fetching the RDF from the institutions' server, and ADDING some RDF facts of its own to the result
>>> 
>>> This third option means that the GBIF database can serve as a central spot where movements of specimens (ie, the assignment of a new accession number) can be put. Hopefully not the only spot, though. Best practice is always to serve up the initial and the immediately prior URI along with any URI you give to the specimen. (this only makes sense for RDF, though: you can't just "add" things to a nicely formatted HTML page).
>>> 
>>> To make all this happen, you would want some sort of usable machine-to-machine service, you'll have to manage authentication (Passwords are a bit of a pain - perhaps a cryptographic certificate given out when the namespace prefix is assigned? Easy enough to do.). You'll want a test/staging service and a real service …
>>> 
>>> Its a fair bit of work, come to think of it, just on the technical side, and this is without starting on the "part-of" issues.
>>> 
>>> ----------------
>>> (Perhaps "uri.gbif.org" as the virtual host name? http:/uri.gbif.org/institution-code/collection-id/number. We'd also like a URI for "the list of institutions" and for each institution "the list of collections". Perhaps reserve "meta"? Thus http:/uri.gbif.org/uq/meta, http:/uri.gbif.org/uq/collectionX/meta as the well-known locations for README information.)
>>> 
>>> (Allocation of URIs would cover more than just specimens. here at biodiversity.org.au, we use dotted names rather than slashes for our namespaces, meaning that our URIs have natural LSID equivalents. I think LSID componens can have slashes, so urn:lsid:uri.gbif.org:uq/collecitonX:12345)
>>> 
>>> 
>>> 
>>> 
>>> If you have received this transmission in error please notify us immediately by return e-mail and delete all copies. If this e-mail or any attachments have been sent to you in error, that error does not constitute waiver of any confidentiality, privilege or copyright in respect of information in the e-mail or attachments.
>>> 
>>> Please consider the environment before printing this email.
>>> _______________________________________________
>>> tdwg-tag mailing list
>>> tdwg-tag at lists.tdwg.org
>>> http://lists.tdwg.org/mailman/listinfo/tdwg-tag
>>> 
>>> .
>>> 
>> 
>> -- 
>> Steven J. Baskauf, Ph.D., Senior Lecturer
>> Vanderbilt University Dept. of Biological Sciences
>> 
>> postal mail address:
>> VU Station B 351634
>> Nashville, TN  37235-1634,  U.S.A.
>> 
>> delivery address:
>> 2125 Stevenson Center
>> 1161 21st Ave., S.
>> Nashville, TN 37235
>> 
>> office: 2128 Stevenson Center
>> phone: (615) 343-4582,  fax: (615) 343-6707
>> http://bioimages.vanderbilt.edu
>> 
>> 
>> _______________________________________________
>> tdwg-tag mailing list
>> tdwg-tag at lists.tdwg.org
>> http://lists.tdwg.org/mailman/listinfo/tdwg-tag
>> 
> 
> ---------------------------------------------------------
> Roderic Page
> Professor of Taxonomy
> Institute of Biodiversity, Animal Health and Comparative Medicine
> College of Medical, Veterinary and Life Sciences
> Graham Kerr Building
> University of Glasgow
> Glasgow G12 8QQ, UK
> 
> Email: r.page at bio.gla.ac.uk
> Tel: +44 141 330 4778
> Fax: +44 141 330 2792
> AIM: rodpage1962 at aim.com
> Facebook: http://www.facebook.com/profile.php?id=1112517192
> Twitter: http://twitter.com/rdmpage
> Blog: http://iphylo.blogspot.com
> Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.tdwg.org/pipermail/tdwg-tag/attachments/20120228/8d59c7de/attachment-0001.html