[tdwg-guid] First step in implementing LSIDs

Sat Jun 9 15:01:29 CEST 2007

Thinking a bit more about this issue, I've decided that perhaps it  
doesn't matter quite as much as I thought. I've posted my argument on  
my blog (http://iphylo.blogspot.com/2007/06/rethinking-lsids-versus- 
http-uri.html), partly because it has links and a picture. The text  
is pasted below for the sake of archiving it in this list.

Regards

Rod

The TDWG-GUID mailing list for this month has a discussion of whether  
TDWG should commit to LSIDs as the GUID of choice. Since the first  
GUID workshop TDWG has pretty much been going down this route,  
despite a growing chorus of voices (including mine) that LSIDs are  
not first class citizens of the Web, and don't play well with the  
Semantic Web.

Leaving aside political considerations (this stuff needs to be  
implemented as soon as possible, concerns that if TDWG advocates HTTP  
URIs people will just treat them as URLs and miss the significance of  
persistence and RDF, worries that biodiversity will be ghettoised if  
it doesn't conform what is going on elsewhere), I think there is a  
way to resolve this that may keep most people happy (or at least,  
they could live with it). My perspective is driven by trying to  
separate needs of primary data providers from application developers,  
and issues of digital preservation.

I'll try and spell out the argument below, but to cut to the chase, I  
will argue

    1. A GUID system needs to provide a globally unique identifier  
for an object, and a means of retrieving information about that object.

    2. Any of the current technologies we've discussed (LSIDs, DOIs,  
Handles) do this (to varying degrees), hence any would do as a GUID.

    3. Most applications that use these GUIDs will use Semantic Web  
tools, and hence will use HTTP URIs.

    4. These HTTP URIs will be unique to the application, the GUIDs  
however will be shared

    5. No third party application can serve an HTTP URI that doesn't  
belong to its domain.

    6. Digital preservation will rely on widely distributed copies of  
data, these cannot have the same HTTP URI.

 From this I think that both parties to this debate are right, and we  
will end up using both LSIDs and HTTP URIs, and that's OK.  
Application developers will use HTTP URIs, but will use clients that  
can handle the various kinds of GUIDs. Data providers will use the  
GUID technology that is easiest for them to get up and running (for  
specimen this is likely to be LSIDs, for literature some providers  
may use Handles via DSpace, some may use URLs).

Individual objects get GUIDs

If individual objects get GUIDs, then this has implications for HTTP  
URIs. If the HTTP URI is the GUID, an object can only be served from  
one place. It may be cached elsewhere, but that cached copy can't  
have the same HTTP URI. Any database that makes use of the HTTP URI  
cannot serve that HTTP URI itself, it needs to refer to it in some  
way. This being the case, whether the GUID is a HTTP URI or not  
starts to look a lot less important, because there is only one place  
we can get the original data from -- the original data provider. Any  
application that builds on this data will need it's own identifier if  
people are going to make use of that application's output.

Connotea as an example

As a concrete example, consider Connotea. This application uses  
deferenceable GUIDs such as DOIs and Pubmed ids to retrieve  
publications. DOIs and Pubmed ids are not HTTP URIs, and hence aren't  
first class citizens of the Web. But Connotea serves its own records  
as HTTP URIs, and URIs with the prefix "rss" return RDF (like this)  
and hence can be used "as is" by Semantic Web tools such as Sparql.

If we look at some Connotea RDF, we see that it contains the original  
DOIs and Pubmed ids.

This means that if two Connotea users bookmark the same paper, we  
could deduce that they are the same paper by comparing the embedded  
GUIDs. In the same way, we could combine RDF from Connotea and  
another application (such as bioGUID) that has information on the  
same paper. Why not use the original GUIDs? Well, for starters there  
are two of them (info:pmid/17079492 and info:doi/10.1073/pnas. 
0605858103) so which to use? Secondly, they aren't HTTP URIs, and if  
they were we'd go straight to CrossRef or NCBI, not Connotea. Lastly,  
we loose the important information that the bookmarks are different  
-- they were made by two different people (or agents).

Applications will use HTTP URIs

We want to go to Connotea (and Connotea wants us to go to it) because  
it gives us additional information, such as the tags added by users.  
Likewise, bioGUID adds links to sequences referred to in the paper.  
Web applications that build on GUIDs want to add value, and need to  
add value partly because the quality of the original data may suck.  
For example, metadata provided by CrossRef is limited, DiGIR  
providers manage to mangle even basic things like dates, and in my  
experience many records provided by DiGIR sources that lack  
geocoordinates have, in fact, been georeferenced (based on reading  
papers about those specimens). The metadata associated with Handles  
is often appallingly bad, and don't get me started on what utter  
gibberish GenBank has in its specimen voucher fields.

Hence, applications will want to edit much of this data to correct  
and improve it, and to make that edited version available they will  
need their own identifiers, i.e. HTTP URIs. This ranges from social  
bookmarking tools like Connotea, to massive databases like FreeBase.

Digital durability

Digital preservation is also relevant. How do we ensure our digital  
records are durable? Well, we can't ensure this (see Clay Shirky's  
talk at LongNow), but one way to make them more durable is massive  
redundancy -- multiple copies in many places. Indeed, given the  
limited functionality of the current GBIF portal, I would argue that  
GBIFs main role at present is to make specimen data more durable.  
DiGIR providers are not online 24/7, but if their data are in GBIF  
those data are still available. Of course, GBIF could not use the  
same GUID as the URI for that data, like Connotea it would have to  
store the original GUID in the GBIF copy of the record.

In the same way, the taxonomic literature of ants is unlikely to  
disappear anytime soon, because a single paper can be in multiple  
places. For example, Engel et al.'s paper on ants in Cretaceous Amber  
is available in at least four places:

     * BioOne (doi:10.1206/0003-0082(2005)485[0001:PNAICA]2.0.CO;2)

     * AMNH DSpace (hdl:2246/5676)

     * AntBase (http://antbase.org/ants/publications/20967/20967.pdf)

     * Internet Archive (http://www.archive.org/details/ants_20967)

Which of the four HTTP URIs you can click on should be the GUID for  
this paper? -- none of them.

LSIDs and the Semantic Web

LSIDs don't play well with the Semantic Web. My feeling is that we  
should just accept this and move on. I suspect that most users will  
not interact directly with LSID servers, they will use applications  
and portals, and these will serve HTTP URIs which are ideal for  
Semantic Web applications. Efforts to make LSIDs compliant by  
inserting owl:sameAs statements and rewriting rdf:resource attributes  
using a HTTP proxy seem to me to be misguided, if for no other reason  
than one of the strengths of the LSID protocol (no single point of  
failure, other than the DNS) is massively compromised because if the  
HTTP proxy goes down (or if the domain name tdwg.org is sold) links  
between the LSID metadata records will break.

Having a service such as a HTTP proxy that can resolve LSIDs on the  
fly and rewrite the metadata to become HTTP-resolvable is fine, but  
to impose an ugly (and possibly short term) hack on the data  
providers strikes me as unwise. The only reason for attempting this  
is if we think the original LSID record will be used directly by  
Semantic web applications. I would argue that in reality, such  
applications may harvest these records, but they will make them  
available to others as part of a record with a HTTP URI (see Connotea  
example).

Conclusions

I think my concerns about LSIDs (and I was an early advocate of  
LSIDs, see doi:10.1186/1471-2105-6-48) stem from trying to marry them  
to the Semantic web, which seems the obvious technology for  
constructing applications to query lots of distributed metadata. But  
I wonder if the mantra of "dereferenceable identifiers" can sometime  
get in the way. ISBNs given to books are not, of themselves,  
dereferenceable, but serve very well as identifiers of books (same  
ISBN, same book), and there are tools that can retrieve metadata  
given an ISBN (e.g., LibraryThing).

In a world of multiple GUIDs for the same thing, and multiple  
applications wanting to talk about the same thing, I think clearly  
separating identifiers from HTTP URIs is useful. For an application  
such as Connotea, a data aggregator such GBIF, a database like  
FreeBase, or a repository like the Internet Archive, HTTP URIs are  
the obvious choice (If I use a Connotea HTTP URI I want Connotea's  
data on a particular paper). For GUID providers, there may be other  
issues to consider.

Note that I'm not saying that we can't use HTTP URIs as GUIDs. In  
some, perhaps many cases they may well be the best option as they are  
easy to set up. It's just that I accept that not all GUIDs need be  
HTTP URIs. Given the arguments above, I think the key thing is to  
have stable identifiers for which we can retrieve associated  
metadata. Data providers can focus on providing those, application  
developers can focus on linking them and their associated metadata  
together, and repackaging the results for consumption by the cloud.

----------------------------------------
Professor Roderic D. M. Page
Editor, Systematic Biology
DEEB, IBLS
Graham Kerr Building
University of Glasgow
Glasgow G12 8QP
United Kingdom

Phone: +44 141 330 4778
Fax: +44 141 330 2792
email: r.page at bio.gla.ac.uk
web: http://taxonomy.zoology.gla.ac.uk/rod/rod.html
iChat: aim://rodpage1962
reprints: http://taxonomy.zoology.gla.ac.uk/rod/pubs.html

Subscribe to Systematic Biology through the Society of Systematic
Biologists Website: http://systematicbiology.org
Search for taxon names: http://darwin.zoology.gla.ac.uk/~rpage/portal/
Find out what we know about a species: http://ispecies.org
Rod's rants on phyloinformatics: http://iphylo.blogspot.com
Rod's rants on ants: http://semant.blogspot.com

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.tdwg.org/pipermail/tdwg-tag/attachments/20070609/4158a669/attachment.html