[tdwg-guid] First step in implementing LSIDs
r.page at bio.gla.ac.uk
Sat Jun 9 15:01:29 CEST 2007
Thinking a bit more about this issue, I've decided that perhaps it
doesn't matter quite as much as I thought. I've posted my argument on
my blog (http://iphylo.blogspot.com/2007/06/rethinking-lsids-versus-
http-uri.html), partly because it has links and a picture. The text
is pasted below for the sake of archiving it in this list.
The TDWG-GUID mailing list for this month has a discussion of whether
TDWG should commit to LSIDs as the GUID of choice. Since the first
GUID workshop TDWG has pretty much been going down this route,
despite a growing chorus of voices (including mine) that LSIDs are
not first class citizens of the Web, and don't play well with the
Leaving aside political considerations (this stuff needs to be
implemented as soon as possible, concerns that if TDWG advocates HTTP
URIs people will just treat them as URLs and miss the significance of
persistence and RDF, worries that biodiversity will be ghettoised if
it doesn't conform what is going on elsewhere), I think there is a
way to resolve this that may keep most people happy (or at least,
they could live with it). My perspective is driven by trying to
separate needs of primary data providers from application developers,
and issues of digital preservation.
I'll try and spell out the argument below, but to cut to the chase, I
1. A GUID system needs to provide a globally unique identifier
for an object, and a means of retrieving information about that object.
2. Any of the current technologies we've discussed (LSIDs, DOIs,
Handles) do this (to varying degrees), hence any would do as a GUID.
3. Most applications that use these GUIDs will use Semantic Web
tools, and hence will use HTTP URIs.
4. These HTTP URIs will be unique to the application, the GUIDs
however will be shared
5. No third party application can serve an HTTP URI that doesn't
belong to its domain.
6. Digital preservation will rely on widely distributed copies of
data, these cannot have the same HTTP URI.
From this I think that both parties to this debate are right, and we
will end up using both LSIDs and HTTP URIs, and that's OK.
Application developers will use HTTP URIs, but will use clients that
can handle the various kinds of GUIDs. Data providers will use the
GUID technology that is easiest for them to get up and running (for
specimen this is likely to be LSIDs, for literature some providers
may use Handles via DSpace, some may use URLs).
Individual objects get GUIDs
If individual objects get GUIDs, then this has implications for HTTP
URIs. If the HTTP URI is the GUID, an object can only be served from
one place. It may be cached elsewhere, but that cached copy can't
have the same HTTP URI. Any database that makes use of the HTTP URI
cannot serve that HTTP URI itself, it needs to refer to it in some
way. This being the case, whether the GUID is a HTTP URI or not
starts to look a lot less important, because there is only one place
we can get the original data from -- the original data provider. Any
application that builds on this data will need it's own identifier if
people are going to make use of that application's output.
Connotea as an example
As a concrete example, consider Connotea. This application uses
deferenceable GUIDs such as DOIs and Pubmed ids to retrieve
publications. DOIs and Pubmed ids are not HTTP URIs, and hence aren't
first class citizens of the Web. But Connotea serves its own records
as HTTP URIs, and URIs with the prefix "rss" return RDF (like this)
and hence can be used "as is" by Semantic Web tools such as Sparql.
If we look at some Connotea RDF, we see that it contains the original
DOIs and Pubmed ids.
This means that if two Connotea users bookmark the same paper, we
could deduce that they are the same paper by comparing the embedded
GUIDs. In the same way, we could combine RDF from Connotea and
another application (such as bioGUID) that has information on the
same paper. Why not use the original GUIDs? Well, for starters there
are two of them (info:pmid/17079492 and info:doi/10.1073/pnas.
0605858103) so which to use? Secondly, they aren't HTTP URIs, and if
they were we'd go straight to CrossRef or NCBI, not Connotea. Lastly,
we loose the important information that the bookmarks are different
-- they were made by two different people (or agents).
Applications will use HTTP URIs
We want to go to Connotea (and Connotea wants us to go to it) because
it gives us additional information, such as the tags added by users.
Likewise, bioGUID adds links to sequences referred to in the paper.
Web applications that build on GUIDs want to add value, and need to
add value partly because the quality of the original data may suck.
For example, metadata provided by CrossRef is limited, DiGIR
providers manage to mangle even basic things like dates, and in my
experience many records provided by DiGIR sources that lack
geocoordinates have, in fact, been georeferenced (based on reading
papers about those specimens). The metadata associated with Handles
is often appallingly bad, and don't get me started on what utter
gibberish GenBank has in its specimen voucher fields.
Hence, applications will want to edit much of this data to correct
and improve it, and to make that edited version available they will
need their own identifiers, i.e. HTTP URIs. This ranges from social
bookmarking tools like Connotea, to massive databases like FreeBase.
Digital preservation is also relevant. How do we ensure our digital
records are durable? Well, we can't ensure this (see Clay Shirky's
talk at LongNow), but one way to make them more durable is massive
redundancy -- multiple copies in many places. Indeed, given the
limited functionality of the current GBIF portal, I would argue that
GBIFs main role at present is to make specimen data more durable.
DiGIR providers are not online 24/7, but if their data are in GBIF
those data are still available. Of course, GBIF could not use the
same GUID as the URI for that data, like Connotea it would have to
store the original GUID in the GBIF copy of the record.
In the same way, the taxonomic literature of ants is unlikely to
disappear anytime soon, because a single paper can be in multiple
places. For example, Engel et al.'s paper on ants in Cretaceous Amber
is available in at least four places:
* BioOne (doi:10.1206/0003-0082(2005)485[0001:PNAICA]2.0.CO;2)
* AMNH DSpace (hdl:2246/5676)
* AntBase (http://antbase.org/ants/publications/20967/20967.pdf)
* Internet Archive (http://www.archive.org/details/ants_20967)
Which of the four HTTP URIs you can click on should be the GUID for
this paper? -- none of them.
LSIDs and the Semantic Web
LSIDs don't play well with the Semantic Web. My feeling is that we
should just accept this and move on. I suspect that most users will
not interact directly with LSID servers, they will use applications
and portals, and these will serve HTTP URIs which are ideal for
Semantic Web applications. Efforts to make LSIDs compliant by
inserting owl:sameAs statements and rewriting rdf:resource attributes
using a HTTP proxy seem to me to be misguided, if for no other reason
than one of the strengths of the LSID protocol (no single point of
failure, other than the DNS) is massively compromised because if the
HTTP proxy goes down (or if the domain name tdwg.org is sold) links
between the LSID metadata records will break.
Having a service such as a HTTP proxy that can resolve LSIDs on the
fly and rewrite the metadata to become HTTP-resolvable is fine, but
to impose an ugly (and possibly short term) hack on the data
providers strikes me as unwise. The only reason for attempting this
is if we think the original LSID record will be used directly by
Semantic web applications. I would argue that in reality, such
applications may harvest these records, but they will make them
available to others as part of a record with a HTTP URI (see Connotea
I think my concerns about LSIDs (and I was an early advocate of
LSIDs, see doi:10.1186/1471-2105-6-48) stem from trying to marry them
to the Semantic web, which seems the obvious technology for
constructing applications to query lots of distributed metadata. But
I wonder if the mantra of "dereferenceable identifiers" can sometime
get in the way. ISBNs given to books are not, of themselves,
dereferenceable, but serve very well as identifiers of books (same
ISBN, same book), and there are tools that can retrieve metadata
given an ISBN (e.g., LibraryThing).
In a world of multiple GUIDs for the same thing, and multiple
applications wanting to talk about the same thing, I think clearly
separating identifiers from HTTP URIs is useful. For an application
such as Connotea, a data aggregator such GBIF, a database like
FreeBase, or a repository like the Internet Archive, HTTP URIs are
the obvious choice (If I use a Connotea HTTP URI I want Connotea's
data on a particular paper). For GUID providers, there may be other
issues to consider.
Note that I'm not saying that we can't use HTTP URIs as GUIDs. In
some, perhaps many cases they may well be the best option as they are
easy to set up. It's just that I accept that not all GUIDs need be
HTTP URIs. Given the arguments above, I think the key thing is to
have stable identifiers for which we can retrieve associated
metadata. Data providers can focus on providing those, application
developers can focus on linking them and their associated metadata
together, and repackaging the results for consumption by the cloud.
Professor Roderic D. M. Page
Editor, Systematic Biology
Graham Kerr Building
University of Glasgow
Glasgow G12 8QP
Phone: +44 141 330 4778
Fax: +44 141 330 2792
email: r.page at bio.gla.ac.uk
Subscribe to Systematic Biology through the Society of Systematic
Biologists Website: http://systematicbiology.org
Search for taxon names: http://darwin.zoology.gla.ac.uk/~rpage/portal/
Find out what we know about a species: http://ispecies.org
Rod's rants on phyloinformatics: http://iphylo.blogspot.com
Rod's rants on ants: http://semant.blogspot.com
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the tdwg-tag