random notes on LSIDs
thau at LEARNINGSITE.COM
Mon Sep 27 11:04:38 CEST 2004
Sorry to come into this late - and forgive me if I'm covering trodden
ground here, I just re-joined the list and may have missed a few posts.
I've been looking at various GUID systems for SEEK - primarily the Handle
System underlying DOI, and LSIDs. It looks like I'll be giving an
introduction to GUIDs and focusing on LSIDs at TDWG in New Zealand in two
weeks. Judging by the conversation here, I think I'll be keeping my
introduction brief to allow maximal time for discussion!
There are a few things about LSIDs that I want to point out. First....
as someone has mentioned, some of my early ramblings on GUIDs can be found
Some of these files got munged in CVS, but I've fixed them. These files
are over six months old, and between then and now I've become more of an
LSID fan. So, if you do bother reading that stuff, realize it's old and
outdated and incomplete in too many places. I'm going to write a revised
document encompassing and expanding all of that and I'll post here when
Now for some miscellaneous points about LSIDs
* What happens when an LSID Authority goes away?
As I think Dave V. and others have pointed out, when an LSID is resolved,
DNS is used to find the LSID authority. The LSID authority then provides
information about how the LSID can be served up (e.g. HTTP, SOAP, FTP),
and where to get the data behind the LSID and associated metadata. If I
start serving up LSIDs with the authority learningsite.com and later
decide that I'm sick of serving up LSIDs, somebody else can take over
serving up the data and the metadata. However, I (or they) still bear the
responsibility of running the authority which points to the data. If my
lsids have an authority like lsid.learningsite.com
(urn:lsid:lsid.learningsite.com:foo:bar) then someone else can take over
the authority by taking over lsid.learningsite.com and I can still have
www.learningsite.com, mail.learningsite.com, etc... for myself. So, with a
little planning, it's not so hard to deal with an authority going away as
long as the people running it are responsible.
* data and metadata
With LSIDs there's a big difference between the data and metadata of an
LSID - and I think this is going to be the biggest challenge in deciding
how to use them in our context. What's the data? What's the metadata?
With gene sequences, the datum is the sequence, the metadata are things
like contact information, who did the sequencing, taxonomic information
about the thing sequenced, etc. There's an LTER site using LSIDs for
their data sets. The LSID data is the data set itself, and the metadata
is what you'd expect - a description of the data set, who the
investigators were, that sort of thing. NCBI has pubmed LSIDs - they're
not serving up the articles yet, but there's associated metadata in there.
For these things the division between data and metadata is fairly clear.
However, what is the data for taxa? What is the metadata?
Here's another interesting thing about data and metadata in LSIDs. When
you issue an LSID you're promising the the DATA behind that LSID never
changes. Additionally, there's only one authority ultimately responsible
for pointing to the data, and that never changes (although as above,
someone else can take the authority over). However, for metadata, there
are no such promises. Metadata can change. Furthermore, organizations
other than the authority can provide metadata, as long as the authority
agrees to it and adds them to a list of authorized metadata providers. I
dont know if this is such a great idea, but its in the specification.
So, what's the data? What's the metadata? This question applies to any
GUID system, really - the Handle System has the same issues, but less
clearly defined. As an aside - the Handle System is very robust and the
fee schedule is probably circumventable. However, I think LSIDs are
better suited to the direction biodiversity informatics is taking - using
XML-based standards and standard internet protocols to share data.
* client stack versus authority server
The LSID folks provide two batches of code - an authority server, for
people who want to to serve up LSIDs themselves, and an LSID Client stack
- which can be used by organizations to provide access to their LSIDs
and/or proxy LSIDs provided by other organizations. It may make sense for
an organization like GBIF to build a service using the Client Stack to
support both their own LSIDs and those served by other organizations. The
Client Stack has a caching mechanism which supports expiration information
from the primary authority, so the primary authority can update where the
LSID may be resolved and metadata of that authority.
In this model, GBIF, or someone else, could support both their own LSIDs,
and the LSIDs of others. Furthermore, it could choose which authorities
it was going to resolve, so people who wanted to be sure to get "just the
good stuff" according to GBIF could use the GBIF service. In addition, it
could perform the de-duplication service that several people have
mentioned - trying to maintain a one LSID per data item mapping.
* lsid namespaces and file formats
I don't think the namespace part of the LSID
(urn:lsid:authority:namespace:object:version) is intended to be
semantically loaded except for the relevant lsid authority. There is a
way in the metadata to state what format the data comes in. It's not the
LSID! For example the FASTA protein sequence file format is:
urn:lsid:i3c.org:formats:fasta. Clients that understand LSIDs, like the
Launchpad application, can be set to attach applications to LSID formats
so that clicking on an LSID with a given format opens up an appropriate
Sorry to be so scattershot - it's hard to come into the middle of a huge
topic like this. Im glad to see all this discussion its going to make
working on my talk much easier (I think
More information about the tdwg-content