Hello everyone,
Sorry to come into this late - and forgive me if I'm covering trodden ground here, I just re-joined the list and may have missed a few posts.
I've been looking at various GUID systems for SEEK - primarily the Handle System underlying DOI, and LSIDs. It looks like I'll be giving an introduction to GUIDs and focusing on LSIDs at TDWG in New Zealand in two weeks. Judging by the conversation here, I think I'll be keeping my introduction brief to allow maximal time for discussion!
There are a few things about LSIDs that I want to point out. First.... as someone has mentioned, some of my early ramblings on GUIDs can be found here:
http://cvs.ecoinformatics.org/cvs/cvsweb.cgi/seek/projects/taxon/docs/guid/
Some of these files got munged in CVS, but I've fixed them. These files are over six months old, and between then and now I've become more of an LSID fan. So, if you do bother reading that stuff, realize it's old and outdated and incomplete in too many places. I'm going to write a revised document encompassing and expanding all of that and I'll post here when it's completed.
Now for some miscellaneous points about LSIDs
* What happens when an LSID Authority goes away?
As I think Dave V. and others have pointed out, when an LSID is resolved, DNS is used to find the LSID authority. The LSID authority then provides information about how the LSID can be served up (e.g. HTTP, SOAP, FTP), and where to get the data behind the LSID and associated metadata. If I start serving up LSIDs with the authority learningsite.com and later decide that I'm sick of serving up LSIDs, somebody else can take over serving up the data and the metadata. However, I (or they) still bear the responsibility of running the authority which points to the data. If my lsids have an authority like lsid.learningsite.com (urn:lsid:lsid.learningsite.com:foo:bar) then someone else can take over the authority by taking over lsid.learningsite.com and I can still have www.learningsite.com, mail.learningsite.com, etc... for myself. So, with a little planning, it's not so hard to deal with an authority going away as long as the people running it are responsible.
* data and metadata
With LSIDs there's a big difference between the data and metadata of an LSID - and I think this is going to be the biggest challenge in deciding how to use them in our context. What's the data? What's the metadata? With gene sequences, the datum is the sequence, the metadata are things like contact information, who did the sequencing, taxonomic information about the thing sequenced, etc. There's an LTER site using LSIDs for their data sets. The LSID data is the data set itself, and the metadata is what you'd expect - a description of the data set, who the investigators were, that sort of thing. NCBI has pubmed LSIDs - they're not serving up the articles yet, but there's associated metadata in there. For these things the division between data and metadata is fairly clear. However, what is the data for taxa? What is the metadata?
Here's another interesting thing about data and metadata in LSIDs. When you issue an LSID you're promising the the DATA behind that LSID never changes. Additionally, there's only one authority ultimately responsible for pointing to the data, and that never changes (although as above, someone else can take the authority over). However, for metadata, there are no such promises. Metadata can change. Furthermore, organizations other than the authority can provide metadata, as long as the authority agrees to it and adds them to a list of authorized metadata providers. I dont know if this is such a great idea, but its in the specification.
So, what's the data? What's the metadata? This question applies to any GUID system, really - the Handle System has the same issues, but less clearly defined. As an aside - the Handle System is very robust and the fee schedule is probably circumventable. However, I think LSIDs are better suited to the direction biodiversity informatics is taking - using XML-based standards and standard internet protocols to share data.
* client stack versus authority server
The LSID folks provide two batches of code - an authority server, for people who want to to serve up LSIDs themselves, and an LSID Client stack - which can be used by organizations to provide access to their LSIDs and/or proxy LSIDs provided by other organizations. It may make sense for an organization like GBIF to build a service using the Client Stack to support both their own LSIDs and those served by other organizations. The Client Stack has a caching mechanism which supports expiration information from the primary authority, so the primary authority can update where the LSID may be resolved and metadata of that authority.
In this model, GBIF, or someone else, could support both their own LSIDs, and the LSIDs of others. Furthermore, it could choose which authorities it was going to resolve, so people who wanted to be sure to get "just the good stuff" according to GBIF could use the GBIF service. In addition, it could perform the de-duplication service that several people have mentioned - trying to maintain a one LSID per data item mapping.
* lsid namespaces and file formats
I don't think the namespace part of the LSID (urn:lsid:authority:namespace:object:version) is intended to be semantically loaded except for the relevant lsid authority. There is a way in the metadata to state what format the data comes in. It's not the traditional text/javascript mime-type tag - instead the format is another LSID! For example the FASTA protein sequence file format is: urn:lsid:i3c.org:formats:fasta. Clients that understand LSIDs, like the Launchpad application, can be set to attach applications to LSID formats so that clicking on an LSID with a given format opens up an appropriate application.
Sorry to be so scattershot - it's hard to come into the middle of a huge topic like this. Im glad to see all this discussion its going to make working on my talk much easier (I think )
Dave
participants (1)
-
dave thau