Re: random notes on LSIDs
Too many streams to choose from so I will jump in here as well. Bob's comments on wiki vs. mail are well noted.
One really important attribute of LSID's is that they *are* intended to be persistent enough to outlive the objects that they represent and/or our ability to resolve them. So that, one day, we might know when multiple documents have referenced a single object - digital or otherwise. It is this property that is particularly useful to GBIF where there is a requirement to identify the unique objects duplicated within a system that is designed, from the outset, to replicate at all costs. But as Dave V. has earlier noted - we may still be confusing our need for LSIDs with our GUID requirement. Digir and Biocase already provide mechanisms for index maintenance, meta-data query, data retrieval etc.
The LSID specification, noted by Thau below, states that as well as being persistent an LSID will always resolve to the *"same set of bytes"* [concrete] or an *empty set* [abstract] ( I have assumed that this meant *exactly the same*, but I could be wrong). The implication is that DiGIR, ABCD and HISPID representations of the same object all require unique LSIDs . The specification documents provide examples to illustrate *hierarchical* ways of registering LSIDs for a single object in multiple formats. Versioning is in there to cover changes to an object - which must still be registered - but a change in format could require yet another LSID.
Each LSID references a static object - once one has one ... unless one has more than one, which is also possible.
never-the-less there are still other ways in which LSIDs may prove useful to biodiversity informatics. Dave V. has mentioned the possible use of an LSID that resolves to a DiGIR query - a static object that may be used to retrieve from a dynamic data set. And Thau has mentioned the fact that while the object behind an LSID must not change, meta-data for that object may be provided from many sources and in different formats. Perhaps there is scope here for the the object behind an LSID to be only that relatively stable component of the contributor data set that will be used to build the global index (or query), with LSID registration and resolution used to simplify index maintenance and with the getMetaData methods providing access to complete records or their access points. Duplicate records registered with abstract LSIDs resolving to a parent LSID...
In herbaria the physical objects are commonly duplicated. Supposedly duplicate vouchers from a single collecting event - the primary data source. Is our aim to provide access to the data gathered during that event (with the benefit of taxonomic hind-sight) or to the objects that result, or both? How do identify replicate material? Will an LSID help with resolution of ambiguous events?
greg
On Tue, 2004-09-28 at 04:04, dave thau wrote:
Hello everyone,
Sorry to come into this late - and forgive me if I'm covering trodden ground here, I just re-joined the list and may have missed a few posts.
I've been looking at various GUID systems for SEEK - primarily the Handle System underlying DOI, and LSIDs. It looks like I'll be giving an introduction to GUIDs and focusing on LSIDs at TDWG in New Zealand in two weeks. Judging by the conversation here, I think I'll be keeping my introduction brief to allow maximal time for discussion!
There are a few things about LSIDs that I want to point out. First.... as someone has mentioned, some of my early ramblings on GUIDs can be found here:
http://cvs.ecoinformatics.org/cvs/cvsweb.cgi/seek/projects/taxon/docs/guid/
Some of these files got munged in CVS, but I've fixed them. These files are over six months old, and between then and now I've become more of an LSID fan. So, if you do bother reading that stuff, realize it's old and outdated and incomplete in too many places. I'm going to write a revised document encompassing and expanding all of that and I'll post here when it's completed.
Now for some miscellaneous points about LSIDs
- What happens when an LSID Authority goes away?
As I think Dave V. and others have pointed out, when an LSID is resolved, DNS is used to find the LSID authority. The LSID authority then provides information about how the LSID can be served up (e.g. HTTP, SOAP, FTP), and where to get the data behind the LSID and associated metadata. If I start serving up LSIDs with the authority learningsite.com and later decide that I'm sick of serving up LSIDs, somebody else can take over serving up the data and the metadata. However, I (or they) still bear the responsibility of running the authority which points to the data. If my lsids have an authority like lsid.learningsite.com (urn:lsid:lsid.learningsite.com:foo:bar) then someone else can take over the authority by taking over lsid.learningsite.com and I can still have www.learningsite.com, mail.learningsite.com, etc... for myself. So, with a little planning, it's not so hard to deal with an authority going away as long as the people running it are responsible.
- data and metadata
With LSIDs there's a big difference between the data and metadata of an LSID - and I think this is going to be the biggest challenge in deciding how to use them in our context. What's the data? What's the metadata? With gene sequences, the datum is the sequence, the metadata are things like contact information, who did the sequencing, taxonomic information about the thing sequenced, etc. There's an LTER site using LSIDs for their data sets. The LSID data is the data set itself, and the metadata is what you'd expect - a description of the data set, who the investigators were, that sort of thing. NCBI has pubmed LSIDs - they're not serving up the articles yet, but there's associated metadata in there.. For these things the division between data and metadata is fairly clear. However, what is the data for taxa? What is the metadata?
Here's another interesting thing about data and metadata in LSIDs. When you issue an LSID you're promising the the DATA behind that LSID never changes. Additionally, there's only one authority ultimately responsible for pointing to the data, and that never changes (although as above, someone else can take the authority over). However, for metadata, there are no such promises. Metadata can change. Furthermore, organizations other than the authority can provide metadata, as long as the authority agrees to it and adds them to a list of authorized metadata providers. I dont know if this is such a great idea, but its in the specification.
So, what's the data? What's the metadata? This question applies to any GUID system, really - the Handle System has the same issues, but less clearly defined. As an aside - the Handle System is very robust and the fee schedule is probably circumventable. However, I think LSIDs are better suited to the direction biodiversity informatics is taking - using XML-based standards and standard internet protocols to share data.
- client stack versus authority server
The LSID folks provide two batches of code - an authority server, for people who want to to serve up LSIDs themselves, and an LSID Client stack
- which can be used by organizations to provide access to their LSIDs
and/or proxy LSIDs provided by other organizations. It may make sense for an organization like GBIF to build a service using the Client Stack to support both their own LSIDs and those served by other organizations. The Client Stack has a caching mechanism which supports expiration information from the primary authority, so the primary authority can update where the LSID may be resolved and metadata of that authority.
In this model, GBIF, or someone else, could support both their own LSIDs, and the LSIDs of others. Furthermore, it could choose which authorities it was going to resolve, so people who wanted to be sure to get "just the good stuff" according to GBIF could use the GBIF service. In addition, it could perform the de-duplication service that several people have mentioned - trying to maintain a one LSID per data item mapping.
- lsid namespaces and file formats
I don't think the namespace part of the LSID (urn:lsid:authority:namespace:object:version) is intended to be semantically loaded except for the relevant lsid authority. There is a way in the metadata to state what format the data comes in. It's not the traditional text/javascript mime-type tag - instead the format is another LSID! For example the FASTA protein sequence file format is: urn:lsid:i3c.org:formats:fasta. Clients that understand LSIDs, like the Launchpad application, can be set to attach applications to LSID formats so that clicking on an LSID with a given format opens up an appropriate application.
Sorry to be so scattershot - it's hard to come into the middle of a huge topic like this. Im glad to see all this discussion its going to make working on my talk much easier (I think )
Dave
participants (1)
-
Greg Whitbread