random notes on LSIDs

Wed Sep 29 18:42:06 CEST 2004

Too many streams to choose from so I will jump in here as well.
Bob's comments on wiki vs. mail are well noted.

One really important attribute of LSID's is that they *are* intended to
be persistent enough to outlive the objects that they represent and/or
our ability to resolve them. So that, one day, we might know when
multiple documents have referenced a single object - digital or
otherwise.  It is this property that is particularly useful to GBIF
where there is a requirement to identify the unique objects duplicated
within a system that is designed, from the outset, to replicate at all
costs. But as Dave V. has earlier noted - we may still be confusing our
need for LSIDs with our GUID requirement.  Digir and Biocase already
provide mechanisms for index maintenance, meta-data query, data
retrieval etc.  

The LSID specification, noted by Thau below, states that as well as
being persistent an LSID will always resolve to the *"same set of
bytes"* [concrete] or an *empty set* [abstract] ( I have assumed that
this meant *exactly the same*, but I could be wrong).  The implication
is that DiGIR, ABCD and HISPID representations of the same object all
require unique LSIDs . The specification documents provide examples to
illustrate *hierarchical* ways of registering LSIDs for a single object
in multiple formats.  Versioning is in there to cover changes to an
object - which must still be registered - but a change in format could
require yet another LSID.

Each LSID references a static object - once one has one ...  unless one
has more than one, which is also possible.

never-the-less there are still other ways in which LSIDs may prove
useful to biodiversity informatics. Dave V. has mentioned the possible
use of an LSID that resolves to a DiGIR query - a static object that may
be used to retrieve from a dynamic data set.  And Thau has mentioned the
fact that while the object behind an LSID must not change, meta-data for
that object may be provided from many sources and in different formats.
Perhaps there is scope here for the the object behind an LSID to be only
that relatively stable component of the contributor data set that will
be used to build the global index (or query), with LSID registration and
resolution used to simplify index maintenance and with the getMetaData
methods providing access to complete records or their access points. 
Duplicate records registered with abstract LSIDs resolving to a parent
LSID...

In herbaria the physical objects are commonly duplicated. Supposedly
duplicate vouchers from a single collecting event -  the primary data
source.  Is our aim to provide access to the data gathered during that
event (with the benefit of taxonomic hind-sight) or to the objects that
result, or both?  How do identify replicate material? Will an LSID help
with resolution of ambiguous events?

greg

On Tue, 2004-09-28 at 04:04, dave thau wrote:
> Hello everyone,
> 
> Sorry to come into this late - and forgive me if I'm covering trodden
> ground here, I just re-joined the list and may have missed a few posts.
> 
> I've been looking at various GUID systems for SEEK - primarily the Handle
> System underlying DOI, and LSIDs.  It looks like I'll be giving an
> introduction to GUIDs and focusing on LSIDs at TDWG in New Zealand in two
> weeks.  Judging by the conversation here, I think I'll be keeping my
> introduction brief to allow maximal time for discussion!
> 
> There are a few things about LSIDs that I want to point out.   First....
> as someone has mentioned, some of my early ramblings on GUIDs can be found
> here:
> 
> http://cvs.ecoinformatics.org/cvs/cvsweb.cgi/seek/projects/taxon/docs/guid/
> 
> Some of these files got munged in CVS, but I've fixed them.  These files
> are over six months old, and between then and now I've become more of an
> LSID fan.  So, if you do bother reading that stuff, realize it's old and
> outdated and incomplete in too many places.  I'm going to write a revised
> document encompassing and expanding all of that and I'll post here when
> it's completed.
> 
> Now for some miscellaneous points about LSIDs
> 
> * What happens when an LSID Authority goes away?
> 
> As I think Dave V. and others have pointed out, when an LSID is resolved,
> DNS is used to find the LSID authority.  The LSID authority then provides
> information about how the LSID can be served up (e.g. HTTP, SOAP, FTP),
> and where to get the data behind  the LSID and associated metadata.  If I
> start serving up LSIDs with the authority learningsite.com and later
> decide that I'm sick of serving up LSIDs, somebody else can take over
> serving up the data and the metadata.  However, I (or they) still bear the
>  responsibility of running the authority which points to the data. If my
> lsids have an authority like lsid.learningsite.com
> (urn:lsid:lsid.learningsite.com:foo:bar) then someone else can take over
> the authority by taking over lsid.learningsite.com and I can  still have
> www.learningsite.com, mail.learningsite.com, etc... for myself. So, with a
> little planning, it's not so hard to deal with an authority going away as
> long as the people running it are responsible.
> 
> * data and metadata
> 
> With LSIDs there's a big difference between the data and metadata of an
> LSID - and I think this is going to be the biggest challenge in deciding
> how to use them in our context.  What's the data?  What's the metadata?
> With gene sequences, the datum is the sequence, the metadata are things
> like contact information, who did the sequencing, taxonomic information
> about the thing sequenced, etc.  There's an LTER site using LSIDs for
> their data sets.  The LSID data is the data set itself, and the metadata
> is what you'd expect - a description of the data set, who the
> investigators were, that sort of thing.  NCBI has pubmed LSIDs - they're
> not serving up the articles yet, but there's associated metadata in there..
>  For these things the division between data and metadata is fairly clear.
> However, what is the data for taxa?  What is the metadata?
> 
> Here's another interesting thing about data and metadata in LSIDs.  When
> you issue an LSID you're promising the the DATA behind that LSID never
> changes.  Additionally, there's only one authority ultimately responsible
> for pointing to the  data, and that never changes (although as above,
> someone else can take the authority over).  However, for metadata, there
> are no such promises.  Metadata can change.  Furthermore, organizations
> other than the authority can provide metadata, as long as the authority
> agrees to it and adds them to a list of authorized metadata providers.  I
> don’t know if this is such a great idea, but it’s in the specification.
> 
> So, what's the data?  What's the metadata?  This question applies to any
> GUID system, really - the Handle System has the same issues, but less
> clearly defined.  As an aside - the Handle System is very robust and the
> fee schedule is probably circumventable.  However, I think LSIDs are
> better suited to the direction biodiversity informatics is taking - using
> XML-based standards and standard internet protocols to share data.
> 
> * client stack versus authority server
> 
> The LSID folks provide two batches of code - an authority server, for
> people who want to to serve up LSIDs themselves, and an LSID Client stack
> - which can be used by organizations to provide access to their LSIDs
> and/or proxy LSIDs provided by other organizations.  It may make sense for
> an organization like GBIF to build a service using the Client Stack to
> support both their own LSIDs and those served by other organizations.  The
> Client Stack has a caching mechanism which supports expiration information
> from the primary authority, so the primary authority can update where the
> LSID may be resolved and metadata of that authority.
> 
> In this model, GBIF, or someone else, could support both their own LSIDs,
> and the LSIDs of others.  Furthermore, it could choose which authorities
> it was going to resolve, so people who wanted to be sure to get "just the
> good stuff" according to GBIF could use the GBIF service.  In addition, it
> could perform the de-duplication service that several people have
> mentioned - trying to maintain a one LSID per data item mapping.
> 
> * lsid namespaces and file formats
> 
> I don't think the namespace part of the LSID
> (urn:lsid:authority:namespace:object:version) is intended to be
> semantically loaded except for the relevant lsid authority.  There is a
> way in the metadata to state what format the data comes in.  It's not the
> traditional text/javascript mime-type tag - instead the format is another
> LSID!  For example the FASTA protein sequence file format is:
> urn:lsid:i3c.org:formats:fasta.  Clients that understand LSIDs, like the
> Launchpad application, can be set to attach applications to LSID formats
> so that clicking on an LSID with a given format opens up an appropriate
> application.
> 
> Sorry to be so scattershot - it's hard to come into the middle of a huge
> topic like this.  I’m glad to see all this discussion – it’s going to make
> working on my talk much easier (I think
)
> 
> Dave
-- 
Greg Whitbread <ghw at anbg.gov.au>
+61-2-62509482
ANBG/CPBR/ANH