Re: random notes on LSIDs

29 Sep 2004

      Too many streams to choose from so I will jump in here as well.
Bob's comments on wiki vs. mail are well noted.

One really important attribute of LSID's is that they *are* intended to
be persistent enough to outlive the objects that they represent and/or
our ability to resolve them. So that, one day, we might know when
multiple documents have referenced a single object - digital or
otherwise.  It is this property that is particularly useful to GBIF
where there is a requirement to identify the unique objects duplicated
within a system that is designed, from the outset, to replicate at all
costs. But as Dave V. has earlier noted - we may still be confusing our
need for LSIDs with our GUID requirement.  Digir and Biocase already
provide mechanisms for index maintenance, meta-data query, data
retrieval etc.  

The LSID specification, noted by Thau below, states that as well as
being persistent an LSID will always resolve to the *"same set of
bytes"* [concrete] or an *empty set* [abstract] ( I have assumed that
this meant *exactly the same*, but I could be wrong).  The implication
is that DiGIR, ABCD and HISPID representations of the same object all
require unique LSIDs . The specification documents provide examples to
illustrate *hierarchical* ways of registering LSIDs for a single object
in multiple formats.  Versioning is in there to cover changes to an
object - which must still be registered - but a change in format could
require yet another LSID.

Each LSID references a static object - once one has one ...  unless one
has more than one, which is also possible.

never-the-less there are still other ways in which LSIDs may prove
useful to biodiversity informatics. Dave V. has mentioned the possible
use of an LSID that resolves to a DiGIR query - a static object that may
be used to retrieve from a dynamic data set.  And Thau has mentioned the
fact that while the object behind an LSID must not change, meta-data for
that object may be provided from many sources and in different formats.
Perhaps there is scope here for the the object behind an LSID to be only
that relatively stable component of the contributor data set that will
be used to build the global index (or query), with LSID registration and
resolution used to simplify index maintenance and with the getMetaData
methods providing access to complete records or their access points. 
Duplicate records registered with abstract LSIDs resolving to a parent
LSID...

In herbaria the physical objects are commonly duplicated. Supposedly
duplicate vouchers from a single collecting event -  the primary data
source.  Is our aim to provide access to the data gathered during that
event (with the benefit of taxonomic hind-sight) or to the objects that
result, or both?  How do identify replicate material? Will an LSID help
with resolution of ambiguous events?

greg

On Tue, 2004-09-28 at 04:04, dave thau wrote:
...
Hello everyone,
Sorry to come into this late - and forgive me if I'm covering trodden
ground here, I just re-joined the list and may have missed a few posts.
I've been looking at various GUID systems for SEEK - primarily the Handle
System underlying DOI, and LSIDs.  It looks like I'll be giving an
introduction to GUIDs and focusing on LSIDs at TDWG in New Zealand in two
weeks.  Judging by the conversation here, I think I'll be keeping my
introduction brief to allow maximal time for discussion!
There are a few things about LSIDs that I want to point out.   First....
as someone has mentioned, some of my early ramblings on GUIDs can be found
here:
http://cvs.ecoinformatics.org/cvs/cvsweb.cgi/seek/projects/taxon/docs/guid/
Some of these files got munged in CVS, but I've fixed them.  These files
are over six months old, and between then and now I've become more of an
LSID fan.  So, if you do bother reading that stuff, realize it's old and
outdated and incomplete in too many places.  I'm going to write a revised
document encompassing and expanding all of that and I'll post here when
it's completed.
Now for some miscellaneous points about LSIDs
* What happens when an LSID Authority goes away?
As I think Dave V. and others have pointed out, when an LSID is resolved,
DNS is used to find the LSID authority.  The LSID authority then provides
information about how the LSID can be served up (e.g. HTTP, SOAP, FTP),
and where to get the data behind  the LSID and associated metadata.  If I
start serving up LSIDs with the authority learningsite.com and later
decide that I'm sick of serving up LSIDs, somebody else can take over
serving up the data and the metadata.  However, I (or they) still bear the
 responsibility of running the authority which points to the data. If my
lsids have an authority like lsid.learningsite.com
(urn:lsid:lsid.learningsite.com:foo:bar) then someone else can take over
the authority by taking over lsid.learningsite.com and I can  still have
www.learningsite.com, mail.learningsite.com, etc... for myself. So, with a
little planning, it's not so hard to deal with an authority going away as
long as the people running it are responsible.
* data and metadata
With LSIDs there's a big difference between the data and metadata of an
LSID - and I think this is going to be the biggest challenge in deciding
how to use them in our context.  What's the data?  What's the metadata?
With gene sequences, the datum is the sequence, the metadata are things
like contact information, who did the sequencing, taxonomic information
about the thing sequenced, etc.  There's an LTER site using LSIDs for
their data sets.  The LSID data is the data set itself, and the metadata
is what you'd expect - a description of the data set, who the
investigators were, that sort of thing.  NCBI has pubmed LSIDs - they're
not serving up the articles yet, but there's associated metadata in there..
 For these things the division between data and metadata is fairly clear.
However, what is the data for taxa?  What is the metadata?
Here's another interesting thing about data and metadata in LSIDs.  When
you issue an LSID you're promising the the DATA behind that LSID never
changes.  Additionally, there's only one authority ultimately responsible
for pointing to the  data, and that never changes (although as above,
someone else can take the authority over).  However, for metadata, there
are no such promises.  Metadata can change.  Furthermore, organizations
other than the authority can provide metadata, as long as the authority
agrees to it and adds them to a list of authorized metadata providers.  I
dont know if this is such a great idea, but its in the specification.
So, what's the data?  What's the metadata?  This question applies to any
GUID system, really - the Handle System has the same issues, but less
clearly defined.  As an aside - the Handle System is very robust and the
fee schedule is probably circumventable.  However, I think LSIDs are
better suited to the direction biodiversity informatics is taking - using
XML-based standards and standard internet protocols to share data.
* client stack versus authority server
The LSID folks provide two batches of code - an authority server, for
people who want to to serve up LSIDs themselves, and an LSID Client stack
- which can be used by organizations to provide access to their LSIDs
and/or proxy LSIDs provided by other organizations.  It may make sense for
an organization like GBIF to build a service using the Client Stack to
support both their own LSIDs and those served by other organizations.  The
Client Stack has a caching mechanism which supports expiration information
from the primary authority, so the primary authority can update where the
LSID may be resolved and metadata of that authority.
In this model, GBIF, or someone else, could support both their own LSIDs,
and the LSIDs of others.  Furthermore, it could choose which authorities
it was going to resolve, so people who wanted to be sure to get "just the
good stuff" according to GBIF could use the GBIF service.  In addition, it
could perform the de-duplication service that several people have
mentioned - trying to maintain a one LSID per data item mapping.
* lsid namespaces and file formats
I don't think the namespace part of the LSID
(urn:lsid:authority:namespace:object:version) is intended to be
semantically loaded except for the relevant lsid authority.  There is a
way in the metadata to state what format the data comes in.  It's not the
traditional text/javascript mime-type tag - instead the format is another
LSID!  For example the FASTA protein sequence file format is:
urn:lsid:i3c.org:formats:fasta.  Clients that understand LSIDs, like the
Launchpad application, can be set to attach applications to LSID formats
so that clicking on an LSID with a given format opens up an appropriate
application.
Sorry to be so scattershot - it's hard to come into the middle of a huge
topic like this.  Im glad to see all this discussion  its going to make
working on my talk much easier (I think)
Dave
-- 
Greg Whitbread <ghw@anbg.gov.au>
+61-2-62509482
ANBG/CPBR/ANH

Greg Whitbread

tags

participants (1)