[tdwg-tag] SourceForge LSID project websites broken - role for TDWG?

Tue Apr 7 07:55:04 CEST 2009

Thanks, Kathi.

I appreciate your comments and understand your concerns.  This certainly is a social problem - no technology solution will take it away.  A large proportion (though certainly not all) of the issues surrounding LSIDs will arise with any technology which tries to address the problem.

I seem to be in the minority in believing that we can use LSIDs as one part of a strategy to develop a community infrastructure for our data.  However we do need to start from somewhere if we want to do anything about the persistence of our data.  We need some foundations before we can properly worry about "intelligent caching and harvesting mechanisms" (which I agree we need).

So - here is my outline for how I think we could move forward from these discussions:

1. An identifier scheme which aims to provide some long term persistence probably needs to embody at least three key facts: who generated/published the data object, what data collection this object belongs to, which data object from the specified data collection this one is.  These correspond roughly to the Darwin Core InstitutionCode/CollectionCode/CatalogueNumber triple and to the three main substitutable elements in an LSID.  Some systems such as DOI may obscure the whoGeneratedTheData part somewhat.  Some systems such as DOI and PURL may not always have an explicit whatCollectionItBelongsTo part, but dealing with collections promises to be an organisational simplification for most purposes.

2. TDWG should recommend the LSID as one suitable model for constructing GUIDs (i.e. "urn:lsid:<whoGeneratedTheDataObject>:<whatCollectionItBelongsTo>:<whichItemInTheCollectionItIs>").  We could propose (or adopt) some other syntax for this, but this gives us a neat enough way to encapsulate what we need to know.  The "urn:lsid:" part can be seen as a useful flag that this is indeed to be considered as an identifier.

3. Where feasible, TDWG should recommend that these LSIDs should be associated with a resolver implementing the standard LSID mechanism.  Frankly I am a lot less bothered by the resolvability of most identifiers than I am about their consistent use, so I have no problem with the idea of assigning LSIDs to things which do not currently resolve.

4. TDWG requires that a path must exist to retrieve the associated data using an HTTP resolver to proxy the LSID (i.e. http://whoGeneratedTheDataObject.org/<optional_path_elements>/<lsid>) and that our practice is to consider this proxified version to be identical for comparison purposes with the bare LSID.  For LSIDs resolvable using the standard LSID mechanism, this path can be http://lsid.tdwg.org/<lsid>.  In cases in which the data are only accessible via HTTP, we have broken the LSID specification - although it seems there may be nobody other than us to care about that fact.  

5. All references to LSIDs within RDF documents should use the proxified form.

6. TDWG and its partners should establish a PURL-like service which makes it easy to register data sets to be associated with identifiers of this form.  In other words, a service should exist (around a domain secured for this purpose into the future) which associates data providers with an appropriate whoGeneratedTheDataObject element and associates their data collections with an appropriate whatCollectionItBelongsTo element and associated URL pattern for retrieving RDF data for the individual data objects.  The exact details could vary, but assume that TDWG sets up this service at http://lsid.tdwg.org/ and that CSIRO wishes to register the ANIC data collection and to have individual specimen records associated with LSID-based identifiers.  Assume further that ANIC has a script on its servers which can return the RDF data for these specimens, say at http://www.csiro.au/anic/specimens/<catalogueNumber>.  The registration process could result in the LSID urn:lsid:tdwg.org:csiro.anic:12345 and the HTTP URI http://lsid.tdwg.org/urn:lsid:tdwg.org:csiro.anic:12345 both being mapped through to http://www.csiro.au/anic/specimens/12345.  It would probably be preferable for the LSID in this case to be urn:lsid:csiro.tdwg.org:anic:12345 (which would make relocation of all LSID services for a single data provider easy, but could require large numbers of SRV records to be managed by TDWG).  (I would note that it would be easy for the infrastructure to allow the data provider to choose whether the whole LSID or just the final ID element should be passed to the final URL.)

7. TDWG and its partners should use this same infrastructure to handle alternative resolution paths as required in the future - if alternative identifier schemes become the preferred option.  This infrastructure could also add significant other functions, including e.g. 1) intelligent caching of data, 2) validation of RDF data, and 3) simultaneous registration of DOIs associated with metadata for each data collection to make it easier for them to be cited by journal articles.

8. Any provider may opt at any time to use alternative HTTP-resolvable identifiers in place of LSIDs (e.g. DOIs, handles, PURLs), but must consider the technological and social implications of keeping these identifiers alive into the future.

As far as I can see, this approach allows us to develop a community-based approach to managing identifiers in a way which builds on LSIDs for those who have already minted them.  It would be easy for us to reinvent this as a PURL-based approach in the future.  The costs should not be great and it gives us a better chance of avoiding the confusion of random-URLs-pointing-at-random-data-formats being offered as semantically useful GUIDs.

Whatever happens, TDWG needs to finalise an applicability statement for how LSIDs should be used by those providers who have chosen or who will choose to use them for biodiversity data.  This does not mandate that everyone MUST use LSIDs.

Does this seem worth pursuing?

Donald

Donald Hobern, Director, Atlas of Living Australia
CSIRO Entomology, GPO Box 1700, Canberra, ACT 2601
Phone: (02) 62464352 Mobile: 0437990208 
Email: Donald.Hobern at csiro.au
Web: http://www.ala.org.au/ 

-----Original Message-----
Date: Mon, 6 Apr 2009 10:15:00 +0200
From: Schleidt Katharina <katharina.schleidt at umweltbundesamt.at>
Subject: Re: [tdwg-tag] SourceForge LSID project websites broken -
	role for TDWG?
To: Roger Hyam <rogerhyam at mac.com>, Peter DeVries
	<pete.devries at gmail.com>
Cc: "tdwg-tag at lists.tdwg.org" <tdwg-tag at lists.tdwg.org>
Message-ID:
	<8638F29270898544933A7663226809E5EAAA6060 at PCMAIL3.umweltbundesamt.at>
Content-Type: text/plain; charset="utf-8"

Hi all,

I admit I?m glad that this topic does seem to be back in discussion. I?ve been worried about LSIDs from the outset, but did not have the time or resources at the time of decision to do anything about it. Most of this discussion reflects what we?ve been discussing here in Vienna ever since the topic came up. Here an excerpt from a recent mail of mine:

?        I have never been a proponent of LSIDs. More to the point, I have been against their adoption from the onset. The reasons for this are:

o   It?s misusing a technical solution as an answer for a social problem. Just because LSIDs entail a list of (quite necessary) requirements such as persistent IDs, dependability of availability of online references, it can in no way guarantee this, it just nicely covers the problem up

o   I do not see the technology being supported. IBM dropped it, and Cambridge Semantics Inc. also seems to have gone other ways

o   An example of the lack of dependability of LSID servers seems to me to be the eternal problem with the TDWG LSID Server

o   I?m worried about a group such as TDWG, which doesn?t have the backup to push through technology development, is going towards requiring all adopters to implement non-mainstream technology in order to maintain compatibility

We?ve come to the conclusion, as mentioned several times in this thread, that what we really need is the commitment to persistence, and no technology will support us in that. Why waste nonexistent funds sorting out an esoteric technology nobodies supporting; why not just buy a domain, pass a hat and set up a trust fund with 1000? (or $), and agree to have this domain available over some institution (i.e. university) for the next 100 years. After that, my non-existent great-grandchildren can sort out the rest!

@Matt: http://www4.wiwiss.fu-berlin.de/bizer/pub/lod-datasets_2009-03-05.html is online again! And a short absence/down-time will happen in all distributed technologies. If anything, I believe that we should worry more about intelligent caching and harvesting mechanisms!

:)

kathi