Re: [tdwg-tag] SourceForge LSID project websites broken - role for TDWG?

30 Mar 2009

      Some thoughts on where I think we should be going with LSIDs...

My take is still that having something that is a standard identifier scheme is a major benefit and it is not a big stretch for us to recognise that <ResolverURL><Identifier> should be considered the same as <Identifier> for purposes of inferring identity.  I still believe that it would be a total disaster for us simply to say that we advocate the use of PURL-style identifiers, since this is a slippery slope with no separation between a good identifier and a web-server-du-jour URL with no planned persistence.  Vince Smith's blog is a reminder that relying even on institutional domain names is risky.

I believe that we need to get behind providing solid infrastructure for LSIDs.  This should be aimed at making it as easy as possible for any provider to set up LSIDs without touching their own DNS records unless they want to do so.  We should have a central service which does the central registration part of what DOI.org does for DOIs, BUT NOT AT THE LEVEL OF INDIVIDUAL LSIDs.  We are trying to establish something like this here in Australia under the taxonomy.org.au domain.  This could be done at a global, national or community level.  What is needed is simply the following:

1. A trusted party (TDWG, GBIF, EOL, ALA, etc.) commits to handle the DNS resolution side of LSIDs for any data providers wishing to use its services. 

2. Any provider can register a dataset with the trusted party and will receive a corresponding LSID namespace to use in their identifiers.  To register the dataset they need to be able to provide a parameterised URL which takes a single parameter - either the full LSID for the record, or just the final record-id part of the LSID (this could be a configuration choice when registering the data set) - and which returns the corresponding data record as RDF (if we don't drive the use of structured data with GUIDs, we are not solving anything, but there could be an option to return the records to the trusted party in some simpler format and for the trusted party to generate the RDF). 

3. The trusted party registers itself in DNS as the resolver for LSIDs for its domain and hosts a resolver implementation which extracts the LSID namespace from LSIDs and forwards the request to the appropriate data provider with either the record-id or the whole LSID as a parameter.  The trusted party also hosts an HTTP LSID proxy and prepends the proxy URL to all identifiers in RDF documents. 

A working example with TDWG as the trusted party and an NHM Sphingidae data set as the data set to be shared.

A. TDWG registers lsid.tdwg.org in DNS as an LSID service. 

B. NHM registers a TAPIR database (or whatever CGI interface they like) for their Sphingidae database using the namespace "nhm.sphingidae" and the endpoint "http://nhm.ac.uk/tapir/sphingidae?op=s&...&darwin:GUID=%S" (where %S is to be replaced with the actual request GUID). 

C. NHM populates its records with GUID values of the pattern "urn:lsid:lsid.tdwg.org:nhm.sphingidae:<record-id>" 

D. A user follows a link http://lsid.tdwg.org/urn:lsid:lsid.tdwg.org:nhm.sphingidae:12345 and hits the LSID resolver.  The LSID resolver maps "nhm.sphingidae" to the NHM endpoint and requests the record, which it then returns to the user. 

This could be enhanced in many different ways to make it more robust and flexible:

1. As mentioned, the trusted party could map other formats to RDF (indeed it could have templates for embedding data in Darwin Core, etc.). 

2. The trusted party could automatically prepend LSIDs in response data with references to its own proxy so that early 21st century WWW technology works as expected. 

3. The trusted party could add additional services around hosted copies of the data and could manage a metadata record for the resource. 

4. The trusted party could in fact use DOIs for the namespace part (in other words the NHM example would end up using something like urn:lsid.tdwg.org:10.1000/987:12345 as the identifier.  If the 10.1000/987 DOI served as a citable identifier for the dataset and could be resolved to get the metadata for the dataset, it could be elegant on several different levels. 

This is really all so easy.  As mentioned, taxonomy.org.au has been going through the teething pains of doing this for an Australian therevid data set held in Mandala in Illinois.  I would hope that we could quickly roll this out as a service for any Australian data providers and then try deploying a similar set-up with TDWG, GBIF or EOL.

At least that's what I think...

One other basic point is that, if we abandon LSIDs and still want a GUID solution with some promise that the data could be relocated, we need a system which somewhere embeds the concepts of provider, dataset and record and can use these to track down the record.  This means we can't allow a total free-for-all on identifiers and need either a robust heavy central registry of records like DOI or need to have a standard place for these three elements in the GUID.  Once we get that far, we may as well adopt LSIDs even if we choose (as the major party using the model) to extend or even replace the models for resolving them.

Donald

Donald Hobern, Director, Atlas of Living Australia
CSIRO Entomology, GPO Box 1700, Canberra, ACT 2601
Phone: (02) 62464352 Mobile: 0437990208 
Email: Donald.Hobern@csiro.au
Web: http://www.ala.org.au/ 

-----Original Message-----
From: tdwg-tag-bounces@lists.tdwg.org
[mailto:tdwg-tag-bounces@lists.tdwg.org] On Behalf Of renato@cria.org.br
Sent: Tuesday, 31 March 2009 12:29 PM
To: tdwg-tag@lists.tdwg.org
Subject: Re: [tdwg-tag] SourceForge LSID project websites broken - role for
TDWG?

I think I would prefer to see a different solution. Dropping LSIDs
altogether seems a bit drastic after all work that was done. If we had a
perfect GUID technology, I would understand this kind of decision, but we
all know we don't have such thing. On the other hand, focusing exclusively
on LSIDs could prevent some of our data providers to serve and to maintain
GUIDs. So why not just offer an alternative?

If clients will need to deal with different types of GUIDs anyway,
especially if they will have to interact with different types of
providers, the matter of having to agree on and to adopt a single GUID
technology becomes less important. We already live on a world where
different types of GUIDs are being provided.

Personally I've always preferred PURLs for its simplicity and compliance
with existing tools and technologies, although I know it has drawbacks. If
some of our data providers can reliably serve LSIDs - great. But if LSIDs
are too complicated for other data providers, I don't see any problem for
our community to create an additional applicability statement for another
GUID technology. The most important thing, in my opinion, is to agree on
the data models/vocabularies that our GUIDs will resolve to, no matter the
resolution mechanism used. But that's another story...

Best Regards,
--
Renato
...
Perhaps the question is whether LSIDs are a hurdle to adoption of the
use of GUIDs or an aid to it.
DOIs are not just a technology they are a business model plus a
technology (they use HANDLE for the technology). It is worth the
client overcoming technical difficulties in their use because of the
value added by the publisher paying for the associated
infrastructure.  I would argue that DOIs/HANDLE are, in fact, a
complete pain because they don't integrate well with semantic web
technologies but that they are carried along purely by the business
model.
In advocating the use of LSIDs we are advocating the pain without the
benefits. Just like DOIs they are awkward and non-standard to set up.
They need to be constantly explained. They don't work in semantic web
technologies. They don't even integrate with XML (could you host an
XML Schema on an LSID?). All this would be OK if they had an
associated business model - but they don't.
My personal belief is that we should either put together a business
model (with the financial backing of big projects and within the next
few months) where some core services are provided by a third party or
we should drop LSIDs altogether. Alas I fear the big projects are more
interested in data volume and pretty pictures than doing good science
and providing basic services (I am being contentious for emphasis so
don't take it personally).
From the technical perspective this:
urn:lsid:zoobank.org:act:8BDC0735-FEA4-4298-83FA-D04F67C3FBEC
is far harder than this:
http://purl.zoobank.org/8BDC0735-FEA4-4298-83FA-D04F67C3FBEC
so we need a good business case for doing the former. What is it?
All the best,
Roger
On 23 Mar 2009, at 01:58, Kevin Richards wrote:
...
As convener of the GUID subgroup of TDWG TAG, I thought I should add
some comments.
The debate over LSIDs, their suitability, technical issues, etc, has
been going on for some years now in the TDWG community (and also
within a few other communities - especially the HCLS Health Care and
Life Science semantic web group).  Most issues have been raised and
dealt with, and as with most technologies, there is no perfect
solution for a GUID technology.  To review these discussions see the
TDWG pages at http://wiki.tdwg.org/GUID/ and
http://www.tdwg.org/activities/guid/documents/
. Documents that cover an introduction to GUIDs/LSIDs, applicability
statements, and technical issues can be found here.
I feel we are getting to a stage with LSIDs that a lot of people in
this community have had some sort of dealing with the technology
(whether it is setting up an LSID resolver, or using them/resolving
them as through client software) and we therefore have a good range
of experiences, knowledge and conclusions about the use of LSIDs.
As part of the TDWG meeting in Montpellier this year, we hope to
hold a session for "LSIDs in Practice" which should give us a good
indication of any LSIDs issues, and how they have been dealt with in
practice.
Also, there are several activities going on that should aid with the
adoption of LSIDs, such as development of software tools and
services, and as we speak the LSID web site is being transferred to
a TDWG server to be hosted there (it has been a bit of a technical
hurdle for some of us to get this web site moved, so you may need to
bear with us for a little while).
Generally the technical issues of LSIDs are relatively minor.  The
more obvious issues (such as persistence - ie that an LSID will be
resolvable indefinitely, and community support and technological
aids will always be available), tend to be community/social issues.
What really makes the success of any initiative is the community
support and drive behind the initiative, and the same is true with
whatever technologies we adopt in the TDWG community.  The important
thing therefore is that we start using the GUIDs, linking them up
with other GUIDs/data, distributing them, promoting "authoritative"
GUIDs, and then I really believe any remaining issues will be easily
overcome.
Thanks
Kevin
-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

_______________________________________________
tdwg-tag mailing list
tdwg-tag@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-tag

Re: [tdwg-tag] SourceForge LSID project websites broken - role for TDWG?

Donald.Hobern＠csiro.au