Re: [tdwg-tag] SourceForge LSID project websites broken - role for TDWG?

7 Apr 2009

      Thanks, Roger.

Certainly SRV records don't help with the actual persistence, although they may help with relocatability.  The real issue underlying your point on persistence is whether or not we are interested in offering our data for integration, e.g. using semantic web technologies.  If so, we need to address the underlying social issues.  I agree that a central caching system would be the right way to go to make this all efficient and stable - like GenBank.  However our community has always been dubious about such centralisation.  Ultimately the issue is a cost-benefit question about the costs of integration against the real applications to which the data are put.

Donald

Donald Hobern, Director, Atlas of Living Australia
CSIRO Entomology, GPO Box 1700, Canberra, ACT 2601
Phone: (02) 62464352 Mobile: 0437990208 
Email: Donald.Hobern@csiro.au
Web: http://www.ala.org.au/ 

-----Original Message-----
From: Roger Hyam [mailto:rogerhyam@mac.com] 
Sent: Tuesday, 7 April 2009 6:39 PM
To: Hobern, Donald (Entomology, Black Mountain)
Cc: hlapp@duke.edu; tdwg-tag@lists.tdwg.org
Subject: Re: [tdwg-tag] SourceForge LSID project websites broken - role for TDWG?

Just hosting SRV records or supplying a redirect service does not  
actually provide any persistence at all to the data/metadata.  
Persistence of a GUID to 500 error rather than a not found is not  
helpful.

I have said in the past "If persistence is important to you then keep  
your own copy." This is how it has worked for 100s of years in the  
library community. If the reason for having a centralised resolution  
mechanism is to try and support persistence then the centralised  
service should actually cache metadata (not data). I would imagine a  
scalable infrastructure would be quite simple to implement. Data could  
be stored in a Lucene index or Hadoop cluster or something. It would  
only be a very large hash table and only keep the latest version of  
the RDF.

Without some kind of persistence mechanism the only advantage of LSIDs  
is that they *look* like they are supposed to be persistent.  
Unfortunately, because many people are using UUIDs as their object  
identifiers LSIDs actually look like something you wouldn't want to  
look at let alone expose to a user! CoL actually hide them because  
they look like this:

urn:lsid:catalogueoflife.org:taxon:d755ba3e-29c1-102b-9a4a-00304854f820:ac2009

No normal person is going to read this or type it in. I am afraid that  
when people started using UUIDs in LSIDs it blew the sociological  
argument for LSIDs out of the water for me. I had carefully designed  
BCI identifiers to be human readable and writable like this:

urn:lsid:biocol.org:col:15670

Which would work as a foot note in a paper but only way a UUID can  
work in that context is if it is hyperlinked and to be hyperlinked it  
will have to be an HTTP URL underneath which begs the question of why  
we are displaying a non human readable string as the human readable  
part of a hyperlink! So we hide the LSID completely and have no  
sociological advantage.

I understand why people used UUIDs. There are good technical reasons  
especially in distributed systems.

If LSIDs are a brand then they need a "unique selling proposition" and  
that implies something behind them beyond what can be had for free  
from other brands. You must use LSIDs because.... "We recommend them"  
is not an adequate answer.

Another point that worries me is that all discussion of LSIDs is about  
how to publish them not how to consume them. LSIDs are better than  
HTTP URIs for the client because... (I still can't answer this question)

Currently the reason for me tagging my data with GUIDs has to be  
because it enables users to access and exploit my data in cost  
effective ways they couldn't before whilst crediting me with producing  
it so that I can attract funding to my organisation to curate and  
collect more data.

The reason for clients using GUIDs is that it enables them to mix and  
match data in ways they couldn't before so as to produce more, higher  
quality scientific publications and so attract funding and kudo.

These are the selling points for GUIDs. How well do LSIDs enable them?

To summarise this overlong post we have to have a service that adds  
*real value* (of an order of magnitude that crossref adds to DOIs) to  
LSID usage. Without this we are better off sticking with todays  
standard web technologies.

Sorry for so many words. I don't have time to write less today.

Roger

On 7 Apr 2009, at 08:17, Donald.Hobern@csiro.au wrote:
...
Thanks, Hilmar.
I agree that using tdwg.org as the authority for the LSID is less  
than ideal - hence my recommendation later that we should consider  
instead using e.g. csiro.tdwg.org (and I don't think it should be  
tdwg.org - perhaps something more neutral like csiro.bio-id.org.  My  
concern there was the proliferation of SRV records if we support the  
LSID protocol.
You are also correct that the big issue with this is the question of  
ownership.  Quite frankly, if we had believed in 2006 that  
institutions would be prepared to cede responsibility for handling  
their identifiers to a third party, the recommendations from the  
TDWG workshops would probably have been rather different.  Part of  
the reason for adopting LSIDs was because institutions did not seem  
to want to use an identifier which might imply that a third-party  
was responsible for the data.
The PURL form would have some benefits and would be a perfectly  
consistent alternative.  I seem to be the only person who wants to  
avoid an outright capitulation to using HTTP URIs to identify  
objects in our domain.  However, in case anyone cares, here again  
are my reasons why I prefer HTTP-wrapped non-HTTP identifiers over  
plain HTTP URIs:
1. The "urn:lsid:" part of the identifier serves as a clear  
statement of intent which is not present with an HTTP URI.  We could  
mandate that ONLY http://purl.tdwg.org/ URIs count as GUIDs in our  
domain and that e.g. http://www.csiro.au/ URIs cannot do so, but  
that seems an arrogant and arbitrary rule.  However, if we simply  
encourage everyone to use PURL URIs from any domain, what separates  
such a URI from any HTTP URL with no planned persistence?  I see  
this as a short cut to casual assignment of volatile identifiers  
based on web application structures and hence to rapid identifier rot.
2. I still feel intense discomfort (pace the W3C) over adopting  
identifiers prefixed HTTP:// for objects such as type specimens  
which have had an important place in the literature for decades and  
which can expect still to be referenced in 50 years time.  Even  
though the HTTP protocol feels like the air we breathe right now, it  
seems certain to be superseded at some point.  Do we want to use  
identifiers which will seem totally "retro" in the future?  The  
usual objection is that HTTP is certain to outlast the LSID  
protocol.  I agree fully, but the urn: prefix is making a statement  
about naming, not about technology.
If I am alone in these feelings, the suggested PURL route may be  
simpler, but we should consider what can be done to maximise the  
robustness of their use.
Best wishes,
Donald
Donald Hobern, Director, Atlas of Living Australia
CSIRO Entomology, GPO Box 1700, Canberra, ACT 2601
Phone: (02) 62464352 Mobile: 0437990208
Email: Donald.Hobern@csiro.au
Web: http://www.ala.org.au/
-----Original Message-----
From: Hilmar Lapp [mailto:hlapp@duke.edu]
Sent: Tuesday, 7 April 2009 4:54 PM
To: Hobern, Donald (Entomology, Black Mountain)
Cc: tdwg-tag@lists.tdwg.org
Subject: Re: [tdwg-tag] SourceForge LSID project websites broken -  
role for TDWG?
On Apr 7, 2009, at 1:55 AM, Donald.Hobern@csiro.au wrote:
...
Assume further that ANIC has a script on its servers which can
return the RDF data for these specimens, say at http://www.csiro.au/anic/specimens/
<catalogueNumber>.  The registration process could result in the
LSID urn:lsid:tdwg.org:csiro.anic:12345
Wouldn't that say according to your proposed usage guideline that
tdwg.org is whoGeneratedTheData and csiro.anic is
whatCollectionItBelongsTo, when in reality CSIRO generated the data
and ANIC is the collection it belongs to?
I understand why you're suggesting the LSID formatted as you do, and
you might say that the name-mangling isn't too drastic. But don't have
data owners a strong sense of ownership in their data objects and in
their collections? And more importantly, don't you think that a usage
guideline that contradicts itself (or that is bound to be internally
inconsistent) will continue to raise debate and be in the way of
broader adoption?
...
and the HTTP URI http://lsid.tdwg.org/urn:lsid:tdwg.org:csiro.anic:12345
both being mapped through to http://www.csiro.au/anic/specimens/
12345.
Wouldn't http://purl.tdwg.org/CSIRO/ANIC/12345 be shorter, do more
justice to the names of whoGeneratedTheData and
whatCollectionItBelongsTo, be easier to implement, and have the same
possibilities to implement caching etc, in fact using standard
software such as mod_proxy for apache?
Just some thoughts.
-hilmar
-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:- hlapp at duke dot edu :
===========================================================
_______________________________________________
tdwg-tag mailing list
tdwg-tag@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-tag

Re: [tdwg-tag] SourceForge LSID project websites broken - role for TDWG?

Donald.Hobern＠csiro.au