[tdwg-tag] SourceForge LSID project websites broken - role for TDWG?

Tue Apr 7 11:32:36 CEST 2009

Thanks, Roger.

Certainly SRV records don't help with the actual persistence, although they may help with relocatability.  The real issue underlying your point on persistence is whether or not we are interested in offering our data for integration, e.g. using semantic web technologies.  If so, we need to address the underlying social issues.  I agree that a central caching system would be the right way to go to make this all efficient and stable - like GenBank.  However our community has always been dubious about such centralisation.  Ultimately the issue is a cost-benefit question about the costs of integration against the real applications to which the data are put.

Donald

Donald Hobern, Director, Atlas of Living Australia
CSIRO Entomology, GPO Box 1700, Canberra, ACT 2601
Phone: (02) 62464352 Mobile: 0437990208 
Email: Donald.Hobern at csiro.au
Web: http://www.ala.org.au/ 

-----Original Message-----
From: Roger Hyam [mailto:rogerhyam at mac.com] 
Sent: Tuesday, 7 April 2009 6:39 PM
To: Hobern, Donald (Entomology, Black Mountain)
Cc: hlapp at duke.edu; tdwg-tag at lists.tdwg.org
Subject: Re: [tdwg-tag] SourceForge LSID project websites broken - role for TDWG?

Just hosting SRV records or supplying a redirect service does not  
actually provide any persistence at all to the data/metadata.  
Persistence of a GUID to 500 error rather than a not found is not  
helpful.

I have said in the past "If persistence is important to you then keep  
your own copy." This is how it has worked for 100s of years in the  
library community. If the reason for having a centralised resolution  
mechanism is to try and support persistence then the centralised  
service should actually cache metadata (not data). I would imagine a  
scalable infrastructure would be quite simple to implement. Data could  
be stored in a Lucene index or Hadoop cluster or something. It would  
only be a very large hash table and only keep the latest version of  
the RDF.

Without some kind of persistence mechanism the only advantage of LSIDs  
is that they *look* like they are supposed to be persistent.  
Unfortunately, because many people are using UUIDs as their object  
identifiers LSIDs actually look like something you wouldn't want to  
look at let alone expose to a user! CoL actually hide them because  
they look like this:

urn:lsid:catalogueoflife.org:taxon:d755ba3e-29c1-102b-9a4a-00304854f820:ac2009

No normal person is going to read this or type it in. I am afraid that  
when people started using UUIDs in LSIDs it blew the sociological  
argument for LSIDs out of the water for me. I had carefully designed  
BCI identifiers to be human readable and writable like this:

urn:lsid:biocol.org:col:15670

Which would work as a foot note in a paper but only way a UUID can  
work in that context is if it is hyperlinked and to be hyperlinked it  
will have to be an HTTP URL underneath which begs the question of why  
we are displaying a non human readable string as the human readable  
part of a hyperlink! So we hide the LSID completely and have no  
sociological advantage.

I understand why people used UUIDs. There are good technical reasons  
especially in distributed systems.

If LSIDs are a brand then they need a "unique selling proposition" and  
that implies something behind them beyond what can be had for free  
from other brands. You must use LSIDs because.... "We recommend them"  
is not an adequate answer.

Another point that worries me is that all discussion of LSIDs is about  
how to publish them not how to consume them. LSIDs are better than  
HTTP URIs for the client because... (I still can't answer this question)

Currently the reason for me tagging my data with GUIDs has to be  
because it enables users to access and exploit my data in cost  
effective ways they couldn't before whilst crediting me with producing  
it so that I can attract funding to my organisation to curate and  
collect more data.

The reason for clients using GUIDs is that it enables them to mix and  
match data in ways they couldn't before so as to produce more, higher  
quality scientific publications and so attract funding and kudo.

These are the selling points for GUIDs. How well do LSIDs enable them?

To summarise this overlong post we have to have a service that adds  
*real value* (of an order of magnitude that crossref adds to DOIs) to  
LSID usage. Without this we are better off sticking with todays  
standard web technologies.

Sorry for so many words. I don't have time to write less today.

Roger

On 7 Apr 2009, at 08:17, Donald.Hobern at csiro.au wrote:

> Thanks, Hilmar.
>
> I agree that using tdwg.org as the authority for the LSID is less  
> than ideal - hence my recommendation later that we should consider  
> instead using e.g. csiro.tdwg.org (and I don't think it should be  
> tdwg.org - perhaps something more neutral like csiro.bio-id.org.  My  
> concern there was the proliferation of SRV records if we support the  
> LSID protocol.
>
> You are also correct that the big issue with this is the question of  
> ownership.  Quite frankly, if we had believed in 2006 that  
> institutions would be prepared to cede responsibility for handling  
> their identifiers to a third party, the recommendations from the  
> TDWG workshops would probably have been rather different.  Part of  
> the reason for adopting LSIDs was because institutions did not seem  
> to want to use an identifier which might imply that a third-party  
> was responsible for the data.
>
> The PURL form would have some benefits and would be a perfectly  
> consistent alternative.  I seem to be the only person who wants to  
> avoid an outright capitulation to using HTTP URIs to identify  
> objects in our domain.  However, in case anyone cares, here again  
> are my reasons why I prefer HTTP-wrapped non-HTTP identifiers over  
> plain HTTP URIs:
>
> 1. The "urn:lsid:" part of the identifier serves as a clear  
> statement of intent which is not present with an HTTP URI.  We could  
> mandate that ONLY http://purl.tdwg.org/ URIs count as GUIDs in our  
> domain and that e.g. http://www.csiro.au/ URIs cannot do so, but  
> that seems an arrogant and arbitrary rule.  However, if we simply  
> encourage everyone to use PURL URIs from any domain, what separates  
> such a URI from any HTTP URL with no planned persistence?  I see  
> this as a short cut to casual assignment of volatile identifiers  
> based on web application structures and hence to rapid identifier rot.
>
> 2. I still feel intense discomfort (pace the W3C) over adopting  
> identifiers prefixed HTTP:// for objects such as type specimens  
> which have had an important place in the literature for decades and  
> which can expect still to be referenced in 50 years time.  Even  
> though the HTTP protocol feels like the air we breathe right now, it  
> seems certain to be superseded at some point.  Do we want to use  
> identifiers which will seem totally "retro" in the future?  The  
> usual objection is that HTTP is certain to outlast the LSID  
> protocol.  I agree fully, but the urn: prefix is making a statement  
> about naming, not about technology.
>
> If I am alone in these feelings, the suggested PURL route may be  
> simpler, but we should consider what can be done to maximise the  
> robustness of their use.
>
> Best wishes,
>
> Donald
>
>
> Donald Hobern, Director, Atlas of Living Australia
> CSIRO Entomology, GPO Box 1700, Canberra, ACT 2601
> Phone: (02) 62464352 Mobile: 0437990208
> Email: Donald.Hobern at csiro.au
> Web: http://www.ala.org.au/
>
>
> -----Original Message-----
> From: Hilmar Lapp [mailto:hlapp at duke.edu]
> Sent: Tuesday, 7 April 2009 4:54 PM
> To: Hobern, Donald (Entomology, Black Mountain)
> Cc: tdwg-tag at lists.tdwg.org
> Subject: Re: [tdwg-tag] SourceForge LSID project websites broken -  
> role for TDWG?
>
>
> On Apr 7, 2009, at 1:55 AM, Donald.Hobern at csiro.au wrote:
>
>> Assume further that ANIC has a script on its servers which can
>> return the RDF data for these specimens, say at http://www.csiro.au/anic/specimens/
>> <catalogueNumber>.  The registration process could result in the
>> LSID urn:lsid:tdwg.org:csiro.anic:12345
>
> Wouldn't that say according to your proposed usage guideline that
> tdwg.org is whoGeneratedTheData and csiro.anic is
> whatCollectionItBelongsTo, when in reality CSIRO generated the data
> and ANIC is the collection it belongs to?
>
> I understand why you're suggesting the LSID formatted as you do, and
> you might say that the name-mangling isn't too drastic. But don't have
> data owners a strong sense of ownership in their data objects and in
> their collections? And more importantly, don't you think that a usage
> guideline that contradicts itself (or that is bound to be internally
> inconsistent) will continue to raise debate and be in the way of
> broader adoption?
>
>> and the HTTP URI http://lsid.tdwg.org/urn:lsid:tdwg.org:csiro.anic:12345
>> both being mapped through to http://www.csiro.au/anic/specimens/
>> 12345.
>
>
> Wouldn't http://purl.tdwg.org/CSIRO/ANIC/12345 be shorter, do more
> justice to the names of whoGeneratedTheData and
> whatCollectionItBelongsTo, be easier to implement, and have the same
> possibilities to implement caching etc, in fact using standard
> software such as mod_proxy for apache?
>
> Just some thoughts.
>
> 	-hilmar
> -- 
> ===========================================================
> : Hilmar Lapp  -:-  Durham, NC  -:- hlapp at duke dot edu :
> ===========================================================
>
>
>
>
> _______________________________________________
> tdwg-tag mailing list
> tdwg-tag at lists.tdwg.org
> http://lists.tdwg.org/mailman/listinfo/tdwg-tag