Thanks, Roger.
Certainly SRV records don't help with the actual persistence, although they may help with relocatability. The real issue underlying your point on persistence is whether or not we are interested in offering our data for integration, e.g. using semantic web technologies. If so, we need to address the underlying social issues. I agree that a central caching system would be the right way to go to make this all efficient and stable - like GenBank. However our community has always been dubious about such centralisation. Ultimately the issue is a cost-benefit question about the costs of integration against the real applications to which the data are put.
Donald
Donald Hobern, Director, Atlas of Living Australia CSIRO Entomology, GPO Box 1700, Canberra, ACT 2601 Phone: (02) 62464352 Mobile: 0437990208 Email: Donald.Hobern@csiro.au Web: http://www.ala.org.au/
-----Original Message----- From: Roger Hyam [mailto:rogerhyam@mac.com] Sent: Tuesday, 7 April 2009 6:39 PM To: Hobern, Donald (Entomology, Black Mountain) Cc: hlapp@duke.edu; tdwg-tag@lists.tdwg.org Subject: Re: [tdwg-tag] SourceForge LSID project websites broken - role for TDWG?
Just hosting SRV records or supplying a redirect service does not actually provide any persistence at all to the data/metadata. Persistence of a GUID to 500 error rather than a not found is not helpful.
I have said in the past "If persistence is important to you then keep your own copy." This is how it has worked for 100s of years in the library community. If the reason for having a centralised resolution mechanism is to try and support persistence then the centralised service should actually cache metadata (not data). I would imagine a scalable infrastructure would be quite simple to implement. Data could be stored in a Lucene index or Hadoop cluster or something. It would only be a very large hash table and only keep the latest version of the RDF.
Without some kind of persistence mechanism the only advantage of LSIDs is that they *look* like they are supposed to be persistent. Unfortunately, because many people are using UUIDs as their object identifiers LSIDs actually look like something you wouldn't want to look at let alone expose to a user! CoL actually hide them because they look like this:
urn:lsid:catalogueoflife.org:taxon:d755ba3e-29c1-102b-9a4a-00304854f820:ac2009
No normal person is going to read this or type it in. I am afraid that when people started using UUIDs in LSIDs it blew the sociological argument for LSIDs out of the water for me. I had carefully designed BCI identifiers to be human readable and writable like this:
urn:lsid:biocol.org:col:15670
Which would work as a foot note in a paper but only way a UUID can work in that context is if it is hyperlinked and to be hyperlinked it will have to be an HTTP URL underneath which begs the question of why we are displaying a non human readable string as the human readable part of a hyperlink! So we hide the LSID completely and have no sociological advantage.
I understand why people used UUIDs. There are good technical reasons especially in distributed systems.
If LSIDs are a brand then they need a "unique selling proposition" and that implies something behind them beyond what can be had for free from other brands. You must use LSIDs because.... "We recommend them" is not an adequate answer.
Another point that worries me is that all discussion of LSIDs is about how to publish them not how to consume them. LSIDs are better than HTTP URIs for the client because... (I still can't answer this question)
Currently the reason for me tagging my data with GUIDs has to be because it enables users to access and exploit my data in cost effective ways they couldn't before whilst crediting me with producing it so that I can attract funding to my organisation to curate and collect more data.
The reason for clients using GUIDs is that it enables them to mix and match data in ways they couldn't before so as to produce more, higher quality scientific publications and so attract funding and kudo.
These are the selling points for GUIDs. How well do LSIDs enable them?
To summarise this overlong post we have to have a service that adds *real value* (of an order of magnitude that crossref adds to DOIs) to LSID usage. Without this we are better off sticking with todays standard web technologies.
Sorry for so many words. I don't have time to write less today.
Roger
On 7 Apr 2009, at 08:17, Donald.Hobern@csiro.au wrote:
Thanks, Hilmar.
I agree that using tdwg.org as the authority for the LSID is less than ideal - hence my recommendation later that we should consider instead using e.g. csiro.tdwg.org (and I don't think it should be tdwg.org - perhaps something more neutral like csiro.bio-id.org. My concern there was the proliferation of SRV records if we support the LSID protocol.
You are also correct that the big issue with this is the question of ownership. Quite frankly, if we had believed in 2006 that institutions would be prepared to cede responsibility for handling their identifiers to a third party, the recommendations from the TDWG workshops would probably have been rather different. Part of the reason for adopting LSIDs was because institutions did not seem to want to use an identifier which might imply that a third-party was responsible for the data.
The PURL form would have some benefits and would be a perfectly consistent alternative. I seem to be the only person who wants to avoid an outright capitulation to using HTTP URIs to identify objects in our domain. However, in case anyone cares, here again are my reasons why I prefer HTTP-wrapped non-HTTP identifiers over plain HTTP URIs:
- The "urn:lsid:" part of the identifier serves as a clear
statement of intent which is not present with an HTTP URI. We could mandate that ONLY http://purl.tdwg.org/ URIs count as GUIDs in our domain and that e.g. http://www.csiro.au/ URIs cannot do so, but that seems an arrogant and arbitrary rule. However, if we simply encourage everyone to use PURL URIs from any domain, what separates such a URI from any HTTP URL with no planned persistence? I see this as a short cut to casual assignment of volatile identifiers based on web application structures and hence to rapid identifier rot.
- I still feel intense discomfort (pace the W3C) over adopting
identifiers prefixed HTTP:// for objects such as type specimens which have had an important place in the literature for decades and which can expect still to be referenced in 50 years time. Even though the HTTP protocol feels like the air we breathe right now, it seems certain to be superseded at some point. Do we want to use identifiers which will seem totally "retro" in the future? The usual objection is that HTTP is certain to outlast the LSID protocol. I agree fully, but the urn: prefix is making a statement about naming, not about technology.
If I am alone in these feelings, the suggested PURL route may be simpler, but we should consider what can be done to maximise the robustness of their use.
Best wishes,
Donald
Donald Hobern, Director, Atlas of Living Australia CSIRO Entomology, GPO Box 1700, Canberra, ACT 2601 Phone: (02) 62464352 Mobile: 0437990208 Email: Donald.Hobern@csiro.au Web: http://www.ala.org.au/
-----Original Message----- From: Hilmar Lapp [mailto:hlapp@duke.edu] Sent: Tuesday, 7 April 2009 4:54 PM To: Hobern, Donald (Entomology, Black Mountain) Cc: tdwg-tag@lists.tdwg.org Subject: Re: [tdwg-tag] SourceForge LSID project websites broken - role for TDWG?
On Apr 7, 2009, at 1:55 AM, Donald.Hobern@csiro.au wrote:
Assume further that ANIC has a script on its servers which can return the RDF data for these specimens, say at http://www.csiro.au/anic/specimens/ <catalogueNumber>. The registration process could result in the LSID urn:lsid:tdwg.org:csiro.anic:12345
Wouldn't that say according to your proposed usage guideline that tdwg.org is whoGeneratedTheData and csiro.anic is whatCollectionItBelongsTo, when in reality CSIRO generated the data and ANIC is the collection it belongs to?
I understand why you're suggesting the LSID formatted as you do, and you might say that the name-mangling isn't too drastic. But don't have data owners a strong sense of ownership in their data objects and in their collections? And more importantly, don't you think that a usage guideline that contradicts itself (or that is bound to be internally inconsistent) will continue to raise debate and be in the way of broader adoption?
and the HTTP URI http://lsid.tdwg.org/urn:lsid:tdwg.org:csiro.anic:12345 both being mapped through to http://www.csiro.au/anic/specimens/ 12345.
Wouldn't http://purl.tdwg.org/CSIRO/ANIC/12345 be shorter, do more justice to the names of whoGeneratedTheData and whatCollectionItBelongsTo, be easier to implement, and have the same possibilities to implement caching etc, in fact using standard software such as mod_proxy for apache?
Just some thoughts.
-hilmar
=========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at duke dot edu : ===========================================================
tdwg-tag mailing list tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag