Re: [tdwg-guid] First step in implementing LSIDs

6 Jun 2007

      Finally I found some time to go through this lively thread.
I hope my post is not already outdated by someone elses ;)

Apart from the what gets an URI discussion there have been some  
people expressing their doubts about LSIDs. As I have had a number of  
discussions lately with people doubting that LSIDs are good for our  
purposes, I would really like to question the TDWG decision to go  
with LSIDs and start yet another comparison of plain http paired with  
redirection, content negotiation and guidelines for using URLs. I  
strongly feel that we should avoid new protocol schemes if we do not  
have *very* good reasons. I will use the term URL for now to refer to  
any http based identification scheme, if its PURLs, our own system or  
something else.

The LSID specification already tells us how to deal with persistent  
identifiers. It is an agreement that we would have to make for URLs.  
As the "what gets an URI" confusion has shown those guidelines are  
needed in any case, no matter if we take up LSIDs or not. Even LSIDs  
can be used with or without versioning and a lot depends on  
agreements in regard to the RDF behind it. So essentially we will  
have to come up with our own best practices anyway.

LSID and HTTP both are based on DNS to guarantee global uniqueness  
and even more important to resolve them. They both derive their  
persistence from the promise of the service provider that the domain  
name is kept forever and a server is running. If the domain is lost  
in 50 years *both* systems are broken.

LSIDs and the semantic web dont play nicely together per se, cause  
the semweb de facto requires plain http. From what I've read the  
suggestion is to use an LSID proxy that maps URLs into LSIDs. The  
problem then is that all RDF statements must use the proxy URL  
instead of the real LSID (otherwise you/a resoner doesn't know that  
the statement about the LSID and the statement about the proxyURL are  
about the identical resource) so essentially noone is using the  
LSIDs, they are just kept as an additional "persistent" ID. To  
overcome this problem and to be able to use both, the LSID or the  
proxy URL, it is suggested to use an owl:sameAs statement within the  
LSID metadata to link the proxy URL with the LSID. So applications  
can use this to understand we are talking about the same thing. This  
gets pretty complex already and I would be surprised if there are  
many applications out there that understand this.

Why not apply the owl:sameAs trick to URLs once we find that http is  
dead (just in case we can't do a global search-and-replace)? We could  
stay with simple URLs now, write simple software fast and get into  
the complex mess at a much later stage when we know we really need to  
- and not already from the start.

A very often raised requirement for the technology is also that it  
should last for hundreds of years. I doubt anyone can predict in that  
time period. But a very good reason to go with http is that there is  
a *lot* of data bound to them and if the world decides there is  
something better than http, there will be many tools to migrate your  
data. I feel much more safe trusting the entire web community than  
eventually getting out of the LSID trap by ourselves.

Imagine if all the different research communities decide to use their  
own resource identification scheme, how bad will data integration  
get? We have to deal already with DOIs, but imagine chemists,  
geologists, meteorologists, physicists would all choose their own  
scheme, just as we are about to issue life science identifiers? Non- 
http URIs put barriers up for adoption to other communities, so I am  
confident that our LSIDs will be referenced much much less than URLs.  
I can see already all those proxy URLs in genebank and alike, not the  
LSIDs.

And finally yet another link to some good discussion in the W3C  
semweb lifescience list:
http://lists.w3.org/Archives/Public/public-semweb-lifesci/2005Mar/ 
0004.html

--
Markus

On 06.06.2007, at 09:21, Dave Vieglais wrote:
...
This discussion has been very interesting reading, and though I  
agree with Donald's comments, I find myself coming to a different  
conclusion, leaning towards HTTP URIs as a preferable scheme.  The  
reasons are simple - HTTP has been around for a long time, it is  
widely implemented, and mechanisms for implementing robust services  
with that protocol are pretty well sorted out - and really there is  
nothing to stop implementation of the same functionality exhibited  
by LSIDs using HTTP.  As Rod has pointed out, http is widely used  
for referencing entities within a semantic web type of context, and  
it seems foolish to ignore the momentum in those technologies as  
they provide a great deal of the desired functionality for  
interoperability and interchange of our data.  As a result my  
preference is towards the use of http, primarily because my intents  
are to integrate data from a much broader community.  In the end  
though, it doesn't really matter which scheme is adopted by TDWG -  
we will build http resolvers regardless, since they will be  
necessary for reasons of convenience in order to utilize LSIDs in  
all but specific, custom built applications.
However, regardless of the scheme used to implement the GUIDs used  
by this community, it is critical that the identifiers are  
persistent and useful beyond the lives of whatever services are  
constructed to resolve them.  This implies some provenance  
information may need to be captured, and I would argue that the use  
of DNS alone for handling server changes as utilized by LSIDs may  
be insufficient.  The only benefit provided by DNS in this context  
is that it is acting as a single source of authority for directing  
how to locate something (in this case an ip address).  What I  
suspect is really required is a more robust, and richer mechanism  
for discovering and recording provenance.  The ideal would be a  
large, replicated, and distributed data store with a single service  
point which provided people and systems with a one-stop shop for  
discovering provenance for a GUID.  Then if an particular GUID  
could not be directly resolved, the global provenance store could  
be consulted and the resulting information providing a pointer (or  
perhaps a series of pointers) indicating how the guid can now be  
resolved.
By creating such provenance records and persisting them with as  
much care as the data, it seems that a system with stability beyond  
the vagaries of the internet could reasonably be constructed.
regards,
  Dave V.
On Jun 6, 2007, at 00:46, Donald Hobern wrote:
...
Yesterday was a vacation here in Denmark - otherwise I'd have  
responded a little earlier, but I'm glad to see all the comments  
from others.  I thoroughly agree with Kevin, Jason, Rich and  
Anna.  No one here believes that any particular solution is going  
to be perfect.  Our biggest need is consensus and the readiness to  
get going with a workable solution.
I do recognise the strength of Rod's arguments.  Indeed, if I were  
building some system for integrating data using semantic web  
technologies, and my only concern was ensuring the efficiency of  
synchronous connections now, I am sure I would adopt HTTP URIs for  
the purpose.  However I remain convinced (as I've stated before)  
that the needs of this community do subtly shift the balance in  
another direction.  We are interested in maintaining long-term  
connections between our objects and have a perspective which goes  
back hundreds of years.  This at least should give us pause over  
whether we want our specimens to be referenced using identifiers  
so firmly tied to the Internet of today.  More importantly, one of  
the key drivers right at the beginning of TDWG's consideration of  
GUIDs was that the community had plenty of experience of URL rot  
and didn't want to rely on everyone maintaining stable virtual  
directories on their web servers to preserve the integrity of  
object identifiers.
Both LSIDs and HTTP URIs could be made to work for us.  Both are  
totally reliant on good practice on the part of data owners.   
Personally I believe our chances of getting the community to  
consider, define and apply such practices are enhanced by the  
identifier technology being something a little more different and  
distinct than just a "special URL".
Thanks,
Donald
------------------------------------------------------------
Donald Hobern (dhobern@gbif.org)
Deputy Director for Informatics
Global Biodiversity Information Facility Secretariat
Universitetsparken 15, DK-2100 Copenhagen, Denmark
Tel: +45-35321483   Mobile: +45-28751483   Fax: +45-35321480
------------------------------------------------------------
On Jun 6, 2007, at 12:51 AM, Kevin Richards wrote:
...
I agree with Jason.  It is not the GUID that is the cause of all  
the problems here - THERE IS NOTHING WRONG WITH LSIDS - we just  
need to move on and start using them in our own context  (or any  
other suitable GUID - LSIDs are only the recommended GUID, NOT  
the only premissable GUID).
If it all falls to pieces later on we could just do a search and  
replace to change all our GUIDs to some other scheme (to quote  
Bob, just serious).
I agree, it is the RDF/metadata/ontologies that are the key to  
getting things working well.
Kevin
...
...
...
"Jason Best" <jbest@brit.org> 06/06/07 8:39 AM >>>
Rod,
I've only had a chance to quickly skim the documents you  
reference, but it seems to me that the alternatives to LSIDs  
don't necessarily make the issues with which we are wrestling go  
away. We still need to decide WHAT a URI references - is it the  
metadata, the physical object etc? URIs don't explicitly require  
persistance, while LSIDs do so I see that as a positive for  
adopting a standard GUID that is explicit in that regard. I think  
the TDWG effort to spec an HTTP proxy for LSIDs makes it clear  
that the technical hurdles of implementing an LSID resolver (SVR  
records, new protocol, client limitations etc) are a bit  
cumbersome, but I don't think the underlying concept is fatally  
flawed. In reading these discussions, I'm starting to believe/ 
understand that RDF may hold the key, regardless of the GUID that  
is implemented. Now I have to go read up more on RDF to see if my  
new-found belief has merit! ;)
Jason
________________________________
From: Roderic Page [mailto:r.page@bio.gla.ac.uk]
Sent: Tuesday, June 05, 2007 2:10 PM
To: Chuck Miller
Cc: Bob Morris; Kevin Richards; tdwg-guid@lists.tdwg.org;  
WEITZMAN@si.edu; Jason Best
Subject: Re: [tdwg-guid] First step in implementing LSIDs?[Scanned]
Maybe it's time to bite the bullet and consider the elephant in  
the room -- LSIDs might not be what we want. Markus Döring sent  
some nice references to the list in April, which I've repeated  
below, there is also http://dx.doi.org/10.1109/MIS.2006.62 .
I think the LSID debate is throwing up issues which have been  
addressed elsewhere (e.g., identifiers for physical things versus  
digital records), and some would argue have been solved to at  
least some people's satisfaction.
LSIDs got us thinking about RDF, which is great. But otherwise I  
think they are making things more complicated than they need to  
be. I think this community is running a grave risk of committing  
to a technology that nobody else takes that seriously (hell, even  
the http://lsid.sourceforge.net/ web site is broken).
The references posted by Markus Döring  were:
(1) http://www.dfki.uni-kl.de/dfkidok/publications/TM/07/01/ 
tm-07-01.pdf
"Cool URIs for the Semantic Web" by Leo Sauermann DFKI GmbH,  
Richard Cyganiak Freie Universität Berlin (D2R author), Max  
Völkel FZI Karlsruhe
The authors of this document come from the semantic web community  
and discuss what kind of URIs should be used for RDF resources.
(2) http://www.w3.org/2001/tag/doc/URNsAndRegistries-50
This one here is written by the W3C and addresses the questions  
"When should URNs or URIs with novel URI schemes be used to name  
information resources for the Web?" The answers given are "Rarely  
if ever" and "Probably not". Common arguments in favor of such  
novel naming schemas are examined, and their properties compared  
with those of the existing http: URI scheme.
Regards
Rod
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 
++++++++++
WARNING: This email and any attachments may be confidential and/or
privileged. They are intended for the addressee only and are not  
to be read,
used, copied or disseminated by anyone receiving them in error.   
If you are
not the intended recipient, please notify the sender by return  
email and
delete this message and any attachments.
The views expressed in this email are those of the sender and do not
necessarily reflect the official views of Landcare Research.
Landcare Research
http://www.landcareresearch.co.nz
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 
++++++++++
_______________________________________________
tdwg-guid mailing list
tdwg-guid@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-guid
_______________________________________________
tdwg-guid mailing list
tdwg-guid@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-guid
_______________________________________________
tdwg-guid mailing list
tdwg-guid@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-guid