Re: [tdwg-tag] LSIDs: web based (HTTP) resolvers and web crawlers

1 May 2009

      So, can we save TDWG, GBIF, etc. the hassle of meetings and working  
groups, etc., and just do this!?

TDWG/GBIF/etc. can then focus on providing any assistance to make this  
happen, and offer tools and services to help, such as:

1. metadata validator to check vocabulary is OK (the BigDig did this  
sort of thing for Darwin Core)

2. service to suggest additional identifiers so that we can link stuff  
together (breaking the silos) (e.g., telling the provider that  
"Edinburgh J. Bot. 66(1): 110 (-113; fig. 2, map). 2009 [Mar 2009]"  
has the identifier doi:10.1017/S0960428609005320 )

3. Service to "augment" provider data (bit like 2, see http://n2.talis.com/wiki/Augment_Service)

4. monitor service availability, provide feedback and assistance to  
providers that are struggling

5. Linked data-compliant proxy for existing non-HTTP URI GUIDs that we  
will always have to deal with (e.g., DOIs for literature, Handles for  
DSpace repositories, LSIDs that we already have)

6. Caching service (on top of which cool applications can be built to  
sell the approach and answer the "remind me again, why are we doing  
this" question).

7. Other cool stuff...

Regards

Rod

On 1 May 2009, at 09:39, Roger Hyam wrote:
...
I wish I could keep out of this debate but...
Linked data approach is an order of magnitude simpler than LSID and  
very easy to layer on top of an existing LSID authority - you  
already have a RDF metadata response you just need the redirect URL  
which can be implemented in the Apache or IIS config or a very  
simple script.
It doesn't have to be only google that caches the metadata it could  
be GBIF/EoL or some other party who are interested in caching  
metadata from biodiversity suppliers. They could even have a  
submission mechanism. So the whole architecture would go:
1) work out how to get your data into RDF (tricky bit we should be  
working on as Markus points out - this could even be RDFa in a web  
page - anyone for Dreamweaver templates!!)
2) set up a 303 redirect to the RDF metadata. (very easy even on an  
ISP hosted domain or corporate internet - unlike messing with SRV  
records)
3) tell the world about it (GBIF/EoL can then scrape it and cache it  
if the license permits - and the license is in the data)
This approach is totally modular, distributed, loosely coupled and  
robust. The data supplier doesn't even need to have a search/browse  
function themselves they could just have a submission tool (SiteMap  
or RSS feed) and allow GBIF or whoever to supply those services on  
top.
We handle the social side of "URLs just break" by having  
recommendations for how URLs are designed. How about this one:
10.682772.info/specimen/E002719
Does that look enough like a DOI to keep people happy? I could  
secure the 682772.info domain for £12.50/year (£125 secures it for  
the next 10 years at the least). This includes free hosting of  
scripts to do my redirection etc. This is a cheeky example but I  
hope it illustrates the point that a well designed string can also  
be a URL.  I don't include the transport protocol just as many  
quotes of DOIs don't include the doi: and all those adverts on the  
bus stops just have nike.com written on them not http://www.nike.com
There is plenty of room for innovation around this simple model.  
This is the most important thing. No strict protocols just enough to  
let people add their value. People can develop data hosting and  
other tools and packages just as GBIF do today.
Now it is a long weekend for nearly everyone I guess. I must stop  
thinking about identifiers!
All the best,
Roger
On 1 May 2009, at 08:41, Peter DeVries wrote:
...
Hi Rod,
I am in favor of couchDB based distributed solutions. I just don't  
see how LSID's can
be justified base on their cost/benefits.
The current LSID's can still be used, but if any group can easily  
make the transition to linked data it would be
those groups that have already successfully implemented LSID's.
Without the proxy, the providers can work out a caching solution  
that works well for them. The TDWG proxy has
to cache all lsid requests, not just those for ipini. It probably  
caches less of the ipini data than ipini would.
Also a lot of people use simpler crawlers that may not know how to  
correctly follow LSID proxies.
My .rdf files are cached by Google
Do a Google Search on:
http://species.geospecies.org/specs/Ochlerotatus_triseriatus.rdf
or
http://species.geospecies.org/specs/Culex_pipiens.rdf
The Google cache is not ideal, but it is an accessible alternative  
version. They may be open to making it work
as a real alternative cache for linked data.
- Pete
On Fri, May 1, 2009 at 1:08 AM, Roderic Page <r.page@bio.gla.ac.uk>  
wrote:
Dear Pete,
On 1 May 2009, at 04:37, Peter DeVries wrote:
...
This seems to be another example of how the use of LSID's creates  
problems and adds
costs for data providers.
I'm not sure being hit by the Google bot is due to LSIDs as such. I  
think the problems of LSIDs lie more with the overhead of fussing  
with the DNS SRV (in theory trivial, but in practice not), the need  
for software beyond a web server, and the fact they don't resolve  
by themselves in browsers without proxies (although this hasn't  
hindered DOIs becoming widespread).
...
It would be much more straight forward to adopt the linked data  
standards and have this data
be available in a widely supported standard.
Here is one linked data alternative:
http://lod.ipni.org/names/783030-1         <- the entity or  
concept ... redirects via 303 to either
http://lod.ipni.org/names/783030-1.html <- human readable page
http://lod.ipni.org/names/783030-1.rdf    <- rdf data
See
http://linkeddata.org/guides-and-tutorials
Test with this service
http://validator.linkeddata.org/vapour
Playing nice with linked data makes sense, but we can do this with  
appropriate proxies. For example, http://validator.linkeddata.org/vapour?vocabUri=http%3A%2F%2Fbioguid.info%2F...
(if link broken in email try http://tinyurl.com/dkl755 )
Given that LSIDs are in the wild (including the scientific  
literature), we need to support them (that's the bugger with  
"persistent" identifiers, once you release them you're stuck with  
them).
That said, I'm guessing that anybody starting a new data providing  
service would be well advised to use HTTP URIs with 303 redirects,  
providing that they got the memo about cool URIs (http://www.w3.org/Provider/Style/URI 
).
...
There are other ways to avoid service outages and data replication.
Google and others have to deal with this problem everyday.
If you want to keep the branding on the identifier you could also  
do something like this.
http://lod.ipni.org/ipni-org_names_783030-1         <- the entity  
or concept, 303 redirect to either
http://lod.ipni.org/ipni-org_names_783030-1.html  <- human  
readable page
http://lod.ipni.org/ipni-org_names_783030-1.rdf    <- rdf data
Couldn't the free and ubiquitous Google cache provide some caching  
of these normal uri's
Firstly, is there any linked data in the Google cache? If the  
Google bot is harvesting as a web browser,  it will get 303  
redirects to HTML and not the RDF. I've had a quick look for  
DBPedia RDF in the cache and haven't found any.
Secondly, how would I get the cached copy? If I'm doing large-scale  
harvesting, I'll need programatic access to the cache, and that's  
not really possible (especially now that Google's SOAP API is  
deprecated).
Gel jockeys don't expect to have to get GenBank sequences from  
Google's cache because GenBank keeps falling over, so why do we  
expect to have to do this? OK, our situation is different because  
we have distributed data sources, but I'd prefer something like http://www.fak3r.com/2009/04/29/resolving-lsids-wit-url-resolvers-and-couchd...
Regards
Rod
...
- Pete
On Mon, Apr 27, 2009 at 7:54 AM, Nicola Nicolson <n.nicolson@rbgkew.org.uk
...
wrote:
Hi,
Further to my last design question re LSID HTTP proxies (thanks  
for the responses), I wanted to raise the issue of HTTP LSID  
proxies and crawlers, in particular the crawl delay part of the  
robots exclusion protocol.
I'll outline a situation we had recently:
The GBIF portal and ZipCodeZoo site both inclde IPNI LSIDs in the  
pages. These are presented in their proxied form using the TDWG  
LSID resolver (eg http://lsid.tdwg.org/urn:lsid:ipni.org:names:783030-1) 
. Using the TDWG resolver to access the data for an IPNI LSID does  
not issue any kind of HTTP redirect, instead the web resolver uses  
the LSID resolution steps to get the data and presents it in its  
own response (ie returning a HTTP 200 OK response).
The problem happens when one of these sites that includes proxied  
IPNI LSIDs is crawled by a search engine. The proxied links appear  
to belong to tdwg.org, so whatever crawl delay is agreed between  
TDWG and the crawler in question is used. The crawler has no  
knowledge that behind the scenes the TDWG resolver is hitting  
ipni.org. We (ipni.org) have agreed our own crawl limits with  
Google and the other major search engines using directives in  
robots.txt and directly agreed limits with Google (who don't use  
the robots.txt directly).
On a couple of occasions in the past we have had to deny access to  
the TDWG LSID resolver as it has been responsible for far more  
traffic than we can support (up to 10 times the crawl limits we  
have agreed with search engine bots) - this due to the pages on  
the GBIF portal and / or zipcodezoo being crawled by a search  
engine, which in turn triggers a high volume of requests from TDWG  
to IPNI. The crawler itself has no knowledge that it is in effect  
accessing data held at ipni.org rather than tdwg.org as the HTTP  
response is HTTP 200.
One of Rod's emails recently mentioned that we need a resolver to  
act like a tinyurl or bit.ly. I have pasted below the HTTP headers  
for an HTTP request to the TDWG LSID resolver, and to tinyurl /  
bit.ly. To the end user it looks as though tdwg.org is the true  
location of the LSID resource, whereas with the tinyurl and bitly  
both just redirect traffic.
I'm just posting this for discussion really - if we are to mandate  
use of a web based HTTP resolver/proxies, it should really issue  
30* redirects so that established crawl delays between producer  
and consumer will be used. The alternative would be for the HTTP  
resolver to read and process the directives in robots.txt, but  
this would be difficult to implement as it is not in itself a  
crawler, just a gateway.
I'm sure that if proxied forms of LSIDs become more prevalent this  
problem will become more widespread, so now - with the on-going  
attempt to define what services a GUID resolver should provide -  
might be a good time to plan how to fix this.
cheers,
Nicky
[nn00kg@kvstage01 ~]$ curl -I http://lsid.tdwg.org/urn:lsid:ipni.org:names:783030-1
HTTP/1.1 200 OK
Via: 1.1 KISA01
Connection: close
Proxy-Connection: close
Date: Mon, 27 Apr 2009 11:41:55 GMT
Content-Type: application/xml
Server: Apache/2.2.3 (CentOS)
[nn00kg@kvstage01 ~]$ curl -I http://tinyurl.com/czkquy
HTTP/1.1 301 Moved Permanently
Via: 1.1 KISA01
Connection: close
Proxy-Connection: close
Date: Mon, 27 Apr 2009 12:16:38 GMT
Location: http://www.ipni.org/ipni/plantNameByVersion.do?id=783030-1&version=1.4&outpu...
Content-type: text/html
Server: TinyURL/1.6
X-Powered-By: PHP/5.2.9
[nn00kg@kvstage01 ~]$ curl -I http://bit.ly/KO1Ko
HTTP/1.1 301 Moved Permanently
Via: 1.1 KISA01
Connection: Keep-Alive
Proxy-Connection: Keep-Alive
Content-Length: 287
Date: Mon, 27 Apr 2009 12:19:48 GMT
Location: http://www.ipni.org/ipni/plantNameByVersion.do?id=783030-1&version=1.4&outpu...
Content-Type: text/html;charset=utf-8
Server: nginx/0.7.42
Allow: GET, HEAD, POST
- Nicola Nicolson
- Science Applications Development,
- Royal Botanic Gardens, Kew,
- Richmond, Surrey, TW9 3AB, UK
- email: n.nicolson@rbgkew.org.uk
- phone: 020-8332-5766
_______________________________________________
tdwg-tag mailing list
tdwg-tag@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-tag
-- 
---------------------------------------------------------------
Pete DeVries
Department of Entomology
University of Wisconsin - Madison
445 Russell Laboratories
1630 Linden Drive
Madison, WI 53706
------------------------------------------------------------
_______________________________________________
tdwg-tag mailing list
tdwg-tag@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-tag
---------------------------------------------------------
Roderic Page
Professor of Taxonomy
DEEB, FBLS
Graham Kerr Building
University of Glasgow
Glasgow G12 8QQ, UK
Email: r.page@bio.gla.ac.uk
Tel: +44 141 330 4778
Fax: +44 141 330 2792
AIM: rodpage1962@aim.com
Facebook: http://www.facebook.com/profile.php?id=1112517192
Twitter: http://twitter.com/rdmpage
Blog: http://iphylo.blogspot.com
Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html
-- 
---------------------------------------------------------------
Pete DeVries
Department of Entomology
University of Wisconsin - Madison
445 Russell Laboratories
1630 Linden Drive
Madison, WI 53706
------------------------------------------------------------
_______________________________________________
tdwg-tag mailing list
tdwg-tag@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-tag
---------------------------------------------------------
Roderic Page
Professor of Taxonomy
DEEB, FBLS
Graham Kerr Building
University of Glasgow
Glasgow G12 8QQ, UK

Email: r.page@bio.gla.ac.uk
Tel: +44 141 330 4778
Fax: +44 141 330 2792
AIM: rodpage1962@aim.com
Facebook: http://www.facebook.com/profile.php?id=1112517192
Twitter: http://twitter.com/rdmpage
Blog: http://iphylo.blogspot.com
Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html