Re: [tdwg-tag] LSIDs: web based (HTTP) resolvers and web crawlers

27 Apr 2009


      This confirms my growing prejudice that third party GUID providers are  
bad news beyond a simple PURL service. Even then it is better if the  
owner of the PURL service is the owner of the data. i.e. only use some  
one else's PURL if you don't/can't own your own domain. Hands up who  
is incapable of buying a domain name!

Roger


On 27 Apr 2009, at 14:30, Roderic Page wrote:
...
Dear Nicky,
Ouch!
I'm not sure I fully understand how 30* redirects work with respect  
to web crawlers, but I'm not sure they will help in this case.
If the TDWG LSID resolver is a full blown resolver, then for each  
request from the crawler it will be doing the full LSID resolution  
(three calls, one for authority WSDL, one for service WSDL, one for  
metadata). It may cache the WSDLs, but it will still do at least one  
call to the service (unless it has cached the metadata as well).
Is the solution to add TDWG to your robots.txt file, and have the  
LSID resolver respect the settings in that file? TDWG could also  
implement metadata caching so it wouldn't need to hammer you so much  
(i.e., when a crawler hit TDWG, TDWG would reply with the cached  
metadata).
Perhaps LSID services such as IPNI's could also implement etag  
headers, which would help avoid excessive traffic from TDWG when  
caching (TDWG could regularly cache metadata form IPNI, respecting  
the robots.txt files, and first checking whether the metadata had  
changed using etag and/or last modified headers).
I assume the DOI resolve has similar issues. It's robots.txt file  
looks like this:
Crawl-delay: 5
Request-rate: 1/5
Hope this makes sense, my understanding of the HTTP headers/ 
redirects/robots.txt is not particularly deep.
Regards
Rod
On 27 Apr 2009, at 13:54, Nicola Nicolson wrote:
...
Hi,
Further to my last design question re LSID HTTP proxies (thanks for  
the responses), I wanted to raise the issue of HTTP LSID proxies  
and crawlers, in particular the crawl delay part of the robots  
exclusion protocol.
I'll outline a situation we had recently:
The GBIF portal and ZipCodeZoo site both inclde IPNI LSIDs in the  
pages. These are presented in their proxied form using the TDWG  
LSID resolver (eg http://lsid.tdwg.org/urn:lsid:ipni.org:names:783030-1) 
. Using the TDWG resolver to access the data for an IPNI LSID does  
not issue any kind of HTTP redirect, instead the web resolver uses  
the LSID resolution steps to get the data and presents it in its  
own response (ie returning a HTTP 200 OK response).
The problem happens when one of these sites that includes proxied  
IPNI LSIDs is crawled by a search engine. The proxied links appear  
to belong to tdwg.org, so whatever crawl delay is agreed between  
TDWG and the crawler in question is used. The crawler has no  
knowledge that behind the scenes the TDWG resolver is hitting  
ipni.org. We (ipni.org) have agreed our own crawl limits with  
Google and the other major search engines using directives in  
robots.txt and directly agreed limits with Google (who don't use  
the robots.txt directly).
On a couple of occasions in the past we have had to deny access to  
the TDWG LSID resolver as it has been responsible for far more  
traffic than we can support (up to 10 times the crawl limits we  
have agreed with search engine bots) - this due to the pages on the  
GBIF portal and / or zipcodezoo being crawled by a search engine,  
which in turn triggers a high volume of requests from TDWG to IPNI.  
The crawler itself has no knowledge that it is in effect accessing  
data held at ipni.org rather than tdwg.org as the HTTP response is  
HTTP 200.
One of Rod's emails recently mentioned that we need a resolver to  
act like a tinyurl or bit.ly. I have pasted below the HTTP headers  
for an HTTP request to the TDWG LSID resolver, and to tinyurl /  
bit.ly. To the end user it looks as though tdwg.org is the true  
location of the LSID resource, whereas with the tinyurl and bitly  
both just redirect traffic.
I'm just posting this for discussion really - if we are to mandate  
use of a web based HTTP resolver/proxies, it should really issue  
30* redirects so that established crawl delays between producer and  
consumer will be used. The alternative would be for the HTTP  
resolver to read and process the directives in robots.txt, but this  
would be difficult to implement as it is not in itself a crawler,  
just a gateway.
I'm sure that if proxied forms of LSIDs become more prevalent this  
problem will become more widespread, so now - with the on-going  
attempt to define what services a GUID resolver should provide -  
might be a good time to plan how to fix this.
cheers,
Nicky
[nn00kg@kvstage01 ~]$ curl -I http://lsid.tdwg.org/urn:lsid:ipni.org:names:783030-1
HTTP/1.1 200 OK
Via: 1.1 KISA01
Connection: close
Proxy-Connection: close
Date: Mon, 27 Apr 2009 11:41:55 GMT
Content-Type: application/xml
Server: Apache/2.2.3 (CentOS)
[nn00kg@kvstage01 ~]$ curl -I http://tinyurl.com/czkquy
HTTP/1.1 301 Moved Permanently
Via: 1.1 KISA01
Connection: close
Proxy-Connection: close
Date: Mon, 27 Apr 2009 12:16:38 GMT
Location: http://www.ipni.org/ipni/plantNameByVersion.do?id=783030-1&version=1.4&outpu...
Content-type: text/html
Server: TinyURL/1.6
X-Powered-By: PHP/5.2.9
[nn00kg@kvstage01 ~]$ curl -I http://bit.ly/KO1Ko
HTTP/1.1 301 Moved Permanently
Via: 1.1 KISA01
Connection: Keep-Alive
Proxy-Connection: Keep-Alive
Content-Length: 287
Date: Mon, 27 Apr 2009 12:19:48 GMT
Location: http://www.ipni.org/ipni/plantNameByVersion.do?id=783030-1&version=1.4&outpu...
Content-Type: text/html;charset=utf-8
Server: nginx/0.7.42
Allow: GET, HEAD, POST
- Nicola Nicolson
- Science Applications Development,
- Royal Botanic Gardens, Kew,
- Richmond, Surrey, TW9 3AB, UK
- email: n.nicolson@rbgkew.org.uk
- phone: 020-8332-5766
_______________________________________________
tdwg-tag mailing list
tdwg-tag@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-tag
---------------------------------------------------------
Roderic Page
Professor of Taxonomy
DEEB, FBLS
Graham Kerr Building
University of Glasgow
Glasgow G12 8QQ, UK
Email: r.page@bio.gla.ac.uk
Tel: +44 141 330 4778
Fax: +44 141 330 2792
AIM: rodpage1962@aim.com
Facebook: http://www.facebook.com/profile.php?id=1112517192
Twitter: http://twitter.com/rdmpage
Blog: http://iphylo.blogspot.com
Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html
_______________________________________________
tdwg-tag mailing list
tdwg-tag@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-tag
-------------------------------------------------------------
Roger Hyam
Roger@BiodiversityCollectionsIndex.org
http://www.BiodiversityCollectionsIndex.org
-------------------------------------------------------------
Royal Botanic Garden Edinburgh
20A Inverleith Row, Edinburgh, EH3 5LR, UK
Tel: +44 131 552 7171 ext 3015
Fax: +44 131 248 2901
http://www.rbge.org.uk/
-------------------------------------------------------------