[tdwg-tag] LSIDs: web based (HTTP) resolvers and web crawlers

Fri May 1 04:32:37 CEST 2009

Kevin

I agree, technically it is very easy. Corporate IT groups in largish government institutions such as ours, however have a deeply ingrained mistrust of anything "new" especially if it came from an online community such as ours.

I know I could do this in 5 minutes while sipping a coffee, but instead I'd have to generate an overview document explaining why the change is necessary, along with a project plan, and a change control document indicating what is to happen and what happens if the change doesn't work. Sheer tedium.

Cheers,
Ben

________________________________

	From: tdwg-tag-bounces at lists.tdwg.org [mailto:tdwg-tag-bounces at lists.tdwg.org] On Behalf Of Kevin Richards
	Sent: Tuesday, 28 April 2009 5:27
	To: Roger Hyam; Roderic Page
	Cc: tdwg-tag at lists.tdwg.org
	Subject: Re: [tdwg-tag] LSIDs: web based (HTTP) resolvers and web crawlers

	Its not owning the domain that seems to be the problem - its apparently adding the DNS SRV record that is too technical to cope with (although I didn't think this was much harder than buying a domain name).

	And I still think it is going to be essential to provider LSID / GUID hosting services for those who cant set them up themselves.

	Kevin

	From: tdwg-tag-bounces at lists.tdwg.org [mailto:tdwg-tag-bounces at lists.tdwg.org] On Behalf Of Roger Hyam
	Sent: Tuesday, 28 April 2009 2:17 a.m.
	To: Roderic Page
	Cc: tdwg-tag at lists.tdwg.org
	Subject: Re: [tdwg-tag] LSIDs: web based (HTTP) resolvers and web crawlers

	This confirms my growing prejudice that third party GUID providers are bad news beyond a simple PURL service. Even then it is better if the owner of the PURL service is the owner of the data. i.e. only use some one else's PURL if you don't/can't own your own domain. Hands up who is incapable of buying a domain name!

	Roger

	On 27 Apr 2009, at 14:30, Roderic Page wrote:

	Dear Nicky,

	Ouch!

	I'm not sure I fully understand how 30* redirects work with respect to web crawlers, but I'm not sure they will help in this case. 

	If the TDWG LSID resolver is a full blown resolver, then for each request from the crawler it will be doing the full LSID resolution (three calls, one for authority WSDL, one for service WSDL, one for metadata). It may cache the WSDLs, but it will still do at least one call to the service (unless it has cached the metadata as well).

	Is the solution to add TDWG to your robots.txt file, and have the LSID resolver respect the settings in that file? TDWG could also implement metadata caching so it wouldn't need to hammer you so much (i.e., when a crawler hit TDWG, TDWG would reply with the cached metadata).

	Perhaps LSID services such as IPNI's could also implement etag headers, which would help avoid excessive traffic from TDWG when caching (TDWG could regularly cache metadata form IPNI, respecting the robots.txt files, and first checking whether the metadata had changed using etag and/or last modified headers).

	I assume the DOI resolve has similar issues. It's robots.txt file looks like this:

	Crawl-delay: 5

	Request-rate: 1/5

	Hope this makes sense, my understanding of the HTTP headers/redirects/robots.txt is not particularly deep.

	Regards

	Rod

	On 27 Apr 2009, at 13:54, Nicola Nicolson wrote:

	Hi,

	Further to my last design question re LSID HTTP proxies (thanks for the responses), I wanted to raise the issue of HTTP LSID proxies and crawlers, in particular the crawl delay part of the robots exclusion protocol.

	I'll outline a situation we had recently:

	The GBIF portal and ZipCodeZoo site both inclde IPNI LSIDs in the pages. These are presented in their proxied form using the TDWG LSID resolver (eg http://lsid.tdwg.org/urn:lsid:ipni.org:names:783030-1 <http://lsid.tdwg.org/urn:lsid:ipni.org:names:783030-1> ). Using the TDWG resolver to access the data for an IPNI LSID does not issue any kind of HTTP redirect, instead the web resolver uses the LSID resolution steps to get the data and presents it in its own response (ie returning a HTTP 200 OK response).

	The problem happens when one of these sites that includes proxied IPNI LSIDs is crawled by a search engine. The proxied links appear to belong to tdwg.org, so whatever crawl delay is agreed between TDWG and the crawler in question is used. The crawler has no knowledge that behind the scenes the TDWG resolver is hitting ipni.org. We (ipni.org) have agreed our own crawl limits with Google and the other major search engines using directives in robots.txt and directly agreed limits with Google (who don't use the robots.txt directly).

	On a couple of occasions in the past we have had to deny access to the TDWG LSID resolver as it has been responsible for far more traffic than we can support (up to 10 times the crawl limits we have agreed with search engine bots) - this due to the pages on the GBIF portal and / or zipcodezoo being crawled by a search engine, which in turn triggers a high volume of requests from TDWG to IPNI. The crawler itself has no knowledge that it is in effect accessing data held at ipni.org rather than tdwg.org as the HTTP response is HTTP 200.

	One of Rod's emails recently mentioned that we need a resolver to act like a tinyurl or bit.ly. I have pasted below the HTTP headers for an HTTP request to the TDWG LSID resolver, and to tinyurl / bit.ly. To the end user it looks as though tdwg.org is the true location of the LSID resource, whereas with the tinyurl and bitly both just redirect traffic.

	I'm just posting this for discussion really - if we are to mandate use of a web based HTTP resolver/proxies, it should really issue 30* redirects so that established crawl delays between producer and consumer will be used. The alternative would be for the HTTP resolver to read and process the directives in robots.txt, but this would be difficult to implement as it is not in itself a crawler, just a gateway.

	I'm sure that if proxied forms of LSIDs become more prevalent this problem will become more widespread, so now - with the on-going attempt to define what services a GUID resolver should provide - might be a good time to plan how to fix this.

	cheers,
	Nicky

	[nn00kg at kvstage01 ~]$ curl -I http://lsid.tdwg.org/urn:lsid:ipni.org:names:783030-1 <http://lsid.tdwg.org/urn:lsid:ipni.org:names:783030-1> 
	HTTP/1.1 200 OK
	Via: 1.1 KISA01
	Connection: close
	Proxy-Connection: close
	Date: Mon, 27 Apr 2009 11:41:55 GMT
	Content-Type: application/xml
	Server: Apache/2.2.3 (CentOS)

	[nn00kg at kvstage01 ~]$ curl -I http://tinyurl.com/czkquy <http://tinyurl.com/czkquy> 
	HTTP/1.1 301 Moved Permanently
	Via: 1.1 KISA01
	Connection: close
	Proxy-Connection: close
	Date: Mon, 27 Apr 2009 12:16:38 GMT
	Location: http://www.ipni.org/ipni/plantNameByVersion.do?id=783030-1&version=1.4&output_format=lsid-metadata&show_history=true <http://www.ipni.org/ipni/plantNameByVersion.do?id=783030-1&version=1.4&output_format=lsid-metadata&show_history=true> 
	Content-type: text/html
	Server: TinyURL/1.6
	X-Powered-By: PHP/5.2.9

	[nn00kg at kvstage01 ~]$ curl -I http://bit.ly/KO1Ko <http://bit.ly/KO1Ko> 
	HTTP/1.1 301 Moved Permanently
	Via: 1.1 KISA01
	Connection: Keep-Alive
	Proxy-Connection: Keep-Alive
	Content-Length: 287
	Date: Mon, 27 Apr 2009 12:19:48 GMT
	Location: http://www.ipni.org/ipni/plantNameByVersion.do?id=783030-1&version=1.4&output_format=lsid-metadata&show_history=true <http://www.ipni.org/ipni/plantNameByVersion.do?id=783030-1&version=1.4&output_format=lsid-metadata&show_history=true> 
	Content-Type: text/html;charset=utf-8
	Server: nginx/0.7.42
	Allow: GET, HEAD, POST

	- Nicola Nicolson
	- Science Applications Development,
	- Royal Botanic Gardens, Kew,
	- Richmond, Surrey, TW9 3AB, UK
	- email: n.nicolson at rbgkew.org.uk
	- phone: 020-8332-5766

	_______________________________________________
	tdwg-tag mailing list
	tdwg-tag at lists.tdwg.org
	http://lists.tdwg.org/mailman/listinfo/tdwg-tag

	---------------------------------------------------------

	Roderic Page

	Professor of Taxonomy

	DEEB, FBLS

	Graham Kerr Building

	University of Glasgow

	Glasgow G12 8QQ, UK

	Email: r.page at bio.gla.ac.uk

	Tel: +44 141 330 4778

	Fax: +44 141 330 2792

	AIM: rodpage1962 at aim.com

	Facebook: http://www.facebook.com/profile.php?id=1112517192

	Twitter: http://twitter.com/rdmpage

	Blog: http://iphylo.blogspot.com

	Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html

	_______________________________________________
	tdwg-tag mailing list
	tdwg-tag at lists.tdwg.org
	http://lists.tdwg.org/mailman/listinfo/tdwg-tag

	-------------------------------------------------------------
	Roger Hyam
	Roger at BiodiversityCollectionsIndex.org
	http://www.BiodiversityCollectionsIndex.org <http://www.BiodiversityCollectionsIndex.org/> 
	-------------------------------------------------------------

	Royal Botanic Garden Edinburgh

	20A Inverleith Row, Edinburgh, EH3 5LR, UK

	Tel: +44 131 552 7171 ext 3015

	Fax: +44 131 248 2901

	http://www.rbge.org.uk/

	-------------------------------------------------------------

________________________________

	Please consider the environment before printing this email
	Warning: This electronic message together with any attachments is confidential. If you receive it in error: (i) you must not read, use, disclose, copy or retain it; (ii) please contact the sender immediately by reply email and then delete the emails.
	The views expressed in this email may not be those of Landcare Research New Zealand Limited. http://www.landcareresearch.co.nz

This email, together with any attachments, is intended for the
addressee only. It may contain confidential or privileged information.
If you are not the intended recipient of this email, please notify
the sender, delete the email and attachments from your system and
destroy any copies you may have taken of the email and its attachments.
Duplication or further distribution by hardcopy, by electronic means
or verbally is not permitted without permission.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.tdwg.org/pipermail/tdwg-tag/attachments/20090501/9fdac50f/attachment.html