[tdwg-tag] LSIDs: web based (HTTP) resolvers and web crawlers

Fri May 1 08:08:29 CEST 2009

Dear Pete,

On 1 May 2009, at 04:37, Peter DeVries wrote:

> This seems to be another example of how the use of LSID's creates  
> problems and adds
> costs for data providers.

I'm not sure being hit by the Google bot is due to LSIDs as such. I  
think the problems of LSIDs lie more with the overhead of fussing with  
the DNS SRV (in theory trivial, but in practice not), the need for  
software beyond a web server, and the fact they don't resolve by  
themselves in browsers without proxies (although this hasn't hindered  
DOIs becoming widespread).

>
>
> It would be much more straight forward to adopt the linked data  
> standards and have this data
> be available in a widely supported standard.
>
> Here is one linked data alternative:
>
> http://lod.ipni.org/names/783030-1         <- the entity or  
> concept ... redirects via 303 to either
> http://lod.ipni.org/names/783030-1.html <- human readable page
> http://lod.ipni.org/names/783030-1.rdf    <- rdf data
>
> See
> http://linkeddata.org/guides-and-tutorials
>
> Test with this service
> http://validator.linkeddata.org/vapour

Playing nice with linked data makes sense, but we can do this with  
appropriate proxies. For example, http://validator.linkeddata.org/vapour?vocabUri=http%3A%2F%2Fbioguid.info%2Furn%3Alsid%3Aipni.org%3Anames%3A138762-3&classUri=http%3A%2F%2F&propertyUri=http%3A%2F%2F&instanceUri=http%3A%2F%2F&defaultResponse=dontmind&userAgent=vapour.sourceforge.net

(if link broken in email try http://tinyurl.com/dkl755 )

Given that LSIDs are in the wild (including the scientific  
literature), we need to support them (that's the bugger with  
"persistent" identifiers, once you release them you're stuck with them).

That said, I'm guessing that anybody starting a new data providing  
service would be well advised to use HTTP URIs with 303 redirects,  
providing that they got the memo about cool URIs (http://www.w3.org/Provider/Style/URI 
).

>
> There are other ways to avoid service outages and data replication.
> Google and others have to deal with this problem everyday.
>
> If you want to keep the branding on the identifier you could also do  
> something like this.
>
> http://lod.ipni.org/ipni-org_names_783030-1         <- the entity or  
> concept, 303 redirect to either
> http://lod.ipni.org/ipni-org_names_783030-1.html  <- human readable  
> page
> http://lod.ipni.org/ipni-org_names_783030-1.rdf    <- rdf data
>
> Couldn't the free and ubiquitous Google cache provide some caching  
> of these normal uri's

Firstly, is there any linked data in the Google cache? If the Google  
bot is harvesting as a web browser,  it will get 303 redirects to HTML  
and not the RDF. I've had a quick look for DBPedia RDF in the cache  
and haven't found any.

Secondly, how would I get the cached copy? If I'm doing large-scale  
harvesting, I'll need programatic access to the cache, and that's not  
really possible (especially now that Google's SOAP API is deprecated).

Gel jockeys don't expect to have to get GenBank sequences from  
Google's cache because GenBank keeps falling over, so why do we expect  
to have to do this? OK, our situation is different because we have  
distributed data sources, but I'd prefer something like http://www.fak3r.com/2009/04/29/resolving-lsids-wit-url-resolvers-and-couchdb

Regards

Rod

>
> - Pete
>
> On Mon, Apr 27, 2009 at 7:54 AM, Nicola Nicolson <n.nicolson at rbgkew.org.uk 
> > wrote:
> Hi,
>
>
> Further to my last design question re LSID HTTP proxies (thanks for  
> the responses), I wanted to raise the issue of HTTP LSID proxies and  
> crawlers, in particular the crawl delay part of the robots exclusion  
> protocol.
>
>
> I'll outline a situation we had recently:
>
>
> The GBIF portal and ZipCodeZoo site both inclde IPNI LSIDs in the  
> pages. These are presented in their proxied form using the TDWG LSID  
> resolver (eg http://lsid.tdwg.org/urn:lsid:ipni.org:names:783030-1).  
> Using the TDWG resolver to access the data for an IPNI LSID does not  
> issue any kind of HTTP redirect, instead the web resolver uses the  
> LSID resolution steps to get the data and presents it in its own  
> response (ie returning a HTTP 200 OK response).
>
>
> The problem happens when one of these sites that includes proxied  
> IPNI LSIDs is crawled by a search engine. The proxied links appear  
> to belong to tdwg.org, so whatever crawl delay is agreed between  
> TDWG and the crawler in question is used. The crawler has no  
> knowledge that behind the scenes the TDWG resolver is hitting  
> ipni.org. We (ipni.org) have agreed our own crawl limits with Google  
> and the other major search engines using directives in robots.txt  
> and directly agreed limits with Google (who don't use the robots.txt  
> directly).
>
>
> On a couple of occasions in the past we have had to deny access to  
> the TDWG LSID resolver as it has been responsible for far more  
> traffic than we can support (up to 10 times the crawl limits we have  
> agreed with search engine bots) - this due to the pages on the GBIF  
> portal and / or zipcodezoo being crawled by a search engine, which  
> in turn triggers a high volume of requests from TDWG to IPNI. The  
> crawler itself has no knowledge that it is in effect accessing data  
> held at ipni.org rather than tdwg.org as the HTTP response is HTTP  
> 200.
>
>
> One of Rod's emails recently mentioned that we need a resolver to  
> act like a tinyurl or bit.ly. I have pasted below the HTTP headers  
> for an HTTP request to the TDWG LSID resolver, and to tinyurl /  
> bit.ly. To the end user it looks as though tdwg.org is the true  
> location of the LSID resource, whereas with the tinyurl and bitly  
> both just redirect traffic.
>
>
> I'm just posting this for discussion really - if we are to mandate  
> use of a web based HTTP resolver/proxies, it should really issue 30*  
> redirects so that established crawl delays between producer and  
> consumer will be used. The alternative would be for the HTTP  
> resolver to read and process the directives in robots.txt, but this  
> would be difficult to implement as it is not in itself a crawler,  
> just a gateway.
>
>
> I'm sure that if proxied forms of LSIDs become more prevalent this  
> problem will become more widespread, so now - with the on-going  
> attempt to define what services a GUID resolver should provide -  
> might be a good time to plan how to fix this.
>
>
> cheers,
> Nicky
>
>
> [nn00kg at kvstage01 ~]$ curl -I http://lsid.tdwg.org/urn:lsid:ipni.org:names:783030-1
> HTTP/1.1 200 OK
> Via: 1.1 KISA01
> Connection: close
> Proxy-Connection: close
> Date: Mon, 27 Apr 2009 11:41:55 GMT
> Content-Type: application/xml
> Server: Apache/2.2.3 (CentOS)
>
>
> [nn00kg at kvstage01 ~]$ curl -I http://tinyurl.com/czkquy
> HTTP/1.1 301 Moved Permanently
> Via: 1.1 KISA01
> Connection: close
> Proxy-Connection: close
> Date: Mon, 27 Apr 2009 12:16:38 GMT
> Location: http://www.ipni.org/ipni/plantNameByVersion.do?id=783030-1&version=1.4&output_format=lsid-metadata&show_history=true
> Content-type: text/html
> Server: TinyURL/1.6
> X-Powered-By: PHP/5.2.9
>
>
> [nn00kg at kvstage01 ~]$ curl -I http://bit.ly/KO1Ko
> HTTP/1.1 301 Moved Permanently
> Via: 1.1 KISA01
> Connection: Keep-Alive
> Proxy-Connection: Keep-Alive
> Content-Length: 287
> Date: Mon, 27 Apr 2009 12:19:48 GMT
> Location: http://www.ipni.org/ipni/plantNameByVersion.do?id=783030-1&version=1.4&output_format=lsid-metadata&show_history=true
> Content-Type: text/html;charset=utf-8
> Server: nginx/0.7.42
> Allow: GET, HEAD, POST
>
>
>
>
> - Nicola Nicolson
> - Science Applications Development,
> - Royal Botanic Gardens, Kew,
> - Richmond, Surrey, TW9 3AB, UK
> - email: n.nicolson at rbgkew.org.uk
> - phone: 020-8332-5766
>
>
> _______________________________________________
> tdwg-tag mailing list
> tdwg-tag at lists.tdwg.org
> http://lists.tdwg.org/mailman/listinfo/tdwg-tag
>
>
>
>
> -- 
> ---------------------------------------------------------------
> Pete DeVries
> Department of Entomology
> University of Wisconsin - Madison
> 445 Russell Laboratories
> 1630 Linden Drive
> Madison, WI 53706
> ------------------------------------------------------------
> _______________________________________________
> tdwg-tag mailing list
> tdwg-tag at lists.tdwg.org
> http://lists.tdwg.org/mailman/listinfo/tdwg-tag

---------------------------------------------------------
Roderic Page
Professor of Taxonomy
DEEB, FBLS
Graham Kerr Building
University of Glasgow
Glasgow G12 8QQ, UK

Email: r.page at bio.gla.ac.uk
Tel: +44 141 330 4778
Fax: +44 141 330 2792
AIM: rodpage1962 at aim.com
Facebook: http://www.facebook.com/profile.php?id=1112517192
Twitter: http://twitter.com/rdmpage
Blog: http://iphylo.blogspot.com
Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.tdwg.org/pipermail/tdwg-tag/attachments/20090501/307b9a9b/attachment.html