[Tdwg-guid] Throttling searches - Web crawlers

Chuck Miller Chuck.Miller at mobot.org
Mon Jun 19 17:27:44 CEST 2006


Sally,
And don't forget the web crawlers.  Google alone can swamp a site when the site's queries become hyperlinks as URL CGI calls on other people's websites.  We were getting 90,000 robotic queries a day at one point before we blocked it.  And Google is far from the only one.
 
Chuck

________________________________

From: Sally Hinchcliffe [mailto:S.Hinchcliffe at kew.org]
Sent: Mon 6/19/2006 4:23 AM
To: Roderic Page
Cc: tdwg-guid at mailman.nhm.ku.edu
Subject: Re: [Tdwg-guid] Throttling searches [ Scanned for viruses ]



Hi Rod 
Sadly not everyone is polite, or asks, or leaves gaps between 
queries. We handle 10 - 15k searches a day which can peak to 20-30k 
when someone is actively crawling it, running against two servers, 
neither of which is in the first flush of youth. That's setting aside 
the irritation of having someone scrape and serve your data without 
acknowledgement (present company excepted, naturally) - data that we 
are assembling at some cost to the organisations which support ipni 
out of their core resources 

I will obviously be providing a canned, limited download, but some 
people want everything. My current plan is to make the download only 
available on signing a data supply agreement, which will include 
terms on rates of further querying and use our logs to check for 
compliance 

This may seem like a petty issue - yes we do want people to use and 
want our data - but on the other hand I have to make sure that the 
service is available to everyone, all the time. And I also have to 
make sure that the people who fund IPNI - the senior management at 
Kew, Harvard and Canberra - are happy that their efforts are not 
being abused. 

Sally 

> I gotta ask -- what is so bad about making life easy for data scrapers  
> (of which I'm one)? Isn't this rather the point -- we WANT to make it  
> easy :-) 
> 
> But, I do realise that providers may run into a problem of being  
> overwhelmed by requests (though, wouldn't that be nice -- people  
> actually want your data). 
> 
> The NCBI throttles by asking people not to hammer the service, and some  
> people leave around half a sec between requests to avoid being blocked.  
> Connotea is thinking of "making the trigger be >10 requests within the  
> last 15 seconds; requests arriving faster than that will be give a 503  
> response with a Retry-After header.", if that makes any sense. 
> 
> You could also provide a service for data scrapers where they can get  
> an RDF dump of the IPNI names, rather than have to scrape them. 
> 
> Regards 
> 
> Rod 
> 
> 
> 
> 
> On 19 Jun 2006, at 10:02, Sally Hinchcliffe wrote: 
> 
> > It's not an LSID issue per se, but LSIDs will make it harder to slow 
> > searches down. For instance, Google restricts use of its spell 
> > checker to 1000 a day by use of a key which is passed in with each 
> > request. Obviously this can't be done with LSIDs as then they 
> > wouldn't be the same for each user. 
> > The other reason why it's relevant to LSIDs is simply that providing 
> > a list of all relevant IPNI LSIDs (not necessary to the LSID 
> > implementation but a nice to have for caching / lookups for other 
> > systems using our LSIDs) also makes life easier for the datascrapers 
> > to operate 
> > 
> > Also I thought ... here's a list full of clever people perhaps they 
> > will have some suggestions 
> > 
> > Sally 
> > 
> >> 
> >> Is this an LSID issue? LSIDs essential provide a binding service  
> >> between 
> >> an name and one or more web services (we default to HTTP GET  
> >> bindings). 
> >> It isn't really up to the LSID authority to administer any policies 
> >> regarding the web service but simply to point at it. It is up to the  
> >> web 
> >> service to do things like throttling, authentication and  
> >> authorization. 
> >> 
> >> Imagine, for example, that the different services had different 
> >> policies. It may be reasonable not to restrict the getMetadata() calls 
> >> but to restrict the getData() calls. 
> >> 
> >> The use of LSIDs does not create any new problems that weren't there 
> >> with web page scraping - or scraping of any other web service. 
> >> 
> >> Just my thoughts... 
> >> 
> >> Roger 
> >> 
> >> 
> >> Ricardo Scachetti Pereira wrote: 
> >>>     Sally, 
> >>> 
> >>>     You raised a really important issue that we had not really  
> >>> addressed 
> >>> at the meeting. Thanks for that. 
> >>> 
> >>>     I would say that we should not constrain the resolution of LSIDs  
> >>> if 
> >>> we expect our LSID infrastructure to work. LSIDs will be the basis of 
> >>> our architecture so we better have good support for that. 
> >>> 
> >>>     However, that is sure a limiting factor. Also server efficiency  
> >>> will 
> >>> likely vary quite a lot, depending on underlying system optimizations 
> >>> and all. 
> >>> 
> >>>     So I think that the solution for this problem is in caching LSID 
> >>> responses on the server LSID stack. Basically, after resolving an  
> >>> LSID 
> >>> once, your server should be able to resolve it again and again really 
> >>> quickly, until something on the metadata that is related to that id  
> >>> changes. 
> >>> 
> >>>     I haven't looked at this aspect of the LSID software stack, but 
> >>> maybe others can say something about it. In any case I'll do some 
> >>> research on it and get back to you. 
> >>> 
> >>>     Again, thanks for bringing it up. 
> >>> 
> >>>     Cheers, 
> >>> 
> >>> Ricardo 
> >>> 
> >>> 
> >>> Sally Hinchcliffe wrote: 
> >>> 
> >>>> There are enough discontinuities in IPNI ids that 1,2,3 would  
> >>>> quickly 
> >>>> run into the sand. I agree it's not a new problem - I just hate to 
> >>>> think I'm making life easier for the data scrapers 
> >>>> Sally 
> >>>> 
> >>>> 
> >>>> 
> >>>> 
> >>>>> It can be a problem but I'm not sure if there is a simple solution  
> >>>>> ... and how different is the LSID crawler scenario from an  
> >>>>> http://www.ipni.org/ipni/plantsearch?id= 1,2,3,4,5 ... 9999999  
> >>>>> scenario? 
> >>>>> 
> >>>>> Paul 
> >>>>> 
> >>>>> -----Original Message----- 
> >>>>> From: tdwg-guid-bounces at mailman.nhm.ku.edu 
> >>>>> [mailto:tdwg-guid-bounces at mailman.nhm.ku.edu]On Behalf Of Sally 
> >>>>> Hinchcliffe 
> >>>>> Sent: 15 June 2006 12:08 
> >>>>> To: tdwg-guid at mailman.nhm.ku.edu 
> >>>>> Subject: [Tdwg-guid] Throttling searches [ Scanned for viruses ] 
> >>>>> 
> >>>>> 
> >>>>> Hi all 
> >>>>> another question that has come up here. 
> >>>>> 
> >>>>> As discussed at the meeting, we're thinking of providing a complete 
> >>>>> download of all IPNI LSIDs plus a label (name and author, probably) 
> >>>>> which will be available as an annually produced download 
> >>>>> 
> >>>>> Most people will play nice and just resolve one or two LSIDs as 
> >>>>> required, but by providing a complete list, we're making it very  
> >>>>> easy 
> >>>>> for someone to write a crawler that hits every LSID in turn and 
> >>>>> basically brings our server to its knees 
> >>>>> 
> >>>>> Anybody know of a good way of enforcing more polite behaviour? We  
> >>>>> can 
> >>>>> make the download only available under a data supply agreement that 
> >>>>> includes a clause limiting hit rates, or we could limit by IP  
> >>>>> address 
> >>>>> (but this would ultimately block out services like Rod's simple 
> >>>>> resolver). I beleive Google's spell checker uses a key which has to 
> >>>>> be passed in as part of the query - obviously we can't do that with 
> >>>>> LSIDs 
> >>>>> 
> >>>>> Any thoughts? Anyone think this is a problem? 
> >>>>> 
> >>>>> Sally 
> >>>>> *** Sally Hinchcliffe 
> >>>>> *** Computer section, Royal Botanic Gardens, Kew 
> >>>>> *** tel: +44 (0)20 8332 5708 
> >>>>> *** S.Hinchcliffe at rbgkew.org.uk 
> >>>>> 
> >>>>> 
> >>>>> _______________________________________________ 
> >>>>> TDWG-GUID mailing list 
> >>>>> TDWG-GUID at mailman.nhm.ku.edu 
> >>>>> http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid 
> >>>>> 
> >>>>> _______________________________________________ 
> >>>>> TDWG-GUID mailing list 
> >>>>> TDWG-GUID at mailman.nhm.ku.edu 
> >>>>> http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid 
> >>>>> 
> >>>>> 
> >>>> *** Sally Hinchcliffe 
> >>>> *** Computer section, Royal Botanic Gardens, Kew 
> >>>> *** tel: +44 (0)20 8332 5708 
> >>>> *** S.Hinchcliffe at rbgkew.org.uk 
> >>>> 
> >>>> 
> >>>> _______________________________________________ 
> >>>> TDWG-GUID mailing list 
> >>>> TDWG-GUID at mailman.nhm.ku.edu 
> >>>> http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid 
> >>>> 
> >>>> 
> >>>> 
> >>> 
> >>> 
> >>> _______________________________________________ 
> >>> TDWG-GUID mailing list 
> >>> TDWG-GUID at mailman.nhm.ku.edu 
> >>> http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid 
> >>> 
> >>> 
> >> 
> >> 
> >> --  
> >> 
> >> ------------------------------------- 
> >>  Roger Hyam 
> >>  Technical Architect 
> >>  Taxonomic Databases Working Group 
> >> ------------------------------------- 
> >>  http://www.tdwg.org <http://www.tdwg.org/>  
> >>  roger at tdwg.org 
> >>  +44 1578 722782 
> >> ------------------------------------- 
> >> 
> >> 
> > 
> > *** Sally Hinchcliffe 
> > *** Computer section, Royal Botanic Gardens, Kew 
> > *** tel: +44 (0)20 8332 5708 
> > *** S.Hinchcliffe at rbgkew.org.uk 
> > 
> > 
> > _______________________________________________ 
> > TDWG-GUID mailing list 
> > TDWG-GUID at mailman.nhm.ku.edu 
> > http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid 
> > 
> > 
> ------------------------------------------------------------------------ 
> ---------------------------------------- 
> Professor Roderic D. M. Page 
> Editor, Systematic Biology 
> DEEB, IBLS 
> Graham Kerr Building 
> University of Glasgow 
> Glasgow G12 8QP 
> United Kingdom 
> 
> Phone:    +44 141 330 4778 
> Fax:      +44 141 330 2792 
> email:    r.page at bio.gla.ac.uk 
> web:      http://taxonomy.zoology.gla.ac.uk/rod/rod.html 
> iChat:    aim://rodpage1962 
> reprints: http://taxonomy.zoology.gla.ac.uk/rod/pubs.html 
> 
> Subscribe to Systematic Biology through the Society of Systematic 
> Biologists Website:  http://systematicbiology.org <http://systematicbiology.org/>  
> Search for taxon names: http://darwin.zoology.gla.ac.uk/~rpage/portal/ 
> Find out what we know about a species: http://ispecies.org <http://ispecies.org/>  
> Rod's rants on phyloinformatics: http://iphylo.blogspot.com <http://iphylo.blogspot.com/>  
> 
> 
> 
>               
> ___________________________________________________________ 
> Now you can scan emails quickly with a reading pane. Get the new Yahoo! Mail. http://uk.docs.yahoo.com/nowyoucan.html 
> 

*** Sally Hinchcliffe 
*** Computer section, Royal Botanic Gardens, Kew 
*** tel: +44 (0)20 8332 5708 
*** S.Hinchcliffe at rbgkew.org.uk 


_______________________________________________ 
TDWG-GUID mailing list 
TDWG-GUID at mailman.nhm.ku.edu 
http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.tdwg.org/pipermail/tdwg-tag/attachments/20060619/7bff9c7a/attachment.html 


More information about the tdwg-tag mailing list