<HTML dir=ltr><HEAD><TITLE>Re: [Tdwg-guid] Throttling searches [ Scanned for viruses ]</TITLE>
<META http-equiv=Content-Type content="text/html; charset=unicode">
<META content="MSHTML 6.00.2900.2802" name=GENERATOR><BASE href=https://mbgowa01.mobot.org/exchange/Chuck.Miller/Drafts/RE:%20%5BTdwg-guid%5D%20Throttling%20searches%20%5B%20Scanned%20for%20viruses%20%5D.EML/1_text.htm></HEAD>
<BODY>
<DIV id=idOWAReplyText74552 dir=ltr>
<DIV dir=ltr><FONT color=#000000 size=2>Sally,</FONT></DIV>
<DIV dir=ltr><FONT color=#000000 size=2>And don't forget the web crawlers. Google alone can swamp a site when the site's queries become hyperlinks as URL CGI calls on other people's websites. We were getting 90,000 robotic queries a day at one point before we blocked it. And Google is far from the only one.</FONT></DIV>
<DIV dir=ltr><FONT size=2></FONT> </DIV>
<DIV dir=ltr><FONT size=2>Chuck</FONT></DIV></DIV>
<DIV dir=ltr><BR>
<HR tabIndex=-1>
<FONT face=Tahoma size=2><B>From:</B> Sally Hinchcliffe [mailto:S.Hinchcliffe@kew.org]<BR><B>Sent:</B> Mon 6/19/2006 4:23 AM<BR><B>To:</B> Roderic Page<BR><B>Cc:</B> tdwg-guid@mailman.nhm.ku.edu<BR><B>Subject:</B> Re: [Tdwg-guid] Throttling searches [ Scanned for viruses ]<BR></FONT><BR></DIV>
<DIV>
<P><FONT size=2>Hi Rod</FONT> <BR><FONT size=2>Sadly not everyone is polite, or asks, or leaves gaps between </FONT><BR><FONT size=2>queries. We handle 10 - 15k searches a day which can peak to 20-30k </FONT><BR><FONT size=2>when someone is actively crawling it, running against two servers, </FONT><BR><FONT size=2>neither of which is in the first flush of youth. That's setting aside </FONT><BR><FONT size=2>the irritation of having someone scrape and serve your data without </FONT><BR><FONT size=2>acknowledgement (present company excepted, naturally) - data that we </FONT><BR><FONT size=2>are assembling at some cost to the organisations which support ipni </FONT><BR><FONT size=2>out of their core resources</FONT> </P>
<P><FONT size=2>I will obviously be providing a canned, limited download, but some </FONT><BR><FONT size=2>people want everything. My current plan is to make the download only </FONT><BR><FONT size=2>available on signing a data supply agreement, which will include </FONT><BR><FONT size=2>terms on rates of further querying and use our logs to check for </FONT><BR><FONT size=2>compliance</FONT> </P>
<P><FONT size=2>This may seem like a petty issue - yes we do want people to use and </FONT><BR><FONT size=2>want our data - but on the other hand I have to make sure that the </FONT><BR><FONT size=2>service is available to everyone, all the time. And I also have to </FONT><BR><FONT size=2>make sure that the people who fund IPNI - the senior management at </FONT><BR><FONT size=2>Kew, Harvard and Canberra - are happy that their efforts are not </FONT><BR><FONT size=2>being abused.</FONT> </P>
<P><FONT size=2>Sally</FONT> </P>
<P><FONT size=2>> I gotta ask -- what is so bad about making life easy for data scrapers </FONT><BR><FONT size=2>> (of which I'm one)? Isn't this rather the point -- we WANT to make it </FONT><BR><FONT size=2>> easy :-)</FONT> <BR><FONT size=2>> </FONT><BR><FONT size=2>> But, I do realise that providers may run into a problem of being </FONT><BR><FONT size=2>> overwhelmed by requests (though, wouldn't that be nice -- people </FONT><BR><FONT size=2>> actually want your data).</FONT> <BR><FONT size=2>> </FONT><BR><FONT size=2>> The NCBI throttles by asking people not to hammer the service, and some </FONT><BR><FONT size=2>> people leave around half a sec between requests to avoid being blocked. </FONT><BR><FONT size=2>> Connotea is thinking of "making the trigger be >10 requests within the </FONT><BR><FONT size=2>> last 15 seconds; requests arriving faster than that will be give a 503 </FONT><BR><FONT size=2>> response with a Retry-After header.", if that makes any sense.</FONT> <BR><FONT size=2>> </FONT><BR><FONT size=2>> You could also provide a service for data scrapers where they can get </FONT><BR><FONT size=2>> an RDF dump of the IPNI names, rather than have to scrape them.</FONT> <BR><FONT size=2>> </FONT><BR><FONT size=2>> Regards</FONT> <BR><FONT size=2>> </FONT><BR><FONT size=2>> Rod</FONT> <BR><FONT size=2>> </FONT><BR><FONT size=2>> </FONT><BR><FONT size=2>> </FONT><BR><FONT size=2>> </FONT><BR><FONT size=2>> On 19 Jun 2006, at 10:02, Sally Hinchcliffe wrote:</FONT> <BR><FONT size=2>> </FONT><BR><FONT size=2>> > It's not an LSID issue per se, but LSIDs will make it harder to slow</FONT> <BR><FONT size=2>> > searches down. For instance, Google restricts use of its spell</FONT> <BR><FONT size=2>> > checker to 1000 a day by use of a key which is passed in with each</FONT> <BR><FONT size=2>> > request. Obviously this can't be done with LSIDs as then they</FONT> <BR><FONT size=2>> > wouldn't be the same for each user.</FONT> <BR><FONT size=2>> > The other reason why it's relevant to LSIDs is simply that providing</FONT> <BR><FONT size=2>> > a list of all relevant IPNI LSIDs (not necessary to the LSID</FONT> <BR><FONT size=2>> > implementation but a nice to have for caching / lookups for other</FONT> <BR><FONT size=2>> > systems using our LSIDs) also makes life easier for the datascrapers</FONT> <BR><FONT size=2>> > to operate</FONT> <BR><FONT size=2>> ></FONT> <BR><FONT size=2>> > Also I thought ... here's a list full of clever people perhaps they</FONT> <BR><FONT size=2>> > will have some suggestions</FONT> <BR><FONT size=2>> ></FONT> <BR><FONT size=2>> > Sally</FONT> <BR><FONT size=2>> ></FONT> <BR><FONT size=2>> >></FONT> <BR><FONT size=2>> >> Is this an LSID issue? LSIDs essential provide a binding service </FONT><BR><FONT size=2>> >> between</FONT> <BR><FONT size=2>> >> an name and one or more web services (we default to HTTP GET </FONT><BR><FONT size=2>> >> bindings).</FONT> <BR><FONT size=2>> >> It isn't really up to the LSID authority to administer any policies</FONT> <BR><FONT size=2>> >> regarding the web service but simply to point at it. It is up to the </FONT><BR><FONT size=2>> >> web</FONT> <BR><FONT size=2>> >> service to do things like throttling, authentication and </FONT><BR><FONT size=2>> >> authorization.</FONT> <BR><FONT size=2>> >></FONT> <BR><FONT size=2>> >> Imagine, for example, that the different services had different</FONT> <BR><FONT size=2>> >> policies. It may be reasonable not to restrict the getMetadata() calls</FONT> <BR><FONT size=2>> >> but to restrict the getData() calls.</FONT> <BR><FONT size=2>> >></FONT> <BR><FONT size=2>> >> The use of LSIDs does not create any new problems that weren't there</FONT> <BR><FONT size=2>> >> with web page scraping - or scraping of any other web service.</FONT> <BR><FONT size=2>> >></FONT> <BR><FONT size=2>> >> Just my thoughts...</FONT> <BR><FONT size=2>> >></FONT> <BR><FONT size=2>> >> Roger</FONT> <BR><FONT size=2>> >></FONT> <BR><FONT size=2>> >></FONT> <BR><FONT size=2>> >> Ricardo Scachetti Pereira wrote:</FONT> <BR><FONT size=2>> >>> Sally,</FONT> <BR><FONT size=2>> >>></FONT> <BR><FONT size=2>> >>> You raised a really important issue that we had not really </FONT><BR><FONT size=2>> >>> addressed</FONT> <BR><FONT size=2>> >>> at the meeting. Thanks for that.</FONT> <BR><FONT size=2>> >>></FONT> <BR><FONT size=2>> >>> I would say that we should not constrain the resolution of LSIDs </FONT><BR><FONT size=2>> >>> if</FONT> <BR><FONT size=2>> >>> we expect our LSID infrastructure to work. LSIDs will be the basis of</FONT> <BR><FONT size=2>> >>> our architecture so we better have good support for that.</FONT> <BR><FONT size=2>> >>></FONT> <BR><FONT size=2>> >>> However, that is sure a limiting factor. Also server efficiency </FONT><BR><FONT size=2>> >>> will</FONT> <BR><FONT size=2>> >>> likely vary quite a lot, depending on underlying system optimizations</FONT> <BR><FONT size=2>> >>> and all.</FONT> <BR><FONT size=2>> >>></FONT> <BR><FONT size=2>> >>> So I think that the solution for this problem is in caching LSID</FONT> <BR><FONT size=2>> >>> responses on the server LSID stack. Basically, after resolving an </FONT><BR><FONT size=2>> >>> LSID</FONT> <BR><FONT size=2>> >>> once, your server should be able to resolve it again and again really</FONT> <BR><FONT size=2>> >>> quickly, until something on the metadata that is related to that id </FONT><BR><FONT size=2>> >>> changes.</FONT> <BR><FONT size=2>> >>></FONT> <BR><FONT size=2>> >>> I haven't looked at this aspect of the LSID software stack, but</FONT> <BR><FONT size=2>> >>> maybe others can say something about it. In any case I'll do some</FONT> <BR><FONT size=2>> >>> research on it and get back to you.</FONT> <BR><FONT size=2>> >>></FONT> <BR><FONT size=2>> >>> Again, thanks for bringing it up.</FONT> <BR><FONT size=2>> >>></FONT> <BR><FONT size=2>> >>> Cheers,</FONT> <BR><FONT size=2>> >>></FONT> <BR><FONT size=2>> >>> Ricardo</FONT> <BR><FONT size=2>> >>></FONT> <BR><FONT size=2>> >>></FONT> <BR><FONT size=2>> >>> Sally Hinchcliffe wrote:</FONT> <BR><FONT size=2>> >>></FONT> <BR><FONT size=2>> >>>> There are enough discontinuities in IPNI ids that 1,2,3 would </FONT><BR><FONT size=2>> >>>> quickly</FONT> <BR><FONT size=2>> >>>> run into the sand. I agree it's not a new problem - I just hate to</FONT> <BR><FONT size=2>> >>>> think I'm making life easier for the data scrapers</FONT> <BR><FONT size=2>> >>>> Sally</FONT> <BR><FONT size=2>> >>>></FONT> <BR><FONT size=2>> >>>></FONT> <BR><FONT size=2>> >>>></FONT> <BR><FONT size=2>> >>>></FONT> <BR><FONT size=2>> >>>>> It can be a problem but I'm not sure if there is a simple solution </FONT><BR><FONT size=2>> >>>>> ... and how different is the LSID crawler scenario from an </FONT><BR><FONT size=2>> >>>>> <A href="http://www.ipni.org/ipni/plantsearch?id=">http://www.ipni.org/ipni/plantsearch?id=</A> 1,2,3,4,5 ... 9999999 </FONT><BR><FONT size=2>> >>>>> scenario?</FONT> <BR><FONT size=2>> >>>>></FONT> <BR><FONT size=2>> >>>>> Paul</FONT> <BR><FONT size=2>> >>>>></FONT> <BR><FONT size=2>> >>>>> -----Original Message-----</FONT> <BR><FONT size=2>> >>>>> From: tdwg-guid-bounces@mailman.nhm.ku.edu</FONT> <BR><FONT size=2>> >>>>> [<A href="mailto:tdwg-guid-bounces@mailman.nhm.ku.edu">mailto:tdwg-guid-bounces@mailman.nhm.ku.edu</A>]On Behalf Of Sally</FONT> <BR><FONT size=2>> >>>>> Hinchcliffe</FONT> <BR><FONT size=2>> >>>>> Sent: 15 June 2006 12:08</FONT> <BR><FONT size=2>> >>>>> To: tdwg-guid@mailman.nhm.ku.edu</FONT> <BR><FONT size=2>> >>>>> Subject: [Tdwg-guid] Throttling searches [ Scanned for viruses ]</FONT> <BR><FONT size=2>> >>>>></FONT> <BR><FONT size=2>> >>>>></FONT> <BR><FONT size=2>> >>>>> Hi all</FONT> <BR><FONT size=2>> >>>>> another question that has come up here.</FONT> <BR><FONT size=2>> >>>>></FONT> <BR><FONT size=2>> >>>>> As discussed at the meeting, we're thinking of providing a complete</FONT> <BR><FONT size=2>> >>>>> download of all IPNI LSIDs plus a label (name and author, probably)</FONT> <BR><FONT size=2>> >>>>> which will be available as an annually produced download</FONT> <BR><FONT size=2>> >>>>></FONT> <BR><FONT size=2>> >>>>> Most people will play nice and just resolve one or two LSIDs as</FONT> <BR><FONT size=2>> >>>>> required, but by providing a complete list, we're making it very </FONT><BR><FONT size=2>> >>>>> easy</FONT> <BR><FONT size=2>> >>>>> for someone to write a crawler that hits every LSID in turn and</FONT> <BR><FONT size=2>> >>>>> basically brings our server to its knees</FONT> <BR><FONT size=2>> >>>>></FONT> <BR><FONT size=2>> >>>>> Anybody know of a good way of enforcing more polite behaviour? We </FONT><BR><FONT size=2>> >>>>> can</FONT> <BR><FONT size=2>> >>>>> make the download only available under a data supply agreement that</FONT> <BR><FONT size=2>> >>>>> includes a clause limiting hit rates, or we could limit by IP </FONT><BR><FONT size=2>> >>>>> address</FONT> <BR><FONT size=2>> >>>>> (but this would ultimately block out services like Rod's simple</FONT> <BR><FONT size=2>> >>>>> resolver). I beleive Google's spell checker uses a key which has to</FONT> <BR><FONT size=2>> >>>>> be passed in as part of the query - obviously we can't do that with</FONT> <BR><FONT size=2>> >>>>> LSIDs</FONT> <BR><FONT size=2>> >>>>></FONT> <BR><FONT size=2>> >>>>> Any thoughts? Anyone think this is a problem?</FONT> <BR><FONT size=2>> >>>>></FONT> <BR><FONT size=2>> >>>>> Sally</FONT> <BR><FONT size=2>> >>>>> *** Sally Hinchcliffe</FONT> <BR><FONT size=2>> >>>>> *** Computer section, Royal Botanic Gardens, Kew</FONT> <BR><FONT size=2>> >>>>> *** tel: +44 (0)20 8332 5708</FONT> <BR><FONT size=2>> >>>>> *** S.Hinchcliffe@rbgkew.org.uk</FONT> <BR><FONT size=2>> >>>>></FONT> <BR><FONT size=2>> >>>>></FONT> <BR><FONT size=2>> >>>>> _______________________________________________</FONT> <BR><FONT size=2>> >>>>> TDWG-GUID mailing list</FONT> <BR><FONT size=2>> >>>>> TDWG-GUID@mailman.nhm.ku.edu</FONT> <BR><FONT size=2>> >>>>> <A href="http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid">http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid</A></FONT> <BR><FONT size=2>> >>>>></FONT> <BR><FONT size=2>> >>>>> _______________________________________________</FONT> <BR><FONT size=2>> >>>>> TDWG-GUID mailing list</FONT> <BR><FONT size=2>> >>>>> TDWG-GUID@mailman.nhm.ku.edu</FONT> <BR><FONT size=2>> >>>>> <A href="http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid">http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid</A></FONT> <BR><FONT size=2>> >>>>></FONT> <BR><FONT size=2>> >>>>></FONT> <BR><FONT size=2>> >>>> *** Sally Hinchcliffe</FONT> <BR><FONT size=2>> >>>> *** Computer section, Royal Botanic Gardens, Kew</FONT> <BR><FONT size=2>> >>>> *** tel: +44 (0)20 8332 5708</FONT> <BR><FONT size=2>> >>>> *** S.Hinchcliffe@rbgkew.org.uk</FONT> <BR><FONT size=2>> >>>></FONT> <BR><FONT size=2>> >>>></FONT> <BR><FONT size=2>> >>>> _______________________________________________</FONT> <BR><FONT size=2>> >>>> TDWG-GUID mailing list</FONT> <BR><FONT size=2>> >>>> TDWG-GUID@mailman.nhm.ku.edu</FONT> <BR><FONT size=2>> >>>> <A href="http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid">http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid</A></FONT> <BR><FONT size=2>> >>>></FONT> <BR><FONT size=2>> >>>></FONT> <BR><FONT size=2>> >>>></FONT> <BR><FONT size=2>> >>></FONT> <BR><FONT size=2>> >>></FONT> <BR><FONT size=2>> >>> _______________________________________________</FONT> <BR><FONT size=2>> >>> TDWG-GUID mailing list</FONT> <BR><FONT size=2>> >>> TDWG-GUID@mailman.nhm.ku.edu</FONT> <BR><FONT size=2>> >>> <A href="http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid">http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid</A></FONT> <BR><FONT size=2>> >>></FONT> <BR><FONT size=2>> >>></FONT> <BR><FONT size=2>> >></FONT> <BR><FONT size=2>> >></FONT> <BR><FONT size=2>> >> -- </FONT><BR><FONT size=2>> >></FONT> <BR><FONT size=2>> >> -------------------------------------</FONT> <BR><FONT size=2>> >> Roger Hyam</FONT> <BR><FONT size=2>> >> Technical Architect</FONT> <BR><FONT size=2>> >> Taxonomic Databases Working Group</FONT> <BR><FONT size=2>> >> -------------------------------------</FONT> <BR><FONT size=2>> >> <A href="http://www.tdwg.org/">http://www.tdwg.org</A></FONT> <BR><FONT size=2>> >> roger@tdwg.org</FONT> <BR><FONT size=2>> >> +44 1578 722782</FONT> <BR><FONT size=2>> >> -------------------------------------</FONT> <BR><FONT size=2>> >></FONT> <BR><FONT size=2>> >></FONT> <BR><FONT size=2>> ></FONT> <BR><FONT size=2>> > *** Sally Hinchcliffe</FONT> <BR><FONT size=2>> > *** Computer section, Royal Botanic Gardens, Kew</FONT> <BR><FONT size=2>> > *** tel: +44 (0)20 8332 5708</FONT> <BR><FONT size=2>> > *** S.Hinchcliffe@rbgkew.org.uk</FONT> <BR><FONT size=2>> ></FONT> <BR><FONT size=2>> ></FONT> <BR><FONT size=2>> > _______________________________________________</FONT> <BR><FONT size=2>> > TDWG-GUID mailing list</FONT> <BR><FONT size=2>> > TDWG-GUID@mailman.nhm.ku.edu</FONT> <BR><FONT size=2>> > <A href="http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid">http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid</A></FONT> <BR><FONT size=2>> ></FONT> <BR><FONT size=2>> ></FONT> <BR><FONT size=2>> ------------------------------------------------------------------------ </FONT><BR><FONT size=2>> ----------------------------------------</FONT> <BR><FONT size=2>> Professor Roderic D. M. Page</FONT> <BR><FONT size=2>> Editor, Systematic Biology</FONT> <BR><FONT size=2>> DEEB, IBLS</FONT> <BR><FONT size=2>> Graham Kerr Building</FONT> <BR><FONT size=2>> University of Glasgow</FONT> <BR><FONT size=2>> Glasgow G12 8QP</FONT> <BR><FONT size=2>> United Kingdom</FONT> <BR><FONT size=2>> </FONT><BR><FONT size=2>> Phone: +44 141 330 4778</FONT> <BR><FONT size=2>> Fax: +44 141 330 2792</FONT> <BR><FONT size=2>> email: r.page@bio.gla.ac.uk</FONT> <BR><FONT size=2>> web: <A href="http://taxonomy.zoology.gla.ac.uk/rod/rod.html">http://taxonomy.zoology.gla.ac.uk/rod/rod.html</A></FONT> <BR><FONT size=2>> iChat: aim://rodpage1962</FONT> <BR><FONT size=2>> reprints: <A href="http://taxonomy.zoology.gla.ac.uk/rod/pubs.html">http://taxonomy.zoology.gla.ac.uk/rod/pubs.html</A></FONT> <BR><FONT size=2>> </FONT><BR><FONT size=2>> Subscribe to Systematic Biology through the Society of Systematic</FONT> <BR><FONT size=2>> Biologists Website: <A href="http://systematicbiology.org/">http://systematicbiology.org</A></FONT> <BR><FONT size=2>> Search for taxon names: <A href="http://darwin.zoology.gla.ac.uk/~rpage/portal/">http://darwin.zoology.gla.ac.uk/~rpage/portal/</A></FONT> <BR><FONT size=2>> Find out what we know about a species: <A href="http://ispecies.org/">http://ispecies.org</A></FONT> <BR><FONT size=2>> Rod's rants on phyloinformatics: <A href="http://iphylo.blogspot.com/">http://iphylo.blogspot.com</A></FONT> <BR><FONT size=2>> </FONT><BR><FONT size=2>> </FONT><BR><FONT size=2>> </FONT><BR><FONT size=2>> </FONT><BR><FONT size=2>> ___________________________________________________________ </FONT><BR><FONT size=2>> Now you can scan emails quickly with a reading pane. Get the new Yahoo! Mail. <A href="http://uk.docs.yahoo.com/nowyoucan.html">http://uk.docs.yahoo.com/nowyoucan.html</A></FONT> <BR><FONT size=2>> </FONT></P>
<P><FONT size=2>*** Sally Hinchcliffe</FONT> <BR><FONT size=2>*** Computer section, Royal Botanic Gardens, Kew</FONT> <BR><FONT size=2>*** tel: +44 (0)20 8332 5708</FONT> <BR><FONT size=2>*** S.Hinchcliffe@rbgkew.org.uk</FONT> </P><BR>
<P><FONT size=2>_______________________________________________</FONT> <BR><FONT size=2>TDWG-GUID mailing list</FONT> <BR><FONT size=2>TDWG-GUID@mailman.nhm.ku.edu</FONT> <BR><FONT size=2><A href="http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid">http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid</A></FONT> </P></DIV></BODY></HTML>