[Tdwg-guid] Throttling searches [ Scanned for viruses ]

Tue Jun 20 10:06:36 CEST 2006

That's actually quite easy to deal with - just truncate the response 
at however many records. What's harder to do is where you've got lots 
of little queries, each one innocuous in itself, but coming at you as 
though out of a fire hose. 
Google and the other search engines (which all seem to use 
inkomisearch) do have services where by you can request that they 
slow down the rate at which they crawl your data. But identifying and 
contacting each crawler individually is inefficient ...

Thanks for all your suggestions. We'll try and build something into 
our service & will report back when we've done so

Sally

> This is probably a dumb question and exposes my ignorance, but what if the originating query is actually "Get all LSIDs where Family = Orchidaceae".  That seems the more likely scenario to me rather than get one LSID.  And that's the one that needs a throttle.  
>  
> Chuck  
> 
> ________________________________
> 
> From: Sally Hinchcliffe [mailto:S.Hinchcliffe at kew.org]
> Sent: Mon 6/19/2006 7:01 AM
> To: roger at tdwg.org
> Cc: tdwg-guid at mailman.nhm.ku.edu
> Subject: Re: [Tdwg-guid] Throttling searches [ Scanned for viruses ]
> 
> 
> 
> Hi Roger 
> Thanks for this ... I _think_ I understand it but Nicky is on leave 
> this week so I won't know if I do or not till after she returns 
> 
> The system doesn't have to be completely villain proof, just slow 
> down most of the villains so everyone else can get a look in 
> Sally 
> 
> > 
> > You don't! The LSID resolves to the binding to the getMetadata() method 
> > - which is a plain old fashioned URL. At this point the LSID authority 
> > has done its duty and we are just on a plain HTTP GET call so you can do 
> > whatever you can do with any regular HTTP GET. You could stipulate 
> > another header field or (more simply) give priority service for those 
> > who append a valid user id to the URL (&user_id=12345) 
> > 
> > So there is no throttle on resolving the LSID to the getMetadata binding 
> > (which is cheap) but there is a throttle on the actual call to get the 
> > metadata method. Really you need to do this because bad people may be 
> > able to tell from the URL how to scrape the source and bypass the LSID 
> > resolver after the first call anyhow. This is especially true if the URL 
> > contains the IPNI record ID which is likely. 
> > 
> > Here is an example using Rod's tester. 
> > 
> > http://linnaeus.zoology.gla.ac.uk/~rpage/lsid/tester/?q=urn:lsid:ubio.org:namebank:11815 
> > 
> > The getMetadata() method for this LSID: 
> > 
> >  urn:lsid:ubio.org:namebank:11815 
> > 
> > Is bound to this URL: 
> > 
> > http://names.ubio.org/authority/metadata.php?lsid=urn:lsid:ubio.org:namebank:11815 
> > 
> > So ubio would just have to give preferential services to calls like this: 
> > 
> > http://names.ubio.org/authority/metadata.php?lsid=urn:lsid:ubio.org:namebank:11815&user_id=rogerhyam1392918790 
> > 
> > If rogerhyam had paid his membership fees this year. 
> > 
> > Does this make sense? 
> > 
> > Roger 
> > p.s. You could do this on the web pages as well with a clever little 
> > thing to write dynamic tokens into the links so that it doesn't degrade 
> > the regular browsing experience and only stops scrapers - but that is 
> > beyond my remit at the moment ;) 
> > 
> > p.p.s. You could wrap this in https if you were paranoid about people 
> > stealing tokens - but this is highly unlikely I believe. 
> > 
> > Sally Hinchcliffe wrote: 
> > > How can we pass a token with an LSID? 
> > > 
> > > 
> > >   
> > >> I think the only way to throttle in these situations is to have some 
> > >> notion of who the client is and the only way to do that is to have some 
> > >> kind of token exchange over HTTP saying who they are. Basically you have 
> > >> to have some kind of client registration system or you can never 
> > >> distinguish between a call from a new client and a repeat call. The use 
> > >> of IP address is a pain because so many people are now behind some kind 
> > >> of NAT gateway. 
> > >> 
> > >> How about this for a plan: 
> > >> 
> > >> You could give a degraded services to people who don't pass a token (a 5 
> > >> second delay perhaps) and offer a quicker service to registered users 
> > >> who pass a token (but then perhaps limit the number of calls they make). 
> > >> This would mean you could offer a universal service even to those with 
> > >> naive client software but a better service to those who play nicely. You 
> > >> could also get better stats on who is using the service. 
> > >> 
> > >> So there are ways that this could be done. I expect people will come up 
> > >> with a host of different ways. It is outside LSIDs though. 
> > >> 
> > >> Roger 
> > >> 
> > >> Sally Hinchcliffe wrote: 
> > >>     
> > >>> It's not an LSID issue per se, but LSIDs will make it harder to slow 
> > >>> searches down. For instance, Google restricts use of its spell 
> > >>> checker to 1000 a day by use of a key which is passed in with each 
> > >>> request. Obviously this can't be done with LSIDs as then they 
> > >>> wouldn't be the same for each user. 
> > >>> The other reason why it's relevant to LSIDs is simply that providing 
> > >>> a list of all relevant IPNI LSIDs (not necessary to the LSID 
> > >>> implementation but a nice to have for caching / lookups for other 
> > >>> systems using our LSIDs) also makes life easier for the datascrapers 
> > >>> to operate 
> > >>> 
> > >>> Also I thought ... here's a list full of clever people perhaps they 
> > >>> will have some suggestions 
> > >>> 
> > >>> Sally 
> > >>> 
> > >>>   
> > >>>       
> > >>>> Is this an LSID issue? LSIDs essential provide a binding service between 
> > >>>> an name and one or more web services (we default to HTTP GET bindings). 
> > >>>> It isn't really up to the LSID authority to administer any policies 
> > >>>> regarding the web service but simply to point at it. It is up to the web 
> > >>>> service to do things like throttling, authentication and authorization. 
> > >>>> 
> > >>>> Imagine, for example, that the different services had different 
> > >>>> policies. It may be reasonable not to restrict the getMetadata() calls 
> > >>>> but to restrict the getData() calls. 
> > >>>> 
> > >>>> The use of LSIDs does not create any new problems that weren't there 
> > >>>> with web page scraping - or scraping of any other web service. 
> > >>>> 
> > >>>> Just my thoughts... 
> > >>>> 
> > >>>> Roger 
> > >>>> 
> > >>>> 
> > >>>> Ricardo Scachetti Pereira wrote: 
> > >>>>     
> > >>>>         
> > >>>>>     Sally, 
> > >>>>> 
> > >>>>>     You raised a really important issue that we had not really addressed 
> > >>>>> at the meeting. Thanks for that. 
> > >>>>> 
> > >>>>>     I would say that we should not constrain the resolution of LSIDs if 
> > >>>>> we expect our LSID infrastructure to work. LSIDs will be the basis of 
> > >>>>> our architecture so we better have good support for that. 
> > >>>>> 
> > >>>>>     However, that is sure a limiting factor. Also server efficiency will 
> > >>>>> likely vary quite a lot, depending on underlying system optimizations 
> > >>>>> and all. 
> > >>>>> 
> > >>>>>     So I think that the solution for this problem is in caching LSID 
> > >>>>> responses on the server LSID stack. Basically, after resolving an LSID 
> > >>>>> once, your server should be able to resolve it again and again really 
> > >>>>> quickly, until something on the metadata that is related to that id changes. 
> > >>>>> 
> > >>>>>     I haven't looked at this aspect of the LSID software stack, but 
> > >>>>> maybe others can say something about it. In any case I'll do some 
> > >>>>> research on it and get back to you. 
> > >>>>> 
> > >>>>>     Again, thanks for bringing it up. 
> > >>>>> 
> > >>>>>     Cheers, 
> > >>>>> 
> > >>>>> Ricardo 
> > >>>>> 
> > >>>>> 
> > >>>>> Sally Hinchcliffe wrote: 
> > >>>>>   
> > >>>>>       
> > >>>>>           
> > >>>>>> There are enough discontinuities in IPNI ids that 1,2,3 would quickly 
> > >>>>>> run into the sand. I agree it's not a new problem - I just hate to 
> > >>>>>> think I'm making life easier for the data scrapers 
> > >>>>>> Sally 
> > >>>>>> 
> > >>>>>> 
> > >>>>>>   
> > >>>>>>     
> > >>>>>>         
> > >>>>>>             
> > >>>>>>> It can be a problem but I'm not sure if there is a simple solution ... and how different is the LSID crawler scenario from an http://www.ipni.org/ipni/plantsearch?id= 1,2,3,4,5 ... 9999999 scenario?
> 
> > >>>>>>> 
> > >>>>>>> Paul 
> > >>>>>>> 
> > >>>>>>> -----Original Message----- 
> > >>>>>>> From: tdwg-guid-bounces at mailman.nhm.ku.edu 
> > >>>>>>> [mailto:tdwg-guid-bounces at mailman.nhm.ku.edu]On Behalf Of Sally 
> > >>>>>>> Hinchcliffe 
> > >>>>>>> Sent: 15 June 2006 12:08 
> > >>>>>>> To: tdwg-guid at mailman.nhm.ku.edu 
> > >>>>>>> Subject: [Tdwg-guid] Throttling searches [ Scanned for viruses ] 
> > >>>>>>> 
> > >>>>>>> 
> > >>>>>>> Hi all 
> > >>>>>>> another question that has come up here. 
> > >>>>>>> 
> > >>>>>>> As discussed at the meeting, we're thinking of providing a complete 
> > >>>>>>> download of all IPNI LSIDs plus a label (name and author, probably) 
> > >>>>>>> which will be available as an annually produced download 
> > >>>>>>> 
> > >>>>>>> Most people will play nice and just resolve one or two LSIDs as 
> > >>>>>>> required, but by providing a complete list, we're making it very easy 
> > >>>>>>> for someone to write a crawler that hits every LSID in turn and 
> > >>>>>>> basically brings our server to its knees 
> > >>>>>>> 
> > >>>>>>> Anybody know of a good way of enforcing more polite behaviour? We can 
> > >>>>>>> make the download only available under a data supply agreement that 
> > >>>>>>> includes a clause limiting hit rates, or we could limit by IP address 
> > >>>>>>> (but this would ultimately block out services like Rod's simple 
> > >>>>>>> resolver). I beleive Google's spell checker uses a key which has to 
> > >>>>>>> be passed in as part of the query - obviously we can't do that with 
> > >>>>>>> LSIDs 
> > >>>>>>> 
> > >>>>>>> Any thoughts? Anyone think this is a problem? 
> > >>>>>>> 
> > >>>>>>> Sally 
> > >>>>>>> *** Sally Hinchcliffe 
> > >>>>>>> *** Computer section, Royal Botanic Gardens, Kew 
> > >>>>>>> *** tel: +44 (0)20 8332 5708 
> > >>>>>>> *** S.Hinchcliffe at rbgkew.org.uk 
> > >>>>>>> 
> > >>>>>>> 
> > >>>>>>> _______________________________________________ 
> > >>>>>>> TDWG-GUID mailing list 
> > >>>>>>> TDWG-GUID at mailman.nhm.ku.edu 
> > >>>>>>> http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid 
> > >>>>>>> 
> > >>>>>>> _______________________________________________ 
> > >>>>>>> TDWG-GUID mailing list 
> > >>>>>>> TDWG-GUID at mailman.nhm.ku.edu 
> > >>>>>>> http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid 
> > >>>>>>>     
> > >>>>>>>       
> > >>>>>>>           
> > >>>>>>>               
> > >>>>>> *** Sally Hinchcliffe 
> > >>>>>> *** Computer section, Royal Botanic Gardens, Kew 
> > >>>>>> *** tel: +44 (0)20 8332 5708 
> > >>>>>> *** S.Hinchcliffe at rbgkew.org.uk 
> > >>>>>> 
> > >>>>>> 
> > >>>>>> _______________________________________________ 
> > >>>>>> TDWG-GUID mailing list 
> > >>>>>> TDWG-GUID at mailman.nhm.ku.edu 
> > >>>>>> http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid 
> > >>>>>> 
> > >>>>>>   
> > >>>>>>     
> > >>>>>>         
> > >>>>>>             
> > >>>>> _______________________________________________ 
> > >>>>> TDWG-GUID mailing list 
> > >>>>> TDWG-GUID at mailman.nhm.ku.edu 
> > >>>>> http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid 
> > >>>>> 
> > >>>>>   
> > >>>>>       
> > >>>>>           
> > >>>> -- 
> > >>>> 
> > >>>> ------------------------------------- 
> > >>>>  Roger Hyam 
> > >>>>  Technical Architect 
> > >>>>  Taxonomic Databases Working Group 
> > >>>> ------------------------------------- 
> > >>>>  http://www.tdwg.org <http://www.tdwg.org/>  
> > >>>>  roger at tdwg.org 
> > >>>>  +44 1578 722782 
> > >>>> ------------------------------------- 
> > >>>> 
> > >>>> 
> > >>>>     
> > >>>>         
> > >>> *** Sally Hinchcliffe 
> > >>> *** Computer section, Royal Botanic Gardens, Kew 
> > >>> *** tel: +44 (0)20 8332 5708 
> > >>> *** S.Hinchcliffe at rbgkew.org.uk 
> > >>> 
> > >>> 
> > >>>   
> > >>>       
> > >> -- 
> > >> 
> > >> ------------------------------------- 
> > >>  Roger Hyam 
> > >>  Technical Architect 
> > >>  Taxonomic Databases Working Group 
> > >> ------------------------------------- 
> > >>  http://www.tdwg.org <http://www.tdwg.org/>  
> > >>  roger at tdwg.org 
> > >>  +44 1578 722782 
> > >> ------------------------------------- 
> > >> 
> > >> 
> > >>     
> > > 
> > > *** Sally Hinchcliffe 
> > > *** Computer section, Royal Botanic Gardens, Kew 
> > > *** tel: +44 (0)20 8332 5708 
> > > *** S.Hinchcliffe at rbgkew.org.uk 
> > > 
> > > 
> > >   
> > 
> > 
> > -- 
> > 
> > ------------------------------------- 
> >  Roger Hyam 
> >  Technical Architect 
> >  Taxonomic Databases Working Group 
> > ------------------------------------- 
> >  http://www.tdwg.org <http://www.tdwg.org/>  
> >  roger at tdwg.org 
> >  +44 1578 722782 
> > ------------------------------------- 
> > 
> > 
> 
> *** Sally Hinchcliffe 
> *** Computer section, Royal Botanic Gardens, Kew 
> *** tel: +44 (0)20 8332 5708 
> *** S.Hinchcliffe at rbgkew.org.uk 
> 
> 
> _______________________________________________ 
> TDWG-GUID mailing list 
> TDWG-GUID at mailman.nhm.ku.edu 
> http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid 
> 
> 

*** Sally Hinchcliffe
*** Computer section, Royal Botanic Gardens, Kew
*** tel: +44 (0)20 8332 5708
*** S.Hinchcliffe at rbgkew.org.uk