[Tdwg-guid] Throttling searches [ Scanned for viruses ]

Chuck Miller Chuck.Miller at mobot.org
Tue Jun 20 04:26:35 CEST 2006


This is probably a dumb question and exposes my ignorance, but what if the originating query is actually "Get all LSIDs where Family = Orchidaceae".  That seems the more likely scenario to me rather than get one LSID.  And that's the one that needs a throttle.  
 
Chuck  

________________________________

From: Sally Hinchcliffe [mailto:S.Hinchcliffe at kew.org]
Sent: Mon 6/19/2006 7:01 AM
To: roger at tdwg.org
Cc: tdwg-guid at mailman.nhm.ku.edu
Subject: Re: [Tdwg-guid] Throttling searches [ Scanned for viruses ]



Hi Roger 
Thanks for this ... I _think_ I understand it but Nicky is on leave 
this week so I won't know if I do or not till after she returns 

The system doesn't have to be completely villain proof, just slow 
down most of the villains so everyone else can get a look in 
Sally 

> 
> You don't! The LSID resolves to the binding to the getMetadata() method 
> - which is a plain old fashioned URL. At this point the LSID authority 
> has done its duty and we are just on a plain HTTP GET call so you can do 
> whatever you can do with any regular HTTP GET. You could stipulate 
> another header field or (more simply) give priority service for those 
> who append a valid user id to the URL (&user_id=12345) 
> 
> So there is no throttle on resolving the LSID to the getMetadata binding 
> (which is cheap) but there is a throttle on the actual call to get the 
> metadata method. Really you need to do this because bad people may be 
> able to tell from the URL how to scrape the source and bypass the LSID 
> resolver after the first call anyhow. This is especially true if the URL 
> contains the IPNI record ID which is likely. 
> 
> Here is an example using Rod's tester. 
> 
> http://linnaeus.zoology.gla.ac.uk/~rpage/lsid/tester/?q=urn:lsid:ubio.org:namebank:11815 
> 
> The getMetadata() method for this LSID: 
> 
>  urn:lsid:ubio.org:namebank:11815 
> 
> Is bound to this URL: 
> 
> http://names.ubio.org/authority/metadata.php?lsid=urn:lsid:ubio.org:namebank:11815 
> 
> So ubio would just have to give preferential services to calls like this: 
> 
> http://names.ubio.org/authority/metadata.php?lsid=urn:lsid:ubio.org:namebank:11815&user_id=rogerhyam1392918790 
> 
> If rogerhyam had paid his membership fees this year. 
> 
> Does this make sense? 
> 
> Roger 
> p.s. You could do this on the web pages as well with a clever little 
> thing to write dynamic tokens into the links so that it doesn't degrade 
> the regular browsing experience and only stops scrapers - but that is 
> beyond my remit at the moment ;) 
> 
> p.p.s. You could wrap this in https if you were paranoid about people 
> stealing tokens - but this is highly unlikely I believe. 
> 
> Sally Hinchcliffe wrote: 
> > How can we pass a token with an LSID? 
> > 
> > 
> >   
> >> I think the only way to throttle in these situations is to have some 
> >> notion of who the client is and the only way to do that is to have some 
> >> kind of token exchange over HTTP saying who they are. Basically you have 
> >> to have some kind of client registration system or you can never 
> >> distinguish between a call from a new client and a repeat call. The use 
> >> of IP address is a pain because so many people are now behind some kind 
> >> of NAT gateway. 
> >> 
> >> How about this for a plan: 
> >> 
> >> You could give a degraded services to people who don't pass a token (a 5 
> >> second delay perhaps) and offer a quicker service to registered users 
> >> who pass a token (but then perhaps limit the number of calls they make). 
> >> This would mean you could offer a universal service even to those with 
> >> naive client software but a better service to those who play nicely. You 
> >> could also get better stats on who is using the service. 
> >> 
> >> So there are ways that this could be done. I expect people will come up 
> >> with a host of different ways. It is outside LSIDs though. 
> >> 
> >> Roger 
> >> 
> >> Sally Hinchcliffe wrote: 
> >>     
> >>> It's not an LSID issue per se, but LSIDs will make it harder to slow 
> >>> searches down. For instance, Google restricts use of its spell 
> >>> checker to 1000 a day by use of a key which is passed in with each 
> >>> request. Obviously this can't be done with LSIDs as then they 
> >>> wouldn't be the same for each user. 
> >>> The other reason why it's relevant to LSIDs is simply that providing 
> >>> a list of all relevant IPNI LSIDs (not necessary to the LSID 
> >>> implementation but a nice to have for caching / lookups for other 
> >>> systems using our LSIDs) also makes life easier for the datascrapers 
> >>> to operate 
> >>> 
> >>> Also I thought ... here's a list full of clever people perhaps they 
> >>> will have some suggestions 
> >>> 
> >>> Sally 
> >>> 
> >>>   
> >>>       
> >>>> Is this an LSID issue? LSIDs essential provide a binding service between 
> >>>> an name and one or more web services (we default to HTTP GET bindings). 
> >>>> It isn't really up to the LSID authority to administer any policies 
> >>>> regarding the web service but simply to point at it. It is up to the web 
> >>>> service to do things like throttling, authentication and authorization. 
> >>>> 
> >>>> Imagine, for example, that the different services had different 
> >>>> policies. It may be reasonable not to restrict the getMetadata() calls 
> >>>> but to restrict the getData() calls. 
> >>>> 
> >>>> The use of LSIDs does not create any new problems that weren't there 
> >>>> with web page scraping - or scraping of any other web service. 
> >>>> 
> >>>> Just my thoughts... 
> >>>> 
> >>>> Roger 
> >>>> 
> >>>> 
> >>>> Ricardo Scachetti Pereira wrote: 
> >>>>     
> >>>>         
> >>>>>     Sally, 
> >>>>> 
> >>>>>     You raised a really important issue that we had not really addressed 
> >>>>> at the meeting. Thanks for that. 
> >>>>> 
> >>>>>     I would say that we should not constrain the resolution of LSIDs if 
> >>>>> we expect our LSID infrastructure to work. LSIDs will be the basis of 
> >>>>> our architecture so we better have good support for that. 
> >>>>> 
> >>>>>     However, that is sure a limiting factor. Also server efficiency will 
> >>>>> likely vary quite a lot, depending on underlying system optimizations 
> >>>>> and all. 
> >>>>> 
> >>>>>     So I think that the solution for this problem is in caching LSID 
> >>>>> responses on the server LSID stack. Basically, after resolving an LSID 
> >>>>> once, your server should be able to resolve it again and again really 
> >>>>> quickly, until something on the metadata that is related to that id changes. 
> >>>>> 
> >>>>>     I haven't looked at this aspect of the LSID software stack, but 
> >>>>> maybe others can say something about it. In any case I'll do some 
> >>>>> research on it and get back to you. 
> >>>>> 
> >>>>>     Again, thanks for bringing it up. 
> >>>>> 
> >>>>>     Cheers, 
> >>>>> 
> >>>>> Ricardo 
> >>>>> 
> >>>>> 
> >>>>> Sally Hinchcliffe wrote: 
> >>>>>   
> >>>>>       
> >>>>>           
> >>>>>> There are enough discontinuities in IPNI ids that 1,2,3 would quickly 
> >>>>>> run into the sand. I agree it's not a new problem - I just hate to 
> >>>>>> think I'm making life easier for the data scrapers 
> >>>>>> Sally 
> >>>>>> 
> >>>>>> 
> >>>>>>   
> >>>>>>     
> >>>>>>         
> >>>>>>             
> >>>>>>> It can be a problem but I'm not sure if there is a simple solution ... and how different is the LSID crawler scenario from an http://www.ipni.org/ipni/plantsearch?id= 1,2,3,4,5 ... 9999999 scenario?

> >>>>>>> 
> >>>>>>> Paul 
> >>>>>>> 
> >>>>>>> -----Original Message----- 
> >>>>>>> From: tdwg-guid-bounces at mailman.nhm.ku.edu 
> >>>>>>> [mailto:tdwg-guid-bounces at mailman.nhm.ku.edu]On Behalf Of Sally 
> >>>>>>> Hinchcliffe 
> >>>>>>> Sent: 15 June 2006 12:08 
> >>>>>>> To: tdwg-guid at mailman.nhm.ku.edu 
> >>>>>>> Subject: [Tdwg-guid] Throttling searches [ Scanned for viruses ] 
> >>>>>>> 
> >>>>>>> 
> >>>>>>> Hi all 
> >>>>>>> another question that has come up here. 
> >>>>>>> 
> >>>>>>> As discussed at the meeting, we're thinking of providing a complete 
> >>>>>>> download of all IPNI LSIDs plus a label (name and author, probably) 
> >>>>>>> which will be available as an annually produced download 
> >>>>>>> 
> >>>>>>> Most people will play nice and just resolve one or two LSIDs as 
> >>>>>>> required, but by providing a complete list, we're making it very easy 
> >>>>>>> for someone to write a crawler that hits every LSID in turn and 
> >>>>>>> basically brings our server to its knees 
> >>>>>>> 
> >>>>>>> Anybody know of a good way of enforcing more polite behaviour? We can 
> >>>>>>> make the download only available under a data supply agreement that 
> >>>>>>> includes a clause limiting hit rates, or we could limit by IP address 
> >>>>>>> (but this would ultimately block out services like Rod's simple 
> >>>>>>> resolver). I beleive Google's spell checker uses a key which has to 
> >>>>>>> be passed in as part of the query - obviously we can't do that with 
> >>>>>>> LSIDs 
> >>>>>>> 
> >>>>>>> Any thoughts? Anyone think this is a problem? 
> >>>>>>> 
> >>>>>>> Sally 
> >>>>>>> *** Sally Hinchcliffe 
> >>>>>>> *** Computer section, Royal Botanic Gardens, Kew 
> >>>>>>> *** tel: +44 (0)20 8332 5708 
> >>>>>>> *** S.Hinchcliffe at rbgkew.org.uk 
> >>>>>>> 
> >>>>>>> 
> >>>>>>> _______________________________________________ 
> >>>>>>> TDWG-GUID mailing list 
> >>>>>>> TDWG-GUID at mailman.nhm.ku.edu 
> >>>>>>> http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid 
> >>>>>>> 
> >>>>>>> _______________________________________________ 
> >>>>>>> TDWG-GUID mailing list 
> >>>>>>> TDWG-GUID at mailman.nhm.ku.edu 
> >>>>>>> http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid 
> >>>>>>>     
> >>>>>>>       
> >>>>>>>           
> >>>>>>>               
> >>>>>> *** Sally Hinchcliffe 
> >>>>>> *** Computer section, Royal Botanic Gardens, Kew 
> >>>>>> *** tel: +44 (0)20 8332 5708 
> >>>>>> *** S.Hinchcliffe at rbgkew.org.uk 
> >>>>>> 
> >>>>>> 
> >>>>>> _______________________________________________ 
> >>>>>> TDWG-GUID mailing list 
> >>>>>> TDWG-GUID at mailman.nhm.ku.edu 
> >>>>>> http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid 
> >>>>>> 
> >>>>>>   
> >>>>>>     
> >>>>>>         
> >>>>>>             
> >>>>> _______________________________________________ 
> >>>>> TDWG-GUID mailing list 
> >>>>> TDWG-GUID at mailman.nhm.ku.edu 
> >>>>> http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid 
> >>>>> 
> >>>>>   
> >>>>>       
> >>>>>           
> >>>> -- 
> >>>> 
> >>>> ------------------------------------- 
> >>>>  Roger Hyam 
> >>>>  Technical Architect 
> >>>>  Taxonomic Databases Working Group 
> >>>> ------------------------------------- 
> >>>>  http://www.tdwg.org <http://www.tdwg.org/>  
> >>>>  roger at tdwg.org 
> >>>>  +44 1578 722782 
> >>>> ------------------------------------- 
> >>>> 
> >>>> 
> >>>>     
> >>>>         
> >>> *** Sally Hinchcliffe 
> >>> *** Computer section, Royal Botanic Gardens, Kew 
> >>> *** tel: +44 (0)20 8332 5708 
> >>> *** S.Hinchcliffe at rbgkew.org.uk 
> >>> 
> >>> 
> >>>   
> >>>       
> >> -- 
> >> 
> >> ------------------------------------- 
> >>  Roger Hyam 
> >>  Technical Architect 
> >>  Taxonomic Databases Working Group 
> >> ------------------------------------- 
> >>  http://www.tdwg.org <http://www.tdwg.org/>  
> >>  roger at tdwg.org 
> >>  +44 1578 722782 
> >> ------------------------------------- 
> >> 
> >> 
> >>     
> > 
> > *** Sally Hinchcliffe 
> > *** Computer section, Royal Botanic Gardens, Kew 
> > *** tel: +44 (0)20 8332 5708 
> > *** S.Hinchcliffe at rbgkew.org.uk 
> > 
> > 
> >   
> 
> 
> -- 
> 
> ------------------------------------- 
>  Roger Hyam 
>  Technical Architect 
>  Taxonomic Databases Working Group 
> ------------------------------------- 
>  http://www.tdwg.org <http://www.tdwg.org/>  
>  roger at tdwg.org 
>  +44 1578 722782 
> ------------------------------------- 
> 
> 

*** Sally Hinchcliffe 
*** Computer section, Royal Botanic Gardens, Kew 
*** tel: +44 (0)20 8332 5708 
*** S.Hinchcliffe at rbgkew.org.uk 


_______________________________________________ 
TDWG-GUID mailing list 
TDWG-GUID at mailman.nhm.ku.edu 
http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.tdwg.org/pipermail/tdwg-tag/attachments/20060619/9bd5e13c/attachment.html 


More information about the tdwg-tag mailing list