[Tdwg-guid] Throttling searches

Mon Jun 19 16:32:09 CEST 2006

We'll be using version no for keeping track of versions so that's 
out. Also I'm a bit reluctant to start overloading the LSID itself 
for what is purely a piece of admin function...

Sally

> 
> We do some of this already with our web services.  SOAP methods  
> required a keycode.  We use the code so we have a contact in case we  
> need to send a message out as well as to provide a better accounting  
> to sources of how we pass on their content.  Patrick (uBio programmer  
> and nice guy) asked why not use the LSID version number as a way to  
> pass a token.  If it's not passed you can fall back to one level of  
> processing else give it the extra special treatment with the  
> userID.   Or is this violating something sacred in the LSID ethos?
> 
> David Remsen
> 
> 
> On Jun 19, 2006, at 6:07 AM, Roger Hyam wrote:
> 
> >
> > You don't! The LSID resolves to the binding to the getMetadata()  
> > method - which is a plain old fashioned URL. At this point the LSID  
> > authority has done its duty and we are just on a plain HTTP GET  
> > call so you can do whatever you can do with any regular HTTP GET.  
> > You could stipulate another header field or (more simply) give  
> > priority service for those who append a valid user id to the URL  
> > (&user_id=12345)
> >
> > So there is no throttle on resolving the LSID to the getMetadata  
> > binding (which is cheap) but there is a throttle on the actual call  
> > to get the metadata method. Really you need to do this because bad  
> > people may be able to tell from the URL how to scrape the source  
> > and bypass the LSID resolver after the first call anyhow. This is  
> > especially true if the URL contains the IPNI record ID which is  
> > likely.
> >
> > Here is an example using Rod's tester.
> >
> > http://linnaeus.zoology.gla.ac.uk/~rpage/lsid/tester/? 
> > q=urn:lsid:ubio.org:namebank:11815
> >
> > The getMetadata() method for this LSID:
> >
> >  urn:lsid:ubio.org:namebank:11815
> >
> > Is bound to this URL:
> >
> > http://names.ubio.org/authority/metadata.php? 
> > lsid=urn:lsid:ubio.org:namebank:11815
> >
> > So ubio would just have to give preferential services to calls like  
> > this:
> >
> > http://names.ubio.org/authority/metadata.php? 
> > lsid=urn:lsid:ubio.org:namebank:11815&user_id=rogerhyam1392918790
> >
> > If rogerhyam had paid his membership fees this year.
> >
> > Does this make sense?
> >
> > Roger
> > p.s. You could do this on the web pages as well with a clever  
> > little thing to write dynamic tokens into the links so that it  
> > doesn't degrade the regular browsing experience and only stops  
> > scrapers - but that is beyond my remit at the moment ;)
> >
> > p.p.s. You could wrap this in https if you were paranoid about  
> > people stealing tokens - but this is highly unlikely I believe.
> >
> > Sally Hinchcliffe wrote:
> >> How can we pass a token with an LSID?
> >>
> >>
> >>
> >>> I think the only way to throttle in these situations is to have some
> >>> notion of who the client is and the only way to do that is to  
> >>> have some
> >>> kind of token exchange over HTTP saying who they are. Basically  
> >>> you have
> >>> to have some kind of client registration system or you can never
> >>> distinguish between a call from a new client and a repeat call.  
> >>> The use
> >>> of IP address is a pain because so many people are now behind  
> >>> some kind
> >>> of NAT gateway.
> >>>
> >>> How about this for a plan:
> >>>
> >>> You could give a degraded services to people who don't pass a  
> >>> token (a 5
> >>> second delay perhaps) and offer a quicker service to registered  
> >>> users
> >>> who pass a token (but then perhaps limit the number of calls they  
> >>> make).
> >>> This would mean you could offer a universal service even to those  
> >>> with
> >>> naive client software but a better service to those who play  
> >>> nicely. You
> >>> could also get better stats on who is using the service.
> >>>
> >>> So there are ways that this could be done. I expect people will  
> >>> come up
> >>> with a host of different ways. It is outside LSIDs though.
> >>>
> >>> Roger
> >>>
> >>> Sally Hinchcliffe wrote:
> >>>
> >>>> It's not an LSID issue per se, but LSIDs will make it harder to  
> >>>> slow
> >>>> searches down. For instance, Google restricts use of its spell
> >>>> checker to 1000 a day by use of a key which is passed in with each
> >>>> request. Obviously this can't be done with LSIDs as then they
> >>>> wouldn't be the same for each user.
> >>>> The other reason why it's relevant to LSIDs is simply that  
> >>>> providing
> >>>> a list of all relevant IPNI LSIDs (not necessary to the LSID
> >>>> implementation but a nice to have for caching / lookups for other
> >>>> systems using our LSIDs) also makes life easier for the  
> >>>> datascrapers
> >>>> to operate
> >>>>
> >>>> Also I thought ... here's a list full of clever people perhaps they
> >>>> will have some suggestions
> >>>>
> >>>> Sally
> >>>>
> >>>>
> >>>>
> >>>>> Is this an LSID issue? LSIDs essential provide a binding  
> >>>>> service between
> >>>>> an name and one or more web services (we default to HTTP GET  
> >>>>> bindings).
> >>>>> It isn't really up to the LSID authority to administer any  
> >>>>> policies
> >>>>> regarding the web service but simply to point at it. It is up  
> >>>>> to the web
> >>>>> service to do things like throttling, authentication and  
> >>>>> authorization.
> >>>>>
> >>>>> Imagine, for example, that the different services had different
> >>>>> policies. It may be reasonable not to restrict the getMetadata 
> >>>>> () calls
> >>>>> but to restrict the getData() calls.
> >>>>>
> >>>>> The use of LSIDs does not create any new problems that weren't  
> >>>>> there
> >>>>> with web page scraping - or scraping of any other web service.
> >>>>>
> >>>>> Just my thoughts...
> >>>>>
> >>>>> Roger
> >>>>>
> >>>>>
> >>>>> Ricardo Scachetti Pereira wrote:
> >>>>>
> >>>>>
> >>>>>>     Sally,
> >>>>>>
> >>>>>>     You raised a really important issue that we had not really  
> >>>>>> addressed
> >>>>>> at the meeting. Thanks for that.
> >>>>>>
> >>>>>>     I would say that we should not constrain the resolution of  
> >>>>>> LSIDs if
> >>>>>> we expect our LSID infrastructure to work. LSIDs will be the  
> >>>>>> basis of
> >>>>>> our architecture so we better have good support for that.
> >>>>>>
> >>>>>>     However, that is sure a limiting factor. Also server  
> >>>>>> efficiency will
> >>>>>> likely vary quite a lot, depending on underlying system  
> >>>>>> optimizations
> >>>>>> and all.
> >>>>>>
> >>>>>>     So I think that the solution for this problem is in  
> >>>>>> caching LSID
> >>>>>> responses on the server LSID stack. Basically, after resolving  
> >>>>>> an LSID
> >>>>>> once, your server should be able to resolve it again and again  
> >>>>>> really
> >>>>>> quickly, until something on the metadata that is related to  
> >>>>>> that id changes.
> >>>>>>
> >>>>>>     I haven't looked at this aspect of the LSID software  
> >>>>>> stack, but
> >>>>>> maybe others can say something about it. In any case I'll do some
> >>>>>> research on it and get back to you.
> >>>>>>
> >>>>>>     Again, thanks for bringing it up.
> >>>>>>
> >>>>>>     Cheers,
> >>>>>>
> >>>>>> Ricardo
> >>>>>>
> >>>>>>
> >>>>>> Sally Hinchcliffe wrote:
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>> There are enough discontinuities in IPNI ids that 1,2,3 would  
> >>>>>>> quickly
> >>>>>>> run into the sand. I agree it's not a new problem - I just  
> >>>>>>> hate to
> >>>>>>> think I'm making life easier for the data scrapers
> >>>>>>> Sally
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>> It can be a problem but I'm not sure if there is a simple  
> >>>>>>>> solution ... and how different is the LSID crawler scenario  
> >>>>>>>> from an http://www.ipni.org/ipni/plantsearch?id=  
> >>>>>>>> 1,2,3,4,5 ... 9999999 scenario?
> >>>>>>>>
> >>>>>>>> Paul
> >>>>>>>>
> >>>>>>>> -----Original Message-----
> >>>>>>>> From: tdwg-guid-bounces at mailman.nhm.ku.edu
> >>>>>>>> [mailto:tdwg-guid-bounces at mailman.nhm.ku.edu]On Behalf Of Sally
> >>>>>>>> Hinchcliffe
> >>>>>>>> Sent: 15 June 2006 12:08
> >>>>>>>> To: tdwg-guid at mailman.nhm.ku.edu
> >>>>>>>> Subject: [Tdwg-guid] Throttling searches [ Scanned for  
> >>>>>>>> viruses ]
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Hi all
> >>>>>>>> another question that has come up here.
> >>>>>>>>
> >>>>>>>> As discussed at the meeting, we're thinking of providing a  
> >>>>>>>> complete
> >>>>>>>> download of all IPNI LSIDs plus a label (name and author,  
> >>>>>>>> probably)
> >>>>>>>> which will be available as an annually produced download
> >>>>>>>>
> >>>>>>>> Most people will play nice and just resolve one or two LSIDs as
> >>>>>>>> required, but by providing a complete list, we're making it  
> >>>>>>>> very easy
> >>>>>>>> for someone to write a crawler that hits every LSID in turn and
> >>>>>>>> basically brings our server to its knees
> >>>>>>>>
> >>>>>>>> Anybody know of a good way of enforcing more polite  
> >>>>>>>> behaviour? We can
> >>>>>>>> make the download only available under a data supply  
> >>>>>>>> agreement that
> >>>>>>>> includes a clause limiting hit rates, or we could limit by  
> >>>>>>>> IP address
> >>>>>>>> (but this would ultimately block out services like Rod's simple
> >>>>>>>> resolver). I beleive Google's spell checker uses a key which  
> >>>>>>>> has to
> >>>>>>>> be passed in as part of the query - obviously we can't do  
> >>>>>>>> that with
> >>>>>>>> LSIDs
> >>>>>>>>
> >>>>>>>> Any thoughts? Anyone think this is a problem?
> >>>>>>>>
> >>>>>>>> Sally
> >>>>>>>> *** Sally Hinchcliffe
> >>>>>>>> *** Computer section, Royal Botanic Gardens, Kew
> >>>>>>>> *** tel: +44 (0)20 8332 5708
> >>>>>>>> *** S.Hinchcliffe at rbgkew.org.uk
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> _______________________________________________
> >>>>>>>> TDWG-GUID mailing list
> >>>>>>>> TDWG-GUID at mailman.nhm.ku.edu
> >>>>>>>> http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid
> >>>>>>>>
> >>>>>>>> _______________________________________________
> >>>>>>>> TDWG-GUID mailing list
> >>>>>>>> TDWG-GUID at mailman.nhm.ku.edu
> >>>>>>>> http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>> *** Sally Hinchcliffe
> >>>>>>> *** Computer section, Royal Botanic Gardens, Kew
> >>>>>>> *** tel: +44 (0)20 8332 5708
> >>>>>>> *** S.Hinchcliffe at rbgkew.org.uk
> >>>>>>>
> >>>>>>>
> >>>>>>> _______________________________________________
> >>>>>>> TDWG-GUID mailing list
> >>>>>>> TDWG-GUID at mailman.nhm.ku.edu
> >>>>>>> http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>> _______________________________________________
> >>>>>> TDWG-GUID mailing list
> >>>>>> TDWG-GUID at mailman.nhm.ku.edu
> >>>>>> http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>> -- 
> >>>>>
> >>>>> -------------------------------------
> >>>>>  Roger Hyam
> >>>>>  Technical Architect
> >>>>>  Taxonomic Databases Working Group
> >>>>> -------------------------------------
> >>>>>  http://www.tdwg.org
> >>>>>  roger at tdwg.org
> >>>>>  +44 1578 722782
> >>>>> -------------------------------------
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>> *** Sally Hinchcliffe
> >>>> *** Computer section, Royal Botanic Gardens, Kew
> >>>> *** tel: +44 (0)20 8332 5708
> >>>> *** S.Hinchcliffe at rbgkew.org.uk
> >>>>
> >>>>
> >>>>
> >>>>
> >>> -- 
> >>>
> >>> -------------------------------------
> >>>  Roger Hyam
> >>>  Technical Architect
> >>>  Taxonomic Databases Working Group
> >>> -------------------------------------
> >>>  http://www.tdwg.org
> >>>  roger at tdwg.org
> >>>  +44 1578 722782
> >>> -------------------------------------
> >>>
> >>>
> >>>
> >> *** Sally Hinchcliffe
> >> *** Computer section, Royal Botanic Gardens, Kew
> >> *** tel: +44 (0)20 8332 5708
> >> *** S.Hinchcliffe at rbgkew.org.uk
> >>
> >>
> >>
> >
> >
> > -- 
> >
> > -------------------------------------
> >  Roger Hyam
> >  Technical Architect
> >  Taxonomic Databases Working Group
> > -------------------------------------
> >  http://www.tdwg.org
> >  roger at tdwg.org
> >  +44 1578 722782
> > -------------------------------------
> > _______________________________________________
> > TDWG-GUID mailing list
> > TDWG-GUID at mailman.nhm.ku.edu
> > http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid
> 
> _______________________________________________
> David Remsen
> uBio Project Manager
> Marine Biological Laboratory
> Woods Hole, MA 02543
> 508-289-7632
> 
> 
> 

*** Sally Hinchcliffe
*** Computer section, Royal Botanic Gardens, Kew
*** tel: +44 (0)20 8332 5708
*** S.Hinchcliffe at rbgkew.org.uk