[Tdwg-guid] Throttling searches [ Scanned for viruses ]
Sally Hinchcliffe
S.Hinchcliffe at kew.org
Mon Jun 19 14:01:26 CEST 2006
Hi Roger
Thanks for this ... I _think_ I understand it but Nicky is on leave
this week so I won't know if I do or not till after she returns
The system doesn't have to be completely villain proof, just slow
down most of the villains so everyone else can get a look in
Sally
>
> You don't! The LSID resolves to the binding to the getMetadata() method
> - which is a plain old fashioned URL. At this point the LSID authority
> has done its duty and we are just on a plain HTTP GET call so you can do
> whatever you can do with any regular HTTP GET. You could stipulate
> another header field or (more simply) give priority service for those
> who append a valid user id to the URL (&user_id=12345)
>
> So there is no throttle on resolving the LSID to the getMetadata binding
> (which is cheap) but there is a throttle on the actual call to get the
> metadata method. Really you need to do this because bad people may be
> able to tell from the URL how to scrape the source and bypass the LSID
> resolver after the first call anyhow. This is especially true if the URL
> contains the IPNI record ID which is likely.
>
> Here is an example using Rod's tester.
>
> http://linnaeus.zoology.gla.ac.uk/~rpage/lsid/tester/?q=urn:lsid:ubio.org:namebank:11815
>
> The getMetadata() method for this LSID:
>
> urn:lsid:ubio.org:namebank:11815
>
> Is bound to this URL:
>
> http://names.ubio.org/authority/metadata.php?lsid=urn:lsid:ubio.org:namebank:11815
>
> So ubio would just have to give preferential services to calls like this:
>
> http://names.ubio.org/authority/metadata.php?lsid=urn:lsid:ubio.org:namebank:11815&user_id=rogerhyam1392918790
>
> If rogerhyam had paid his membership fees this year.
>
> Does this make sense?
>
> Roger
> p.s. You could do this on the web pages as well with a clever little
> thing to write dynamic tokens into the links so that it doesn't degrade
> the regular browsing experience and only stops scrapers - but that is
> beyond my remit at the moment ;)
>
> p.p.s. You could wrap this in https if you were paranoid about people
> stealing tokens - but this is highly unlikely I believe.
>
> Sally Hinchcliffe wrote:
> > How can we pass a token with an LSID?
> >
> >
> >
> >> I think the only way to throttle in these situations is to have some
> >> notion of who the client is and the only way to do that is to have some
> >> kind of token exchange over HTTP saying who they are. Basically you have
> >> to have some kind of client registration system or you can never
> >> distinguish between a call from a new client and a repeat call. The use
> >> of IP address is a pain because so many people are now behind some kind
> >> of NAT gateway.
> >>
> >> How about this for a plan:
> >>
> >> You could give a degraded services to people who don't pass a token (a 5
> >> second delay perhaps) and offer a quicker service to registered users
> >> who pass a token (but then perhaps limit the number of calls they make).
> >> This would mean you could offer a universal service even to those with
> >> naive client software but a better service to those who play nicely. You
> >> could also get better stats on who is using the service.
> >>
> >> So there are ways that this could be done. I expect people will come up
> >> with a host of different ways. It is outside LSIDs though.
> >>
> >> Roger
> >>
> >> Sally Hinchcliffe wrote:
> >>
> >>> It's not an LSID issue per se, but LSIDs will make it harder to slow
> >>> searches down. For instance, Google restricts use of its spell
> >>> checker to 1000 a day by use of a key which is passed in with each
> >>> request. Obviously this can't be done with LSIDs as then they
> >>> wouldn't be the same for each user.
> >>> The other reason why it's relevant to LSIDs is simply that providing
> >>> a list of all relevant IPNI LSIDs (not necessary to the LSID
> >>> implementation but a nice to have for caching / lookups for other
> >>> systems using our LSIDs) also makes life easier for the datascrapers
> >>> to operate
> >>>
> >>> Also I thought ... here's a list full of clever people perhaps they
> >>> will have some suggestions
> >>>
> >>> Sally
> >>>
> >>>
> >>>
> >>>> Is this an LSID issue? LSIDs essential provide a binding service between
> >>>> an name and one or more web services (we default to HTTP GET bindings).
> >>>> It isn't really up to the LSID authority to administer any policies
> >>>> regarding the web service but simply to point at it. It is up to the web
> >>>> service to do things like throttling, authentication and authorization.
> >>>>
> >>>> Imagine, for example, that the different services had different
> >>>> policies. It may be reasonable not to restrict the getMetadata() calls
> >>>> but to restrict the getData() calls.
> >>>>
> >>>> The use of LSIDs does not create any new problems that weren't there
> >>>> with web page scraping - or scraping of any other web service.
> >>>>
> >>>> Just my thoughts...
> >>>>
> >>>> Roger
> >>>>
> >>>>
> >>>> Ricardo Scachetti Pereira wrote:
> >>>>
> >>>>
> >>>>> Sally,
> >>>>>
> >>>>> You raised a really important issue that we had not really addressed
> >>>>> at the meeting. Thanks for that.
> >>>>>
> >>>>> I would say that we should not constrain the resolution of LSIDs if
> >>>>> we expect our LSID infrastructure to work. LSIDs will be the basis of
> >>>>> our architecture so we better have good support for that.
> >>>>>
> >>>>> However, that is sure a limiting factor. Also server efficiency will
> >>>>> likely vary quite a lot, depending on underlying system optimizations
> >>>>> and all.
> >>>>>
> >>>>> So I think that the solution for this problem is in caching LSID
> >>>>> responses on the server LSID stack. Basically, after resolving an LSID
> >>>>> once, your server should be able to resolve it again and again really
> >>>>> quickly, until something on the metadata that is related to that id changes.
> >>>>>
> >>>>> I haven't looked at this aspect of the LSID software stack, but
> >>>>> maybe others can say something about it. In any case I'll do some
> >>>>> research on it and get back to you.
> >>>>>
> >>>>> Again, thanks for bringing it up.
> >>>>>
> >>>>> Cheers,
> >>>>>
> >>>>> Ricardo
> >>>>>
> >>>>>
> >>>>> Sally Hinchcliffe wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>>> There are enough discontinuities in IPNI ids that 1,2,3 would quickly
> >>>>>> run into the sand. I agree it's not a new problem - I just hate to
> >>>>>> think I'm making life easier for the data scrapers
> >>>>>> Sally
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>> It can be a problem but I'm not sure if there is a simple solution ... and how different is the LSID crawler scenario from an http://www.ipni.org/ipni/plantsearch?id= 1,2,3,4,5 ... 9999999 scenario?
> >>>>>>>
> >>>>>>> Paul
> >>>>>>>
> >>>>>>> -----Original Message-----
> >>>>>>> From: tdwg-guid-bounces at mailman.nhm.ku.edu
> >>>>>>> [mailto:tdwg-guid-bounces at mailman.nhm.ku.edu]On Behalf Of Sally
> >>>>>>> Hinchcliffe
> >>>>>>> Sent: 15 June 2006 12:08
> >>>>>>> To: tdwg-guid at mailman.nhm.ku.edu
> >>>>>>> Subject: [Tdwg-guid] Throttling searches [ Scanned for viruses ]
> >>>>>>>
> >>>>>>>
> >>>>>>> Hi all
> >>>>>>> another question that has come up here.
> >>>>>>>
> >>>>>>> As discussed at the meeting, we're thinking of providing a complete
> >>>>>>> download of all IPNI LSIDs plus a label (name and author, probably)
> >>>>>>> which will be available as an annually produced download
> >>>>>>>
> >>>>>>> Most people will play nice and just resolve one or two LSIDs as
> >>>>>>> required, but by providing a complete list, we're making it very easy
> >>>>>>> for someone to write a crawler that hits every LSID in turn and
> >>>>>>> basically brings our server to its knees
> >>>>>>>
> >>>>>>> Anybody know of a good way of enforcing more polite behaviour? We can
> >>>>>>> make the download only available under a data supply agreement that
> >>>>>>> includes a clause limiting hit rates, or we could limit by IP address
> >>>>>>> (but this would ultimately block out services like Rod's simple
> >>>>>>> resolver). I beleive Google's spell checker uses a key which has to
> >>>>>>> be passed in as part of the query - obviously we can't do that with
> >>>>>>> LSIDs
> >>>>>>>
> >>>>>>> Any thoughts? Anyone think this is a problem?
> >>>>>>>
> >>>>>>> Sally
> >>>>>>> *** Sally Hinchcliffe
> >>>>>>> *** Computer section, Royal Botanic Gardens, Kew
> >>>>>>> *** tel: +44 (0)20 8332 5708
> >>>>>>> *** S.Hinchcliffe at rbgkew.org.uk
> >>>>>>>
> >>>>>>>
> >>>>>>> _______________________________________________
> >>>>>>> TDWG-GUID mailing list
> >>>>>>> TDWG-GUID at mailman.nhm.ku.edu
> >>>>>>> http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid
> >>>>>>>
> >>>>>>> _______________________________________________
> >>>>>>> TDWG-GUID mailing list
> >>>>>>> TDWG-GUID at mailman.nhm.ku.edu
> >>>>>>> http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>> *** Sally Hinchcliffe
> >>>>>> *** Computer section, Royal Botanic Gardens, Kew
> >>>>>> *** tel: +44 (0)20 8332 5708
> >>>>>> *** S.Hinchcliffe at rbgkew.org.uk
> >>>>>>
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> TDWG-GUID mailing list
> >>>>>> TDWG-GUID at mailman.nhm.ku.edu
> >>>>>> http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>> _______________________________________________
> >>>>> TDWG-GUID mailing list
> >>>>> TDWG-GUID at mailman.nhm.ku.edu
> >>>>> http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>> --
> >>>>
> >>>> -------------------------------------
> >>>> Roger Hyam
> >>>> Technical Architect
> >>>> Taxonomic Databases Working Group
> >>>> -------------------------------------
> >>>> http://www.tdwg.org
> >>>> roger at tdwg.org
> >>>> +44 1578 722782
> >>>> -------------------------------------
> >>>>
> >>>>
> >>>>
> >>>>
> >>> *** Sally Hinchcliffe
> >>> *** Computer section, Royal Botanic Gardens, Kew
> >>> *** tel: +44 (0)20 8332 5708
> >>> *** S.Hinchcliffe at rbgkew.org.uk
> >>>
> >>>
> >>>
> >>>
> >> --
> >>
> >> -------------------------------------
> >> Roger Hyam
> >> Technical Architect
> >> Taxonomic Databases Working Group
> >> -------------------------------------
> >> http://www.tdwg.org
> >> roger at tdwg.org
> >> +44 1578 722782
> >> -------------------------------------
> >>
> >>
> >>
> >
> > *** Sally Hinchcliffe
> > *** Computer section, Royal Botanic Gardens, Kew
> > *** tel: +44 (0)20 8332 5708
> > *** S.Hinchcliffe at rbgkew.org.uk
> >
> >
> >
>
>
> --
>
> -------------------------------------
> Roger Hyam
> Technical Architect
> Taxonomic Databases Working Group
> -------------------------------------
> http://www.tdwg.org
> roger at tdwg.org
> +44 1578 722782
> -------------------------------------
>
>
*** Sally Hinchcliffe
*** Computer section, Royal Botanic Gardens, Kew
*** tel: +44 (0)20 8332 5708
*** S.Hinchcliffe at rbgkew.org.uk
More information about the tdwg-tag
mailing list