[Tdwg-guid] Throttling searches [ Scanned for viruses ]

Mon Jun 19 11:14:33 CEST 2006

I gotta ask -- what is so bad about making life easy for data scrapers  
(of which I'm one)? Isn't this rather the point -- we WANT to make it  
easy :-)

But, I do realise that providers may run into a problem of being  
overwhelmed by requests (though, wouldn't that be nice -- people  
actually want your data).

The NCBI throttles by asking people not to hammer the service, and some  
people leave around half a sec between requests to avoid being blocked.  
Connotea is thinking of "making the trigger be >10 requests within the  
last 15 seconds; requests arriving faster than that will be give a 503  
response with a Retry-After header.", if that makes any sense.

You could also provide a service for data scrapers where they can get  
an RDF dump of the IPNI names, rather than have to scrape them.

Regards

Rod

On 19 Jun 2006, at 10:02, Sally Hinchcliffe wrote:

> It's not an LSID issue per se, but LSIDs will make it harder to slow
> searches down. For instance, Google restricts use of its spell
> checker to 1000 a day by use of a key which is passed in with each
> request. Obviously this can't be done with LSIDs as then they
> wouldn't be the same for each user.
> The other reason why it's relevant to LSIDs is simply that providing
> a list of all relevant IPNI LSIDs (not necessary to the LSID
> implementation but a nice to have for caching / lookups for other
> systems using our LSIDs) also makes life easier for the datascrapers
> to operate
>
> Also I thought ... here's a list full of clever people perhaps they
> will have some suggestions
>
> Sally
>
>>
>> Is this an LSID issue? LSIDs essential provide a binding service  
>> between
>> an name and one or more web services (we default to HTTP GET  
>> bindings).
>> It isn't really up to the LSID authority to administer any policies
>> regarding the web service but simply to point at it. It is up to the  
>> web
>> service to do things like throttling, authentication and  
>> authorization.
>>
>> Imagine, for example, that the different services had different
>> policies. It may be reasonable not to restrict the getMetadata() calls
>> but to restrict the getData() calls.
>>
>> The use of LSIDs does not create any new problems that weren't there
>> with web page scraping - or scraping of any other web service.
>>
>> Just my thoughts...
>>
>> Roger
>>
>>
>> Ricardo Scachetti Pereira wrote:
>>>     Sally,
>>>
>>>     You raised a really important issue that we had not really  
>>> addressed
>>> at the meeting. Thanks for that.
>>>
>>>     I would say that we should not constrain the resolution of LSIDs  
>>> if
>>> we expect our LSID infrastructure to work. LSIDs will be the basis of
>>> our architecture so we better have good support for that.
>>>
>>>     However, that is sure a limiting factor. Also server efficiency  
>>> will
>>> likely vary quite a lot, depending on underlying system optimizations
>>> and all.
>>>
>>>     So I think that the solution for this problem is in caching LSID
>>> responses on the server LSID stack. Basically, after resolving an  
>>> LSID
>>> once, your server should be able to resolve it again and again really
>>> quickly, until something on the metadata that is related to that id  
>>> changes.
>>>
>>>     I haven't looked at this aspect of the LSID software stack, but
>>> maybe others can say something about it. In any case I'll do some
>>> research on it and get back to you.
>>>
>>>     Again, thanks for bringing it up.
>>>
>>>     Cheers,
>>>
>>> Ricardo
>>>
>>>
>>> Sally Hinchcliffe wrote:
>>>
>>>> There are enough discontinuities in IPNI ids that 1,2,3 would  
>>>> quickly
>>>> run into the sand. I agree it's not a new problem - I just hate to
>>>> think I'm making life easier for the data scrapers
>>>> Sally
>>>>
>>>>
>>>>
>>>>
>>>>> It can be a problem but I'm not sure if there is a simple solution  
>>>>> ... and how different is the LSID crawler scenario from an  
>>>>> http://www.ipni.org/ipni/plantsearch?id= 1,2,3,4,5 ... 9999999  
>>>>> scenario?
>>>>>
>>>>> Paul
>>>>>
>>>>> -----Original Message-----
>>>>> From: tdwg-guid-bounces at mailman.nhm.ku.edu
>>>>> [mailto:tdwg-guid-bounces at mailman.nhm.ku.edu]On Behalf Of Sally
>>>>> Hinchcliffe
>>>>> Sent: 15 June 2006 12:08
>>>>> To: tdwg-guid at mailman.nhm.ku.edu
>>>>> Subject: [Tdwg-guid] Throttling searches [ Scanned for viruses ]
>>>>>
>>>>>
>>>>> Hi all
>>>>> another question that has come up here.
>>>>>
>>>>> As discussed at the meeting, we're thinking of providing a complete
>>>>> download of all IPNI LSIDs plus a label (name and author, probably)
>>>>> which will be available as an annually produced download
>>>>>
>>>>> Most people will play nice and just resolve one or two LSIDs as
>>>>> required, but by providing a complete list, we're making it very  
>>>>> easy
>>>>> for someone to write a crawler that hits every LSID in turn and
>>>>> basically brings our server to its knees
>>>>>
>>>>> Anybody know of a good way of enforcing more polite behaviour? We  
>>>>> can
>>>>> make the download only available under a data supply agreement that
>>>>> includes a clause limiting hit rates, or we could limit by IP  
>>>>> address
>>>>> (but this would ultimately block out services like Rod's simple
>>>>> resolver). I beleive Google's spell checker uses a key which has to
>>>>> be passed in as part of the query - obviously we can't do that with
>>>>> LSIDs
>>>>>
>>>>> Any thoughts? Anyone think this is a problem?
>>>>>
>>>>> Sally
>>>>> *** Sally Hinchcliffe
>>>>> *** Computer section, Royal Botanic Gardens, Kew
>>>>> *** tel: +44 (0)20 8332 5708
>>>>> *** S.Hinchcliffe at rbgkew.org.uk
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> TDWG-GUID mailing list
>>>>> TDWG-GUID at mailman.nhm.ku.edu
>>>>> http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid
>>>>>
>>>>> _______________________________________________
>>>>> TDWG-GUID mailing list
>>>>> TDWG-GUID at mailman.nhm.ku.edu
>>>>> http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid
>>>>>
>>>>>
>>>> *** Sally Hinchcliffe
>>>> *** Computer section, Royal Botanic Gardens, Kew
>>>> *** tel: +44 (0)20 8332 5708
>>>> *** S.Hinchcliffe at rbgkew.org.uk
>>>>
>>>>
>>>> _______________________________________________
>>>> TDWG-GUID mailing list
>>>> TDWG-GUID at mailman.nhm.ku.edu
>>>> http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid
>>>>
>>>>
>>>>
>>>
>>>
>>> _______________________________________________
>>> TDWG-GUID mailing list
>>> TDWG-GUID at mailman.nhm.ku.edu
>>> http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid
>>>
>>>
>>
>>
>> --  
>>
>> -------------------------------------
>>  Roger Hyam
>>  Technical Architect
>>  Taxonomic Databases Working Group
>> -------------------------------------
>>  http://www.tdwg.org
>>  roger at tdwg.org
>>  +44 1578 722782
>> -------------------------------------
>>
>>
>
> *** Sally Hinchcliffe
> *** Computer section, Royal Botanic Gardens, Kew
> *** tel: +44 (0)20 8332 5708
> *** S.Hinchcliffe at rbgkew.org.uk
>
>
> _______________________________________________
> TDWG-GUID mailing list
> TDWG-GUID at mailman.nhm.ku.edu
> http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid
>
>
------------------------------------------------------------------------ 
----------------------------------------
Professor Roderic D. M. Page
Editor, Systematic Biology
DEEB, IBLS
Graham Kerr Building
University of Glasgow
Glasgow G12 8QP
United Kingdom

Phone:    +44 141 330 4778
Fax:      +44 141 330 2792
email:    r.page at bio.gla.ac.uk
web:      http://taxonomy.zoology.gla.ac.uk/rod/rod.html
iChat:    aim://rodpage1962
reprints: http://taxonomy.zoology.gla.ac.uk/rod/pubs.html

Subscribe to Systematic Biology through the Society of Systematic
Biologists Website:  http://systematicbiology.org
Search for taxon names: http://darwin.zoology.gla.ac.uk/~rpage/portal/
Find out what we know about a species: http://ispecies.org
Rod's rants on phyloinformatics: http://iphylo.blogspot.com

___________________________________________________________ 
Now you can scan emails quickly with a reading pane. Get the new Yahoo! Mail. http://uk.docs.yahoo.com/nowyoucan.html