[Tdwg-guid] Throttling searches [ Scanned for viruses ]

Paul Kirk p.kirk at cabi.org
Mon Jun 19 11:29:33 CEST 2006


right behind Sally on this one.

for there is a potential bigger problem here than just the unwanted attention of data scrapers - I prefer harvesters - like army ants ... ;-) ... because the effect of harvesting could be the same.

All Species did it to Index Fungorum, if they had 'cornered the market', serving their rapidy past it's sell by date harvest (and apologetic acknowledgement of source) it could kill Index Fungorum and then where would we all be? Picking up the pieces and trying to fill in the gap between the death of IF and the birth of it's replacement ...

Paul

-----Original Message-----
From: tdwg-guid-bounces at mailman.nhm.ku.edu
[mailto:tdwg-guid-bounces at mailman.nhm.ku.edu]On Behalf Of Roderic Page
Sent: 19 June 2006 10:15
To: S.Hinchcliffe at kew.org
Cc: tdwg-guid at mailman.nhm.ku.edu
Subject: Re: [Tdwg-guid] Throttling searches [ Scanned for viruses ]


I gotta ask -- what is so bad about making life easy for data scrapers  
(of which I'm one)? Isn't this rather the point -- we WANT to make it  
easy :-)

But, I do realise that providers may run into a problem of being  
overwhelmed by requests (though, wouldn't that be nice -- people  
actually want your data).

The NCBI throttles by asking people not to hammer the service, and some  
people leave around half a sec between requests to avoid being blocked.  
Connotea is thinking of "making the trigger be >10 requests within the  
last 15 seconds; requests arriving faster than that will be give a 503  
response with a Retry-After header.", if that makes any sense.

You could also provide a service for data scrapers where they can get  
an RDF dump of the IPNI names, rather than have to scrape them.

Regards

Rod




On 19 Jun 2006, at 10:02, Sally Hinchcliffe wrote:

> It's not an LSID issue per se, but LSIDs will make it harder to slow
> searches down. For instance, Google restricts use of its spell
> checker to 1000 a day by use of a key which is passed in with each
> request. Obviously this can't be done with LSIDs as then they
> wouldn't be the same for each user.
> The other reason why it's relevant to LSIDs is simply that providing
> a list of all relevant IPNI LSIDs (not necessary to the LSID
> implementation but a nice to have for caching / lookups for other
> systems using our LSIDs) also makes life easier for the datascrapers
> to operate
>
> Also I thought ... here's a list full of clever people perhaps they
> will have some suggestions
>
> Sally
>
>>
>> Is this an LSID issue? LSIDs essential provide a binding service  
>> between
>> an name and one or more web services (we default to HTTP GET  
>> bindings).
>> It isn't really up to the LSID authority to administer any policies
>> regarding the web service but simply to point at it. It is up to the  
>> web
>> service to do things like throttling, authentication and  
>> authorization.
>>
>> Imagine, for example, that the different services had different
>> policies. It may be reasonable not to restrict the getMetadata() calls
>> but to restrict the getData() calls.
>>
>> The use of LSIDs does not create any new problems that weren't there
>> with web page scraping - or scraping of any other web service.
>>
>> Just my thoughts...
>>
>> Roger
>>
>>
>> Ricardo Scachetti Pereira wrote:
>>>     Sally,
>>>
>>>     You raised a really important issue that we had not really  
>>> addressed
>>> at the meeting. Thanks for that.
>>>
>>>     I would say that we should not constrain the resolution of LSIDs  
>>> if
>>> we expect our LSID infrastructure to work. LSIDs will be the basis of
>>> our architecture so we better have good support for that.
>>>
>>>     However, that is sure a limiting factor. Also server efficiency  
>>> will
>>> likely vary quite a lot, depending on underlying system optimizations
>>> and all.
>>>
>>>     So I think that the solution for this problem is in caching LSID
>>> responses on the server LSID stack. Basically, after resolving an  
>>> LSID
>>> once, your server should be able to resolve it again and again really
>>> quickly, until something on the metadata that is related to that id  
>>> changes.
>>>
>>>     I haven't looked at this aspect of the LSID software stack, but
>>> maybe others can say something about it. In any case I'll do some
>>> research on it and get back to you.
>>>
>>>     Again, thanks for bringing it up.
>>>
>>>     Cheers,
>>>
>>> Ricardo
>>>
>>>
>>> Sally Hinchcliffe wrote:
>>>
>>>> There are enough discontinuities in IPNI ids that 1,2,3 would  
>>>> quickly
>>>> run into the sand. I agree it's not a new problem - I just hate to
>>>> think I'm making life easier for the data scrapers
>>>> Sally
>>>>
>>>>
>>>>
>>>>
>>>>> It can be a problem but I'm not sure if there is a simple solution  
>>>>> ... and how different is the LSID crawler scenario from an  
>>>>> http://www.ipni.org/ipni/plantsearch?id= 1,2,3,4,5 ... 9999999  
>>>>> scenario?
>>>>>
>>>>> Paul
>>>>>
>>>>> -----Original Message-----
>>>>> From: tdwg-guid-bounces at mailman.nhm.ku.edu
>>>>> [mailto:tdwg-guid-bounces at mailman.nhm.ku.edu]On Behalf Of Sally
>>>>> Hinchcliffe
>>>>> Sent: 15 June 2006 12:08
>>>>> To: tdwg-guid at mailman.nhm.ku.edu
>>>>> Subject: [Tdwg-guid] Throttling searches [ Scanned for viruses ]
>>>>>
>>>>>
>>>>> Hi all
>>>>> another question that has come up here.
>>>>>
>>>>> As discussed at the meeting, we're thinking of providing a complete
>>>>> download of all IPNI LSIDs plus a label (name and author, probably)
>>>>> which will be available as an annually produced download
>>>>>
>>>>> Most people will play nice and just resolve one or two LSIDs as
>>>>> required, but by providing a complete list, we're making it very  
>>>>> easy
>>>>> for someone to write a crawler that hits every LSID in turn and
>>>>> basically brings our server to its knees
>>>>>
>>>>> Anybody know of a good way of enforcing more polite behaviour? We  
>>>>> can
>>>>> make the download only available under a data supply agreement that
>>>>> includes a clause limiting hit rates, or we could limit by IP  
>>>>> address
>>>>> (but this would ultimately block out services like Rod's simple
>>>>> resolver). I beleive Google's spell checker uses a key which has to
>>>>> be passed in as part of the query - obviously we can't do that with
>>>>> LSIDs
>>>>>
>>>>> Any thoughts? Anyone think this is a problem?
>>>>>
>>>>> Sally
>>>>> *** Sally Hinchcliffe
>>>>> *** Computer section, Royal Botanic Gardens, Kew
>>>>> *** tel: +44 (0)20 8332 5708
>>>>> *** S.Hinchcliffe at rbgkew.org.uk
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> TDWG-GUID mailing list
>>>>> TDWG-GUID at mailman.nhm.ku.edu
>>>>> http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid
>>>>>
>>>>> _______________________________________________
>>>>> TDWG-GUID mailing list
>>>>> TDWG-GUID at mailman.nhm.ku.edu
>>>>> http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid
>>>>>
>>>>>
>>>> *** Sally Hinchcliffe
>>>> *** Computer section, Royal Botanic Gardens, Kew
>>>> *** tel: +44 (0)20 8332 5708
>>>> *** S.Hinchcliffe at rbgkew.org.uk
>>>>
>>>>
>>>> _______________________________________________
>>>> TDWG-GUID mailing list
>>>> TDWG-GUID at mailman.nhm.ku.edu
>>>> http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid
>>>>
>>>>
>>>>
>>>
>>>
>>> _______________________________________________
>>> TDWG-GUID mailing list
>>> TDWG-GUID at mailman.nhm.ku.edu
>>> http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid
>>>
>>>
>>
>>
>> --  
>>
>> -------------------------------------
>>  Roger Hyam
>>  Technical Architect
>>  Taxonomic Databases Working Group
>> -------------------------------------
>>  http://www.tdwg.org
>>  roger at tdwg.org
>>  +44 1578 722782
>> -------------------------------------
>>
>>
>
> *** Sally Hinchcliffe
> *** Computer section, Royal Botanic Gardens, Kew
> *** tel: +44 (0)20 8332 5708
> *** S.Hinchcliffe at rbgkew.org.uk
>
>
> _______________________________________________
> TDWG-GUID mailing list
> TDWG-GUID at mailman.nhm.ku.edu
> http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid
>
>
------------------------------------------------------------------------ 
----------------------------------------
Professor Roderic D. M. Page
Editor, Systematic Biology
DEEB, IBLS
Graham Kerr Building
University of Glasgow
Glasgow G12 8QP
United Kingdom

Phone:    +44 141 330 4778
Fax:      +44 141 330 2792
email:    r.page at bio.gla.ac.uk
web:      http://taxonomy.zoology.gla.ac.uk/rod/rod.html
iChat:    aim://rodpage1962
reprints: http://taxonomy.zoology.gla.ac.uk/rod/pubs.html

Subscribe to Systematic Biology through the Society of Systematic
Biologists Website:  http://systematicbiology.org
Search for taxon names: http://darwin.zoology.gla.ac.uk/~rpage/portal/
Find out what we know about a species: http://ispecies.org
Rod's rants on phyloinformatics: http://iphylo.blogspot.com



		
___________________________________________________________ 
Now you can scan emails quickly with a reading pane. Get the new Yahoo! Mail. http://uk.docs.yahoo.com/nowyoucan.html


_______________________________________________
TDWG-GUID mailing list
TDWG-GUID at mailman.nhm.ku.edu
http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid




More information about the tdwg-tag mailing list