right behind Sally on this one.
for there is a potential bigger problem here than just the unwanted attention of data scrapers - I prefer harvesters - like army ants ... ;-) ... because the effect of harvesting could be the same.
All Species did it to Index Fungorum, if they had 'cornered the market', serving their rapidy past it's sell by date harvest (and apologetic acknowledgement of source) it could kill Index Fungorum and then where would we all be? Picking up the pieces and trying to fill in the gap between the death of IF and the birth of it's replacement ...
Paul
-----Original Message----- From: tdwg-guid-bounces@mailman.nhm.ku.edu [mailto:tdwg-guid-bounces@mailman.nhm.ku.edu]On Behalf Of Roderic Page Sent: 19 June 2006 10:15 To: S.Hinchcliffe@kew.org Cc: tdwg-guid@mailman.nhm.ku.edu Subject: Re: [Tdwg-guid] Throttling searches [ Scanned for viruses ]
I gotta ask -- what is so bad about making life easy for data scrapers (of which I'm one)? Isn't this rather the point -- we WANT to make it easy :-)
But, I do realise that providers may run into a problem of being overwhelmed by requests (though, wouldn't that be nice -- people actually want your data).
The NCBI throttles by asking people not to hammer the service, and some people leave around half a sec between requests to avoid being blocked. Connotea is thinking of "making the trigger be >10 requests within the last 15 seconds; requests arriving faster than that will be give a 503 response with a Retry-After header.", if that makes any sense.
You could also provide a service for data scrapers where they can get an RDF dump of the IPNI names, rather than have to scrape them.
Regards
Rod
On 19 Jun 2006, at 10:02, Sally Hinchcliffe wrote:
It's not an LSID issue per se, but LSIDs will make it harder to slow searches down. For instance, Google restricts use of its spell checker to 1000 a day by use of a key which is passed in with each request. Obviously this can't be done with LSIDs as then they wouldn't be the same for each user. The other reason why it's relevant to LSIDs is simply that providing a list of all relevant IPNI LSIDs (not necessary to the LSID implementation but a nice to have for caching / lookups for other systems using our LSIDs) also makes life easier for the datascrapers to operate
Also I thought ... here's a list full of clever people perhaps they will have some suggestions
Sally
Is this an LSID issue? LSIDs essential provide a binding service between an name and one or more web services (we default to HTTP GET bindings). It isn't really up to the LSID authority to administer any policies regarding the web service but simply to point at it. It is up to the web service to do things like throttling, authentication and authorization.
Imagine, for example, that the different services had different policies. It may be reasonable not to restrict the getMetadata() calls but to restrict the getData() calls.
The use of LSIDs does not create any new problems that weren't there with web page scraping - or scraping of any other web service.
Just my thoughts...
Roger
Ricardo Scachetti Pereira wrote:
Sally, You raised a really important issue that we had not really
addressed at the meeting. Thanks for that.
I would say that we should not constrain the resolution of LSIDs
if we expect our LSID infrastructure to work. LSIDs will be the basis of our architecture so we better have good support for that.
However, that is sure a limiting factor. Also server efficiency
will likely vary quite a lot, depending on underlying system optimizations and all.
So I think that the solution for this problem is in caching LSID
responses on the server LSID stack. Basically, after resolving an LSID once, your server should be able to resolve it again and again really quickly, until something on the metadata that is related to that id changes.
I haven't looked at this aspect of the LSID software stack, but
maybe others can say something about it. In any case I'll do some research on it and get back to you.
Again, thanks for bringing it up. Cheers,
Ricardo
Sally Hinchcliffe wrote:
There are enough discontinuities in IPNI ids that 1,2,3 would quickly run into the sand. I agree it's not a new problem - I just hate to think I'm making life easier for the data scrapers Sally
It can be a problem but I'm not sure if there is a simple solution ... and how different is the LSID crawler scenario from an http://www.ipni.org/ipni/plantsearch?id= 1,2,3,4,5 ... 9999999 scenario?
Paul
-----Original Message----- From: tdwg-guid-bounces@mailman.nhm.ku.edu [mailto:tdwg-guid-bounces@mailman.nhm.ku.edu]On Behalf Of Sally Hinchcliffe Sent: 15 June 2006 12:08 To: tdwg-guid@mailman.nhm.ku.edu Subject: [Tdwg-guid] Throttling searches [ Scanned for viruses ]
Hi all another question that has come up here.
As discussed at the meeting, we're thinking of providing a complete download of all IPNI LSIDs plus a label (name and author, probably) which will be available as an annually produced download
Most people will play nice and just resolve one or two LSIDs as required, but by providing a complete list, we're making it very easy for someone to write a crawler that hits every LSID in turn and basically brings our server to its knees
Anybody know of a good way of enforcing more polite behaviour? We can make the download only available under a data supply agreement that includes a clause limiting hit rates, or we could limit by IP address (but this would ultimately block out services like Rod's simple resolver). I beleive Google's spell checker uses a key which has to be passed in as part of the query - obviously we can't do that with LSIDs
Any thoughts? Anyone think this is a problem?
Sally *** Sally Hinchcliffe *** Computer section, Royal Botanic Gardens, Kew *** tel: +44 (0)20 8332 5708 *** S.Hinchcliffe@rbgkew.org.uk
TDWG-GUID mailing list TDWG-GUID@mailman.nhm.ku.edu http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid
TDWG-GUID mailing list TDWG-GUID@mailman.nhm.ku.edu http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid
*** Sally Hinchcliffe *** Computer section, Royal Botanic Gardens, Kew *** tel: +44 (0)20 8332 5708 *** S.Hinchcliffe@rbgkew.org.uk
TDWG-GUID mailing list TDWG-GUID@mailman.nhm.ku.edu http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid
TDWG-GUID mailing list TDWG-GUID@mailman.nhm.ku.edu http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid
--
Roger Hyam Technical Architect Taxonomic Databases Working Group
http://www.tdwg.org roger@tdwg.org
+44 1578 722782
*** Sally Hinchcliffe *** Computer section, Royal Botanic Gardens, Kew *** tel: +44 (0)20 8332 5708 *** S.Hinchcliffe@rbgkew.org.uk
TDWG-GUID mailing list TDWG-GUID@mailman.nhm.ku.edu http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid
------------------------------------------------------------------------ ---------------------------------------- Professor Roderic D. M. Page Editor, Systematic Biology DEEB, IBLS Graham Kerr Building University of Glasgow Glasgow G12 8QP United Kingdom
Phone: +44 141 330 4778 Fax: +44 141 330 2792 email: r.page@bio.gla.ac.uk web: http://taxonomy.zoology.gla.ac.uk/rod/rod.html iChat: aim://rodpage1962 reprints: http://taxonomy.zoology.gla.ac.uk/rod/pubs.html
Subscribe to Systematic Biology through the Society of Systematic Biologists Website: http://systematicbiology.org Search for taxon names: http://darwin.zoology.gla.ac.uk/~rpage/portal/ Find out what we know about a species: http://ispecies.org Rod's rants on phyloinformatics: http://iphylo.blogspot.com
___________________________________________________________ Now you can scan emails quickly with a reading pane. Get the new Yahoo! Mail. http://uk.docs.yahoo.com/nowyoucan.html
_______________________________________________ TDWG-GUID mailing list TDWG-GUID@mailman.nhm.ku.edu http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid