Re: [Tdwg-guid] Throttling searches

19 Jun 2006


      We do some of this already with our web services.  SOAP methods  
required a keycode.  We use the code so we have a contact in case we  
need to send a message out as well as to provide a better accounting  
to sources of how we pass on their content.  Patrick (uBio programmer  
and nice guy) asked why not use the LSID version number as a way to  
pass a token.  If it's not passed you can fall back to one level of  
processing else give it the extra special treatment with the  
userID.   Or is this violating something sacred in the LSID ethos?

David Remsen


On Jun 19, 2006, at 6:07 AM, Roger Hyam wrote:
...
You don't! The LSID resolves to the binding to the getMetadata()  
method - which is a plain old fashioned URL. At this point the LSID  
authority has done its duty and we are just on a plain HTTP GET  
call so you can do whatever you can do with any regular HTTP GET.  
You could stipulate another header field or (more simply) give  
priority service for those who append a valid user id to the URL  
(&user_id=12345)
So there is no throttle on resolving the LSID to the getMetadata  
binding (which is cheap) but there is a throttle on the actual call  
to get the metadata method. Really you need to do this because bad  
people may be able to tell from the URL how to scrape the source  
and bypass the LSID resolver after the first call anyhow. This is  
especially true if the URL contains the IPNI record ID which is  
likely.
Here is an example using Rod's tester.
http://linnaeus.zoology.gla.ac.uk/~rpage/lsid/tester/? 
q=urn:lsid:ubio.org:namebank:11815
The getMetadata() method for this LSID:
urn:lsid:ubio.org:namebank:11815
Is bound to this URL:
http://names.ubio.org/authority/metadata.php? 
lsid=urn:lsid:ubio.org:namebank:11815
So ubio would just have to give preferential services to calls like  
this:
http://names.ubio.org/authority/metadata.php? 
lsid=urn:lsid:ubio.org:namebank:11815&user_id=rogerhyam1392918790
If rogerhyam had paid his membership fees this year.
Does this make sense?
Roger
p.s. You could do this on the web pages as well with a clever  
little thing to write dynamic tokens into the links so that it  
doesn't degrade the regular browsing experience and only stops  
scrapers - but that is beyond my remit at the moment ;)
p.p.s. You could wrap this in https if you were paranoid about  
people stealing tokens - but this is highly unlikely I believe.
Sally Hinchcliffe wrote:
...
How can we pass a token with an LSID?
...
I think the only way to throttle in these situations is to have some
notion of who the client is and the only way to do that is to  
have some
kind of token exchange over HTTP saying who they are. Basically  
you have
to have some kind of client registration system or you can never
distinguish between a call from a new client and a repeat call.  
The use
of IP address is a pain because so many people are now behind  
some kind
of NAT gateway.
How about this for a plan:
You could give a degraded services to people who don't pass a  
token (a 5
second delay perhaps) and offer a quicker service to registered  
users
who pass a token (but then perhaps limit the number of calls they  
make).
This would mean you could offer a universal service even to those  
with
naive client software but a better service to those who play  
nicely. You
could also get better stats on who is using the service.
So there are ways that this could be done. I expect people will  
come up
with a host of different ways. It is outside LSIDs though.
Roger
Sally Hinchcliffe wrote:
...
It's not an LSID issue per se, but LSIDs will make it harder to  
slow
searches down. For instance, Google restricts use of its spell
checker to 1000 a day by use of a key which is passed in with each
request. Obviously this can't be done with LSIDs as then they
wouldn't be the same for each user.
The other reason why it's relevant to LSIDs is simply that  
providing
a list of all relevant IPNI LSIDs (not necessary to the LSID
implementation but a nice to have for caching / lookups for other
systems using our LSIDs) also makes life easier for the  
datascrapers
to operate
Also I thought ... here's a list full of clever people perhaps they
will have some suggestions
Sally
...
Is this an LSID issue? LSIDs essential provide a binding  
service between
an name and one or more web services (we default to HTTP GET  
bindings).
It isn't really up to the LSID authority to administer any  
policies
regarding the web service but simply to point at it. It is up  
to the web
service to do things like throttling, authentication and  
authorization.
Imagine, for example, that the different services had different
policies. It may be reasonable not to restrict the getMetadata 
() calls
but to restrict the getData() calls.
The use of LSIDs does not create any new problems that weren't  
there
with web page scraping - or scraping of any other web service.
Just my thoughts...
Roger
Ricardo Scachetti Pereira wrote:
...
Sally,
You raised a really important issue that we had not really  
addressed
at the meeting. Thanks for that.
I would say that we should not constrain the resolution of  
LSIDs if
we expect our LSID infrastructure to work. LSIDs will be the  
basis of
our architecture so we better have good support for that.
However, that is sure a limiting factor. Also server  
efficiency will
likely vary quite a lot, depending on underlying system  
optimizations
and all.
So I think that the solution for this problem is in  
caching LSID
responses on the server LSID stack. Basically, after resolving  
an LSID
once, your server should be able to resolve it again and again  
really
quickly, until something on the metadata that is related to  
that id changes.
I haven't looked at this aspect of the LSID software  
stack, but
maybe others can say something about it. In any case I'll do some
research on it and get back to you.
Again, thanks for bringing it up.
Cheers,
Ricardo
Sally Hinchcliffe wrote:
> There are enough discontinuities in IPNI ids that 1,2,3 would  
> quickly
> run into the sand. I agree it's not a new problem - I just  
> hate to
> think I'm making life easier for the data scrapers
> Sally
>
>
>
>
>
>
>> It can be a problem but I'm not sure if there is a simple  
>> solution ... and how different is the LSID crawler scenario  
>> from an http://www.ipni.org/ipni/plantsearch?id=  
>> 1,2,3,4,5 ... 9999999 scenario?
>>
>> Paul
>>
>> -----Original Message-----
>> From: tdwg-guid-bounces@mailman.nhm.ku.edu
>> [mailto:tdwg-guid-bounces@mailman.nhm.ku.edu]On Behalf Of Sally
>> Hinchcliffe
>> Sent: 15 June 2006 12:08
>> To: tdwg-guid@mailman.nhm.ku.edu
>> Subject: [Tdwg-guid] Throttling searches [ Scanned for  
>> viruses ]
>>
>>
>> Hi all
>> another question that has come up here.
>>
>> As discussed at the meeting, we're thinking of providing a  
>> complete
>> download of all IPNI LSIDs plus a label (name and author,  
>> probably)
>> which will be available as an annually produced download
>>
>> Most people will play nice and just resolve one or two LSIDs as
>> required, but by providing a complete list, we're making it  
>> very easy
>> for someone to write a crawler that hits every LSID in turn and
>> basically brings our server to its knees
>>
>> Anybody know of a good way of enforcing more polite  
>> behaviour? We can
>> make the download only available under a data supply  
>> agreement that
>> includes a clause limiting hit rates, or we could limit by  
>> IP address
>> (but this would ultimately block out services like Rod's simple
>> resolver). I beleive Google's spell checker uses a key which  
>> has to
>> be passed in as part of the query - obviously we can't do  
>> that with
>> LSIDs
>>
>> Any thoughts? Anyone think this is a problem?
>>
>> Sally
>> *** Sally Hinchcliffe
>> *** Computer section, Royal Botanic Gardens, Kew
>> *** tel: +44 (0)20 8332 5708
>> *** S.Hinchcliffe@rbgkew.org.uk
>>
>>
>> _______________________________________________
>> TDWG-GUID mailing list
>> TDWG-GUID@mailman.nhm.ku.edu
>> http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid
>>
>> _______________________________________________
>> TDWG-GUID mailing list
>> TDWG-GUID@mailman.nhm.ku.edu
>> http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid
>>
>>
>>
>>
> *** Sally Hinchcliffe
> *** Computer section, Royal Botanic Gardens, Kew
> *** tel: +44 (0)20 8332 5708
> *** S.Hinchcliffe@rbgkew.org.uk
>
>
> _______________________________________________
> TDWG-GUID mailing list
> TDWG-GUID@mailman.nhm.ku.edu
> http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid
>
>
>
>
>
_______________________________________________
TDWG-GUID mailing list
TDWG-GUID@mailman.nhm.ku.edu
http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid
--
-------------------------------------
 Roger Hyam
 Technical Architect
 Taxonomic Databases Working Group
-------------------------------------
 http://www.tdwg.org
 roger@tdwg.org
 +44 1578 722782
-------------------------------------
*** Sally Hinchcliffe
*** Computer section, Royal Botanic Gardens, Kew
*** tel: +44 (0)20 8332 5708
*** S.Hinchcliffe@rbgkew.org.uk
--
-------------------------------------
 Roger Hyam
 Technical Architect
 Taxonomic Databases Working Group
-------------------------------------
 http://www.tdwg.org
 roger@tdwg.org
 +44 1578 722782
-------------------------------------
*** Sally Hinchcliffe
*** Computer section, Royal Botanic Gardens, Kew
*** tel: +44 (0)20 8332 5708
*** S.Hinchcliffe@rbgkew.org.uk
--
-------------------------------------
 Roger Hyam
 Technical Architect
 Taxonomic Databases Working Group
-------------------------------------
 http://www.tdwg.org
 roger@tdwg.org
 +44 1578 722782
-------------------------------------
_______________________________________________
TDWG-GUID mailing list
TDWG-GUID@mailman.nhm.ku.edu
http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid
_______________________________________________
David Remsen
uBio Project Manager
Marine Biological Laboratory
Woods Hole, MA 02543
508-289-7632