Re: [Tdwg-guid] Throttling searches [ Scanned for viruses ]

20 Jun 2006

      Steve,
OK thanks,  I get that now:  LSID is an element of a query response.  I guess that leads to more questions about where the LSID goes in which response formats.  Just add an LSID concept to Darwin Core for instance?

But, if SPARQL is a protocol, how does it layer on top of or parallel to all the other protocols that TDWG is attempting to standardize?  Or are we proposing it as the "protocol to rule them all"?

At least with DIGIR, imperfect as it is, it was clear that it was both query/response and protocol. (The complete flow diagram fits on one screen)  As we break this all apart and grow it, I think it is becoming difficult for those outside to follow the model and makes it more important to describe the complete query, protocol, response stack for the TDWG membership that will be called to recommend and vote on it.

Which reminds me that we still need the Rosetta Stone that resolves all these things that are on the table:  DIGIR, BioCASE, TAPIR, Wasabi, SPARQL, LSID, PURL, RDF, OWL, OWL-DL, WSDL, SOAP, HTTP GET plus XML Schema - Darwin Core(Base plus extensions like GML), ABCD, TCS, SDD, etc. and more.  And resolves them in a way that the general TDWG membership can fully grasp during the upcoming TDWG meeting.

Chuck

________________________________

From: Steven Perry [mailto:smperry@ku.edu]
Sent: Mon 6/19/2006 10:22 PM
To: Chuck Miller
Cc: S.Hinchcliffe@kew.org; roger@tdwg.org; tdwg-guid@mailman.nhm.ku.edu
Subject: Re: [Tdwg-guid] Throttling searches [ Scanned for viruses ]

Hi Chuck, 

I've been thinking of the case you describe as a query operation.  A 
query operation would take match conditions as input and, when applied 
to a set of RDF metadata, returns either an RDF graph or values bound to 
variables (analogous to an SQL select statement).  Either type of output 
may contain references to other data objects by LSID which would have to 
be resolved by clients. 

This query operation is not supported by the LSID spec and requires a 
distinct service.  We've implemented SPARQL as the query service for 
DiGIR2 (now called Wasabi).  SPARQL is a W3C candidate recommendation 
and is both a query language and a protocol. 

See the following for more information: 

http://www.w3.org/TR/rdf-sparql-query/ 
http://www.w3.org/TR/rdf-sparql-protocol/ 

-Steve 

Chuck Miller wrote:
...
This is probably a dumb question and exposes my ignorance, but what if 
the originating query is actually "Get all LSIDs where Family = 
Orchidaceae".  That seems the more likely scenario to me rather than 
get one LSID.  And that's the one that needs a throttle.
Chuck
------------------------------------------------------------------------ 
*From:* Sally Hinchcliffe [mailto:S.Hinchcliffe@kew.org] 
*Sent:* Mon 6/19/2006 7:01 AM 
*To:* roger@tdwg.org 
*Cc:* tdwg-guid@mailman.nhm.ku.edu 
*Subject:* Re: [Tdwg-guid] Throttling searches [ Scanned for viruses ]
Hi Roger 
Thanks for this ... I _think_ I understand it but Nicky is on leave 
this week so I won't know if I do or not till after she returns
The system doesn't have to be completely villain proof, just slow 
down most of the villains so everyone else can get a look in 
Sally
...
You don't! The LSID resolves to the binding to the getMetadata() method 
- which is a plain old fashioned URL. At this point the LSID authority 
has done its duty and we are just on a plain HTTP GET call so you
can do
...
whatever you can do with any regular HTTP GET. You could stipulate 
another header field or (more simply) give priority service for those 
who append a valid user id to the URL (&user_id=12345)
So there is no throttle on resolving the LSID to the getMetadata 
binding 
(which is cheap) but there is a throttle on the actual call to get the 
metadata method. Really you need to do this because bad people may be 
able to tell from the URL how to scrape the source and bypass the LSID 
resolver after the first call anyhow. This is especially true if the 
URL 
contains the IPNI record ID which is likely.
Here is an example using Rod's tester.
http://linnaeus.zoology.gla.ac.uk/~rpage/lsid/tester/?q=urn:lsid:ubio.org:na... 
<http://linnaeus.zoology.gla.ac.uk/%7Erpage/lsid/tester/?q=urn:lsid:ubio.org:namebank:11815>
...
The getMetadata() method for this LSID:
urn:lsid:ubio.org:namebank:11815
Is bound to this URL:
http://names.ubio.org/authority/metadata.php?lsid=urn:lsid:ubio.org:namebank...
...
So ubio would just have to give preferential services to calls like
this:
...
http://names.ubio.org/authority/metadata.php?lsid=urn:lsid:ubio.org:namebank:11815&user_id=rogerhyam1392918790 
<http://names.ubio.org/authority/metadata.php?lsid=urn:lsid:ubio.org:namebank:11815&user_id=rogerhyam1392918790>
...
If rogerhyam had paid his membership fees this year.
Does this make sense?
Roger 
p.s. You could do this on the web pages as well with a clever little 
thing to write dynamic tokens into the links so that it doesn't degrade 
the regular browsing experience and only stops scrapers - but that is 
beyond my remit at the moment ;)
p.p.s. You could wrap this in https if you were paranoid about people 
stealing tokens - but this is highly unlikely I believe.
Sally Hinchcliffe wrote:
...
How can we pass a token with an LSID?
...
I think the only way to throttle in these situations is to have some 
notion of who the client is and the only way to do that is to
...
...
...
kind of token exchange over HTTP saying who they are. Basically 
you have 
to have some kind of client registration system or you can never 
distinguish between a call from a new client and a repeat call. 
The use 
of IP address is a pain because so many people are now behind 
some kind 
of NAT gateway.
How about this for a plan:
You could give a degraded services to people who don't pass a 
token (a 5 
second delay perhaps) and offer a quicker service to registered 
users 
who pass a token (but then perhaps limit the number of calls they 
make). 
This would mean you could offer a universal service even to those 
with 
naive client software but a better service to those who play 
nicely. You 
could also get better stats on who is using the service.
So there are ways that this could be done. I expect people will 
come up 
with a host of different ways. It is outside LSIDs though.
Roger
Sally Hinchcliffe wrote:
...
It's not an LSID issue per se, but LSIDs will make it harder to 
slow 
searches down. For instance, Google restricts use of its spell 
checker to 1000 a day by use of a key which is passed in with each 
request. Obviously this can't be done with LSIDs as then they 
wouldn't be the same for each user. 
The other reason why it's relevant to LSIDs is simply that
...
...
...
...
a list of all relevant IPNI LSIDs (not necessary to the LSID 
implementation but a nice to have for caching / lookups for other 
systems using our LSIDs) also makes life easier for the 
datascrapers 
to operate
Also I thought ... here's a list full of clever people perhaps they 
will have some suggestions
Sally
...
Is this an LSID issue? LSIDs essential provide a binding 
service between 
an name and one or more web services (we default to HTTP GET 
bindings). 
It isn't really up to the LSID authority to administer any
...
...
...
...
...
regarding the web service but simply to point at it. It is up 
to the web 
service to do things like throttling, authentication and 
authorization.
Imagine, for example, that the different services had different 
policies. It may be reasonable not to restrict the 
getMetadata() calls 
but to restrict the getData() calls.
The use of LSIDs does not create any new problems that weren't
...
...
...
...
...
with web page scraping - or scraping of any other web service.
Just my thoughts...
Roger
Ricardo Scachetti Pereira wrote:
>     Sally, 
> 
>     You raised a really important issue that we had not really 
addressed 
> at the meeting. Thanks for that. 
> 
>     I would say that we should not constrain the resolution of 
LSIDs if 
> we expect our LSID infrastructure to work. LSIDs will be the 
basis of 
> our architecture so we better have good support for that. 
> 
>     However, that is sure a limiting factor. Also server 
efficiency will 
> likely vary quite a lot, depending on underlying system 
optimizations 
> and all. 
> 
>     So I think that the solution for this problem is in 
caching LSID 
> responses on the server LSID stack. Basically, after resolving 
an LSID 
> once, your server should be able to resolve it again and again 
really 
> quickly, until something on the metadata that is related to
have some 
providing 
policies 
there 
that id changes.
...
...
...
...
...
> 
>     I haven't looked at this aspect of the LSID software 
stack, but 
> maybe others can say something about it. In any case I'll do some 
> research on it and get back to you. 
> 
>     Again, thanks for bringing it up. 
> 
>     Cheers, 
> 
> Ricardo 
> 
> 
> Sally Hinchcliffe wrote: 
>  
>      
>          
>> There are enough discontinuities in IPNI ids that 1,2,3 would 
quickly 
>> run into the sand. I agree it's not a new problem - I just 
hate to 
>> think I'm making life easier for the data scrapers 
>> Sally 
>> 
>> 
>>  
>>    
>>        
>>            
>>> It can be a problem but I'm not sure if there is a simple 
solution ... and how different is the LSID crawler scenario from an 
http://www.ipni.org/ipni/plantsearch?id= 1,2,3,4,5 ... 9999999 scenario?
...
...
...
...
...
>>> 
>>> Paul 
>>> 
>>> -----Original Message----- 
>>> From: tdwg-guid-bounces@mailman.nhm.ku.edu 
>>> [mailto:tdwg-guid-bounces@mailman.nhm.ku.edu]On Behalf Of Sally 
>>> Hinchcliffe 
>>> Sent: 15 June 2006 12:08 
>>> To: tdwg-guid@mailman.nhm.ku.edu 
>>> Subject: [Tdwg-guid] Throttling searches [ Scanned for 
viruses ] 
>>> 
>>> 
>>> Hi all 
>>> another question that has come up here. 
>>> 
>>> As discussed at the meeting, we're thinking of providing a 
complete 
>>> download of all IPNI LSIDs plus a label (name and author, 
probably) 
>>> which will be available as an annually produced download 
>>> 
>>> Most people will play nice and just resolve one or two LSIDs as 
>>> required, but by providing a complete list, we're making it 
very easy 
>>> for someone to write a crawler that hits every LSID in turn and 
>>> basically brings our server to its knees 
>>> 
>>> Anybody know of a good way of enforcing more polite 
behaviour? We can 
>>> make the download only available under a data supply 
agreement that 
>>> includes a clause limiting hit rates, or we could limit by 
IP address 
>>> (but this would ultimately block out services like Rod's simple 
>>> resolver). I beleive Google's spell checker uses a key which 
has to 
>>> be passed in as part of the query - obviously we can't do 
that with 
>>> LSIDs 
>>> 
>>> Any thoughts? Anyone think this is a problem? 
>>> 
>>> Sally 
>>> *** Sally Hinchcliffe 
>>> *** Computer section, Royal Botanic Gardens, Kew 
>>> *** tel: +44 (0)20 8332 5708 
>>> *** S.Hinchcliffe@rbgkew.org.uk 
>>> 
>>> 
>>> _______________________________________________ 
>>> TDWG-GUID mailing list 
>>> TDWG-GUID@mailman.nhm.ku.edu 
>>> http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid 
>>> 
>>> _______________________________________________ 
>>> TDWG-GUID mailing list 
>>> TDWG-GUID@mailman.nhm.ku.edu 
>>> http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid 
>>>    
>>>      
>>>          
>>>              
>> *** Sally Hinchcliffe 
>> *** Computer section, Royal Botanic Gardens, Kew 
>> *** tel: +44 (0)20 8332 5708 
>> *** S.Hinchcliffe@rbgkew.org.uk 
>> 
>> 
>> _______________________________________________ 
>> TDWG-GUID mailing list 
>> TDWG-GUID@mailman.nhm.ku.edu 
>> http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid 
>> 
>>  
>>    
>>        
>>            
> _______________________________________________ 
> TDWG-GUID mailing list 
> TDWG-GUID@mailman.nhm.ku.edu 
> http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid 
> 
>  
>      
>          
--
------------------------------------- 
 Roger Hyam 
 Technical Architect 
 Taxonomic Databases Working Group 
------------------------------------- 
 http://www.tdwg.org <http://www.tdwg.org/>  <http://www.tdwg.org/> 
 roger@tdwg.org 
 +44 1578 722782 
-------------------------------------
*** Sally Hinchcliffe 
*** Computer section, Royal Botanic Gardens, Kew 
*** tel: +44 (0)20 8332 5708 
*** S.Hinchcliffe@rbgkew.org.uk
--
------------------------------------- 
 Roger Hyam 
 Technical Architect 
 Taxonomic Databases Working Group 
------------------------------------- 
 http://www.tdwg.org <http://www.tdwg.org/>  <http://www.tdwg.org/> 
 roger@tdwg.org 
 +44 1578 722782 
-------------------------------------
*** Sally Hinchcliffe 
*** Computer section, Royal Botanic Gardens, Kew 
*** tel: +44 (0)20 8332 5708 
*** S.Hinchcliffe@rbgkew.org.uk
--
------------------------------------- 
 Roger Hyam 
 Technical Architect 
 Taxonomic Databases Working Group 
------------------------------------- 
 http://www.tdwg.org <http://www.tdwg.org/>  <http://www.tdwg.org/> 
 roger@tdwg.org 
 +44 1578 722782 
-------------------------------------
*** Sally Hinchcliffe 
*** Computer section, Royal Botanic Gardens, Kew 
*** tel: +44 (0)20 8332 5708 
*** S.Hinchcliffe@rbgkew.org.uk
_______________________________________________ 
TDWG-GUID mailing list 
TDWG-GUID@mailman.nhm.ku.edu 
http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid
------------------------------------------------------------------------
_______________________________________________ 
TDWG-GUID mailing list 
TDWG-GUID@mailman.nhm.ku.edu 
http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid