Re: [Tdwg-guid] Throttling searches [ Scanned for viruses ]

newer
Mailing list will be unavailable...

older
[Tdwg-guid] Interest in versions -...

Paul Kirk

15 Jun 2006 15 Jun '06

11:21

It can be a problem but I'm not sure if there is a simple solution ... and how different is the LSID crawler scenario from an http://www.ipni.org/ipni/plantsearch?id= 1,2,3,4,5 ... 9999999 scenario? Paul -----Original Message----- From: tdwg-guid-bounces@mailman.nhm.ku.edu [mailto:tdwg-guid-bounces@mailman.nhm.ku.edu]On Behalf Of Sally Hinchcliffe Sent: 15 June 2006 12:08 To: tdwg-guid@mailman.nhm.ku.edu Subject: [Tdwg-guid] Throttling searches [ Scanned for viruses ] Hi all another question that has come up here. As discussed at the meeting, we're thinking of providing a complete download of all IPNI LSIDs plus a label (name and author, probably) which will be available as an annually produced download Most people will play nice and just resolve one or two LSIDs as required, but by providing a complete list, we're making it very easy for someone to write a crawler that hits every LSID in turn and basically brings our server to its knees Anybody know of a good way of enforcing more polite behaviour? We can make the download only available under a data supply agreement that includes a clause limiting hit rates, or we could limit by IP address (but this would ultimately block out services like Rod's simple resolver). I beleive Google's spell checker uses a key which has to be passed in as part of the query - obviously we can't do that with LSIDs Any thoughts? Anyone think this is a problem? Sally *** Sally Hinchcliffe *** Computer section, Royal Botanic Gardens, Kew *** tel: +44 (0)20 8332 5708 *** S.Hinchcliffe@rbgkew.org.uk _______________________________________________ TDWG-GUID mailing list TDWG-GUID@mailman.nhm.ku.edu http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid

Show replies by date

Sally Hinchcliffe

15 Jun 15 Jun

11:43

New subject: [Tdwg-guid] Throttling searches [ Scanned for viruses ]

There are enough discontinuities in IPNI ids that 1,2,3 would quickly run into the sand. I agree it's not a new problem - I just hate to think I'm making life easier for the data scrapers Sally

...

It can be a problem but I'm not sure if there is a simple solution ... and how different is the LSID crawler scenario from an http://www.ipni.org/ipni/plantsearch?id= 1,2,3,4,5 ... 9999999 scenario?

Paul

-----Original Message----- From: tdwg-guid-bounces@mailman.nhm.ku.edu [mailto:tdwg-guid-bounces@mailman.nhm.ku.edu]On Behalf Of Sally Hinchcliffe Sent: 15 June 2006 12:08 To: tdwg-guid@mailman.nhm.ku.edu Subject: [Tdwg-guid] Throttling searches [ Scanned for viruses ]

Hi all another question that has come up here.

As discussed at the meeting, we're thinking of providing a complete download of all IPNI LSIDs plus a label (name and author, probably) which will be available as an annually produced download

Most people will play nice and just resolve one or two LSIDs as required, but by providing a complete list, we're making it very easy for someone to write a crawler that hits every LSID in turn and basically brings our server to its knees

Anybody know of a good way of enforcing more polite behaviour? We can make the download only available under a data supply agreement that includes a clause limiting hit rates, or we could limit by IP address (but this would ultimately block out services like Rod's simple resolver). I beleive Google's spell checker uses a key which has to be passed in as part of the query - obviously we can't do that with LSIDs

Any thoughts? Anyone think this is a problem?

Sally *** Sally Hinchcliffe *** Computer section, Royal Botanic Gardens, Kew *** tel: +44 (0)20 8332 5708 *** S.Hinchcliffe@rbgkew.org.uk

_______________________________________________ TDWG-GUID mailing list TDWG-GUID@mailman.nhm.ku.edu http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid

_______________________________________________ TDWG-GUID mailing list TDWG-GUID@mailman.nhm.ku.edu http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid

*** Sally Hinchcliffe *** Computer section, Royal Botanic Gardens, Kew *** tel: +44 (0)20 8332 5708 *** S.Hinchcliffe@rbgkew.org.uk

Ricardo Scachetti Pereira

14:31

New subject: [Tdwg-guid] Throttling searches [ Scanned for viruses ]

Sally, You raised a really important issue that we had not really addressed at the meeting. Thanks for that. I would say that we should not constrain the resolution of LSIDs if we expect our LSID infrastructure to work. LSIDs will be the basis of our architecture so we better have good support for that. However, that is sure a limiting factor. Also server efficiency will likely vary quite a lot, depending on underlying system optimizations and all. So I think that the solution for this problem is in caching LSID responses on the server LSID stack. Basically, after resolving an LSID once, your server should be able to resolve it again and again really quickly, until something on the metadata that is related to that id changes. I haven't looked at this aspect of the LSID software stack, but maybe others can say something about it. In any case I'll do some research on it and get back to you. Again, thanks for bringing it up. Cheers, Ricardo Sally Hinchcliffe wrote:

...

There are enough discontinuities in IPNI ids that 1,2,3 would quickly run into the sand. I agree it's not a new problem - I just hate to think I'm making life easier for the data scrapers Sally

...
It can be a problem but I'm not sure if there is a simple solution ... and how different is the LSID crawler scenario from an http://www.ipni.org/ipni/plantsearch?id= 1,2,3,4,5 ... 9999999 scenario?

Paul

-----Original Message----- From: tdwg-guid-bounces@mailman.nhm.ku.edu [mailto:tdwg-guid-bounces@mailman.nhm.ku.edu]On Behalf Of Sally Hinchcliffe Sent: 15 June 2006 12:08 To: tdwg-guid@mailman.nhm.ku.edu Subject: [Tdwg-guid] Throttling searches [ Scanned for viruses ]

Hi all another question that has come up here.

As discussed at the meeting, we're thinking of providing a complete download of all IPNI LSIDs plus a label (name and author, probably) which will be available as an annually produced download

Most people will play nice and just resolve one or two LSIDs as required, but by providing a complete list, we're making it very easy for someone to write a crawler that hits every LSID in turn and basically brings our server to its knees

Anybody know of a good way of enforcing more polite behaviour? We can make the download only available under a data supply agreement that includes a clause limiting hit rates, or we could limit by IP address (but this would ultimately block out services like Rod's simple resolver). I beleive Google's spell checker uses a key which has to be passed in as part of the query - obviously we can't do that with LSIDs

Any thoughts? Anyone think this is a problem?

Sally *** Sally Hinchcliffe *** Computer section, Royal Botanic Gardens, Kew *** tel: +44 (0)20 8332 5708 *** S.Hinchcliffe@rbgkew.org.uk

_______________________________________________ TDWG-GUID mailing list TDWG-GUID@mailman.nhm.ku.edu http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid

_______________________________________________ TDWG-GUID mailing list TDWG-GUID@mailman.nhm.ku.edu http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid

*** Sally Hinchcliffe *** Computer section, Royal Botanic Gardens, Kew *** tel: +44 (0)20 8332 5708 *** S.Hinchcliffe@rbgkew.org.uk

_______________________________________________ TDWG-GUID mailing list TDWG-GUID@mailman.nhm.ku.edu http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid

Roger Hyam

19 Jun 19 Jun

08:33

New subject: [Tdwg-guid] Throttling searches [ Scanned for viruses ]

Is this an LSID issue? LSIDs essential provide a binding service between an name and one or more web services (we default to HTTP GET bindings). It isn't really up to the LSID authority to administer any policies regarding the web service but simply to point at it. It is up to the web service to do things like throttling, authentication and authorization. Imagine, for example, that the different services had different policies. It may be reasonable not to restrict the getMetadata() calls but to restrict the getData() calls. The use of LSIDs does not create any new problems that weren't there with web page scraping - or scraping of any other web service. Just my thoughts... Roger Ricardo Scachetti Pereira wrote:

...

Sally,

You raised a really important issue that we had not really addressed at the meeting. Thanks for that.

I would say that we should not constrain the resolution of LSIDs if we expect our LSID infrastructure to work. LSIDs will be the basis of our architecture so we better have good support for that.

However, that is sure a limiting factor. Also server efficiency will likely vary quite a lot, depending on underlying system optimizations and all.

So I think that the solution for this problem is in caching LSID responses on the server LSID stack. Basically, after resolving an LSID once, your server should be able to resolve it again and again really quickly, until something on the metadata that is related to that id changes.

I haven't looked at this aspect of the LSID software stack, but maybe others can say something about it. In any case I'll do some research on it and get back to you.

Again, thanks for bringing it up.

Cheers,

Ricardo

Sally Hinchcliffe wrote:

...
There are enough discontinuities in IPNI ids that 1,2,3 would quickly run into the sand. I agree it's not a new problem - I just hate to think I'm making life easier for the data scrapers Sally

...
It can be a problem but I'm not sure if there is a simple solution ... and how different is the LSID crawler scenario from an http://www.ipni.org/ipni/plantsearch?id= 1,2,3,4,5 ... 9999999 scenario?

Paul

-----Original Message----- From: tdwg-guid-bounces@mailman.nhm.ku.edu [mailto:tdwg-guid-bounces@mailman.nhm.ku.edu]On Behalf Of Sally Hinchcliffe Sent: 15 June 2006 12:08 To: tdwg-guid@mailman.nhm.ku.edu Subject: [Tdwg-guid] Throttling searches [ Scanned for viruses ]

Hi all another question that has come up here.

As discussed at the meeting, we're thinking of providing a complete download of all IPNI LSIDs plus a label (name and author, probably) which will be available as an annually produced download

Most people will play nice and just resolve one or two LSIDs as required, but by providing a complete list, we're making it very easy for someone to write a crawler that hits every LSID in turn and basically brings our server to its knees

Anybody know of a good way of enforcing more polite behaviour? We can make the download only available under a data supply agreement that includes a clause limiting hit rates, or we could limit by IP address (but this would ultimately block out services like Rod's simple resolver). I beleive Google's spell checker uses a key which has to be passed in as part of the query - obviously we can't do that with LSIDs

Any thoughts? Anyone think this is a problem?

Sally *** Sally Hinchcliffe *** Computer section, Royal Botanic Gardens, Kew *** tel: +44 (0)20 8332 5708 *** S.Hinchcliffe@rbgkew.org.uk

_______________________________________________ TDWG-GUID mailing list TDWG-GUID@mailman.nhm.ku.edu http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid

_______________________________________________ TDWG-GUID mailing list TDWG-GUID@mailman.nhm.ku.edu http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid

*** Sally Hinchcliffe *** Computer section, Royal Botanic Gardens, Kew *** tel: +44 (0)20 8332 5708 *** S.Hinchcliffe@rbgkew.org.uk

_______________________________________________ TDWG-GUID mailing list TDWG-GUID@mailman.nhm.ku.edu http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid

_______________________________________________ TDWG-GUID mailing list TDWG-GUID@mailman.nhm.ku.edu http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid

-- ------------------------------------- Roger Hyam Technical Architect Taxonomic Databases Working Group ------------------------------------- http://www.tdwg.org roger@tdwg.org +44 1578 722782 -------------------------------------

Sally Hinchcliffe

09:02

New subject: [Tdwg-guid] Throttling searches [ Scanned for viruses ]

It's not an LSID issue per se, but LSIDs will make it harder to slow searches down. For instance, Google restricts use of its spell checker to 1000 a day by use of a key which is passed in with each request. Obviously this can't be done with LSIDs as then they wouldn't be the same for each user. The other reason why it's relevant to LSIDs is simply that providing a list of all relevant IPNI LSIDs (not necessary to the LSID implementation but a nice to have for caching / lookups for other systems using our LSIDs) also makes life easier for the datascrapers to operate Also I thought ... here's a list full of clever people perhaps they will have some suggestions Sally

...

Is this an LSID issue? LSIDs essential provide a binding service between an name and one or more web services (we default to HTTP GET bindings). It isn't really up to the LSID authority to administer any policies regarding the web service but simply to point at it. It is up to the web service to do things like throttling, authentication and authorization.

Imagine, for example, that the different services had different policies. It may be reasonable not to restrict the getMetadata() calls but to restrict the getData() calls.

The use of LSIDs does not create any new problems that weren't there with web page scraping - or scraping of any other web service.

Just my thoughts...

Roger

Ricardo Scachetti Pereira wrote:

...
Sally,

You raised a really important issue that we had not really addressed at the meeting. Thanks for that.

I would say that we should not constrain the resolution of LSIDs if we expect our LSID infrastructure to work. LSIDs will be the basis of our architecture so we better have good support for that.

However, that is sure a limiting factor. Also server efficiency will likely vary quite a lot, depending on underlying system optimizations and all.

So I think that the solution for this problem is in caching LSID responses on the server LSID stack. Basically, after resolving an LSID once, your server should be able to resolve it again and again really quickly, until something on the metadata that is related to that id changes.

I haven't looked at this aspect of the LSID software stack, but maybe others can say something about it. In any case I'll do some research on it and get back to you.

Again, thanks for bringing it up.

Cheers,

Ricardo

Sally Hinchcliffe wrote:

...
There are enough discontinuities in IPNI ids that 1,2,3 would quickly run into the sand. I agree it's not a new problem - I just hate to think I'm making life easier for the data scrapers Sally

...
It can be a problem but I'm not sure if there is a simple solution ... and how different is the LSID crawler scenario from an http://www.ipni.org/ipni/plantsearch?id= 1,2,3,4,5 ... 9999999 scenario?

Paul

-----Original Message----- From: tdwg-guid-bounces@mailman.nhm.ku.edu [mailto:tdwg-guid-bounces@mailman.nhm.ku.edu]On Behalf Of Sally Hinchcliffe Sent: 15 June 2006 12:08 To: tdwg-guid@mailman.nhm.ku.edu Subject: [Tdwg-guid] Throttling searches [ Scanned for viruses ]

Hi all another question that has come up here.

As discussed at the meeting, we're thinking of providing a complete download of all IPNI LSIDs plus a label (name and author, probably) which will be available as an annually produced download

Most people will play nice and just resolve one or two LSIDs as required, but by providing a complete list, we're making it very easy for someone to write a crawler that hits every LSID in turn and basically brings our server to its knees

Anybody know of a good way of enforcing more polite behaviour? We can make the download only available under a data supply agreement that includes a clause limiting hit rates, or we could limit by IP address (but this would ultimately block out services like Rod's simple resolver). I beleive Google's spell checker uses a key which has to be passed in as part of the query - obviously we can't do that with LSIDs

Any thoughts? Anyone think this is a problem?

Sally *** Sally Hinchcliffe *** Computer section, Royal Botanic Gardens, Kew *** tel: +44 (0)20 8332 5708 *** S.Hinchcliffe@rbgkew.org.uk

_______________________________________________ TDWG-GUID mailing list TDWG-GUID@mailman.nhm.ku.edu http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid

_______________________________________________ TDWG-GUID mailing list TDWG-GUID@mailman.nhm.ku.edu http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid

*** Sally Hinchcliffe *** Computer section, Royal Botanic Gardens, Kew *** tel: +44 (0)20 8332 5708 *** S.Hinchcliffe@rbgkew.org.uk

_______________________________________________ TDWG-GUID mailing list TDWG-GUID@mailman.nhm.ku.edu http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid

_______________________________________________ TDWG-GUID mailing list TDWG-GUID@mailman.nhm.ku.edu http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid

--

------------------------------------- Roger Hyam Technical Architect Taxonomic Databases Working Group ------------------------------------- http://www.tdwg.org roger@tdwg.org +44 1578 722782 -------------------------------------

*** Sally Hinchcliffe *** Computer section, Royal Botanic Gardens, Kew *** tel: +44 (0)20 8332 5708 *** S.Hinchcliffe@rbgkew.org.uk

Roderic Page

09:14

New subject: [Tdwg-guid] Throttling searches [ Scanned for viruses ]

I gotta ask -- what is so bad about making life easy for data scrapers (of which I'm one)? Isn't this rather the point -- we WANT to make it easy :-) But, I do realise that providers may run into a problem of being overwhelmed by requests (though, wouldn't that be nice -- people actually want your data). The NCBI throttles by asking people not to hammer the service, and some people leave around half a sec between requests to avoid being blocked. Connotea is thinking of "making the trigger be >10 requests within the last 15 seconds; requests arriving faster than that will be give a 503 response with a Retry-After header.", if that makes any sense. You could also provide a service for data scrapers where they can get an RDF dump of the IPNI names, rather than have to scrape them. Regards Rod On 19 Jun 2006, at 10:02, Sally Hinchcliffe wrote:

...

It's not an LSID issue per se, but LSIDs will make it harder to slow searches down. For instance, Google restricts use of its spell checker to 1000 a day by use of a key which is passed in with each request. Obviously this can't be done with LSIDs as then they wouldn't be the same for each user. The other reason why it's relevant to LSIDs is simply that providing a list of all relevant IPNI LSIDs (not necessary to the LSID implementation but a nice to have for caching / lookups for other systems using our LSIDs) also makes life easier for the datascrapers to operate

Also I thought ... here's a list full of clever people perhaps they will have some suggestions

Sally

...
Is this an LSID issue? LSIDs essential provide a binding service between an name and one or more web services (we default to HTTP GET bindings). It isn't really up to the LSID authority to administer any policies regarding the web service but simply to point at it. It is up to the web service to do things like throttling, authentication and authorization.

Imagine, for example, that the different services had different policies. It may be reasonable not to restrict the getMetadata() calls but to restrict the getData() calls.

The use of LSIDs does not create any new problems that weren't there with web page scraping - or scraping of any other web service.

Just my thoughts...

Roger

Ricardo Scachetti Pereira wrote:

...
Sally,

You raised a really important issue that we had not really addressed at the meeting. Thanks for that.

I would say that we should not constrain the resolution of LSIDs if we expect our LSID infrastructure to work. LSIDs will be the basis of our architecture so we better have good support for that.

However, that is sure a limiting factor. Also server efficiency will likely vary quite a lot, depending on underlying system optimizations and all.

So I think that the solution for this problem is in caching LSID responses on the server LSID stack. Basically, after resolving an LSID once, your server should be able to resolve it again and again really quickly, until something on the metadata that is related to that id changes.

I haven't looked at this aspect of the LSID software stack, but maybe others can say something about it. In any case I'll do some research on it and get back to you.

Again, thanks for bringing it up.

Cheers,

Ricardo

Sally Hinchcliffe wrote:

...
There are enough discontinuities in IPNI ids that 1,2,3 would quickly run into the sand. I agree it's not a new problem - I just hate to think I'm making life easier for the data scrapers Sally

...
It can be a problem but I'm not sure if there is a simple solution ... and how different is the LSID crawler scenario from an http://www.ipni.org/ipni/plantsearch?id= 1,2,3,4,5 ... 9999999 scenario?

Paul

-----Original Message----- From: tdwg-guid-bounces@mailman.nhm.ku.edu [mailto:tdwg-guid-bounces@mailman.nhm.ku.edu]On Behalf Of Sally Hinchcliffe Sent: 15 June 2006 12:08 To: tdwg-guid@mailman.nhm.ku.edu Subject: [Tdwg-guid] Throttling searches [ Scanned for viruses ]

Hi all another question that has come up here.

As discussed at the meeting, we're thinking of providing a complete download of all IPNI LSIDs plus a label (name and author, probably) which will be available as an annually produced download

Most people will play nice and just resolve one or two LSIDs as required, but by providing a complete list, we're making it very easy for someone to write a crawler that hits every LSID in turn and basically brings our server to its knees

Anybody know of a good way of enforcing more polite behaviour? We can make the download only available under a data supply agreement that includes a clause limiting hit rates, or we could limit by IP address (but this would ultimately block out services like Rod's simple resolver). I beleive Google's spell checker uses a key which has to be passed in as part of the query - obviously we can't do that with LSIDs

Any thoughts? Anyone think this is a problem?

Sally *** Sally Hinchcliffe *** Computer section, Royal Botanic Gardens, Kew *** tel: +44 (0)20 8332 5708 *** S.Hinchcliffe@rbgkew.org.uk

_______________________________________________ TDWG-GUID mailing list TDWG-GUID@mailman.nhm.ku.edu http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid

_______________________________________________ TDWG-GUID mailing list TDWG-GUID@mailman.nhm.ku.edu http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid

*** Sally Hinchcliffe *** Computer section, Royal Botanic Gardens, Kew *** tel: +44 (0)20 8332 5708 *** S.Hinchcliffe@rbgkew.org.uk

_______________________________________________ TDWG-GUID mailing list TDWG-GUID@mailman.nhm.ku.edu http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid

_______________________________________________ TDWG-GUID mailing list TDWG-GUID@mailman.nhm.ku.edu http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid

--

------------------------------------- Roger Hyam Technical Architect Taxonomic Databases Working Group ------------------------------------- http://www.tdwg.org roger@tdwg.org +44 1578 722782 -------------------------------------

*** Sally Hinchcliffe *** Computer section, Royal Botanic Gardens, Kew *** tel: +44 (0)20 8332 5708 *** S.Hinchcliffe@rbgkew.org.uk

_______________________________________________ TDWG-GUID mailing list TDWG-GUID@mailman.nhm.ku.edu http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid

------------------------------------------------------------------------ ---------------------------------------- Professor Roderic D. M. Page Editor, Systematic Biology DEEB, IBLS Graham Kerr Building University of Glasgow Glasgow G12 8QP United Kingdom Phone: +44 141 330 4778 Fax: +44 141 330 2792 email: r.page@bio.gla.ac.uk web: http://taxonomy.zoology.gla.ac.uk/rod/rod.html iChat: aim://rodpage1962 reprints: http://taxonomy.zoology.gla.ac.uk/rod/pubs.html Subscribe to Systematic Biology through the Society of Systematic Biologists Website: http://systematicbiology.org Search for taxon names: http://darwin.zoology.gla.ac.uk/~rpage/portal/ Find out what we know about a species: http://ispecies.org Rod's rants on phyloinformatics: http://iphylo.blogspot.com ___________________________________________________________ Now you can scan emails quickly with a reading pane. Get the new Yahoo! Mail. http://uk.docs.yahoo.com/nowyoucan.html

Sally Hinchcliffe

09:23

New subject: [Tdwg-guid] Throttling searches [ Scanned for viruses ]

Hi Rod Sadly not everyone is polite, or asks, or leaves gaps between queries. We handle 10 - 15k searches a day which can peak to 20-30k when someone is actively crawling it, running against two servers, neither of which is in the first flush of youth. That's setting aside the irritation of having someone scrape and serve your data without acknowledgement (present company excepted, naturally) - data that we are assembling at some cost to the organisations which support ipni out of their core resources I will obviously be providing a canned, limited download, but some people want everything. My current plan is to make the download only available on signing a data supply agreement, which will include terms on rates of further querying and use our logs to check for compliance This may seem like a petty issue - yes we do want people to use and want our data - but on the other hand I have to make sure that the service is available to everyone, all the time. And I also have to make sure that the people who fund IPNI - the senior management at Kew, Harvard and Canberra - are happy that their efforts are not being abused. Sally

...

I gotta ask -- what is so bad about making life easy for data scrapers (of which I'm one)? Isn't this rather the point -- we WANT to make it easy :-)

But, I do realise that providers may run into a problem of being overwhelmed by requests (though, wouldn't that be nice -- people actually want your data).

The NCBI throttles by asking people not to hammer the service, and some people leave around half a sec between requests to avoid being blocked. Connotea is thinking of "making the trigger be >10 requests within the last 15 seconds; requests arriving faster than that will be give a 503 response with a Retry-After header.", if that makes any sense.

You could also provide a service for data scrapers where they can get an RDF dump of the IPNI names, rather than have to scrape them.

Regards

Rod

On 19 Jun 2006, at 10:02, Sally Hinchcliffe wrote:

...
It's not an LSID issue per se, but LSIDs will make it harder to slow searches down. For instance, Google restricts use of its spell checker to 1000 a day by use of a key which is passed in with each request. Obviously this can't be done with LSIDs as then they wouldn't be the same for each user. The other reason why it's relevant to LSIDs is simply that providing a list of all relevant IPNI LSIDs (not necessary to the LSID implementation but a nice to have for caching / lookups for other systems using our LSIDs) also makes life easier for the datascrapers to operate

Also I thought ... here's a list full of clever people perhaps they will have some suggestions

Sally

...
Is this an LSID issue? LSIDs essential provide a binding service between an name and one or more web services (we default to HTTP GET bindings). It isn't really up to the LSID authority to administer any policies regarding the web service but simply to point at it. It is up to the web service to do things like throttling, authentication and authorization.

Imagine, for example, that the different services had different policies. It may be reasonable not to restrict the getMetadata() calls but to restrict the getData() calls.

The use of LSIDs does not create any new problems that weren't there with web page scraping - or scraping of any other web service.

Just my thoughts...

Roger

Ricardo Scachetti Pereira wrote:

...
Sally,

You raised a really important issue that we had not really addressed at the meeting. Thanks for that.

I would say that we should not constrain the resolution of LSIDs if we expect our LSID infrastructure to work. LSIDs will be the basis of our architecture so we better have good support for that.

However, that is sure a limiting factor. Also server efficiency will likely vary quite a lot, depending on underlying system optimizations and all.

So I think that the solution for this problem is in caching LSID responses on the server LSID stack. Basically, after resolving an LSID once, your server should be able to resolve it again and again really quickly, until something on the metadata that is related to that id changes.

I haven't looked at this aspect of the LSID software stack, but maybe others can say something about it. In any case I'll do some research on it and get back to you.

Again, thanks for bringing it up.

Cheers,

Ricardo

Sally Hinchcliffe wrote:

...
There are enough discontinuities in IPNI ids that 1,2,3 would quickly run into the sand. I agree it's not a new problem - I just hate to think I'm making life easier for the data scrapers Sally

...
It can be a problem but I'm not sure if there is a simple solution ... and how different is the LSID crawler scenario from an http://www.ipni.org/ipni/plantsearch?id= 1,2,3,4,5 ... 9999999 scenario?

Paul

-----Original Message----- From: tdwg-guid-bounces@mailman.nhm.ku.edu [mailto:tdwg-guid-bounces@mailman.nhm.ku.edu]On Behalf Of Sally Hinchcliffe Sent: 15 June 2006 12:08 To: tdwg-guid@mailman.nhm.ku.edu Subject: [Tdwg-guid] Throttling searches [ Scanned for viruses ]

Hi all another question that has come up here.

As discussed at the meeting, we're thinking of providing a complete download of all IPNI LSIDs plus a label (name and author, probably) which will be available as an annually produced download

Most people will play nice and just resolve one or two LSIDs as required, but by providing a complete list, we're making it very easy for someone to write a crawler that hits every LSID in turn and basically brings our server to its knees

Anybody know of a good way of enforcing more polite behaviour? We can make the download only available under a data supply agreement that includes a clause limiting hit rates, or we could limit by IP address (but this would ultimately block out services like Rod's simple resolver). I beleive Google's spell checker uses a key which has to be passed in as part of the query - obviously we can't do that with LSIDs

Any thoughts? Anyone think this is a problem?

Sally *** Sally Hinchcliffe *** Computer section, Royal Botanic Gardens, Kew *** tel: +44 (0)20 8332 5708 *** S.Hinchcliffe@rbgkew.org.uk

_______________________________________________ TDWG-GUID mailing list TDWG-GUID@mailman.nhm.ku.edu http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid

_______________________________________________ TDWG-GUID mailing list TDWG-GUID@mailman.nhm.ku.edu http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid

*** Sally Hinchcliffe *** Computer section, Royal Botanic Gardens, Kew *** tel: +44 (0)20 8332 5708 *** S.Hinchcliffe@rbgkew.org.uk

_______________________________________________ TDWG-GUID mailing list TDWG-GUID@mailman.nhm.ku.edu http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid

_______________________________________________ TDWG-GUID mailing list TDWG-GUID@mailman.nhm.ku.edu http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid

--

------------------------------------- Roger Hyam Technical Architect Taxonomic Databases Working Group ------------------------------------- http://www.tdwg.org roger@tdwg.org +44 1578 722782 -------------------------------------

*** Sally Hinchcliffe *** Computer section, Royal Botanic Gardens, Kew *** tel: +44 (0)20 8332 5708 *** S.Hinchcliffe@rbgkew.org.uk

_______________________________________________ TDWG-GUID mailing list TDWG-GUID@mailman.nhm.ku.edu http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid

------------------------------------------------------------------------ ---------------------------------------- Professor Roderic D. M. Page Editor, Systematic Biology DEEB, IBLS Graham Kerr Building University of Glasgow Glasgow G12 8QP United Kingdom

Phone: +44 141 330 4778 Fax: +44 141 330 2792 email: r.page@bio.gla.ac.uk web: http://taxonomy.zoology.gla.ac.uk/rod/rod.html iChat: aim://rodpage1962 reprints: http://taxonomy.zoology.gla.ac.uk/rod/pubs.html

Subscribe to Systematic Biology through the Society of Systematic Biologists Website: http://systematicbiology.org Search for taxon names: http://darwin.zoology.gla.ac.uk/~rpage/portal/ Find out what we know about a species: http://ispecies.org Rod's rants on phyloinformatics: http://iphylo.blogspot.com

___________________________________________________________ Now you can scan emails quickly with a reading pane. Get the new Yahoo! Mail. http://uk.docs.yahoo.com/nowyoucan.html

*** Sally Hinchcliffe *** Computer section, Royal Botanic Gardens, Kew *** tel: +44 (0)20 8332 5708 *** S.Hinchcliffe@rbgkew.org.uk

Chuck Miller

15:27

New subject: [Tdwg-guid] Throttling searches - Web crawlers

Sally, And don't forget the web crawlers. Google alone can swamp a site when the site's queries become hyperlinks as URL CGI calls on other people's websites. We were getting 90,000 robotic queries a day at one point before we blocked it. And Google is far from the only one. Chuck ________________________________ From: Sally Hinchcliffe [mailto:S.Hinchcliffe@kew.org] Sent: Mon 6/19/2006 4:23 AM To: Roderic Page Cc: tdwg-guid@mailman.nhm.ku.edu Subject: Re: [Tdwg-guid] Throttling searches [ Scanned for viruses ] Hi Rod Sadly not everyone is polite, or asks, or leaves gaps between queries. We handle 10 - 15k searches a day which can peak to 20-30k when someone is actively crawling it, running against two servers, neither of which is in the first flush of youth. That's setting aside the irritation of having someone scrape and serve your data without acknowledgement (present company excepted, naturally) - data that we are assembling at some cost to the organisations which support ipni out of their core resources I will obviously be providing a canned, limited download, but some people want everything. My current plan is to make the download only available on signing a data supply agreement, which will include terms on rates of further querying and use our logs to check for compliance This may seem like a petty issue - yes we do want people to use and want our data - but on the other hand I have to make sure that the service is available to everyone, all the time. And I also have to make sure that the people who fund IPNI - the senior management at Kew, Harvard and Canberra - are happy that their efforts are not being abused. Sally

...

I gotta ask -- what is so bad about making life easy for data scrapers (of which I'm one)? Isn't this rather the point -- we WANT to make it easy :-)

But, I do realise that providers may run into a problem of being overwhelmed by requests (though, wouldn't that be nice -- people actually want your data).

The NCBI throttles by asking people not to hammer the service, and some people leave around half a sec between requests to avoid being blocked. Connotea is thinking of "making the trigger be >10 requests within the last 15 seconds; requests arriving faster than that will be give a 503 response with a Retry-After header.", if that makes any sense.

You could also provide a service for data scrapers where they can get an RDF dump of the IPNI names, rather than have to scrape them.

Regards

Rod

On 19 Jun 2006, at 10:02, Sally Hinchcliffe wrote:

...
It's not an LSID issue per se, but LSIDs will make it harder to slow searches down. For instance, Google restricts use of its spell checker to 1000 a day by use of a key which is passed in with each request. Obviously this can't be done with LSIDs as then they wouldn't be the same for each user. The other reason why it's relevant to LSIDs is simply that providing a list of all relevant IPNI LSIDs (not necessary to the LSID implementation but a nice to have for caching / lookups for other systems using our LSIDs) also makes life easier for the datascrapers to operate

Also I thought ... here's a list full of clever people perhaps they will have some suggestions

Sally

...
Is this an LSID issue? LSIDs essential provide a binding service between an name and one or more web services (we default to HTTP GET bindings). It isn't really up to the LSID authority to administer any policies regarding the web service but simply to point at it. It is up to the web service to do things like throttling, authentication and authorization.

Imagine, for example, that the different services had different policies. It may be reasonable not to restrict the getMetadata() calls but to restrict the getData() calls.

The use of LSIDs does not create any new problems that weren't there with web page scraping - or scraping of any other web service.

Just my thoughts...

Roger

Ricardo Scachetti Pereira wrote:

...
Sally,

You raised a really important issue that we had not really addressed at the meeting. Thanks for that.

I would say that we should not constrain the resolution of LSIDs if we expect our LSID infrastructure to work. LSIDs will be the basis of our architecture so we better have good support for that.

However, that is sure a limiting factor. Also server efficiency will likely vary quite a lot, depending on underlying system optimizations and all.

So I think that the solution for this problem is in caching LSID responses on the server LSID stack. Basically, after resolving an LSID once, your server should be able to resolve it again and again really quickly, until something on the metadata that is related to that id changes.

I haven't looked at this aspect of the LSID software stack, but maybe others can say something about it. In any case I'll do some research on it and get back to you.

Again, thanks for bringing it up.

Cheers,

Ricardo

Sally Hinchcliffe wrote:

...
There are enough discontinuities in IPNI ids that 1,2,3 would quickly run into the sand. I agree it's not a new problem - I just hate to think I'm making life easier for the data scrapers Sally

...
It can be a problem but I'm not sure if there is a simple solution ... and how different is the LSID crawler scenario from an http://www.ipni.org/ipni/plantsearch?id= 1,2,3,4,5 ... 9999999 scenario?

Paul

-----Original Message----- From: tdwg-guid-bounces@mailman.nhm.ku.edu [mailto:tdwg-guid-bounces@mailman.nhm.ku.edu]On Behalf Of Sally Hinchcliffe Sent: 15 June 2006 12:08 To: tdwg-guid@mailman.nhm.ku.edu Subject: [Tdwg-guid] Throttling searches [ Scanned for viruses ]

Hi all another question that has come up here.

As discussed at the meeting, we're thinking of providing a complete download of all IPNI LSIDs plus a label (name and author, probably) which will be available as an annually produced download

Most people will play nice and just resolve one or two LSIDs as required, but by providing a complete list, we're making it very easy for someone to write a crawler that hits every LSID in turn and basically brings our server to its knees

Anybody know of a good way of enforcing more polite behaviour? We can make the download only available under a data supply agreement that includes a clause limiting hit rates, or we could limit by IP address (but this would ultimately block out services like Rod's simple resolver). I beleive Google's spell checker uses a key which has to be passed in as part of the query - obviously we can't do that with LSIDs

Any thoughts? Anyone think this is a problem?

Sally *** Sally Hinchcliffe *** Computer section, Royal Botanic Gardens, Kew *** tel: +44 (0)20 8332 5708 *** S.Hinchcliffe@rbgkew.org.uk

_______________________________________________ TDWG-GUID mailing list TDWG-GUID@mailman.nhm.ku.edu http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid

_______________________________________________ TDWG-GUID mailing list TDWG-GUID@mailman.nhm.ku.edu http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid

*** Sally Hinchcliffe *** Computer section, Royal Botanic Gardens, Kew *** tel: +44 (0)20 8332 5708 *** S.Hinchcliffe@rbgkew.org.uk

_______________________________________________ TDWG-GUID mailing list TDWG-GUID@mailman.nhm.ku.edu http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid

_______________________________________________ TDWG-GUID mailing list TDWG-GUID@mailman.nhm.ku.edu http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid

--

------------------------------------- Roger Hyam Technical Architect Taxonomic Databases Working Group ------------------------------------- http://www.tdwg.org <http://www.tdwg.org/> roger@tdwg.org +44 1578 722782 -------------------------------------

*** Sally Hinchcliffe *** Computer section, Royal Botanic Gardens, Kew *** tel: +44 (0)20 8332 5708 *** S.Hinchcliffe@rbgkew.org.uk

_______________________________________________ TDWG-GUID mailing list TDWG-GUID@mailman.nhm.ku.edu http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid

------------------------------------------------------------------------ ---------------------------------------- Professor Roderic D. M. Page Editor, Systematic Biology DEEB, IBLS Graham Kerr Building University of Glasgow Glasgow G12 8QP United Kingdom

Phone: +44 141 330 4778 Fax: +44 141 330 2792 email: r.page@bio.gla.ac.uk web: http://taxonomy.zoology.gla.ac.uk/rod/rod.html iChat: aim://rodpage1962 reprints: http://taxonomy.zoology.gla.ac.uk/rod/pubs.html

Subscribe to Systematic Biology through the Society of Systematic Biologists Website: http://systematicbiology.org <http://systematicbiology.org/> Search for taxon names: http://darwin.zoology.gla.ac.uk/~rpage/portal/ Find out what we know about a species: http://ispecies.org <http://ispecies.org/> Rod's rants on phyloinformatics: http://iphylo.blogspot.com <http://iphylo.blogspot.com/>

___________________________________________________________ Now you can scan emails quickly with a reading pane. Get the new Yahoo! Mail. http://uk.docs.yahoo.com/nowyoucan.html

*** Sally Hinchcliffe *** Computer section, Royal Botanic Gardens, Kew *** tel: +44 (0)20 8332 5708 *** S.Hinchcliffe@rbgkew.org.uk _______________________________________________ TDWG-GUID mailing list TDWG-GUID@mailman.nhm.ku.edu http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid

Roger Hyam

09:30

New subject: [Tdwg-guid] Throttling searches [ Scanned for viruses ]

I think the only way to throttle in these situations is to have some notion of who the client is and the only way to do that is to have some kind of token exchange over HTTP saying who they are. Basically you have to have some kind of client registration system or you can never distinguish between a call from a new client and a repeat call. The use of IP address is a pain because so many people are now behind some kind of NAT gateway. How about this for a plan: You could give a degraded services to people who don't pass a token (a 5 second delay perhaps) and offer a quicker service to registered users who pass a token (but then perhaps limit the number of calls they make). This would mean you could offer a universal service even to those with naive client software but a better service to those who play nicely. You could also get better stats on who is using the service. So there are ways that this could be done. I expect people will come up with a host of different ways. It is outside LSIDs though. Roger Sally Hinchcliffe wrote:

...

It's not an LSID issue per se, but LSIDs will make it harder to slow searches down. For instance, Google restricts use of its spell checker to 1000 a day by use of a key which is passed in with each request. Obviously this can't be done with LSIDs as then they wouldn't be the same for each user. The other reason why it's relevant to LSIDs is simply that providing a list of all relevant IPNI LSIDs (not necessary to the LSID implementation but a nice to have for caching / lookups for other systems using our LSIDs) also makes life easier for the datascrapers to operate

Also I thought ... here's a list full of clever people perhaps they will have some suggestions

Sally

...
Is this an LSID issue? LSIDs essential provide a binding service between an name and one or more web services (we default to HTTP GET bindings). It isn't really up to the LSID authority to administer any policies regarding the web service but simply to point at it. It is up to the web service to do things like throttling, authentication and authorization.

Imagine, for example, that the different services had different policies. It may be reasonable not to restrict the getMetadata() calls but to restrict the getData() calls.

The use of LSIDs does not create any new problems that weren't there with web page scraping - or scraping of any other web service.

Just my thoughts...

Roger

Ricardo Scachetti Pereira wrote:

...
Sally,

You raised a really important issue that we had not really addressed at the meeting. Thanks for that.

I would say that we should not constrain the resolution of LSIDs if we expect our LSID infrastructure to work. LSIDs will be the basis of our architecture so we better have good support for that.

However, that is sure a limiting factor. Also server efficiency will likely vary quite a lot, depending on underlying system optimizations and all.

So I think that the solution for this problem is in caching LSID responses on the server LSID stack. Basically, after resolving an LSID once, your server should be able to resolve it again and again really quickly, until something on the metadata that is related to that id changes.

I haven't looked at this aspect of the LSID software stack, but maybe others can say something about it. In any case I'll do some research on it and get back to you.

Again, thanks for bringing it up.

Cheers,

Ricardo

Sally Hinchcliffe wrote:

...
There are enough discontinuities in IPNI ids that 1,2,3 would quickly run into the sand. I agree it's not a new problem - I just hate to think I'm making life easier for the data scrapers Sally

...
It can be a problem but I'm not sure if there is a simple solution ... and how different is the LSID crawler scenario from an http://www.ipni.org/ipni/plantsearch?id= 1,2,3,4,5 ... 9999999 scenario?

Paul

-----Original Message----- From: tdwg-guid-bounces@mailman.nhm.ku.edu [mailto:tdwg-guid-bounces@mailman.nhm.ku.edu]On Behalf Of Sally Hinchcliffe Sent: 15 June 2006 12:08 To: tdwg-guid@mailman.nhm.ku.edu Subject: [Tdwg-guid] Throttling searches [ Scanned for viruses ]

Hi all another question that has come up here.

As discussed at the meeting, we're thinking of providing a complete download of all IPNI LSIDs plus a label (name and author, probably) which will be available as an annually produced download

Most people will play nice and just resolve one or two LSIDs as required, but by providing a complete list, we're making it very easy for someone to write a crawler that hits every LSID in turn and basically brings our server to its knees

Anybody know of a good way of enforcing more polite behaviour? We can make the download only available under a data supply agreement that includes a clause limiting hit rates, or we could limit by IP address (but this would ultimately block out services like Rod's simple resolver). I beleive Google's spell checker uses a key which has to be passed in as part of the query - obviously we can't do that with LSIDs

Any thoughts? Anyone think this is a problem?

Sally *** Sally Hinchcliffe *** Computer section, Royal Botanic Gardens, Kew *** tel: +44 (0)20 8332 5708 *** S.Hinchcliffe@rbgkew.org.uk

_______________________________________________ TDWG-GUID mailing list TDWG-GUID@mailman.nhm.ku.edu http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid

_______________________________________________ TDWG-GUID mailing list TDWG-GUID@mailman.nhm.ku.edu http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid

*** Sally Hinchcliffe *** Computer section, Royal Botanic Gardens, Kew *** tel: +44 (0)20 8332 5708 *** S.Hinchcliffe@rbgkew.org.uk

_______________________________________________ TDWG-GUID mailing list TDWG-GUID@mailman.nhm.ku.edu http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid

_______________________________________________ TDWG-GUID mailing list TDWG-GUID@mailman.nhm.ku.edu http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid

--

------------------------------------- Roger Hyam Technical Architect Taxonomic Databases Working Group ------------------------------------- http://www.tdwg.org roger@tdwg.org +44 1578 722782 -------------------------------------

*** Sally Hinchcliffe *** Computer section, Royal Botanic Gardens, Kew *** tel: +44 (0)20 8332 5708 *** S.Hinchcliffe@rbgkew.org.uk

Sally Hinchcliffe

09:34

New subject: [Tdwg-guid] Throttling searches [ Scanned for viruses ]

How can we pass a token with an LSID?

...

I think the only way to throttle in these situations is to have some notion of who the client is and the only way to do that is to have some kind of token exchange over HTTP saying who they are. Basically you have to have some kind of client registration system or you can never distinguish between a call from a new client and a repeat call. The use of IP address is a pain because so many people are now behind some kind of NAT gateway.

How about this for a plan:

You could give a degraded services to people who don't pass a token (a 5 second delay perhaps) and offer a quicker service to registered users who pass a token (but then perhaps limit the number of calls they make). This would mean you could offer a universal service even to those with naive client software but a better service to those who play nicely. You could also get better stats on who is using the service.

So there are ways that this could be done. I expect people will come up with a host of different ways. It is outside LSIDs though.

Roger

Sally Hinchcliffe wrote:

...
It's not an LSID issue per se, but LSIDs will make it harder to slow searches down. For instance, Google restricts use of its spell checker to 1000 a day by use of a key which is passed in with each request. Obviously this can't be done with LSIDs as then they wouldn't be the same for each user. The other reason why it's relevant to LSIDs is simply that providing a list of all relevant IPNI LSIDs (not necessary to the LSID implementation but a nice to have for caching / lookups for other systems using our LSIDs) also makes life easier for the datascrapers to operate

Also I thought ... here's a list full of clever people perhaps they will have some suggestions

Sally

...
Is this an LSID issue? LSIDs essential provide a binding service between an name and one or more web services (we default to HTTP GET bindings). It isn't really up to the LSID authority to administer any policies regarding the web service but simply to point at it. It is up to the web service to do things like throttling, authentication and authorization.

Imagine, for example, that the different services had different policies. It may be reasonable not to restrict the getMetadata() calls but to restrict the getData() calls.

The use of LSIDs does not create any new problems that weren't there with web page scraping - or scraping of any other web service.

Just my thoughts...

Roger

Ricardo Scachetti Pereira wrote:

...
Sally,

You raised a really important issue that we had not really addressed at the meeting. Thanks for that.

I would say that we should not constrain the resolution of LSIDs if we expect our LSID infrastructure to work. LSIDs will be the basis of our architecture so we better have good support for that.

However, that is sure a limiting factor. Also server efficiency will likely vary quite a lot, depending on underlying system optimizations and all.

So I think that the solution for this problem is in caching LSID responses on the server LSID stack. Basically, after resolving an LSID once, your server should be able to resolve it again and again really quickly, until something on the metadata that is related to that id changes.

I haven't looked at this aspect of the LSID software stack, but maybe others can say something about it. In any case I'll do some research on it and get back to you.

Again, thanks for bringing it up.

Cheers,

Ricardo

Sally Hinchcliffe wrote:

...
There are enough discontinuities in IPNI ids that 1,2,3 would quickly run into the sand. I agree it's not a new problem - I just hate to think I'm making life easier for the data scrapers Sally

...
It can be a problem but I'm not sure if there is a simple solution ... and how different is the LSID crawler scenario from an http://www.ipni.org/ipni/plantsearch?id= 1,2,3,4,5 ... 9999999 scenario?

Paul

-----Original Message----- From: tdwg-guid-bounces@mailman.nhm.ku.edu [mailto:tdwg-guid-bounces@mailman.nhm.ku.edu]On Behalf Of Sally Hinchcliffe Sent: 15 June 2006 12:08 To: tdwg-guid@mailman.nhm.ku.edu Subject: [Tdwg-guid] Throttling searches [ Scanned for viruses ]

Hi all another question that has come up here.

As discussed at the meeting, we're thinking of providing a complete download of all IPNI LSIDs plus a label (name and author, probably) which will be available as an annually produced download

Most people will play nice and just resolve one or two LSIDs as required, but by providing a complete list, we're making it very easy for someone to write a crawler that hits every LSID in turn and basically brings our server to its knees

Anybody know of a good way of enforcing more polite behaviour? We can make the download only available under a data supply agreement that includes a clause limiting hit rates, or we could limit by IP address (but this would ultimately block out services like Rod's simple resolver). I beleive Google's spell checker uses a key which has to be passed in as part of the query - obviously we can't do that with LSIDs

Any thoughts? Anyone think this is a problem?

Sally *** Sally Hinchcliffe *** Computer section, Royal Botanic Gardens, Kew *** tel: +44 (0)20 8332 5708 *** S.Hinchcliffe@rbgkew.org.uk

_______________________________________________ TDWG-GUID mailing list TDWG-GUID@mailman.nhm.ku.edu http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid

_______________________________________________ TDWG-GUID mailing list TDWG-GUID@mailman.nhm.ku.edu http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid

*** Sally Hinchcliffe *** Computer section, Royal Botanic Gardens, Kew *** tel: +44 (0)20 8332 5708 *** S.Hinchcliffe@rbgkew.org.uk

_______________________________________________ TDWG-GUID mailing list TDWG-GUID@mailman.nhm.ku.edu http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid

_______________________________________________ TDWG-GUID mailing list TDWG-GUID@mailman.nhm.ku.edu http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid

--

------------------------------------- Roger Hyam Technical Architect Taxonomic Databases Working Group ------------------------------------- http://www.tdwg.org roger@tdwg.org +44 1578 722782 -------------------------------------

*** Sally Hinchcliffe *** Computer section, Royal Botanic Gardens, Kew *** tel: +44 (0)20 8332 5708 *** S.Hinchcliffe@rbgkew.org.uk

--

------------------------------------- Roger Hyam Technical Architect Taxonomic Databases Working Group ------------------------------------- http://www.tdwg.org roger@tdwg.org +44 1578 722782 -------------------------------------

*** Sally Hinchcliffe *** Computer section, Royal Botanic Gardens, Kew *** tel: +44 (0)20 8332 5708 *** S.Hinchcliffe@rbgkew.org.uk

Roger Hyam

10:07

New subject: [Tdwg-guid] Throttling searches [ Scanned for viruses ]

You don't! The LSID resolves to the binding to the getMetadata() method - which is a plain old fashioned URL. At this point the LSID authority has done its duty and we are just on a plain HTTP GET call so you can do whatever you can do with any regular HTTP GET. You could stipulate another header field or (more simply) give priority service for those who append a valid user id to the URL (&user_id=12345) So there is no throttle on resolving the LSID to the getMetadata binding (which is cheap) but there is a throttle on the actual call to get the metadata method. Really you need to do this because bad people may be able to tell from the URL how to scrape the source and bypass the LSID resolver after the first call anyhow. This is especially true if the URL contains the IPNI record ID which is likely. Here is an example using Rod's tester. http://linnaeus.zoology.gla.ac.uk/~rpage/lsid/tester/?q=urn:lsid:ubio.org:na... The getMetadata() method for this LSID: urn:lsid:ubio.org:namebank:11815 Is bound to this URL: http://names.ubio.org/authority/metadata.php?lsid=urn:lsid:ubio.org:namebank... So ubio would just have to give preferential services to calls like this: http://names.ubio.org/authority/metadata.php?lsid=urn:lsid:ubio.org:namebank:11815&user_id=rogerhyam1392918790 If rogerhyam had paid his membership fees this year. Does this make sense? Roger p.s. You could do this on the web pages as well with a clever little thing to write dynamic tokens into the links so that it doesn't degrade the regular browsing experience and only stops scrapers - but that is beyond my remit at the moment ;) p.p.s. You could wrap this in https if you were paranoid about people stealing tokens - but this is highly unlikely I believe. Sally Hinchcliffe wrote:

...

How can we pass a token with an LSID?

...
I think the only way to throttle in these situations is to have some notion of who the client is and the only way to do that is to have some kind of token exchange over HTTP saying who they are. Basically you have to have some kind of client registration system or you can never distinguish between a call from a new client and a repeat call. The use of IP address is a pain because so many people are now behind some kind of NAT gateway.

How about this for a plan:

You could give a degraded services to people who don't pass a token (a 5 second delay perhaps) and offer a quicker service to registered users who pass a token (but then perhaps limit the number of calls they make). This would mean you could offer a universal service even to those with naive client software but a better service to those who play nicely. You could also get better stats on who is using the service.

So there are ways that this could be done. I expect people will come up with a host of different ways. It is outside LSIDs though.

Roger

Sally Hinchcliffe wrote:

...
It's not an LSID issue per se, but LSIDs will make it harder to slow searches down. For instance, Google restricts use of its spell checker to 1000 a day by use of a key which is passed in with each request. Obviously this can't be done with LSIDs as then they wouldn't be the same for each user. The other reason why it's relevant to LSIDs is simply that providing a list of all relevant IPNI LSIDs (not necessary to the LSID implementation but a nice to have for caching / lookups for other systems using our LSIDs) also makes life easier for the datascrapers to operate

Also I thought ... here's a list full of clever people perhaps they will have some suggestions

Sally

...
Is this an LSID issue? LSIDs essential provide a binding service between an name and one or more web services (we default to HTTP GET bindings). It isn't really up to the LSID authority to administer any policies regarding the web service but simply to point at it. It is up to the web service to do things like throttling, authentication and authorization.

Imagine, for example, that the different services had different policies. It may be reasonable not to restrict the getMetadata() calls but to restrict the getData() calls.

The use of LSIDs does not create any new problems that weren't there with web page scraping - or scraping of any other web service.

Just my thoughts...

Roger

Ricardo Scachetti Pereira wrote:

...
Sally,

You raised a really important issue that we had not really addressed at the meeting. Thanks for that.

I would say that we should not constrain the resolution of LSIDs if we expect our LSID infrastructure to work. LSIDs will be the basis of our architecture so we better have good support for that.

However, that is sure a limiting factor. Also server efficiency will likely vary quite a lot, depending on underlying system optimizations and all.

So I think that the solution for this problem is in caching LSID responses on the server LSID stack. Basically, after resolving an LSID once, your server should be able to resolve it again and again really quickly, until something on the metadata that is related to that id changes.

I haven't looked at this aspect of the LSID software stack, but maybe others can say something about it. In any case I'll do some research on it and get back to you.

Again, thanks for bringing it up.

Cheers,

Ricardo

Sally Hinchcliffe wrote:

...
There are enough discontinuities in IPNI ids that 1,2,3 would quickly run into the sand. I agree it's not a new problem - I just hate to think I'm making life easier for the data scrapers Sally

> It can be a problem but I'm not sure if there is a simple solution ... and how different is the LSID crawler scenario from an http://www.ipni.org/ipni/plantsearch?id= 1,2,3,4,5 ... 9999999 scenario? > > Paul > > -----Original Message----- > From: tdwg-guid-bounces@mailman.nhm.ku.edu > [mailto:tdwg-guid-bounces@mailman.nhm.ku.edu]On Behalf Of Sally > Hinchcliffe > Sent: 15 June 2006 12:08 > To: tdwg-guid@mailman.nhm.ku.edu > Subject: [Tdwg-guid] Throttling searches [ Scanned for viruses ] > > > Hi all > another question that has come up here. > > As discussed at the meeting, we're thinking of providing a complete > download of all IPNI LSIDs plus a label (name and author, probably) > which will be available as an annually produced download > > Most people will play nice and just resolve one or two LSIDs as > required, but by providing a complete list, we're making it very easy > for someone to write a crawler that hits every LSID in turn and > basically brings our server to its knees > > Anybody know of a good way of enforcing more polite behaviour? We can > make the download only available under a data supply agreement that > includes a clause limiting hit rates, or we could limit by IP address > (but this would ultimately block out services like Rod's simple > resolver). I beleive Google's spell checker uses a key which has to > be passed in as part of the query - obviously we can't do that with > LSIDs > > Any thoughts? Anyone think this is a problem? > > Sally > *** Sally Hinchcliffe > *** Computer section, Royal Botanic Gardens, Kew > *** tel: +44 (0)20 8332 5708 > *** S.Hinchcliffe@rbgkew.org.uk > > > _______________________________________________ > TDWG-GUID mailing list > TDWG-GUID@mailman.nhm.ku.edu > http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid > > _______________________________________________ > TDWG-GUID mailing list > TDWG-GUID@mailman.nhm.ku.edu > http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid > > > > *** Sally Hinchcliffe *** Computer section, Royal Botanic Gardens, Kew *** tel: +44 (0)20 8332 5708 *** S.Hinchcliffe@rbgkew.org.uk

_______________________________________________ TDWG-GUID mailing list TDWG-GUID@mailman.nhm.ku.edu http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid

_______________________________________________ TDWG-GUID mailing list TDWG-GUID@mailman.nhm.ku.edu http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid

--

------------------------------------- Roger Hyam Technical Architect Taxonomic Databases Working Group ------------------------------------- http://www.tdwg.org roger@tdwg.org +44 1578 722782 -------------------------------------

*** Sally Hinchcliffe *** Computer section, Royal Botanic Gardens, Kew *** tel: +44 (0)20 8332 5708 *** S.Hinchcliffe@rbgkew.org.uk

--

------------------------------------- Roger Hyam Technical Architect Taxonomic Databases Working Group ------------------------------------- http://www.tdwg.org roger@tdwg.org +44 1578 722782 -------------------------------------

*** Sally Hinchcliffe *** Computer section, Royal Botanic Gardens, Kew *** tel: +44 (0)20 8332 5708 *** S.Hinchcliffe@rbgkew.org.uk

Sally Hinchcliffe

12:01

New subject: [Tdwg-guid] Throttling searches [ Scanned for viruses ]

Hi Roger Thanks for this ... I _think_ I understand it but Nicky is on leave this week so I won't know if I do or not till after she returns The system doesn't have to be completely villain proof, just slow down most of the villains so everyone else can get a look in Sally

...

You don't! The LSID resolves to the binding to the getMetadata() method - which is a plain old fashioned URL. At this point the LSID authority has done its duty and we are just on a plain HTTP GET call so you can do whatever you can do with any regular HTTP GET. You could stipulate another header field or (more simply) give priority service for those who append a valid user id to the URL (&user_id=12345)

So there is no throttle on resolving the LSID to the getMetadata binding (which is cheap) but there is a throttle on the actual call to get the metadata method. Really you need to do this because bad people may be able to tell from the URL how to scrape the source and bypass the LSID resolver after the first call anyhow. This is especially true if the URL contains the IPNI record ID which is likely.

Here is an example using Rod's tester.

http://linnaeus.zoology.gla.ac.uk/~rpage/lsid/tester/?q=urn:lsid:ubio.org:na...

The getMetadata() method for this LSID:

urn:lsid:ubio.org:namebank:11815

Is bound to this URL:

http://names.ubio.org/authority/metadata.php?lsid=urn:lsid:ubio.org:namebank...

So ubio would just have to give preferential services to calls like this:

http://names.ubio.org/authority/metadata.php?lsid=urn:lsid:ubio.org:namebank:11815&user_id=rogerhyam1392918790

If rogerhyam had paid his membership fees this year.

Does this make sense?

Roger p.s. You could do this on the web pages as well with a clever little thing to write dynamic tokens into the links so that it doesn't degrade the regular browsing experience and only stops scrapers - but that is beyond my remit at the moment ;)

p.p.s. You could wrap this in https if you were paranoid about people stealing tokens - but this is highly unlikely I believe.

Sally Hinchcliffe wrote:

...
How can we pass a token with an LSID?

...
I think the only way to throttle in these situations is to have some notion of who the client is and the only way to do that is to have some kind of token exchange over HTTP saying who they are. Basically you have to have some kind of client registration system or you can never distinguish between a call from a new client and a repeat call. The use of IP address is a pain because so many people are now behind some kind of NAT gateway.

How about this for a plan:

You could give a degraded services to people who don't pass a token (a 5 second delay perhaps) and offer a quicker service to registered users who pass a token (but then perhaps limit the number of calls they make). This would mean you could offer a universal service even to those with naive client software but a better service to those who play nicely. You could also get better stats on who is using the service.

So there are ways that this could be done. I expect people will come up with a host of different ways. It is outside LSIDs though.

Roger

Sally Hinchcliffe wrote:

...
It's not an LSID issue per se, but LSIDs will make it harder to slow searches down. For instance, Google restricts use of its spell checker to 1000 a day by use of a key which is passed in with each request. Obviously this can't be done with LSIDs as then they wouldn't be the same for each user. The other reason why it's relevant to LSIDs is simply that providing a list of all relevant IPNI LSIDs (not necessary to the LSID implementation but a nice to have for caching / lookups for other systems using our LSIDs) also makes life easier for the datascrapers to operate

Also I thought ... here's a list full of clever people perhaps they will have some suggestions

Sally

...
Is this an LSID issue? LSIDs essential provide a binding service between an name and one or more web services (we default to HTTP GET bindings). It isn't really up to the LSID authority to administer any policies regarding the web service but simply to point at it. It is up to the web service to do things like throttling, authentication and authorization.

Imagine, for example, that the different services had different policies. It may be reasonable not to restrict the getMetadata() calls but to restrict the getData() calls.

The use of LSIDs does not create any new problems that weren't there with web page scraping - or scraping of any other web service.

Just my thoughts...

Roger

Ricardo Scachetti Pereira wrote:

...
Sally,

You raised a really important issue that we had not really addressed at the meeting. Thanks for that.

I would say that we should not constrain the resolution of LSIDs if we expect our LSID infrastructure to work. LSIDs will be the basis of our architecture so we better have good support for that.

However, that is sure a limiting factor. Also server efficiency will likely vary quite a lot, depending on underlying system optimizations and all.

So I think that the solution for this problem is in caching LSID responses on the server LSID stack. Basically, after resolving an LSID once, your server should be able to resolve it again and again really quickly, until something on the metadata that is related to that id changes.

I haven't looked at this aspect of the LSID software stack, but maybe others can say something about it. In any case I'll do some research on it and get back to you.

Again, thanks for bringing it up.

Cheers,

Ricardo

Sally Hinchcliffe wrote:

> There are enough discontinuities in IPNI ids that 1,2,3 would quickly > run into the sand. I agree it's not a new problem - I just hate to > think I'm making life easier for the data scrapers > Sally > > > > > > >> It can be a problem but I'm not sure if there is a simple solution ... and how different is the LSID crawler scenario from an http://www.ipni.org/ipni/plantsearch?id= 1,2,3,4,5 ... 9999999 scenario? >> >> Paul >> >> -----Original Message----- >> From: tdwg-guid-bounces@mailman.nhm.ku.edu >> [mailto:tdwg-guid-bounces@mailman.nhm.ku.edu]On Behalf Of Sally >> Hinchcliffe >> Sent: 15 June 2006 12:08 >> To: tdwg-guid@mailman.nhm.ku.edu >> Subject: [Tdwg-guid] Throttling searches [ Scanned for viruses ] >> >> >> Hi all >> another question that has come up here. >> >> As discussed at the meeting, we're thinking of providing a complete >> download of all IPNI LSIDs plus a label (name and author, probably) >> which will be available as an annually produced download >> >> Most people will play nice and just resolve one or two LSIDs as >> required, but by providing a complete list, we're making it very easy >> for someone to write a crawler that hits every LSID in turn and >> basically brings our server to its knees >> >> Anybody know of a good way of enforcing more polite behaviour? We can >> make the download only available under a data supply agreement that >> includes a clause limiting hit rates, or we could limit by IP address >> (but this would ultimately block out services like Rod's simple >> resolver). I beleive Google's spell checker uses a key which has to >> be passed in as part of the query - obviously we can't do that with >> LSIDs >> >> Any thoughts? Anyone think this is a problem? >> >> Sally >> *** Sally Hinchcliffe >> *** Computer section, Royal Botanic Gardens, Kew >> *** tel: +44 (0)20 8332 5708 >> *** S.Hinchcliffe@rbgkew.org.uk >> >> >> _______________________________________________ >> TDWG-GUID mailing list >> TDWG-GUID@mailman.nhm.ku.edu >> http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid >> >> _______________________________________________ >> TDWG-GUID mailing list >> TDWG-GUID@mailman.nhm.ku.edu >> http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid >> >> >> >> > *** Sally Hinchcliffe > *** Computer section, Royal Botanic Gardens, Kew > *** tel: +44 (0)20 8332 5708 > *** S.Hinchcliffe@rbgkew.org.uk > > > _______________________________________________ > TDWG-GUID mailing list > TDWG-GUID@mailman.nhm.ku.edu > http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid > > > > > _______________________________________________ TDWG-GUID mailing list TDWG-GUID@mailman.nhm.ku.edu http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid

--

------------------------------------- Roger Hyam Technical Architect Taxonomic Databases Working Group ------------------------------------- http://www.tdwg.org roger@tdwg.org +44 1578 722782 -------------------------------------

*** Sally Hinchcliffe *** Computer section, Royal Botanic Gardens, Kew *** tel: +44 (0)20 8332 5708 *** S.Hinchcliffe@rbgkew.org.uk

--

------------------------------------- Roger Hyam Technical Architect Taxonomic Databases Working Group ------------------------------------- http://www.tdwg.org roger@tdwg.org +44 1578 722782 -------------------------------------

*** Sally Hinchcliffe *** Computer section, Royal Botanic Gardens, Kew *** tel: +44 (0)20 8332 5708 *** S.Hinchcliffe@rbgkew.org.uk

--

------------------------------------- Roger Hyam Technical Architect Taxonomic Databases Working Group ------------------------------------- http://www.tdwg.org roger@tdwg.org +44 1578 722782 -------------------------------------

*** Sally Hinchcliffe *** Computer section, Royal Botanic Gardens, Kew *** tel: +44 (0)20 8332 5708 *** S.Hinchcliffe@rbgkew.org.uk

Chuck Miller

20 Jun 20 Jun

02:26

New subject: [Tdwg-guid] Throttling searches [ Scanned for viruses ]

This is probably a dumb question and exposes my ignorance, but what if the originating query is actually "Get all LSIDs where Family = Orchidaceae". That seems the more likely scenario to me rather than get one LSID. And that's the one that needs a throttle. Chuck ________________________________ From: Sally Hinchcliffe [mailto:S.Hinchcliffe@kew.org] Sent: Mon 6/19/2006 7:01 AM To: roger@tdwg.org Cc: tdwg-guid@mailman.nhm.ku.edu Subject: Re: [Tdwg-guid] Throttling searches [ Scanned for viruses ] Hi Roger Thanks for this ... I _think_ I understand it but Nicky is on leave this week so I won't know if I do or not till after she returns The system doesn't have to be completely villain proof, just slow down most of the villains so everyone else can get a look in Sally

...

You don't! The LSID resolves to the binding to the getMetadata() method - which is a plain old fashioned URL. At this point the LSID authority has done its duty and we are just on a plain HTTP GET call so you can do whatever you can do with any regular HTTP GET. You could stipulate another header field or (more simply) give priority service for those who append a valid user id to the URL (&user_id=12345)

So there is no throttle on resolving the LSID to the getMetadata binding (which is cheap) but there is a throttle on the actual call to get the metadata method. Really you need to do this because bad people may be able to tell from the URL how to scrape the source and bypass the LSID resolver after the first call anyhow. This is especially true if the URL contains the IPNI record ID which is likely.

Here is an example using Rod's tester.

http://linnaeus.zoology.gla.ac.uk/~rpage/lsid/tester/?q=urn:lsid:ubio.org:na...

The getMetadata() method for this LSID:

urn:lsid:ubio.org:namebank:11815

Is bound to this URL:

http://names.ubio.org/authority/metadata.php?lsid=urn:lsid:ubio.org:namebank...

So ubio would just have to give preferential services to calls like this:

http://names.ubio.org/authority/metadata.php?lsid=urn:lsid:ubio.org:namebank:11815&user_id=rogerhyam1392918790

If rogerhyam had paid his membership fees this year.

Does this make sense?

Roger p.s. You could do this on the web pages as well with a clever little thing to write dynamic tokens into the links so that it doesn't degrade the regular browsing experience and only stops scrapers - but that is beyond my remit at the moment ;)

p.p.s. You could wrap this in https if you were paranoid about people stealing tokens - but this is highly unlikely I believe.

Sally Hinchcliffe wrote:

...
How can we pass a token with an LSID?

...
I think the only way to throttle in these situations is to have some notion of who the client is and the only way to do that is to have some kind of token exchange over HTTP saying who they are. Basically you have to have some kind of client registration system or you can never distinguish between a call from a new client and a repeat call. The use of IP address is a pain because so many people are now behind some kind of NAT gateway.

How about this for a plan:

You could give a degraded services to people who don't pass a token (a 5 second delay perhaps) and offer a quicker service to registered users who pass a token (but then perhaps limit the number of calls they make). This would mean you could offer a universal service even to those with naive client software but a better service to those who play nicely. You could also get better stats on who is using the service.

So there are ways that this could be done. I expect people will come up with a host of different ways. It is outside LSIDs though.

Roger

Sally Hinchcliffe wrote:

...
It's not an LSID issue per se, but LSIDs will make it harder to slow searches down. For instance, Google restricts use of its spell checker to 1000 a day by use of a key which is passed in with each request. Obviously this can't be done with LSIDs as then they wouldn't be the same for each user. The other reason why it's relevant to LSIDs is simply that providing a list of all relevant IPNI LSIDs (not necessary to the LSID implementation but a nice to have for caching / lookups for other systems using our LSIDs) also makes life easier for the datascrapers to operate

Also I thought ... here's a list full of clever people perhaps they will have some suggestions

Sally

...
Is this an LSID issue? LSIDs essential provide a binding service between an name and one or more web services (we default to HTTP GET bindings). It isn't really up to the LSID authority to administer any policies regarding the web service but simply to point at it. It is up to the web service to do things like throttling, authentication and authorization.

Imagine, for example, that the different services had different policies. It may be reasonable not to restrict the getMetadata() calls but to restrict the getData() calls.

The use of LSIDs does not create any new problems that weren't there with web page scraping - or scraping of any other web service.

Just my thoughts...

Roger

Ricardo Scachetti Pereira wrote:

...
Sally,

You raised a really important issue that we had not really addressed at the meeting. Thanks for that.

I would say that we should not constrain the resolution of LSIDs if we expect our LSID infrastructure to work. LSIDs will be the basis of our architecture so we better have good support for that.

However, that is sure a limiting factor. Also server efficiency will likely vary quite a lot, depending on underlying system optimizations and all.

So I think that the solution for this problem is in caching LSID responses on the server LSID stack. Basically, after resolving an LSID once, your server should be able to resolve it again and again really quickly, until something on the metadata that is related to that id changes.

I haven't looked at this aspect of the LSID software stack, but maybe others can say something about it. In any case I'll do some research on it and get back to you.

Again, thanks for bringing it up.

Cheers,

Ricardo

Sally Hinchcliffe wrote:

> There are enough discontinuities in IPNI ids that 1,2,3 would quickly > run into the sand. I agree it's not a new problem - I just hate to > think I'm making life easier for the data scrapers > Sally > > > > > > >> It can be a problem but I'm not sure if there is a simple solution ... and how different is the LSID crawler scenario from an http://www.ipni.org/ipni/plantsearch?id= 1,2,3,4,5 ... 9999999 scenario?

...

...
...
...
...
...
>> >> Paul >> >> -----Original Message----- >> From: tdwg-guid-bounces@mailman.nhm.ku.edu >> [mailto:tdwg-guid-bounces@mailman.nhm.ku.edu]On Behalf Of Sally >> Hinchcliffe >> Sent: 15 June 2006 12:08 >> To: tdwg-guid@mailman.nhm.ku.edu >> Subject: [Tdwg-guid] Throttling searches [ Scanned for viruses ] >> >> >> Hi all >> another question that has come up here. >> >> As discussed at the meeting, we're thinking of providing a complete >> download of all IPNI LSIDs plus a label (name and author, probably) >> which will be available as an annually produced download >> >> Most people will play nice and just resolve one or two LSIDs as >> required, but by providing a complete list, we're making it very easy >> for someone to write a crawler that hits every LSID in turn and >> basically brings our server to its knees >> >> Anybody know of a good way of enforcing more polite behaviour? We can >> make the download only available under a data supply agreement that >> includes a clause limiting hit rates, or we could limit by IP address >> (but this would ultimately block out services like Rod's simple >> resolver). I beleive Google's spell checker uses a key which has to >> be passed in as part of the query - obviously we can't do that with >> LSIDs >> >> Any thoughts? Anyone think this is a problem? >> >> Sally >> *** Sally Hinchcliffe >> *** Computer section, Royal Botanic Gardens, Kew >> *** tel: +44 (0)20 8332 5708 >> *** S.Hinchcliffe@rbgkew.org.uk >> >> >> _______________________________________________ >> TDWG-GUID mailing list >> TDWG-GUID@mailman.nhm.ku.edu >> http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid >> >> _______________________________________________ >> TDWG-GUID mailing list >> TDWG-GUID@mailman.nhm.ku.edu >> http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid >> >> >> >> > *** Sally Hinchcliffe > *** Computer section, Royal Botanic Gardens, Kew > *** tel: +44 (0)20 8332 5708 > *** S.Hinchcliffe@rbgkew.org.uk > > > _______________________________________________ > TDWG-GUID mailing list > TDWG-GUID@mailman.nhm.ku.edu > http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid > > > > > _______________________________________________ TDWG-GUID mailing list TDWG-GUID@mailman.nhm.ku.edu http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid

--

------------------------------------- Roger Hyam Technical Architect Taxonomic Databases Working Group ------------------------------------- http://www.tdwg.org <http://www.tdwg.org/> roger@tdwg.org +44 1578 722782 -------------------------------------

*** Sally Hinchcliffe *** Computer section, Royal Botanic Gardens, Kew *** tel: +44 (0)20 8332 5708 *** S.Hinchcliffe@rbgkew.org.uk

--

------------------------------------- Roger Hyam Technical Architect Taxonomic Databases Working Group ------------------------------------- http://www.tdwg.org <http://www.tdwg.org/> roger@tdwg.org +44 1578 722782 -------------------------------------

*** Sally Hinchcliffe *** Computer section, Royal Botanic Gardens, Kew *** tel: +44 (0)20 8332 5708 *** S.Hinchcliffe@rbgkew.org.uk

--

------------------------------------- Roger Hyam Technical Architect Taxonomic Databases Working Group ------------------------------------- http://www.tdwg.org <http://www.tdwg.org/> roger@tdwg.org +44 1578 722782 -------------------------------------

Steven Perry

03:22

New subject: [Tdwg-guid] Throttling searches [ Scanned for viruses ]

Hi Chuck, I've been thinking of the case you describe as a query operation. A query operation would take match conditions as input and, when applied to a set of RDF metadata, returns either an RDF graph or values bound to variables (analogous to an SQL select statement). Either type of output may contain references to other data objects by LSID which would have to be resolved by clients. This query operation is not supported by the LSID spec and requires a distinct service. We've implemented SPARQL as the query service for DiGIR2 (now called Wasabi). SPARQL is a W3C candidate recommendation and is both a query language and a protocol. See the following for more information: http://www.w3.org/TR/rdf-sparql-query/ http://www.w3.org/TR/rdf-sparql-protocol/ -Steve Chuck Miller wrote:

...

This is probably a dumb question and exposes my ignorance, but what if the originating query is actually "Get all LSIDs where Family = Orchidaceae". That seems the more likely scenario to me rather than get one LSID. And that's the one that needs a throttle.

Chuck

------------------------------------------------------------------------ *From:* Sally Hinchcliffe [mailto:S.Hinchcliffe@kew.org] *Sent:* Mon 6/19/2006 7:01 AM *To:* roger@tdwg.org *Cc:* tdwg-guid@mailman.nhm.ku.edu *Subject:* Re: [Tdwg-guid] Throttling searches [ Scanned for viruses ]

Hi Roger Thanks for this ... I _think_ I understand it but Nicky is on leave this week so I won't know if I do or not till after she returns

The system doesn't have to be completely villain proof, just slow down most of the villains so everyone else can get a look in Sally

...
You don't! The LSID resolves to the binding to the getMetadata() method - which is a plain old fashioned URL. At this point the LSID authority has done its duty and we are just on a plain HTTP GET call so you

can do

...
whatever you can do with any regular HTTP GET. You could stipulate another header field or (more simply) give priority service for those who append a valid user id to the URL (&user_id=12345)

So there is no throttle on resolving the LSID to the getMetadata binding (which is cheap) but there is a throttle on the actual call to get the metadata method. Really you need to do this because bad people may be able to tell from the URL how to scrape the source and bypass the LSID resolver after the first call anyhow. This is especially true if the URL contains the IPNI record ID which is likely.

Here is an example using Rod's tester.

http://linnaeus.zoology.gla.ac.uk/~rpage/lsid/tester/?q=urn:lsid:ubio.org:na... <http://linnaeus.zoology.gla.ac.uk/%7Erpage/lsid/tester/?q=urn:lsid:ubio.org:namebank:11815>

...
The getMetadata() method for this LSID:

urn:lsid:ubio.org:namebank:11815

Is bound to this URL:

http://names.ubio.org/authority/metadata.php?lsid=urn:lsid:ubio.org:namebank...

...
So ubio would just have to give preferential services to calls like

this:

...
http://names.ubio.org/authority/metadata.php?lsid=urn:lsid:ubio.org:namebank:11815&user_id=rogerhyam1392918790 <http://names.ubio.org/authority/metadata.php?lsid=urn:lsid:ubio.org:namebank:11815&user_id=rogerhyam1392918790>

...
If rogerhyam had paid his membership fees this year.

Does this make sense?

Roger p.s. You could do this on the web pages as well with a clever little thing to write dynamic tokens into the links so that it doesn't degrade the regular browsing experience and only stops scrapers - but that is beyond my remit at the moment ;)

p.p.s. You could wrap this in https if you were paranoid about people stealing tokens - but this is highly unlikely I believe.

Sally Hinchcliffe wrote:

...
How can we pass a token with an LSID?

...
I think the only way to throttle in these situations is to have some notion of who the client is and the only way to do that is to

...
...
...
kind of token exchange over HTTP saying who they are. Basically you have to have some kind of client registration system or you can never distinguish between a call from a new client and a repeat call. The use of IP address is a pain because so many people are now behind some kind of NAT gateway.

How about this for a plan:

You could give a degraded services to people who don't pass a token (a 5 second delay perhaps) and offer a quicker service to registered users who pass a token (but then perhaps limit the number of calls they make). This would mean you could offer a universal service even to those with naive client software but a better service to those who play nicely. You could also get better stats on who is using the service.

So there are ways that this could be done. I expect people will come up with a host of different ways. It is outside LSIDs though.

Roger

Sally Hinchcliffe wrote:

...
It's not an LSID issue per se, but LSIDs will make it harder to slow searches down. For instance, Google restricts use of its spell checker to 1000 a day by use of a key which is passed in with each request. Obviously this can't be done with LSIDs as then they wouldn't be the same for each user. The other reason why it's relevant to LSIDs is simply that

...
...
...
...
a list of all relevant IPNI LSIDs (not necessary to the LSID implementation but a nice to have for caching / lookups for other systems using our LSIDs) also makes life easier for the datascrapers to operate

Also I thought ... here's a list full of clever people perhaps they will have some suggestions

Sally

...
Is this an LSID issue? LSIDs essential provide a binding service between an name and one or more web services (we default to HTTP GET bindings). It isn't really up to the LSID authority to administer any

...
...
...
...
...
regarding the web service but simply to point at it. It is up to the web service to do things like throttling, authentication and authorization.

Imagine, for example, that the different services had different policies. It may be reasonable not to restrict the getMetadata() calls but to restrict the getData() calls.

The use of LSIDs does not create any new problems that weren't

...
...
...
...
...
with web page scraping - or scraping of any other web service.

Just my thoughts...

Roger

Ricardo Scachetti Pereira wrote:

> Sally, > > You raised a really important issue that we had not really addressed > at the meeting. Thanks for that. > > I would say that we should not constrain the resolution of LSIDs if > we expect our LSID infrastructure to work. LSIDs will be the basis of > our architecture so we better have good support for that. > > However, that is sure a limiting factor. Also server efficiency will > likely vary quite a lot, depending on underlying system optimizations > and all. > > So I think that the solution for this problem is in caching LSID > responses on the server LSID stack. Basically, after resolving an LSID > once, your server should be able to resolve it again and again really > quickly, until something on the metadata that is related to

have some providing policies there that id changes.

...
...
...
...
...
> > I haven't looked at this aspect of the LSID software stack, but > maybe others can say something about it. In any case I'll do some > research on it and get back to you. > > Again, thanks for bringing it up. > > Cheers, > > Ricardo > > > Sally Hinchcliffe wrote: > > > >> There are enough discontinuities in IPNI ids that 1,2,3 would quickly >> run into the sand. I agree it's not a new problem - I just hate to >> think I'm making life easier for the data scrapers >> Sally >> >> >> >> >> >> >>> It can be a problem but I'm not sure if there is a simple solution ... and how different is the LSID crawler scenario from an http://www.ipni.org/ipni/plantsearch?id= 1,2,3,4,5 ... 9999999 scenario?

...
...
...
...
...
>>> >>> Paul >>> >>> -----Original Message----- >>> From: tdwg-guid-bounces@mailman.nhm.ku.edu >>> [mailto:tdwg-guid-bounces@mailman.nhm.ku.edu]On Behalf Of Sally >>> Hinchcliffe >>> Sent: 15 June 2006 12:08 >>> To: tdwg-guid@mailman.nhm.ku.edu >>> Subject: [Tdwg-guid] Throttling searches [ Scanned for viruses ] >>> >>> >>> Hi all >>> another question that has come up here. >>> >>> As discussed at the meeting, we're thinking of providing a complete >>> download of all IPNI LSIDs plus a label (name and author, probably) >>> which will be available as an annually produced download >>> >>> Most people will play nice and just resolve one or two LSIDs as >>> required, but by providing a complete list, we're making it very easy >>> for someone to write a crawler that hits every LSID in turn and >>> basically brings our server to its knees >>> >>> Anybody know of a good way of enforcing more polite behaviour? We can >>> make the download only available under a data supply agreement that >>> includes a clause limiting hit rates, or we could limit by IP address >>> (but this would ultimately block out services like Rod's simple >>> resolver). I beleive Google's spell checker uses a key which has to >>> be passed in as part of the query - obviously we can't do that with >>> LSIDs >>> >>> Any thoughts? Anyone think this is a problem? >>> >>> Sally >>> *** Sally Hinchcliffe >>> *** Computer section, Royal Botanic Gardens, Kew >>> *** tel: +44 (0)20 8332 5708 >>> *** S.Hinchcliffe@rbgkew.org.uk >>> >>> >>> _______________________________________________ >>> TDWG-GUID mailing list >>> TDWG-GUID@mailman.nhm.ku.edu >>> http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid >>> >>> _______________________________________________ >>> TDWG-GUID mailing list >>> TDWG-GUID@mailman.nhm.ku.edu >>> http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid >>> >>> >>> >>> >> *** Sally Hinchcliffe >> *** Computer section, Royal Botanic Gardens, Kew >> *** tel: +44 (0)20 8332 5708 >> *** S.Hinchcliffe@rbgkew.org.uk >> >> >> _______________________________________________ >> TDWG-GUID mailing list >> TDWG-GUID@mailman.nhm.ku.edu >> http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid >> >> >> >> >> > _______________________________________________ > TDWG-GUID mailing list > TDWG-GUID@mailman.nhm.ku.edu > http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid > > > > --

------------------------------------- Roger Hyam Technical Architect Taxonomic Databases Working Group ------------------------------------- http://www.tdwg.org <http://www.tdwg.org/> roger@tdwg.org +44 1578 722782 -------------------------------------

*** Sally Hinchcliffe *** Computer section, Royal Botanic Gardens, Kew *** tel: +44 (0)20 8332 5708 *** S.Hinchcliffe@rbgkew.org.uk

--

------------------------------------- Roger Hyam Technical Architect Taxonomic Databases Working Group ------------------------------------- http://www.tdwg.org <http://www.tdwg.org/> roger@tdwg.org +44 1578 722782 -------------------------------------

*** Sally Hinchcliffe *** Computer section, Royal Botanic Gardens, Kew *** tel: +44 (0)20 8332 5708 *** S.Hinchcliffe@rbgkew.org.uk

--

------------------------------------- Roger Hyam Technical Architect Taxonomic Databases Working Group ------------------------------------- http://www.tdwg.org <http://www.tdwg.org/> roger@tdwg.org +44 1578 722782 -------------------------------------

*** Sally Hinchcliffe *** Computer section, Royal Botanic Gardens, Kew *** tel: +44 (0)20 8332 5708 *** S.Hinchcliffe@rbgkew.org.uk

_______________________________________________ TDWG-GUID mailing list TDWG-GUID@mailman.nhm.ku.edu http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid

------------------------------------------------------------------------

_______________________________________________ TDWG-GUID mailing list TDWG-GUID@mailman.nhm.ku.edu http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid

Chuck Miller

03:44

New subject: [Tdwg-guid] Throttling searches [ Scanned for viruses ]

Steve, OK thanks, I get that now: LSID is an element of a query response. I guess that leads to more questions about where the LSID goes in which response formats. Just add an LSID concept to Darwin Core for instance? But, if SPARQL is a protocol, how does it layer on top of or parallel to all the other protocols that TDWG is attempting to standardize? Or are we proposing it as the "protocol to rule them all"? At least with DIGIR, imperfect as it is, it was clear that it was both query/response and protocol. (The complete flow diagram fits on one screen) As we break this all apart and grow it, I think it is becoming difficult for those outside to follow the model and makes it more important to describe the complete query, protocol, response stack for the TDWG membership that will be called to recommend and vote on it. Which reminds me that we still need the Rosetta Stone that resolves all these things that are on the table: DIGIR, BioCASE, TAPIR, Wasabi, SPARQL, LSID, PURL, RDF, OWL, OWL-DL, WSDL, SOAP, HTTP GET plus XML Schema - Darwin Core(Base plus extensions like GML), ABCD, TCS, SDD, etc. and more. And resolves them in a way that the general TDWG membership can fully grasp during the upcoming TDWG meeting. Chuck ________________________________ From: Steven Perry [mailto:smperry@ku.edu] Sent: Mon 6/19/2006 10:22 PM To: Chuck Miller Cc: S.Hinchcliffe@kew.org; roger@tdwg.org; tdwg-guid@mailman.nhm.ku.edu Subject: Re: [Tdwg-guid] Throttling searches [ Scanned for viruses ] Hi Chuck, I've been thinking of the case you describe as a query operation. A query operation would take match conditions as input and, when applied to a set of RDF metadata, returns either an RDF graph or values bound to variables (analogous to an SQL select statement). Either type of output may contain references to other data objects by LSID which would have to be resolved by clients. This query operation is not supported by the LSID spec and requires a distinct service. We've implemented SPARQL as the query service for DiGIR2 (now called Wasabi). SPARQL is a W3C candidate recommendation and is both a query language and a protocol. See the following for more information: http://www.w3.org/TR/rdf-sparql-query/ http://www.w3.org/TR/rdf-sparql-protocol/ -Steve Chuck Miller wrote:

...

This is probably a dumb question and exposes my ignorance, but what if the originating query is actually "Get all LSIDs where Family = Orchidaceae". That seems the more likely scenario to me rather than get one LSID. And that's the one that needs a throttle.

Chuck

------------------------------------------------------------------------ *From:* Sally Hinchcliffe [mailto:S.Hinchcliffe@kew.org] *Sent:* Mon 6/19/2006 7:01 AM *To:* roger@tdwg.org *Cc:* tdwg-guid@mailman.nhm.ku.edu *Subject:* Re: [Tdwg-guid] Throttling searches [ Scanned for viruses ]

Hi Roger Thanks for this ... I _think_ I understand it but Nicky is on leave this week so I won't know if I do or not till after she returns

The system doesn't have to be completely villain proof, just slow down most of the villains so everyone else can get a look in Sally

...
You don't! The LSID resolves to the binding to the getMetadata() method - which is a plain old fashioned URL. At this point the LSID authority has done its duty and we are just on a plain HTTP GET call so you

can do

...
whatever you can do with any regular HTTP GET. You could stipulate another header field or (more simply) give priority service for those who append a valid user id to the URL (&user_id=12345)

So there is no throttle on resolving the LSID to the getMetadata binding (which is cheap) but there is a throttle on the actual call to get the metadata method. Really you need to do this because bad people may be able to tell from the URL how to scrape the source and bypass the LSID resolver after the first call anyhow. This is especially true if the URL contains the IPNI record ID which is likely.

Here is an example using Rod's tester.

http://linnaeus.zoology.gla.ac.uk/~rpage/lsid/tester/?q=urn:lsid:ubio.org:na... <http://linnaeus.zoology.gla.ac.uk/%7Erpage/lsid/tester/?q=urn:lsid:ubio.org:namebank:11815>

...
The getMetadata() method for this LSID:

urn:lsid:ubio.org:namebank:11815

Is bound to this URL:

http://names.ubio.org/authority/metadata.php?lsid=urn:lsid:ubio.org:namebank...

...
So ubio would just have to give preferential services to calls like

this:

...
http://names.ubio.org/authority/metadata.php?lsid=urn:lsid:ubio.org:namebank:11815&user_id=rogerhyam1392918790 <http://names.ubio.org/authority/metadata.php?lsid=urn:lsid:ubio.org:namebank:11815&user_id=rogerhyam1392918790>

...
If rogerhyam had paid his membership fees this year.

Does this make sense?

Roger p.s. You could do this on the web pages as well with a clever little thing to write dynamic tokens into the links so that it doesn't degrade the regular browsing experience and only stops scrapers - but that is beyond my remit at the moment ;)

p.p.s. You could wrap this in https if you were paranoid about people stealing tokens - but this is highly unlikely I believe.

Sally Hinchcliffe wrote:

...
How can we pass a token with an LSID?

...
I think the only way to throttle in these situations is to have some notion of who the client is and the only way to do that is to

...
...
...
kind of token exchange over HTTP saying who they are. Basically you have to have some kind of client registration system or you can never distinguish between a call from a new client and a repeat call. The use of IP address is a pain because so many people are now behind some kind of NAT gateway.

How about this for a plan:

You could give a degraded services to people who don't pass a token (a 5 second delay perhaps) and offer a quicker service to registered users who pass a token (but then perhaps limit the number of calls they make). This would mean you could offer a universal service even to those with naive client software but a better service to those who play nicely. You could also get better stats on who is using the service.

So there are ways that this could be done. I expect people will come up with a host of different ways. It is outside LSIDs though.

Roger

Sally Hinchcliffe wrote:

...
It's not an LSID issue per se, but LSIDs will make it harder to slow searches down. For instance, Google restricts use of its spell checker to 1000 a day by use of a key which is passed in with each request. Obviously this can't be done with LSIDs as then they wouldn't be the same for each user. The other reason why it's relevant to LSIDs is simply that

...
...
...
...
a list of all relevant IPNI LSIDs (not necessary to the LSID implementation but a nice to have for caching / lookups for other systems using our LSIDs) also makes life easier for the datascrapers to operate

Also I thought ... here's a list full of clever people perhaps they will have some suggestions

Sally

...
Is this an LSID issue? LSIDs essential provide a binding service between an name and one or more web services (we default to HTTP GET bindings). It isn't really up to the LSID authority to administer any

...
...
...
...
...
regarding the web service but simply to point at it. It is up to the web service to do things like throttling, authentication and authorization.

Imagine, for example, that the different services had different policies. It may be reasonable not to restrict the getMetadata() calls but to restrict the getData() calls.

The use of LSIDs does not create any new problems that weren't

...
...
...
...
...
with web page scraping - or scraping of any other web service.

Just my thoughts...

Roger

Ricardo Scachetti Pereira wrote:

> Sally, > > You raised a really important issue that we had not really addressed > at the meeting. Thanks for that. > > I would say that we should not constrain the resolution of LSIDs if > we expect our LSID infrastructure to work. LSIDs will be the basis of > our architecture so we better have good support for that. > > However, that is sure a limiting factor. Also server efficiency will > likely vary quite a lot, depending on underlying system optimizations > and all. > > So I think that the solution for this problem is in caching LSID > responses on the server LSID stack. Basically, after resolving an LSID > once, your server should be able to resolve it again and again really > quickly, until something on the metadata that is related to

have some providing policies there that id changes.

...
...
...
...
...
> > I haven't looked at this aspect of the LSID software stack, but > maybe others can say something about it. In any case I'll do some > research on it and get back to you. > > Again, thanks for bringing it up. > > Cheers, > > Ricardo > > > Sally Hinchcliffe wrote: > > > >> There are enough discontinuities in IPNI ids that 1,2,3 would quickly >> run into the sand. I agree it's not a new problem - I just hate to >> think I'm making life easier for the data scrapers >> Sally >> >> >> >> >> >> >>> It can be a problem but I'm not sure if there is a simple solution ... and how different is the LSID crawler scenario from an http://www.ipni.org/ipni/plantsearch?id= 1,2,3,4,5 ... 9999999 scenario?

...
...
...
...
...
>>> >>> Paul >>> >>> -----Original Message----- >>> From: tdwg-guid-bounces@mailman.nhm.ku.edu >>> [mailto:tdwg-guid-bounces@mailman.nhm.ku.edu]On Behalf Of Sally >>> Hinchcliffe >>> Sent: 15 June 2006 12:08 >>> To: tdwg-guid@mailman.nhm.ku.edu >>> Subject: [Tdwg-guid] Throttling searches [ Scanned for viruses ] >>> >>> >>> Hi all >>> another question that has come up here. >>> >>> As discussed at the meeting, we're thinking of providing a complete >>> download of all IPNI LSIDs plus a label (name and author, probably) >>> which will be available as an annually produced download >>> >>> Most people will play nice and just resolve one or two LSIDs as >>> required, but by providing a complete list, we're making it very easy >>> for someone to write a crawler that hits every LSID in turn and >>> basically brings our server to its knees >>> >>> Anybody know of a good way of enforcing more polite behaviour? We can >>> make the download only available under a data supply agreement that >>> includes a clause limiting hit rates, or we could limit by IP address >>> (but this would ultimately block out services like Rod's simple >>> resolver). I beleive Google's spell checker uses a key which has to >>> be passed in as part of the query - obviously we can't do that with >>> LSIDs >>> >>> Any thoughts? Anyone think this is a problem? >>> >>> Sally >>> *** Sally Hinchcliffe >>> *** Computer section, Royal Botanic Gardens, Kew >>> *** tel: +44 (0)20 8332 5708 >>> *** S.Hinchcliffe@rbgkew.org.uk >>> >>> >>> _______________________________________________ >>> TDWG-GUID mailing list >>> TDWG-GUID@mailman.nhm.ku.edu >>> http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid >>> >>> _______________________________________________ >>> TDWG-GUID mailing list >>> TDWG-GUID@mailman.nhm.ku.edu >>> http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid >>> >>> >>> >>> >> *** Sally Hinchcliffe >> *** Computer section, Royal Botanic Gardens, Kew >> *** tel: +44 (0)20 8332 5708 >> *** S.Hinchcliffe@rbgkew.org.uk >> >> >> _______________________________________________ >> TDWG-GUID mailing list >> TDWG-GUID@mailman.nhm.ku.edu >> http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid >> >> >> >> >> > _______________________________________________ > TDWG-GUID mailing list > TDWG-GUID@mailman.nhm.ku.edu > http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid > > > > --

------------------------------------- Roger Hyam Technical Architect Taxonomic Databases Working Group ------------------------------------- http://www.tdwg.org <http://www.tdwg.org/> <http://www.tdwg.org/> roger@tdwg.org +44 1578 722782 -------------------------------------

*** Sally Hinchcliffe *** Computer section, Royal Botanic Gardens, Kew *** tel: +44 (0)20 8332 5708 *** S.Hinchcliffe@rbgkew.org.uk

--

------------------------------------- Roger Hyam Technical Architect Taxonomic Databases Working Group ------------------------------------- http://www.tdwg.org <http://www.tdwg.org/> <http://www.tdwg.org/> roger@tdwg.org +44 1578 722782 -------------------------------------

*** Sally Hinchcliffe *** Computer section, Royal Botanic Gardens, Kew *** tel: +44 (0)20 8332 5708 *** S.Hinchcliffe@rbgkew.org.uk

--

------------------------------------- Roger Hyam Technical Architect Taxonomic Databases Working Group ------------------------------------- http://www.tdwg.org <http://www.tdwg.org/> <http://www.tdwg.org/> roger@tdwg.org +44 1578 722782 -------------------------------------

*** Sally Hinchcliffe *** Computer section, Royal Botanic Gardens, Kew *** tel: +44 (0)20 8332 5708 *** S.Hinchcliffe@rbgkew.org.uk

_______________________________________________ TDWG-GUID mailing list TDWG-GUID@mailman.nhm.ku.edu http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid

------------------------------------------------------------------------

_______________________________________________ TDWG-GUID mailing list TDWG-GUID@mailman.nhm.ku.edu http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid

Sally Hinchcliffe

08:06

New subject: [Tdwg-guid] Throttling searches [ Scanned for viruses ]

That's actually quite easy to deal with - just truncate the response at however many records. What's harder to do is where you've got lots of little queries, each one innocuous in itself, but coming at you as though out of a fire hose. Google and the other search engines (which all seem to use inkomisearch) do have services where by you can request that they slow down the rate at which they crawl your data. But identifying and contacting each crawler individually is inefficient ... Thanks for all your suggestions. We'll try and build something into our service & will report back when we've done so Sally

...

This is probably a dumb question and exposes my ignorance, but what if the originating query is actually "Get all LSIDs where Family = Orchidaceae". That seems the more likely scenario to me rather than get one LSID. And that's the one that needs a throttle.

Chuck

________________________________

From: Sally Hinchcliffe [mailto:S.Hinchcliffe@kew.org] Sent: Mon 6/19/2006 7:01 AM To: roger@tdwg.org Cc: tdwg-guid@mailman.nhm.ku.edu Subject: Re: [Tdwg-guid] Throttling searches [ Scanned for viruses ]

Hi Roger Thanks for this ... I _think_ I understand it but Nicky is on leave this week so I won't know if I do or not till after she returns

The system doesn't have to be completely villain proof, just slow down most of the villains so everyone else can get a look in Sally

...
You don't! The LSID resolves to the binding to the getMetadata() method - which is a plain old fashioned URL. At this point the LSID authority has done its duty and we are just on a plain HTTP GET call so you can do whatever you can do with any regular HTTP GET. You could stipulate another header field or (more simply) give priority service for those who append a valid user id to the URL (&user_id=12345)

So there is no throttle on resolving the LSID to the getMetadata binding (which is cheap) but there is a throttle on the actual call to get the metadata method. Really you need to do this because bad people may be able to tell from the URL how to scrape the source and bypass the LSID resolver after the first call anyhow. This is especially true if the URL contains the IPNI record ID which is likely.

Here is an example using Rod's tester.

http://linnaeus.zoology.gla.ac.uk/~rpage/lsid/tester/?q=urn:lsid:ubio.org:na...

The getMetadata() method for this LSID:

urn:lsid:ubio.org:namebank:11815

Is bound to this URL:

http://names.ubio.org/authority/metadata.php?lsid=urn:lsid:ubio.org:namebank...

So ubio would just have to give preferential services to calls like this:

http://names.ubio.org/authority/metadata.php?lsid=urn:lsid:ubio.org:namebank:11815&user_id=rogerhyam1392918790

If rogerhyam had paid his membership fees this year.

Does this make sense?

Roger p.s. You could do this on the web pages as well with a clever little thing to write dynamic tokens into the links so that it doesn't degrade the regular browsing experience and only stops scrapers - but that is beyond my remit at the moment ;)

p.p.s. You could wrap this in https if you were paranoid about people stealing tokens - but this is highly unlikely I believe.

Sally Hinchcliffe wrote:

...
How can we pass a token with an LSID?

...
I think the only way to throttle in these situations is to have some notion of who the client is and the only way to do that is to have some kind of token exchange over HTTP saying who they are. Basically you have to have some kind of client registration system or you can never distinguish between a call from a new client and a repeat call. The use of IP address is a pain because so many people are now behind some kind of NAT gateway.

How about this for a plan:

You could give a degraded services to people who don't pass a token (a 5 second delay perhaps) and offer a quicker service to registered users who pass a token (but then perhaps limit the number of calls they make). This would mean you could offer a universal service even to those with naive client software but a better service to those who play nicely. You could also get better stats on who is using the service.

So there are ways that this could be done. I expect people will come up with a host of different ways. It is outside LSIDs though.

Roger

Sally Hinchcliffe wrote:

...
It's not an LSID issue per se, but LSIDs will make it harder to slow searches down. For instance, Google restricts use of its spell checker to 1000 a day by use of a key which is passed in with each request. Obviously this can't be done with LSIDs as then they wouldn't be the same for each user. The other reason why it's relevant to LSIDs is simply that providing a list of all relevant IPNI LSIDs (not necessary to the LSID implementation but a nice to have for caching / lookups for other systems using our LSIDs) also makes life easier for the datascrapers to operate

Also I thought ... here's a list full of clever people perhaps they will have some suggestions

Sally

...
Is this an LSID issue? LSIDs essential provide a binding service between an name and one or more web services (we default to HTTP GET bindings). It isn't really up to the LSID authority to administer any policies regarding the web service but simply to point at it. It is up to the web service to do things like throttling, authentication and authorization.

Imagine, for example, that the different services had different policies. It may be reasonable not to restrict the getMetadata() calls but to restrict the getData() calls.

The use of LSIDs does not create any new problems that weren't there with web page scraping - or scraping of any other web service.

Just my thoughts...

Roger

Ricardo Scachetti Pereira wrote:

> Sally, > > You raised a really important issue that we had not really addressed > at the meeting. Thanks for that. > > I would say that we should not constrain the resolution of LSIDs if > we expect our LSID infrastructure to work. LSIDs will be the basis of > our architecture so we better have good support for that. > > However, that is sure a limiting factor. Also server efficiency will > likely vary quite a lot, depending on underlying system optimizations > and all. > > So I think that the solution for this problem is in caching LSID > responses on the server LSID stack. Basically, after resolving an LSID > once, your server should be able to resolve it again and again really > quickly, until something on the metadata that is related to that id changes. > > I haven't looked at this aspect of the LSID software stack, but > maybe others can say something about it. In any case I'll do some > research on it and get back to you. > > Again, thanks for bringing it up. > > Cheers, > > Ricardo > > > Sally Hinchcliffe wrote: > > > >> There are enough discontinuities in IPNI ids that 1,2,3 would quickly >> run into the sand. I agree it's not a new problem - I just hate to >> think I'm making life easier for the data scrapers >> Sally >> >> >> >> >> >> >>> It can be a problem but I'm not sure if there is a simple solution ... and how different is the LSID crawler scenario from an http://www.ipni.org/ipni/plantsearch?id= 1,2,3,4,5 ... 9999999 scenario?

...
...
...
...
...
>>> >>> Paul >>> >>> -----Original Message----- >>> From: tdwg-guid-bounces@mailman.nhm.ku.edu >>> [mailto:tdwg-guid-bounces@mailman.nhm.ku.edu]On Behalf Of Sally >>> Hinchcliffe >>> Sent: 15 June 2006 12:08 >>> To: tdwg-guid@mailman.nhm.ku.edu >>> Subject: [Tdwg-guid] Throttling searches [ Scanned for viruses ] >>> >>> >>> Hi all >>> another question that has come up here. >>> >>> As discussed at the meeting, we're thinking of providing a complete >>> download of all IPNI LSIDs plus a label (name and author, probably) >>> which will be available as an annually produced download >>> >>> Most people will play nice and just resolve one or two LSIDs as >>> required, but by providing a complete list, we're making it very easy >>> for someone to write a crawler that hits every LSID in turn and >>> basically brings our server to its knees >>> >>> Anybody know of a good way of enforcing more polite behaviour? We can >>> make the download only available under a data supply agreement that >>> includes a clause limiting hit rates, or we could limit by IP address >>> (but this would ultimately block out services like Rod's simple >>> resolver). I beleive Google's spell checker uses a key which has to >>> be passed in as part of the query - obviously we can't do that with >>> LSIDs >>> >>> Any thoughts? Anyone think this is a problem? >>> >>> Sally >>> *** Sally Hinchcliffe >>> *** Computer section, Royal Botanic Gardens, Kew >>> *** tel: +44 (0)20 8332 5708 >>> *** S.Hinchcliffe@rbgkew.org.uk >>> >>> >>> _______________________________________________ >>> TDWG-GUID mailing list >>> TDWG-GUID@mailman.nhm.ku.edu >>> http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid >>> >>> _______________________________________________ >>> TDWG-GUID mailing list >>> TDWG-GUID@mailman.nhm.ku.edu >>> http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid >>> >>> >>> >>> >> *** Sally Hinchcliffe >> *** Computer section, Royal Botanic Gardens, Kew >> *** tel: +44 (0)20 8332 5708 >> *** S.Hinchcliffe@rbgkew.org.uk >> >> >> _______________________________________________ >> TDWG-GUID mailing list >> TDWG-GUID@mailman.nhm.ku.edu >> http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid >> >> >> >> >> > _______________________________________________ > TDWG-GUID mailing list > TDWG-GUID@mailman.nhm.ku.edu > http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid > > > > --

------------------------------------- Roger Hyam Technical Architect Taxonomic Databases Working Group ------------------------------------- http://www.tdwg.org <http://www.tdwg.org/> roger@tdwg.org +44 1578 722782 -------------------------------------

*** Sally Hinchcliffe *** Computer section, Royal Botanic Gardens, Kew *** tel: +44 (0)20 8332 5708 *** S.Hinchcliffe@rbgkew.org.uk

--

------------------------------------- Roger Hyam Technical Architect Taxonomic Databases Working Group ------------------------------------- http://www.tdwg.org <http://www.tdwg.org/> roger@tdwg.org +44 1578 722782 -------------------------------------

*** Sally Hinchcliffe *** Computer section, Royal Botanic Gardens, Kew *** tel: +44 (0)20 8332 5708 *** S.Hinchcliffe@rbgkew.org.uk

--

------------------------------------- Roger Hyam Technical Architect Taxonomic Databases Working Group ------------------------------------- http://www.tdwg.org <http://www.tdwg.org/> roger@tdwg.org +44 1578 722782 -------------------------------------

*** Sally Hinchcliffe *** Computer section, Royal Botanic Gardens, Kew *** tel: +44 (0)20 8332 5708 *** S.Hinchcliffe@rbgkew.org.uk

_______________________________________________ TDWG-GUID mailing list TDWG-GUID@mailman.nhm.ku.edu http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid

*** Sally Hinchcliffe *** Computer section, Royal Botanic Gardens, Kew *** tel: +44 (0)20 8332 5708 *** S.Hinchcliffe@rbgkew.org.uk

David Remsen

19 Jun 19 Jun

14:24

New subject: [Tdwg-guid] Throttling searches

We do some of this already with our web services. SOAP methods required a keycode. We use the code so we have a contact in case we need to send a message out as well as to provide a better accounting to sources of how we pass on their content. Patrick (uBio programmer and nice guy) asked why not use the LSID version number as a way to pass a token. If it's not passed you can fall back to one level of processing else give it the extra special treatment with the userID. Or is this violating something sacred in the LSID ethos? David Remsen On Jun 19, 2006, at 6:07 AM, Roger Hyam wrote:

...

You don't! The LSID resolves to the binding to the getMetadata() method - which is a plain old fashioned URL. At this point the LSID authority has done its duty and we are just on a plain HTTP GET call so you can do whatever you can do with any regular HTTP GET. You could stipulate another header field or (more simply) give priority service for those who append a valid user id to the URL (&user_id=12345)

So there is no throttle on resolving the LSID to the getMetadata binding (which is cheap) but there is a throttle on the actual call to get the metadata method. Really you need to do this because bad people may be able to tell from the URL how to scrape the source and bypass the LSID resolver after the first call anyhow. This is especially true if the URL contains the IPNI record ID which is likely.

Here is an example using Rod's tester.

http://linnaeus.zoology.gla.ac.uk/~rpage/lsid/tester/? q=urn:lsid:ubio.org:namebank:11815

The getMetadata() method for this LSID:

urn:lsid:ubio.org:namebank:11815

Is bound to this URL:

http://names.ubio.org/authority/metadata.php? lsid=urn:lsid:ubio.org:namebank:11815

So ubio would just have to give preferential services to calls like this:

http://names.ubio.org/authority/metadata.php? lsid=urn:lsid:ubio.org:namebank:11815&user_id=rogerhyam1392918790

If rogerhyam had paid his membership fees this year.

Does this make sense?

Roger p.s. You could do this on the web pages as well with a clever little thing to write dynamic tokens into the links so that it doesn't degrade the regular browsing experience and only stops scrapers - but that is beyond my remit at the moment ;)

p.p.s. You could wrap this in https if you were paranoid about people stealing tokens - but this is highly unlikely I believe.

Sally Hinchcliffe wrote:

...
How can we pass a token with an LSID?

...
I think the only way to throttle in these situations is to have some notion of who the client is and the only way to do that is to have some kind of token exchange over HTTP saying who they are. Basically you have to have some kind of client registration system or you can never distinguish between a call from a new client and a repeat call. The use of IP address is a pain because so many people are now behind some kind of NAT gateway.

How about this for a plan:

You could give a degraded services to people who don't pass a token (a 5 second delay perhaps) and offer a quicker service to registered users who pass a token (but then perhaps limit the number of calls they make). This would mean you could offer a universal service even to those with naive client software but a better service to those who play nicely. You could also get better stats on who is using the service.

So there are ways that this could be done. I expect people will come up with a host of different ways. It is outside LSIDs though.

Roger

Sally Hinchcliffe wrote:

...
It's not an LSID issue per se, but LSIDs will make it harder to slow searches down. For instance, Google restricts use of its spell checker to 1000 a day by use of a key which is passed in with each request. Obviously this can't be done with LSIDs as then they wouldn't be the same for each user. The other reason why it's relevant to LSIDs is simply that providing a list of all relevant IPNI LSIDs (not necessary to the LSID implementation but a nice to have for caching / lookups for other systems using our LSIDs) also makes life easier for the datascrapers to operate

Also I thought ... here's a list full of clever people perhaps they will have some suggestions

Sally

...
Is this an LSID issue? LSIDs essential provide a binding service between an name and one or more web services (we default to HTTP GET bindings). It isn't really up to the LSID authority to administer any policies regarding the web service but simply to point at it. It is up to the web service to do things like throttling, authentication and authorization.

Imagine, for example, that the different services had different policies. It may be reasonable not to restrict the getMetadata () calls but to restrict the getData() calls.

The use of LSIDs does not create any new problems that weren't there with web page scraping - or scraping of any other web service.

Just my thoughts...

Roger

Ricardo Scachetti Pereira wrote:

...
Sally,

You raised a really important issue that we had not really addressed at the meeting. Thanks for that.

I would say that we should not constrain the resolution of LSIDs if we expect our LSID infrastructure to work. LSIDs will be the basis of our architecture so we better have good support for that.

However, that is sure a limiting factor. Also server efficiency will likely vary quite a lot, depending on underlying system optimizations and all.

So I think that the solution for this problem is in caching LSID responses on the server LSID stack. Basically, after resolving an LSID once, your server should be able to resolve it again and again really quickly, until something on the metadata that is related to that id changes.

I haven't looked at this aspect of the LSID software stack, but maybe others can say something about it. In any case I'll do some research on it and get back to you.

Again, thanks for bringing it up.

Cheers,

Ricardo

Sally Hinchcliffe wrote:

> There are enough discontinuities in IPNI ids that 1,2,3 would > quickly > run into the sand. I agree it's not a new problem - I just > hate to > think I'm making life easier for the data scrapers > Sally > > > > > > >> It can be a problem but I'm not sure if there is a simple >> solution ... and how different is the LSID crawler scenario >> from an http://www.ipni.org/ipni/plantsearch?id= >> 1,2,3,4,5 ... 9999999 scenario? >> >> Paul >> >> -----Original Message----- >> From: tdwg-guid-bounces@mailman.nhm.ku.edu >> [mailto:tdwg-guid-bounces@mailman.nhm.ku.edu]On Behalf Of Sally >> Hinchcliffe >> Sent: 15 June 2006 12:08 >> To: tdwg-guid@mailman.nhm.ku.edu >> Subject: [Tdwg-guid] Throttling searches [ Scanned for >> viruses ] >> >> >> Hi all >> another question that has come up here. >> >> As discussed at the meeting, we're thinking of providing a >> complete >> download of all IPNI LSIDs plus a label (name and author, >> probably) >> which will be available as an annually produced download >> >> Most people will play nice and just resolve one or two LSIDs as >> required, but by providing a complete list, we're making it >> very easy >> for someone to write a crawler that hits every LSID in turn and >> basically brings our server to its knees >> >> Anybody know of a good way of enforcing more polite >> behaviour? We can >> make the download only available under a data supply >> agreement that >> includes a clause limiting hit rates, or we could limit by >> IP address >> (but this would ultimately block out services like Rod's simple >> resolver). I beleive Google's spell checker uses a key which >> has to >> be passed in as part of the query - obviously we can't do >> that with >> LSIDs >> >> Any thoughts? Anyone think this is a problem? >> >> Sally >> *** Sally Hinchcliffe >> *** Computer section, Royal Botanic Gardens, Kew >> *** tel: +44 (0)20 8332 5708 >> *** S.Hinchcliffe@rbgkew.org.uk >> >> >> _______________________________________________ >> TDWG-GUID mailing list >> TDWG-GUID@mailman.nhm.ku.edu >> http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid >> >> _______________________________________________ >> TDWG-GUID mailing list >> TDWG-GUID@mailman.nhm.ku.edu >> http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid >> >> >> >> > *** Sally Hinchcliffe > *** Computer section, Royal Botanic Gardens, Kew > *** tel: +44 (0)20 8332 5708 > *** S.Hinchcliffe@rbgkew.org.uk > > > _______________________________________________ > TDWG-GUID mailing list > TDWG-GUID@mailman.nhm.ku.edu > http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid > > > > > _______________________________________________ TDWG-GUID mailing list TDWG-GUID@mailman.nhm.ku.edu http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid

--

------------------------------------- Roger Hyam Technical Architect Taxonomic Databases Working Group ------------------------------------- http://www.tdwg.org roger@tdwg.org +44 1578 722782 -------------------------------------

*** Sally Hinchcliffe *** Computer section, Royal Botanic Gardens, Kew *** tel: +44 (0)20 8332 5708 *** S.Hinchcliffe@rbgkew.org.uk

--

------------------------------------- Roger Hyam Technical Architect Taxonomic Databases Working Group ------------------------------------- http://www.tdwg.org roger@tdwg.org +44 1578 722782 -------------------------------------

*** Sally Hinchcliffe *** Computer section, Royal Botanic Gardens, Kew *** tel: +44 (0)20 8332 5708 *** S.Hinchcliffe@rbgkew.org.uk

--

------------------------------------- Roger Hyam Technical Architect Taxonomic Databases Working Group ------------------------------------- http://www.tdwg.org roger@tdwg.org +44 1578 722782 ------------------------------------- _______________________________________________ TDWG-GUID mailing list TDWG-GUID@mailman.nhm.ku.edu http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid

_______________________________________________ David Remsen uBio Project Manager Marine Biological Laboratory Woods Hole, MA 02543 508-289-7632

Sally Hinchcliffe

14:32

New subject: [Tdwg-guid] Throttling searches

We'll be using version no for keeping track of versions so that's out. Also I'm a bit reluctant to start overloading the LSID itself for what is purely a piece of admin function... Sally

...

We do some of this already with our web services. SOAP methods required a keycode. We use the code so we have a contact in case we need to send a message out as well as to provide a better accounting to sources of how we pass on their content. Patrick (uBio programmer and nice guy) asked why not use the LSID version number as a way to pass a token. If it's not passed you can fall back to one level of processing else give it the extra special treatment with the userID. Or is this violating something sacred in the LSID ethos?

David Remsen

On Jun 19, 2006, at 6:07 AM, Roger Hyam wrote:

...
You don't! The LSID resolves to the binding to the getMetadata() method - which is a plain old fashioned URL. At this point the LSID authority has done its duty and we are just on a plain HTTP GET call so you can do whatever you can do with any regular HTTP GET. You could stipulate another header field or (more simply) give priority service for those who append a valid user id to the URL (&user_id=12345)

So there is no throttle on resolving the LSID to the getMetadata binding (which is cheap) but there is a throttle on the actual call to get the metadata method. Really you need to do this because bad people may be able to tell from the URL how to scrape the source and bypass the LSID resolver after the first call anyhow. This is especially true if the URL contains the IPNI record ID which is likely.

Here is an example using Rod's tester.

http://linnaeus.zoology.gla.ac.uk/~rpage/lsid/tester/? q=urn:lsid:ubio.org:namebank:11815

The getMetadata() method for this LSID:

urn:lsid:ubio.org:namebank:11815

Is bound to this URL:

http://names.ubio.org/authority/metadata.php? lsid=urn:lsid:ubio.org:namebank:11815

So ubio would just have to give preferential services to calls like this:

http://names.ubio.org/authority/metadata.php? lsid=urn:lsid:ubio.org:namebank:11815&user_id=rogerhyam1392918790

If rogerhyam had paid his membership fees this year.

Does this make sense?

Roger p.s. You could do this on the web pages as well with a clever little thing to write dynamic tokens into the links so that it doesn't degrade the regular browsing experience and only stops scrapers - but that is beyond my remit at the moment ;)

p.p.s. You could wrap this in https if you were paranoid about people stealing tokens - but this is highly unlikely I believe.

Sally Hinchcliffe wrote:

...
How can we pass a token with an LSID?

...
I think the only way to throttle in these situations is to have some notion of who the client is and the only way to do that is to have some kind of token exchange over HTTP saying who they are. Basically you have to have some kind of client registration system or you can never distinguish between a call from a new client and a repeat call. The use of IP address is a pain because so many people are now behind some kind of NAT gateway.

How about this for a plan:

You could give a degraded services to people who don't pass a token (a 5 second delay perhaps) and offer a quicker service to registered users who pass a token (but then perhaps limit the number of calls they make). This would mean you could offer a universal service even to those with naive client software but a better service to those who play nicely. You could also get better stats on who is using the service.

So there are ways that this could be done. I expect people will come up with a host of different ways. It is outside LSIDs though.

Roger

Sally Hinchcliffe wrote:

...
It's not an LSID issue per se, but LSIDs will make it harder to slow searches down. For instance, Google restricts use of its spell checker to 1000 a day by use of a key which is passed in with each request. Obviously this can't be done with LSIDs as then they wouldn't be the same for each user. The other reason why it's relevant to LSIDs is simply that providing a list of all relevant IPNI LSIDs (not necessary to the LSID implementation but a nice to have for caching / lookups for other systems using our LSIDs) also makes life easier for the datascrapers to operate

Also I thought ... here's a list full of clever people perhaps they will have some suggestions

Sally

...
Is this an LSID issue? LSIDs essential provide a binding service between an name and one or more web services (we default to HTTP GET bindings). It isn't really up to the LSID authority to administer any policies regarding the web service but simply to point at it. It is up to the web service to do things like throttling, authentication and authorization.

Imagine, for example, that the different services had different policies. It may be reasonable not to restrict the getMetadata () calls but to restrict the getData() calls.

The use of LSIDs does not create any new problems that weren't there with web page scraping - or scraping of any other web service.

Just my thoughts...

Roger

Ricardo Scachetti Pereira wrote:

> Sally, > > You raised a really important issue that we had not really > addressed > at the meeting. Thanks for that. > > I would say that we should not constrain the resolution of > LSIDs if > we expect our LSID infrastructure to work. LSIDs will be the > basis of > our architecture so we better have good support for that. > > However, that is sure a limiting factor. Also server > efficiency will > likely vary quite a lot, depending on underlying system > optimizations > and all. > > So I think that the solution for this problem is in > caching LSID > responses on the server LSID stack. Basically, after resolving > an LSID > once, your server should be able to resolve it again and again > really > quickly, until something on the metadata that is related to > that id changes. > > I haven't looked at this aspect of the LSID software > stack, but > maybe others can say something about it. In any case I'll do some > research on it and get back to you. > > Again, thanks for bringing it up. > > Cheers, > > Ricardo > > > Sally Hinchcliffe wrote: > > > >> There are enough discontinuities in IPNI ids that 1,2,3 would >> quickly >> run into the sand. I agree it's not a new problem - I just >> hate to >> think I'm making life easier for the data scrapers >> Sally >> >> >> >> >> >> >>> It can be a problem but I'm not sure if there is a simple >>> solution ... and how different is the LSID crawler scenario >>> from an http://www.ipni.org/ipni/plantsearch?id= >>> 1,2,3,4,5 ... 9999999 scenario? >>> >>> Paul >>> >>> -----Original Message----- >>> From: tdwg-guid-bounces@mailman.nhm.ku.edu >>> [mailto:tdwg-guid-bounces@mailman.nhm.ku.edu]On Behalf Of Sally >>> Hinchcliffe >>> Sent: 15 June 2006 12:08 >>> To: tdwg-guid@mailman.nhm.ku.edu >>> Subject: [Tdwg-guid] Throttling searches [ Scanned for >>> viruses ] >>> >>> >>> Hi all >>> another question that has come up here. >>> >>> As discussed at the meeting, we're thinking of providing a >>> complete >>> download of all IPNI LSIDs plus a label (name and author, >>> probably) >>> which will be available as an annually produced download >>> >>> Most people will play nice and just resolve one or two LSIDs as >>> required, but by providing a complete list, we're making it >>> very easy >>> for someone to write a crawler that hits every LSID in turn and >>> basically brings our server to its knees >>> >>> Anybody know of a good way of enforcing more polite >>> behaviour? We can >>> make the download only available under a data supply >>> agreement that >>> includes a clause limiting hit rates, or we could limit by >>> IP address >>> (but this would ultimately block out services like Rod's simple >>> resolver). I beleive Google's spell checker uses a key which >>> has to >>> be passed in as part of the query - obviously we can't do >>> that with >>> LSIDs >>> >>> Any thoughts? Anyone think this is a problem? >>> >>> Sally >>> *** Sally Hinchcliffe >>> *** Computer section, Royal Botanic Gardens, Kew >>> *** tel: +44 (0)20 8332 5708 >>> *** S.Hinchcliffe@rbgkew.org.uk >>> >>> >>> _______________________________________________ >>> TDWG-GUID mailing list >>> TDWG-GUID@mailman.nhm.ku.edu >>> http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid >>> >>> _______________________________________________ >>> TDWG-GUID mailing list >>> TDWG-GUID@mailman.nhm.ku.edu >>> http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid >>> >>> >>> >>> >> *** Sally Hinchcliffe >> *** Computer section, Royal Botanic Gardens, Kew >> *** tel: +44 (0)20 8332 5708 >> *** S.Hinchcliffe@rbgkew.org.uk >> >> >> _______________________________________________ >> TDWG-GUID mailing list >> TDWG-GUID@mailman.nhm.ku.edu >> http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid >> >> >> >> >> > _______________________________________________ > TDWG-GUID mailing list > TDWG-GUID@mailman.nhm.ku.edu > http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid > > > > --

------------------------------------- Roger Hyam Technical Architect Taxonomic Databases Working Group ------------------------------------- http://www.tdwg.org roger@tdwg.org +44 1578 722782 -------------------------------------

*** Sally Hinchcliffe *** Computer section, Royal Botanic Gardens, Kew *** tel: +44 (0)20 8332 5708 *** S.Hinchcliffe@rbgkew.org.uk

--

------------------------------------- Roger Hyam Technical Architect Taxonomic Databases Working Group ------------------------------------- http://www.tdwg.org roger@tdwg.org +44 1578 722782 -------------------------------------

*** Sally Hinchcliffe *** Computer section, Royal Botanic Gardens, Kew *** tel: +44 (0)20 8332 5708 *** S.Hinchcliffe@rbgkew.org.uk

--

------------------------------------- Roger Hyam Technical Architect Taxonomic Databases Working Group ------------------------------------- http://www.tdwg.org roger@tdwg.org +44 1578 722782 ------------------------------------- _______________________________________________ TDWG-GUID mailing list TDWG-GUID@mailman.nhm.ku.edu http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid

_______________________________________________ David Remsen uBio Project Manager Marine Biological Laboratory Woods Hole, MA 02543 508-289-7632

*** Sally Hinchcliffe *** Computer section, Royal Botanic Gardens, Kew *** tel: +44 (0)20 8332 5708 *** S.Hinchcliffe@rbgkew.org.uk

Roger Hyam

14:37

New subject: [Tdwg-guid] Throttling searches

Yes it would be violating the LSID ethos to use the version number as a different version number means a different LSID - also what would happen if the LSID already had a version number? Really this stuff is not to do with the LSID 'layer' at all - it is the web services the LSIDs resolve to. There may be all sorts of authentication and authorization wrapped round the web services and we don't want to go trying to leaver that into the GUID technology - in my opinion. Roger David Remsen wrote:

...

We do some of this already with our web services. SOAP methods required a keycode. We use the code so we have a contact in case we need to send a message out as well as to provide a better accounting to sources of how we pass on their content. Patrick (uBio programmer and nice guy) asked why not use the LSID version number as a way to pass a token. If it's not passed you can fall back to one level of processing else give it the extra special treatment with the userID. Or is this violating something sacred in the LSID ethos?

David Remsen

On Jun 19, 2006, at 6:07 AM, Roger Hyam wrote:

...
You don't! The LSID resolves to the binding to the getMetadata() method - which is a plain old fashioned URL. At this point the LSID authority has done its duty and we are just on a plain HTTP GET call so you can do whatever you can do with any regular HTTP GET. You could stipulate another header field or (more simply) give priority service for those who append a valid user id to the URL (&user_id=12345)

So there is no throttle on resolving the LSID to the getMetadata binding (which is cheap) but there is a throttle on the actual call to get the metadata method. Really you need to do this because bad people may be able to tell from the URL how to scrape the source and bypass the LSID resolver after the first call anyhow. This is especially true if the URL contains the IPNI record ID which is likely.

Here is an example using Rod's tester.

http://linnaeus.zoology.gla.ac.uk/~rpage/lsid/tester/?q=urn:lsid:ubio.org:na...

The getMetadata() method for this LSID:

urn:lsid:ubio.org:namebank:11815

Is bound to this URL:

http://names.ubio.org/authority/metadata.php?lsid=urn:lsid:ubio.org:namebank...

So ubio would just have to give preferential services to calls like this:

http://names.ubio.org/authority/metadata.php?lsid=urn:lsid:ubio.org:namebank:11815&user_id=rogerhyam1392918790

If rogerhyam had paid his membership fees this year.

Does this make sense?

Roger p.s. You could do this on the web pages as well with a clever little thing to write dynamic tokens into the links so that it doesn't degrade the regular browsing experience and only stops scrapers - but that is beyond my remit at the moment ;)

p.p.s. You could wrap this in https if you were paranoid about people stealing tokens - but this is highly unlikely I believe.

Sally Hinchcliffe wrote:

...
How can we pass a token with an LSID?

...
I think the only way to throttle in these situations is to have some notion of who the client is and the only way to do that is to have some kind of token exchange over HTTP saying who they are. Basically you have to have some kind of client registration system or you can never distinguish between a call from a new client and a repeat call. The use of IP address is a pain because so many people are now behind some kind of NAT gateway.

How about this for a plan:

You could give a degraded services to people who don't pass a token (a 5 second delay perhaps) and offer a quicker service to registered users who pass a token (but then perhaps limit the number of calls they make). This would mean you could offer a universal service even to those with naive client software but a better service to those who play nicely. You could also get better stats on who is using the service.

So there are ways that this could be done. I expect people will come up with a host of different ways. It is outside LSIDs though.

Roger

Sally Hinchcliffe wrote:

...
It's not an LSID issue per se, but LSIDs will make it harder to slow searches down. For instance, Google restricts use of its spell checker to 1000 a day by use of a key which is passed in with each request. Obviously this can't be done with LSIDs as then they wouldn't be the same for each user. The other reason why it's relevant to LSIDs is simply that providing a list of all relevant IPNI LSIDs (not necessary to the LSID implementation but a nice to have for caching / lookups for other systems using our LSIDs) also makes life easier for the datascrapers to operate

Also I thought ... here's a list full of clever people perhaps they will have some suggestions

Sally

...
Is this an LSID issue? LSIDs essential provide a binding service between an name and one or more web services (we default to HTTP GET bindings). It isn't really up to the LSID authority to administer any policies regarding the web service but simply to point at it. It is up to the web service to do things like throttling, authentication and authorization.

Imagine, for example, that the different services had different policies. It may be reasonable not to restrict the getMetadata() calls but to restrict the getData() calls.

The use of LSIDs does not create any new problems that weren't there with web page scraping - or scraping of any other web service.

Just my thoughts...

Roger

Ricardo Scachetti Pereira wrote:

> Sally, > > You raised a really important issue that we had not really addressed > at the meeting. Thanks for that. > > I would say that we should not constrain the resolution of LSIDs if > we expect our LSID infrastructure to work. LSIDs will be the basis of > our architecture so we better have good support for that. > > However, that is sure a limiting factor. Also server efficiency will > likely vary quite a lot, depending on underlying system optimizations > and all. > > So I think that the solution for this problem is in caching LSID > responses on the server LSID stack. Basically, after resolving an LSID > once, your server should be able to resolve it again and again really > quickly, until something on the metadata that is related to that id changes. > > I haven't looked at this aspect of the LSID software stack, but > maybe others can say something about it. In any case I'll do some > research on it and get back to you. > > Again, thanks for bringing it up. > > Cheers, > > Ricardo > > > Sally Hinchcliffe wrote: > > > >> There are enough discontinuities in IPNI ids that 1,2,3 would quickly >> run into the sand. I agree it's not a new problem - I just hate to >> think I'm making life easier for the data scrapers >> Sally >> >> >> >> >> >> >>> It can be a problem but I'm not sure if there is a simple solution ... and how different is the LSID crawler scenario from an http://www.ipni.org/ipni/plantsearch?id= 1,2,3,4,5 ... 9999999 scenario? >>> >>> Paul >>> >>> -----Original Message----- >>> From: tdwg-guid-bounces@mailman.nhm.ku.edu >>> [mailto:tdwg-guid-bounces@mailman.nhm.ku.edu]On Behalf Of Sally >>> Hinchcliffe >>> Sent: 15 June 2006 12:08 >>> To: tdwg-guid@mailman.nhm.ku.edu >>> Subject: [Tdwg-guid] Throttling searches [ Scanned for viruses ] >>> >>> >>> Hi all >>> another question that has come up here. >>> >>> As discussed at the meeting, we're thinking of providing a complete >>> download of all IPNI LSIDs plus a label (name and author, probably) >>> which will be available as an annually produced download >>> >>> Most people will play nice and just resolve one or two LSIDs as >>> required, but by providing a complete list, we're making it very easy >>> for someone to write a crawler that hits every LSID in turn and >>> basically brings our server to its knees >>> >>> Anybody know of a good way of enforcing more polite behaviour? We can >>> make the download only available under a data supply agreement that >>> includes a clause limiting hit rates, or we could limit by IP address >>> (but this would ultimately block out services like Rod's simple >>> resolver). I beleive Google's spell checker uses a key which has to >>> be passed in as part of the query - obviously we can't do that with >>> LSIDs >>> >>> Any thoughts? Anyone think this is a problem? >>> >>> Sally >>> *** Sally Hinchcliffe >>> *** Computer section, Royal Botanic Gardens, Kew >>> *** tel: +44 (0)20 8332 5708 >>> *** S.Hinchcliffe@rbgkew.org.uk >>> >>> >>> _______________________________________________ >>> TDWG-GUID mailing list >>> TDWG-GUID@mailman.nhm.ku.edu >>> http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid >>> >>> _______________________________________________ >>> TDWG-GUID mailing list >>> TDWG-GUID@mailman.nhm.ku.edu >>> http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid >>> >>> >>> >>> >> *** Sally Hinchcliffe >> *** Computer section, Royal Botanic Gardens, Kew >> *** tel: +44 (0)20 8332 5708 >> *** S.Hinchcliffe@rbgkew.org.uk >> >> >> _______________________________________________ >> TDWG-GUID mailing list >> TDWG-GUID@mailman.nhm.ku.edu >> http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid >> >> >> >> >> > _______________________________________________ > TDWG-GUID mailing list > TDWG-GUID@mailman.nhm.ku.edu > http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid > > > > --

------------------------------------- Roger Hyam Technical Architect Taxonomic Databases Working Group ------------------------------------- http://www.tdwg.org roger@tdwg.org +44 1578 722782 -------------------------------------

*** Sally Hinchcliffe *** Computer section, Royal Botanic Gardens, Kew *** tel: +44 (0)20 8332 5708 *** S.Hinchcliffe@rbgkew.org.uk

--

------------------------------------- Roger Hyam Technical Architect Taxonomic Databases Working Group ------------------------------------- http://www.tdwg.org roger@tdwg.org +44 1578 722782 -------------------------------------

*** Sally Hinchcliffe *** Computer section, Royal Botanic Gardens, Kew *** tel: +44 (0)20 8332 5708 *** S.Hinchcliffe@rbgkew.org.uk

--

------------------------------------- Roger Hyam Technical Architect Taxonomic Databases Working Group ------------------------------------- http://www.tdwg.org roger@tdwg.org +44 1578 722782 -------------------------------------

_______________________________________________ TDWG-GUID mailing list TDWG-GUID@mailman.nhm.ku.edu <mailto:TDWG-GUID@mailman.nhm.ku.edu> http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid

_______________________________________________

David Remsen

uBio Project Manager

Marine Biological Laboratory

Woods Hole, MA 02543

508-289-7632

------------------------------------------------------------------------

_______________________________________________ TDWG-GUID mailing list TDWG-GUID@mailman.nhm.ku.edu http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid

Chuck Miller

20 Jun 20 Jun

02:20

New subject: [Tdwg-guid] Throttling searches

Sounds to me that we have a multi-layer communications protocol stack in development here, but we aren't spelling out the layers very well. Discussing LSID in the context of biodiversity systems/databases without a clear definition of the necessary underlying layers is confusing me. Can someone do a more expanded elucidation of the complete LSID/RDF protocol stack? What exactly are we proposing to standardize on besides just the syntax of an LSID. Chuck ________________________________ From: Roger Hyam [mailto:roger@tdwg.org] Sent: Mon 6/19/2006 9:37 AM To: David Remsen Cc: tdwg-guid@mailman.nhm.ku.edu Subject: Re: [Tdwg-guid] Throttling searches Yes it would be violating the LSID ethos to use the version number as a different version number means a different LSID - also what would happen if the LSID already had a version number? Really this stuff is not to do with the LSID 'layer' at all - it is the web services the LSIDs resolve to. There may be all sorts of authentication and authorization wrapped round the web services and we don't want to go trying to leaver that into the GUID technology - in my opinion. Roger David Remsen wrote: We do some of this already with our web services. SOAP methods required a keycode. We use the code so we have a contact in case we need to send a message out as well as to provide a better accounting to sources of how we pass on their content. Patrick (uBio programmer and nice guy) asked why not use the LSID version number as a way to pass a token. If it's not passed you can fall back to one level of processing else give it the extra special treatment with the userID. Or is this violating something sacred in the LSID ethos? David Remsen On Jun 19, 2006, at 6:07 AM, Roger Hyam wrote: You don't! The LSID resolves to the binding to the getMetadata() method - which is a plain old fashioned URL. At this point the LSID authority has done its duty and we are just on a plain HTTP GET call so you can do whatever you can do with any regular HTTP GET. You could stipulate another header field or (more simply) give priority service for those who append a valid user id to the URL (&user_id=12345) So there is no throttle on resolving the LSID to the getMetadata binding (which is cheap) but there is a throttle on the actual call to get the metadata method. Really you need to do this because bad people may be able to tell from the URL how to scrape the source and bypass the LSID resolver after the first call anyhow. This is especially true if the URL contains the IPNI record ID which is likely. Here is an example using Rod's tester. http://linnaeus.zoology.gla.ac.uk/~rpage/lsid/tester/?q=urn:lsid:ubio.org:na... <http://linnaeus.zoology.gla.ac.uk/%7Erpage/lsid/tester/?q=urn:lsid:ubio.org:namebank:11815> The getMetadata() method for this LSID: urn:lsid:ubio.org:namebank:11815 Is bound to this URL: http://names.ubio.org/authority/metadata.php?lsid=urn:lsid:ubio.org:namebank... So ubio would just have to give preferential services to calls like this: http://names.ubio.org/authority/metadata.php?lsid=urn:lsid:ubio.org:namebank:11815&user_id=rogerhyam1392918790 If rogerhyam had paid his membership fees this year. Does this make sense? Roger p.s. You could do this on the web pages as well with a clever little thing to write dynamic tokens into the links so that it doesn't degrade the regular browsing experience and only stops scrapers - but that is beyond my remit at the moment ;) p.p.s. You could wrap this in https if you were paranoid about people stealing tokens - but this is highly unlikely I believe. Sally Hinchcliffe wrote: How can we pass a token with an LSID? I think the only way to throttle in these situations is to have some notion of who the client is and the only way to do that is to have some kind of token exchange over HTTP saying who they are. Basically you have to have some kind of client registration system or you can never distinguish between a call from a new client and a repeat call. The use of IP address is a pain because so many people are now behind some kind of NAT gateway. How about this for a plan: You could give a degraded services to people who don't pass a token (a 5 second delay perhaps) and offer a quicker service to registered users who pass a token (but then perhaps limit the number of calls they make). This would mean you could offer a universal service even to those with naive client software but a better service to those who play nicely. You could also get better stats on who is using the service. So there are ways that this could be done. I expect people will come up with a host of different ways. It is outside LSIDs though. Roger Sally Hinchcliffe wrote: It's not an LSID issue per se, but LSIDs will make it harder to slow searches down. For instance, Google restricts use of its spell checker to 1000 a day by use of a key which is passed in with each request. Obviously this can't be done with LSIDs as then they wouldn't be the same for each user. The other reason why it's relevant to LSIDs is simply that providing a list of all relevant IPNI LSIDs (not necessary to the LSID implementation but a nice to have for caching / lookups for other systems using our LSIDs) also makes life easier for the datascrapers to operate Also I thought ... here's a list full of clever people perhaps they will have some suggestions Sally Is this an LSID issue? LSIDs essential provide a binding service between an name and one or more web services (we default to HTTP GET bindings). It isn't really up to the LSID authority to administer any policies regarding the web service but simply to point at it. It is up to the web service to do things like throttling, authentication and authorization. Imagine, for example, that the different services had different policies. It may be reasonable not to restrict the getMetadata() calls but to restrict the getData() calls. The use of LSIDs does not create any new problems that weren't there with web page scraping - or scraping of any other web service. Just my thoughts... Roger Ricardo Scachetti Pereira wrote: Sally, You raised a really important issue that we had not really addressed at the meeting. Thanks for that. I would say that we should not constrain the resolution of LSIDs if we expect our LSID infrastructure to work. LSIDs will be the basis of our architecture so we better have good support for that. However, that is sure a limiting factor. Also server efficiency will likely vary quite a lot, depending on underlying system optimizations and all. So I think that the solution for this problem is in caching LSID responses on the server LSID stack. Basically, after resolving an LSID once, your server should be able to resolve it again and again really quickly, until something on the metadata that is related to that id changes. I haven't looked at this aspect of the LSID software stack, but maybe others can say something about it. In any case I'll do some research on it and get back to you. Again, thanks for bringing it up. Cheers, Ricardo Sally Hinchcliffe wrote: There are enough discontinuities in IPNI ids that 1,2,3 would quickly run into the sand. I agree it's not a new problem - I just hate to think I'm making life easier for the data scrapers Sally It can be a problem but I'm not sure if there is a simple solution ... and how different is the LSID crawler scenario from an http://www.ipni.org/ipni/plantsearch?id= 1,2,3,4,5 ... 9999999 scenario? Paul -----Original Message----- From: tdwg-guid-bounces@mailman.nhm.ku.edu [mailto:tdwg-guid-bounces@mailman.nhm.ku.edu]On Behalf Of Sally Hinchcliffe Sent: 15 June 2006 12:08 To: tdwg-guid@mailman.nhm.ku.edu Subject: [Tdwg-guid] Throttling searches [ Scanned for viruses ] Hi all another question that has come up here. As discussed at the meeting, we're thinking of providing a complete download of all IPNI LSIDs plus a label (name and author, probably) which will be available as an annually produced download Most people will play nice and just resolve one or two LSIDs as required, but by providing a complete list, we're making it very easy for someone to write a crawler that hits every LSID in turn and basically brings our server to its knees Anybody know of a good way of enforcing more polite behaviour? We can make the download only available under a data supply agreement that includes a clause limiting hit rates, or we could limit by IP address (but this would ultimately block out services like Rod's simple resolver). I beleive Google's spell checker uses a key which has to be passed in as part of the query - obviously we can't do that with LSIDs Any thoughts? Anyone think this is a problem? Sally *** Sally Hinchcliffe *** Computer section, Royal Botanic Gardens, Kew *** tel: +44 (0)20 8332 5708 *** S.Hinchcliffe@rbgkew.org.uk _______________________________________________ TDWG-GUID mailing list TDWG-GUID@mailman.nhm.ku.edu http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid _______________________________________________ TDWG-GUID mailing list TDWG-GUID@mailman.nhm.ku.edu http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid *** Sally Hinchcliffe *** Computer section, Royal Botanic Gardens, Kew *** tel: +44 (0)20 8332 5708 *** S.Hinchcliffe@rbgkew.org.uk _______________________________________________ TDWG-GUID mailing list TDWG-GUID@mailman.nhm.ku.edu http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid _______________________________________________ TDWG-GUID mailing list TDWG-GUID@mailman.nhm.ku.edu http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid -- ------------------------------------- Roger Hyam Technical Architect Taxonomic Databases Working Group ------------------------------------- http://www.tdwg.org <http://www.tdwg.org/> roger@tdwg.org +44 1578 722782 ------------------------------------- *** Sally Hinchcliffe *** Computer section, Royal Botanic Gardens, Kew *** tel: +44 (0)20 8332 5708 *** S.Hinchcliffe@rbgkew.org.uk -- ------------------------------------- Roger Hyam Technical Architect Taxonomic Databases Working Group ------------------------------------- http://www.tdwg.org <http://www.tdwg.org/> roger@tdwg.org +44 1578 722782 ------------------------------------- *** Sally Hinchcliffe *** Computer section, Royal Botanic Gardens, Kew *** tel: +44 (0)20 8332 5708 *** S.Hinchcliffe@rbgkew.org.uk -- ------------------------------------- Roger Hyam Technical Architect Taxonomic Databases Working Group ------------------------------------- http://www.tdwg.org <http://www.tdwg.org/> roger@tdwg.org +44 1578 722782 ------------------------------------- _______________________________________________ TDWG-GUID mailing list TDWG-GUID@mailman.nhm.ku.edu http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid _______________________________________________ David Remsen uBio Project Manager Marine Biological Laboratory Woods Hole, MA 02543 508-289-7632 ________________________________ _______________________________________________ TDWG-GUID mailing list TDWG-GUID@mailman.nhm.ku.edu http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid -- ------------------------------------- Roger Hyam Technical Architect Taxonomic Databases Working Group ------------------------------------- http://www.tdwg.org <http://www.tdwg.org/> roger@tdwg.org +44 1578 722782 -------------------------------------

Roger Hyam

10:00

New subject: [Tdwg-guid] Throttling searches

Hi Chuck, The 'stack' I have in my mind (which isn't really a stack but a 'slider' because they don't wrap each other) goes like this: 1. Resolution = Have identifier get object = LSID (and to a lesser extent URL for things we don't care about so much like logos). 2. Harvest = Give me what has changed since... = OAI? ( This is still to be fully investigated but is a half way house to a full blown query language). 3. Query = Ask whatever you like = BioCASe, TAPIR, DiGIR or SPARQL. We have unified on 1. We don't disagree on 2 but it may not be necessary. We are moving towards unifying 3 by unifying the vocabulary used by all the protocols. Different query protocols will probably always be needed for different purposes but the query terms should map to the same place. RDF will figure large going forward as it is the default return type for LSID and so we need to be able to express all our objects in it. We may also need to express them in other ways such as GML. Appendix B of the TAG-1 report summarizes the current technology. http://wiki.tdwg.org/twiki/pub/TAG/TagMeeting1Report/TAG-1_Report_Final.pdf Hope this helps, Roger Chuck Miller wrote:

...

Sounds to me that we have a multi-layer communications protocol stack in development here, but we aren't spelling out the layers very well. Discussing LSID in the context of biodiversity systems/databases without a clear definition of the necessary underlying layers is confusing me.

Can someone do a more expanded elucidation of the complete LSID/RDF protocol stack? What exactly are we proposing to standardize on besides just the syntax of an LSID.

Chuck

________________________________

From: Roger Hyam [mailto:roger@tdwg.org] Sent: Mon 6/19/2006 9:37 AM To: David Remsen Cc: tdwg-guid@mailman.nhm.ku.edu Subject: Re: [Tdwg-guid] Throttling searches

Yes it would be violating the LSID ethos to use the version number as a different version number means a different LSID - also what would happen if the LSID already had a version number? Really this stuff is not to do with the LSID 'layer' at all - it is the web services the LSIDs resolve to. There may be all sorts of authentication and authorization wrapped round the web services and we don't want to go trying to leaver that into the GUID technology - in my opinion.

Roger

David Remsen wrote:

We do some of this already with our web services. SOAP methods required a keycode. We use the code so we have a contact in case we need to send a message out as well as to provide a better accounting to sources of how we pass on their content. Patrick (uBio programmer and nice guy) asked why not use the LSID version number as a way to pass a token. If it's not passed you can fall back to one level of processing else give it the extra special treatment with the userID. Or is this violating something sacred in the LSID ethos?

David Remsen

On Jun 19, 2006, at 6:07 AM, Roger Hyam wrote:

You don't! The LSID resolves to the binding to the getMetadata() method - which is a plain old fashioned URL. At this point the LSID authority has done its duty and we are just on a plain HTTP GET call so you can do whatever you can do with any regular HTTP GET. You could stipulate another header field or (more simply) give priority service for those who append a valid user id to the URL (&user_id=12345)

So there is no throttle on resolving the LSID to the getMetadata binding (which is cheap) but there is a throttle on the actual call to get the metadata method. Really you need to do this because bad people may be able to tell from the URL how to scrape the source and bypass the LSID resolver after the first call anyhow. This is especially true if the URL contains the IPNI record ID which is likely.

Here is an example using Rod's tester.

http://linnaeus.zoology.gla.ac.uk/~rpage/lsid/tester/?q=urn:lsid:ubio.org:na... <http://linnaeus.zoology.gla.ac.uk/%7Erpage/lsid/tester/?q=urn:lsid:ubio.org:namebank:11815>

The getMetadata() method for this LSID:

urn:lsid:ubio.org:namebank:11815

Is bound to this URL:

http://names.ubio.org/authority/metadata.php?lsid=urn:lsid:ubio.org:namebank...

So ubio would just have to give preferential services to calls like this:

http://names.ubio.org/authority/metadata.php?lsid=urn:lsid:ubio.org:namebank:11815&user_id=rogerhyam1392918790

If rogerhyam had paid his membership fees this year.

Does this make sense?

Roger p.s. You could do this on the web pages as well with a clever little thing to write dynamic tokens into the links so that it doesn't degrade the regular browsing experience and only stops scrapers - but that is beyond my remit at the moment ;)

p.p.s. You could wrap this in https if you were paranoid about people stealing tokens - but this is highly unlikely I believe.

Sally Hinchcliffe wrote:

How can we pass a token with an LSID?

I think the only way to throttle in these situations is to have some notion of who the client is and the only way to do that is to have some kind of token exchange over HTTP saying who they are. Basically you have to have some kind of client registration system or you can never distinguish between a call from a new client and a repeat call. The use of IP address is a pain because so many people are now behind some kind of NAT gateway.

How about this for a plan:

You could give a degraded services to people who don't pass a token (a 5 second delay perhaps) and offer a quicker service to registered users who pass a token (but then perhaps limit the number of calls they make). This would mean you could offer a universal service even to those with naive client software but a better service to those who play nicely. You could also get better stats on who is using the service.

So there are ways that this could be done. I expect people will come up with a host of different ways. It is outside LSIDs though.

Roger

Sally Hinchcliffe wrote:

It's not an LSID issue per se, but LSIDs will make it harder to slow searches down. For instance, Google restricts use of its spell checker to 1000 a day by use of a key which is passed in with each request. Obviously this can't be done with LSIDs as then they wouldn't be the same for each user. The other reason why it's relevant to LSIDs is simply that providing a list of all relevant IPNI LSIDs (not necessary to the LSID implementation but a nice to have for caching / lookups for other systems using our LSIDs) also makes life easier for the datascrapers to operate

Also I thought ... here's a list full of clever people perhaps they will have some suggestions

Sally

Is this an LSID issue? LSIDs essential provide a binding service between an name and one or more web services (we default to HTTP GET bindings). It isn't really up to the LSID authority to administer any policies regarding the web service but simply to point at it. It is up to the web service to do things like throttling, authentication and authorization.

Imagine, for example, that the different services had different policies. It may be reasonable not to restrict the getMetadata() calls but to restrict the getData() calls.

The use of LSIDs does not create any new problems that weren't there with web page scraping - or scraping of any other web service.

Just my thoughts...

Roger

Ricardo Scachetti Pereira wrote:

Sally,

You raised a really important issue that we had not really addressed at the meeting. Thanks for that.

I would say that we should not constrain the resolution of LSIDs if we expect our LSID infrastructure to work. LSIDs will be the basis of our architecture so we better have good support for that.

However, that is sure a limiting factor. Also server efficiency will likely vary quite a lot, depending on underlying system optimizations and all.

So I think that the solution for this problem is in caching LSID responses on the server LSID stack. Basically, after resolving an LSID once, your server should be able to resolve it again and again really quickly, until something on the metadata that is related to that id changes.

I haven't looked at this aspect of the LSID software stack, but maybe others can say something about it. In any case I'll do some research on it and get back to you.

Again, thanks for bringing it up.

Cheers,

Ricardo

Sally Hinchcliffe wrote:

There are enough discontinuities in IPNI ids that 1,2,3 would quickly run into the sand. I agree it's not a new problem - I just hate to think I'm making life easier for the data scrapers Sally

It can be a problem but I'm not sure if there is a simple solution ... and how different is the LSID crawler scenario from an http://www.ipni.org/ipni/plantsearch?id= 1,2,3,4,5 ... 9999999 scenario?

Paul

-----Original Message----- From: tdwg-guid-bounces@mailman.nhm.ku.edu [mailto:tdwg-guid-bounces@mailman.nhm.ku.edu]On Behalf Of Sally Hinchcliffe Sent: 15 June 2006 12:08 To: tdwg-guid@mailman.nhm.ku.edu Subject: [Tdwg-guid] Throttling searches [ Scanned for viruses ]

Hi all another question that has come up here.

As discussed at the meeting, we're thinking of providing a complete download of all IPNI LSIDs plus a label (name and author, probably) which will be available as an annually produced download

Most people will play nice and just resolve one or two LSIDs as required, but by providing a complete list, we're making it very easy for someone to write a crawler that hits every LSID in turn and basically brings our server to its knees

Anybody know of a good way of enforcing more polite behaviour? We can make the download only available under a data supply agreement that includes a clause limiting hit rates, or we could limit by IP address (but this would ultimately block out services like Rod's simple resolver). I beleive Google's spell checker uses a key which has to be passed in as part of the query - obviously we can't do that with LSIDs

Any thoughts? Anyone think this is a problem?

Sally *** Sally Hinchcliffe *** Computer section, Royal Botanic Gardens, Kew *** tel: +44 (0)20 8332 5708 *** S.Hinchcliffe@rbgkew.org.uk

_______________________________________________ TDWG-GUID mailing list TDWG-GUID@mailman.nhm.ku.edu http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid

_______________________________________________ TDWG-GUID mailing list TDWG-GUID@mailman.nhm.ku.edu http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid

*** Sally Hinchcliffe *** Computer section, Royal Botanic Gardens, Kew *** tel: +44 (0)20 8332 5708 *** S.Hinchcliffe@rbgkew.org.uk

_______________________________________________ TDWG-GUID mailing list TDWG-GUID@mailman.nhm.ku.edu http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid

_______________________________________________ TDWG-GUID mailing list TDWG-GUID@mailman.nhm.ku.edu http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid

--

------------------------------------- Roger Hyam Technical Architect Taxonomic Databases Working Group ------------------------------------- http://www.tdwg.org <http://www.tdwg.org/> roger@tdwg.org +44 1578 722782 -------------------------------------

*** Sally Hinchcliffe *** Computer section, Royal Botanic Gardens, Kew *** tel: +44 (0)20 8332 5708 *** S.Hinchcliffe@rbgkew.org.uk

--

------------------------------------- Roger Hyam Technical Architect Taxonomic Databases Working Group ------------------------------------- http://www.tdwg.org <http://www.tdwg.org/> roger@tdwg.org +44 1578 722782 -------------------------------------

*** Sally Hinchcliffe *** Computer section, Royal Botanic Gardens, Kew *** tel: +44 (0)20 8332 5708 *** S.Hinchcliffe@rbgkew.org.uk

--

------------------------------------- Roger Hyam Technical Architect Taxonomic Databases Working Group ------------------------------------- http://www.tdwg.org <http://www.tdwg.org/> roger@tdwg.org +44 1578 722782 -------------------------------------

_______________________________________________ TDWG-GUID mailing list TDWG-GUID@mailman.nhm.ku.edu http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid

_______________________________________________

David Remsen

uBio Project Manager

Marine Biological Laboratory

Woods Hole, MA 02543

508-289-7632

________________________________

_______________________________________________ TDWG-GUID mailing list TDWG-GUID@mailman.nhm.ku.edu http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid

Dave Vieglais

19 Jun 19 Jun

18:28

New subject: [Tdwg-guid] Throttling searches [ Scanned for viruses ]

A somewhat related issue: Does the LSID spec provide guidelines for when a resolver is not accessible, such as when they are overloaded (I would read the spec myself but I can't seem to access the OMG site this morning)? Dave V. Roger Hyam said the following on 6/19/2006 8:33 PM:

...

Is this an LSID issue? LSIDs essential provide a binding service between an name and one or more web services (we default to HTTP GET bindings). It isn't really up to the LSID authority to administer any policies regarding the web service but simply to point at it. It is up to the web service to do things like throttling, authentication and authorization.

Imagine, for example, that the different services had different policies. It may be reasonable not to restrict the getMetadata() calls but to restrict the getData() calls.

The use of LSIDs does not create any new problems that weren't there with web page scraping - or scraping of any other web service.

Just my thoughts...

Roger

Ricardo Scachetti Pereira wrote:

...
Sally,

You raised a really important issue that we had not really addressed at the meeting. Thanks for that.

I would say that we should not constrain the resolution of LSIDs if we expect our LSID infrastructure to work. LSIDs will be the basis of our architecture so we better have good support for that.

However, that is sure a limiting factor. Also server efficiency will likely vary quite a lot, depending on underlying system optimizations and all.

So I think that the solution for this problem is in caching LSID responses on the server LSID stack. Basically, after resolving an LSID once, your server should be able to resolve it again and again really quickly, until something on the metadata that is related to that id changes.

I haven't looked at this aspect of the LSID software stack, but maybe others can say something about it. In any case I'll do some research on it and get back to you.

Again, thanks for bringing it up.

Cheers,

Ricardo

Sally Hinchcliffe wrote:

...
There are enough discontinuities in IPNI ids that 1,2,3 would quickly run into the sand. I agree it's not a new problem - I just hate to think I'm making life easier for the data scrapers Sally

...
It can be a problem but I'm not sure if there is a simple solution ... and how different is the LSID crawler scenario from an http://www.ipni.org/ipni/plantsearch?id= 1,2,3,4,5 ... 9999999 scenario?

Paul

-----Original Message----- From: tdwg-guid-bounces@mailman.nhm.ku.edu [mailto:tdwg-guid-bounces@mailman.nhm.ku.edu]On Behalf Of Sally Hinchcliffe Sent: 15 June 2006 12:08 To: tdwg-guid@mailman.nhm.ku.edu Subject: [Tdwg-guid] Throttling searches [ Scanned for viruses ]

Hi all another question that has come up here.

As discussed at the meeting, we're thinking of providing a complete download of all IPNI LSIDs plus a label (name and author, probably) which will be available as an annually produced download

Most people will play nice and just resolve one or two LSIDs as required, but by providing a complete list, we're making it very easy for someone to write a crawler that hits every LSID in turn and basically brings our server to its knees

Anybody know of a good way of enforcing more polite behaviour? We can make the download only available under a data supply agreement that includes a clause limiting hit rates, or we could limit by IP address (but this would ultimately block out services like Rod's simple resolver). I beleive Google's spell checker uses a key which has to be passed in as part of the query - obviously we can't do that with LSIDs

Any thoughts? Anyone think this is a problem?

Sally *** Sally Hinchcliffe *** Computer section, Royal Botanic Gardens, Kew *** tel: +44 (0)20 8332 5708 *** S.Hinchcliffe@rbgkew.org.uk

_______________________________________________ TDWG-GUID mailing list TDWG-GUID@mailman.nhm.ku.edu http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid

_______________________________________________ TDWG-GUID mailing list TDWG-GUID@mailman.nhm.ku.edu http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid

*** Sally Hinchcliffe *** Computer section, Royal Botanic Gardens, Kew *** tel: +44 (0)20 8332 5708 *** S.Hinchcliffe@rbgkew.org.uk

_______________________________________________ TDWG-GUID mailing list TDWG-GUID@mailman.nhm.ku.edu http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid

_______________________________________________ TDWG-GUID mailing list TDWG-GUID@mailman.nhm.ku.edu http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid

--

------------------------------------- Roger Hyam Technical Architect Taxonomic Databases Working Group ------------------------------------- http://www.tdwg.org roger@tdwg.org +44 1578 722782 -------------------------------------

------------------------------------------------------------------------

_______________________________________________ TDWG-GUID mailing list TDWG-GUID@mailman.nhm.ku.edu http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid

6980

Age (days ago)

6985

Last active (days ago)

List overview

Download

21 comments

9 participants

participants (9)

Chuck Miller
Dave Vieglais
David Remsen
Paul Kirk
Ricardo Scachetti Pereira
Roderic Page
Roger Hyam
Sally Hinchcliffe
Steven Perry