LSIDs: web based (HTTP) resolvers and web crawlers
Hi,
Further to my last design question re LSID HTTP proxies (thanks for the responses), I wanted to raise the issue of HTTP LSID proxies and crawlers, in particular the crawl delay part of the robots exclusion protocol.
I'll outline a situation we had recently:
The GBIF portal and ZipCodeZoo site both inclde IPNI LSIDs in the pages. These are presented in their proxied form using the TDWG LSID resolver (eg http://lsid.tdwg.org/urn:lsid:ipni.org:names:783030-1). Using the TDWG resolver to access the data for an IPNI LSID does not issue any kind of HTTP redirect, instead the web resolver uses the LSID resolution steps to get the data and presents it in its own response (ie returning a HTTP 200 OK response).
The problem happens when one of these sites that includes proxied IPNI LSIDs is crawled by a search engine. The proxied links appear to belong to tdwg.org, so whatever crawl delay is agreed between TDWG and the crawler in question is used. The crawler has no knowledge that behind the scenes the TDWG resolver is hitting ipni.org. We (ipni.org) have agreed our own crawl limits with Google and the other major search engines using directives in robots.txt and directly agreed limits with Google (who don't use the robots.txt directly).
On a couple of occasions in the past we have had to deny access to the TDWG LSID resolver as it has been responsible for far more traffic than we can support (up to 10 times the crawl limits we have agreed with search engine bots) - this due to the pages on the GBIF portal and / or zipcodezoo being crawled by a search engine, which in turn triggers a high volume of requests from TDWG to IPNI. The crawler itself has no knowledge that it is in effect accessing data held at ipni.org rather than tdwg.org as the HTTP response is HTTP 200.
One of Rod's emails recently mentioned that we need a resolver to act like a tinyurl or bit.ly. I have pasted below the HTTP headers for an HTTP request to the TDWG LSID resolver, and to tinyurl / bit.ly. To the end user it looks as though tdwg.org is the true location of the LSID resource, whereas with the tinyurl and bitly both just redirect traffic.
I'm just posting this for discussion really - if we are to mandate use of a web based HTTP resolver/proxies, it should really issue 30* redirects so that established crawl delays between producer and consumer will be used. The alternative would be for the HTTP resolver to read and process the directives in robots.txt, but this would be difficult to implement as it is not in itself a crawler, just a gateway.
I'm sure that if proxied forms of LSIDs become more prevalent this problem will become more widespread, so now - with the on-going attempt to define what services a GUID resolver should provide - might be a good time to plan how to fix this.
cheers, Nicky
[nn00kg@kvstage01 ~]$ curl -I http://lsid.tdwg.org/urn:lsid:ipni.org:names:783030-1 HTTP/1.1 200 OK Via: 1.1 KISA01 Connection: close Proxy-Connection: close Date: Mon, 27 Apr 2009 11:41:55 GMT Content-Type: application/xml Server: Apache/2.2.3 (CentOS)
[nn00kg@kvstage01 ~]$ curl -I http://tinyurl.com/czkquy HTTP/1.1 301 Moved Permanently Via: 1.1 KISA01 Connection: close Proxy-Connection: close Date: Mon, 27 Apr 2009 12:16:38 GMT Location: http://www.ipni.org/ipni/plantNameByVersion.do?id=783030-1&version=1.4&a... Content-type: text/html Server: TinyURL/1.6 X-Powered-By: PHP/5.2.9
[nn00kg@kvstage01 ~]$ curl -I http://bit.ly/KO1Ko HTTP/1.1 301 Moved Permanently Via: 1.1 KISA01 Connection: Keep-Alive Proxy-Connection: Keep-Alive Content-Length: 287 Date: Mon, 27 Apr 2009 12:19:48 GMT Location: http://www.ipni.org/ipni/plantNameByVersion.do?id=783030-1&version=1.4&a... Content-Type: text/html;charset=utf-8 Server: nginx/0.7.42 Allow: GET, HEAD, POST
- Nicola Nicolson - Science Applications Development, - Royal Botanic Gardens, Kew, - Richmond, Surrey, TW9 3AB, UK - email: n.nicolson@rbgkew.org.uk - phone: 020-8332-5766
Dear Nicky,
Ouch!
I'm not sure I fully understand how 30* redirects work with respect to web crawlers, but I'm not sure they will help in this case.
If the TDWG LSID resolver is a full blown resolver, then for each request from the crawler it will be doing the full LSID resolution (three calls, one for authority WSDL, one for service WSDL, one for metadata). It may cache the WSDLs, but it will still do at least one call to the service (unless it has cached the metadata as well).
Is the solution to add TDWG to your robots.txt file, and have the LSID resolver respect the settings in that file? TDWG could also implement metadata caching so it wouldn't need to hammer you so much (i.e., when a crawler hit TDWG, TDWG would reply with the cached metadata).
Perhaps LSID services such as IPNI's could also implement etag headers, which would help avoid excessive traffic from TDWG when caching (TDWG could regularly cache metadata form IPNI, respecting the robots.txt files, and first checking whether the metadata had changed using etag and/or last modified headers).
I assume the DOI resolve has similar issues. It's robots.txt file looks like this:
Crawl-delay: 5 Request-rate: 1/5
Hope this makes sense, my understanding of the HTTP headers/redirects/ robots.txt is not particularly deep.
Regards
Rod
On 27 Apr 2009, at 13:54, Nicola Nicolson wrote:
Hi,
Further to my last design question re LSID HTTP proxies (thanks for the responses), I wanted to raise the issue of HTTP LSID proxies and crawlers, in particular the crawl delay part of the robots exclusion protocol.
I'll outline a situation we had recently:
The GBIF portal and ZipCodeZoo site both inclde IPNI LSIDs in the pages. These are presented in their proxied form using the TDWG LSID resolver (eg http://lsid.tdwg.org/urn:lsid:ipni.org:names:783030-1). Using the TDWG resolver to access the data for an IPNI LSID does not issue any kind of HTTP redirect, instead the web resolver uses the LSID resolution steps to get the data and presents it in its own response (ie returning a HTTP 200 OK response).
The problem happens when one of these sites that includes proxied IPNI LSIDs is crawled by a search engine. The proxied links appear to belong to tdwg.org, so whatever crawl delay is agreed between TDWG and the crawler in question is used. The crawler has no knowledge that behind the scenes the TDWG resolver is hitting ipni.org. We (ipni.org) have agreed our own crawl limits with Google and the other major search engines using directives in robots.txt and directly agreed limits with Google (who don't use the robots.txt directly).
On a couple of occasions in the past we have had to deny access to the TDWG LSID resolver as it has been responsible for far more traffic than we can support (up to 10 times the crawl limits we have agreed with search engine bots) - this due to the pages on the GBIF portal and / or zipcodezoo being crawled by a search engine, which in turn triggers a high volume of requests from TDWG to IPNI. The crawler itself has no knowledge that it is in effect accessing data held at ipni.org rather than tdwg.org as the HTTP response is HTTP 200.
One of Rod's emails recently mentioned that we need a resolver to act like a tinyurl or bit.ly. I have pasted below the HTTP headers for an HTTP request to the TDWG LSID resolver, and to tinyurl / bit.ly. To the end user it looks as though tdwg.org is the true location of the LSID resource, whereas with the tinyurl and bitly both just redirect traffic.
I'm just posting this for discussion really - if we are to mandate use of a web based HTTP resolver/proxies, it should really issue 30* redirects so that established crawl delays between producer and consumer will be used. The alternative would be for the HTTP resolver to read and process the directives in robots.txt, but this would be difficult to implement as it is not in itself a crawler, just a gateway.
I'm sure that if proxied forms of LSIDs become more prevalent this problem will become more widespread, so now - with the on-going attempt to define what services a GUID resolver should provide - might be a good time to plan how to fix this.
cheers, Nicky
[nn00kg@kvstage01 ~]$ curl -I http://lsid.tdwg.org/urn:lsid:ipni.org:names:783030-1 HTTP/1.1 200 OK Via: 1.1 KISA01 Connection: close Proxy-Connection: close Date: Mon, 27 Apr 2009 11:41:55 GMT Content-Type: application/xml Server: Apache/2.2.3 (CentOS)
[nn00kg@kvstage01 ~]$ curl -I http://tinyurl.com/czkquy HTTP/1.1 301 Moved Permanently Via: 1.1 KISA01 Connection: close Proxy-Connection: close Date: Mon, 27 Apr 2009 12:16:38 GMT Location: http://www.ipni.org/ipni/plantNameByVersion.do?id=783030-1&version=1.4&a... Content-type: text/html Server: TinyURL/1.6 X-Powered-By: PHP/5.2.9
[nn00kg@kvstage01 ~]$ curl -I http://bit.ly/KO1Ko HTTP/1.1 301 Moved Permanently Via: 1.1 KISA01 Connection: Keep-Alive Proxy-Connection: Keep-Alive Content-Length: 287 Date: Mon, 27 Apr 2009 12:19:48 GMT Location: http://www.ipni.org/ipni/plantNameByVersion.do?id=783030-1&version=1.4&a... Content-Type: text/html;charset=utf-8 Server: nginx/0.7.42 Allow: GET, HEAD, POST
- Nicola Nicolson
- Science Applications Development,
- Royal Botanic Gardens, Kew,
- Richmond, Surrey, TW9 3AB, UK
- email: n.nicolson@rbgkew.org.uk
- phone: 020-8332-5766
tdwg-tag mailing list tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag
--------------------------------------------------------- Roderic Page Professor of Taxonomy DEEB, FBLS Graham Kerr Building University of Glasgow Glasgow G12 8QQ, UK
Email: r.page@bio.gla.ac.uk Tel: +44 141 330 4778 Fax: +44 141 330 2792 AIM: rodpage1962@aim.com Facebook: http://www.facebook.com/profile.php?id=1112517192 Twitter: http://twitter.com/rdmpage Blog: http://iphylo.blogspot.com Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html
This confirms my growing prejudice that third party GUID providers are bad news beyond a simple PURL service. Even then it is better if the owner of the PURL service is the owner of the data. i.e. only use some one else's PURL if you don't/can't own your own domain. Hands up who is incapable of buying a domain name!
Roger
On 27 Apr 2009, at 14:30, Roderic Page wrote:
Dear Nicky,
Ouch!
I'm not sure I fully understand how 30* redirects work with respect to web crawlers, but I'm not sure they will help in this case.
If the TDWG LSID resolver is a full blown resolver, then for each request from the crawler it will be doing the full LSID resolution (three calls, one for authority WSDL, one for service WSDL, one for metadata). It may cache the WSDLs, but it will still do at least one call to the service (unless it has cached the metadata as well).
Is the solution to add TDWG to your robots.txt file, and have the LSID resolver respect the settings in that file? TDWG could also implement metadata caching so it wouldn't need to hammer you so much (i.e., when a crawler hit TDWG, TDWG would reply with the cached metadata).
Perhaps LSID services such as IPNI's could also implement etag headers, which would help avoid excessive traffic from TDWG when caching (TDWG could regularly cache metadata form IPNI, respecting the robots.txt files, and first checking whether the metadata had changed using etag and/or last modified headers).
I assume the DOI resolve has similar issues. It's robots.txt file looks like this:
Crawl-delay: 5 Request-rate: 1/5
Hope this makes sense, my understanding of the HTTP headers/ redirects/robots.txt is not particularly deep.
Regards
Rod
On 27 Apr 2009, at 13:54, Nicola Nicolson wrote:
Hi,
Further to my last design question re LSID HTTP proxies (thanks for the responses), I wanted to raise the issue of HTTP LSID proxies and crawlers, in particular the crawl delay part of the robots exclusion protocol.
I'll outline a situation we had recently:
The GBIF portal and ZipCodeZoo site both inclde IPNI LSIDs in the pages. These are presented in their proxied form using the TDWG LSID resolver (eg http://lsid.tdwg.org/urn:lsid:ipni.org:names:783030-1) . Using the TDWG resolver to access the data for an IPNI LSID does not issue any kind of HTTP redirect, instead the web resolver uses the LSID resolution steps to get the data and presents it in its own response (ie returning a HTTP 200 OK response).
The problem happens when one of these sites that includes proxied IPNI LSIDs is crawled by a search engine. The proxied links appear to belong to tdwg.org, so whatever crawl delay is agreed between TDWG and the crawler in question is used. The crawler has no knowledge that behind the scenes the TDWG resolver is hitting ipni.org. We (ipni.org) have agreed our own crawl limits with Google and the other major search engines using directives in robots.txt and directly agreed limits with Google (who don't use the robots.txt directly).
On a couple of occasions in the past we have had to deny access to the TDWG LSID resolver as it has been responsible for far more traffic than we can support (up to 10 times the crawl limits we have agreed with search engine bots) - this due to the pages on the GBIF portal and / or zipcodezoo being crawled by a search engine, which in turn triggers a high volume of requests from TDWG to IPNI. The crawler itself has no knowledge that it is in effect accessing data held at ipni.org rather than tdwg.org as the HTTP response is HTTP 200.
One of Rod's emails recently mentioned that we need a resolver to act like a tinyurl or bit.ly. I have pasted below the HTTP headers for an HTTP request to the TDWG LSID resolver, and to tinyurl / bit.ly. To the end user it looks as though tdwg.org is the true location of the LSID resource, whereas with the tinyurl and bitly both just redirect traffic.
I'm just posting this for discussion really - if we are to mandate use of a web based HTTP resolver/proxies, it should really issue 30* redirects so that established crawl delays between producer and consumer will be used. The alternative would be for the HTTP resolver to read and process the directives in robots.txt, but this would be difficult to implement as it is not in itself a crawler, just a gateway.
I'm sure that if proxied forms of LSIDs become more prevalent this problem will become more widespread, so now - with the on-going attempt to define what services a GUID resolver should provide - might be a good time to plan how to fix this.
cheers, Nicky
[nn00kg@kvstage01 ~]$ curl -I http://lsid.tdwg.org/urn:lsid:ipni.org:names:783030-1 HTTP/1.1 200 OK Via: 1.1 KISA01 Connection: close Proxy-Connection: close Date: Mon, 27 Apr 2009 11:41:55 GMT Content-Type: application/xml Server: Apache/2.2.3 (CentOS)
[nn00kg@kvstage01 ~]$ curl -I http://tinyurl.com/czkquy HTTP/1.1 301 Moved Permanently Via: 1.1 KISA01 Connection: close Proxy-Connection: close Date: Mon, 27 Apr 2009 12:16:38 GMT Location: http://www.ipni.org/ipni/plantNameByVersion.do?id=783030-1&version=1.4&a... Content-type: text/html Server: TinyURL/1.6 X-Powered-By: PHP/5.2.9
[nn00kg@kvstage01 ~]$ curl -I http://bit.ly/KO1Ko HTTP/1.1 301 Moved Permanently Via: 1.1 KISA01 Connection: Keep-Alive Proxy-Connection: Keep-Alive Content-Length: 287 Date: Mon, 27 Apr 2009 12:19:48 GMT Location: http://www.ipni.org/ipni/plantNameByVersion.do?id=783030-1&version=1.4&a... Content-Type: text/html;charset=utf-8 Server: nginx/0.7.42 Allow: GET, HEAD, POST
- Nicola Nicolson
- Science Applications Development,
- Royal Botanic Gardens, Kew,
- Richmond, Surrey, TW9 3AB, UK
- email: n.nicolson@rbgkew.org.uk
- phone: 020-8332-5766
tdwg-tag mailing list tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag
Roderic Page Professor of Taxonomy DEEB, FBLS Graham Kerr Building University of Glasgow Glasgow G12 8QQ, UK
Email: r.page@bio.gla.ac.uk Tel: +44 141 330 4778 Fax: +44 141 330 2792 AIM: rodpage1962@aim.com Facebook: http://www.facebook.com/profile.php?id=1112517192 Twitter: http://twitter.com/rdmpage Blog: http://iphylo.blogspot.com Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html
tdwg-tag mailing list tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag
------------------------------------------------------------- Roger Hyam Roger@BiodiversityCollectionsIndex.org http://www.BiodiversityCollectionsIndex.org ------------------------------------------------------------- Royal Botanic Garden Edinburgh 20A Inverleith Row, Edinburgh, EH3 5LR, UK Tel: +44 131 552 7171 ext 3015 Fax: +44 131 248 2901 http://www.rbge.org.uk/ -------------------------------------------------------------
Its not owning the domain that seems to be the problem - its apparently adding the DNS SRV record that is too technical to cope with (although I didn't think this was much harder than buying a domain name).
And I still think it is going to be essential to provider LSID / GUID hosting services for those who cant set them up themselves.
Kevin
From: tdwg-tag-bounces@lists.tdwg.org [mailto:tdwg-tag-bounces@lists.tdwg.org] On Behalf Of Roger Hyam Sent: Tuesday, 28 April 2009 2:17 a.m. To: Roderic Page Cc: tdwg-tag@lists.tdwg.org Subject: Re: [tdwg-tag] LSIDs: web based (HTTP) resolvers and web crawlers
This confirms my growing prejudice that third party GUID providers are bad news beyond a simple PURL service. Even then it is better if the owner of the PURL service is the owner of the data. i.e. only use some one else's PURL if you don't/can't own your own domain. Hands up who is incapable of buying a domain name!
Roger
On 27 Apr 2009, at 14:30, Roderic Page wrote:
Dear Nicky,
Ouch!
I'm not sure I fully understand how 30* redirects work with respect to web crawlers, but I'm not sure they will help in this case.
If the TDWG LSID resolver is a full blown resolver, then for each request from the crawler it will be doing the full LSID resolution (three calls, one for authority WSDL, one for service WSDL, one for metadata). It may cache the WSDLs, but it will still do at least one call to the service (unless it has cached the metadata as well).
Is the solution to add TDWG to your robots.txt file, and have the LSID resolver respect the settings in that file? TDWG could also implement metadata caching so it wouldn't need to hammer you so much (i.e., when a crawler hit TDWG, TDWG would reply with the cached metadata).
Perhaps LSID services such as IPNI's could also implement etag headers, which would help avoid excessive traffic from TDWG when caching (TDWG could regularly cache metadata form IPNI, respecting the robots.txt files, and first checking whether the metadata had changed using etag and/or last modified headers).
I assume the DOI resolve has similar issues. It's robots.txt file looks like this:
Crawl-delay: 5 Request-rate: 1/5
Hope this makes sense, my understanding of the HTTP headers/redirects/robots.txt is not particularly deep.
Regards
Rod
On 27 Apr 2009, at 13:54, Nicola Nicolson wrote:
Hi,
Further to my last design question re LSID HTTP proxies (thanks for the responses), I wanted to raise the issue of HTTP LSID proxies and crawlers, in particular the crawl delay part of the robots exclusion protocol.
I'll outline a situation we had recently:
The GBIF portal and ZipCodeZoo site both inclde IPNI LSIDs in the pages. These are presented in their proxied form using the TDWG LSID resolver (eg http://lsid.tdwg.org/urn:lsid:ipni.org:names:783030-1). Using the TDWG resolver to access the data for an IPNI LSID does not issue any kind of HTTP redirect, instead the web resolver uses the LSID resolution steps to get the data and presents it in its own response (ie returning a HTTP 200 OK response).
The problem happens when one of these sites that includes proxied IPNI LSIDs is crawled by a search engine. The proxied links appear to belong to tdwg.org, so whatever crawl delay is agreed between TDWG and the crawler in question is used. The crawler has no knowledge that behind the scenes the TDWG resolver is hitting ipni.org. We (ipni.org) have agreed our own crawl limits with Google and the other major search engines using directives in robots.txt and directly agreed limits with Google (who don't use the robots.txt directly).
On a couple of occasions in the past we have had to deny access to the TDWG LSID resolver as it has been responsible for far more traffic than we can support (up to 10 times the crawl limits we have agreed with search engine bots) - this due to the pages on the GBIF portal and / or zipcodezoo being crawled by a search engine, which in turn triggers a high volume of requests from TDWG to IPNI. The crawler itself has no knowledge that it is in effect accessing data held at ipni.org rather than tdwg.org as the HTTP response is HTTP 200.
One of Rod's emails recently mentioned that we need a resolver to act like a tinyurl or bit.ly. I have pasted below the HTTP headers for an HTTP request to the TDWG LSID resolver, and to tinyurl / bit.ly. To the end user it looks as though tdwg.org is the true location of the LSID resource, whereas with the tinyurl and bitly both just redirect traffic.
I'm just posting this for discussion really - if we are to mandate use of a web based HTTP resolver/proxies, it should really issue 30* redirects so that established crawl delays between producer and consumer will be used. The alternative would be for the HTTP resolver to read and process the directives in robots.txt, but this would be difficult to implement as it is not in itself a crawler, just a gateway.
I'm sure that if proxied forms of LSIDs become more prevalent this problem will become more widespread, so now - with the on-going attempt to define what services a GUID resolver should provide - might be a good time to plan how to fix this.
cheers, Nicky
[nn00kg@kvstage01 ~]$ curl -I http://lsid.tdwg.org/urn:lsid:ipni.org:names:783030-1 HTTP/1.1 200 OK Via: 1.1 KISA01 Connection: close Proxy-Connection: close Date: Mon, 27 Apr 2009 11:41:55 GMT Content-Type: application/xml Server: Apache/2.2.3 (CentOS)
[nn00kg@kvstage01 ~]$ curl -I http://tinyurl.com/czkquy HTTP/1.1 301 Moved Permanently Via: 1.1 KISA01 Connection: close Proxy-Connection: close Date: Mon, 27 Apr 2009 12:16:38 GMT Location: http://www.ipni.org/ipni/plantNameByVersion.do?id=783030-1&version=1.4&a... Content-type: text/html Server: TinyURL/1.6 X-Powered-By: PHP/5.2.9
[nn00kg@kvstage01 ~]$ curl -I http://bit.ly/KO1Ko HTTP/1.1 301 Moved Permanently Via: 1.1 KISA01 Connection: Keep-Alive Proxy-Connection: Keep-Alive Content-Length: 287 Date: Mon, 27 Apr 2009 12:19:48 GMT Location: http://www.ipni.org/ipni/plantNameByVersion.do?id=783030-1&version=1.4&a... Content-Type: text/html;charset=utf-8 Server: nginx/0.7.42 Allow: GET, HEAD, POST
- Nicola Nicolson - Science Applications Development, - Royal Botanic Gardens, Kew, - Richmond, Surrey, TW9 3AB, UK - email: n.nicolson@rbgkew.org.ukmailto:n.nicolson@rbgkew.org.uk - phone: 020-8332-5766 _______________________________________________ tdwg-tag mailing list tdwg-tag@lists.tdwg.orgmailto:tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag
--------------------------------------------------------- Roderic Page Professor of Taxonomy DEEB, FBLS Graham Kerr Building University of Glasgow Glasgow G12 8QQ, UK
Email: r.page@bio.gla.ac.ukmailto:r.page@bio.gla.ac.uk Tel: +44 141 330 4778 Fax: +44 141 330 2792 AIM: rodpage1962@aim.commailto:rodpage1962@aim.com Facebook: http://www.facebook.com/profile.php?id=1112517192 Twitter: http://twitter.com/rdmpage Blog: http://iphylo.blogspot.com Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html
_______________________________________________ tdwg-tag mailing list tdwg-tag@lists.tdwg.orgmailto:tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag
------------------------------------------------------------- Roger Hyam Roger@BiodiversityCollectionsIndex.orgmailto:Roger@BiodiversityCollectionsIndex.org http://www.BiodiversityCollectionsIndex.orghttp://www.BiodiversityCollectionsIndex.org/ ------------------------------------------------------------- Royal Botanic Garden Edinburgh 20A Inverleith Row, Edinburgh, EH3 5LR, UK Tel: +44 131 552 7171 ext 3015 Fax: +44 131 248 2901 http://www.rbge.org.uk/ -------------------------------------------------------------
________________________________ Please consider the environment before printing this email Warning: This electronic message together with any attachments is confidential. If you receive it in error: (i) you must not read, use, disclose, copy or retain it; (ii) please contact the sender immediately by reply email and then delete the emails. The views expressed in this email may not be those of Landcare Research New Zealand Limited. http://www.landcareresearch.co.nz
I'm not sure if anyone has suggested this strategy (I'll be surprised if not):
TDWG seems determined to use LSIDs for GUIDs, yet the technical issues for implementation are discouraging enough for some to defer deployment. Perhaps TDWG could offer as a bonus for membership (or perhaps a small additional charge) the provision of some elements of the LSID infrastructure stack, overloading the tdwg.org domain?
Then, instead of having each institution create DNS entries such as "mydepartment.institution.org" and deal with the SRV details, use TDWG as a kind of registrar and do something like "mycollectionid.tdwg.org". TDWG would then be responsible for the appropriate DNS SRV registration, and could even operate a resolver and/or redirection service for that domain.
The income would not be very much (say $10/year per org * 100 participants = $1k), but it should be a lot less expensive for the entire community than the total cost for each organization operating their own infrastructure (perhaps $10/year DNS + $1000/year operations * 100 participants = $101k).
So as not to overload the TDWG infrastructure, it would be sensible to encourage technically astute groups (e.g. GBIF, EoL, NCEAS) to contribute computing cycles and fallback DNS services to ensure reliability of the entire system.
The end result could be a reliable, distributed infrastructure for LSID (or whatever GUID scheme is decided upon) resolution that conforms to the requirements / specifications derived from the TDWG procedures at a small cost for participation. The low cost and deferral of technical overhead to a knowledgeable group would hopefully encourage participation by a broader audience in this piece of fundamental architecture.
(It may also help reduce the endless cycles of discussion about GUIDs and LSIDs)
Dave V.
I like the idea in principle, but I would do it directly with the "technically astute" organizations and let them keep the modest revenue. More precisely, I think managing something that is deemed a critical resource should be done in an organization that has professional staff to do it if possible GBIF seems the most broadly connected to TDWG. That's where I would do it. In fact, aren't they already managing some TDWG services?
Bob
On Mon, Apr 27, 2009 at 6:25 PM, Dave Vieglais vieglais@ku.edu wrote:
I'm not sure if anyone has suggested this strategy (I'll be surprised if not):
TDWG seems determined to use LSIDs for GUIDs, yet the technical issues for implementation are discouraging enough for some to defer deployment. Perhaps TDWG could offer as a bonus for membership (or perhaps a small additional charge) the provision of some elements of the LSID infrastructure stack, overloading the tdwg.org domain?
Then, instead of having each institution create DNS entries such as "mydepartment.institution.org" and deal with the SRV details, use TDWG as a kind of registrar and do something like "mycollectionid.tdwg.org". TDWG would then be responsible for the appropriate DNS SRV registration, and could even operate a resolver and/or redirection service for that domain.
The income would not be very much (say $10/year per org * 100 participants = $1k), but it should be a lot less expensive for the entire community than the total cost for each organization operating their own infrastructure (perhaps $10/year DNS + $1000/year operations
- 100 participants = $101k).
So as not to overload the TDWG infrastructure, it would be sensible to encourage technically astute groups (e.g. GBIF, EoL, NCEAS) to contribute computing cycles and fallback DNS services to ensure reliability of the entire system.
The end result could be a reliable, distributed infrastructure for LSID (or whatever GUID scheme is decided upon) resolution that conforms to the requirements / specifications derived from the TDWG procedures at a small cost for participation. The low cost and deferral of technical overhead to a knowledgeable group would hopefully encourage participation by a broader audience in this piece of fundamental architecture.
(It may also help reduce the endless cycles of discussion about GUIDs and LSIDs)
Dave V. _______________________________________________ tdwg-tag mailing list tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag
Dear Dave,
Good to read something about this issue in this list! I like the idea, it reminds a bit on how handle and/or doi manage this. The doi system allows registrants to act as doi resellers, and this is working. Prices vary, but e.g. for packages of several thousand dois are sold for an annual fee (http://www.medra.org/en/terms.htm) or for free (http://www.std-doi.de). Unless someone does it for free (GBIF?), selling LSIDs could be part of the business model for TDWG?. And in analogy to the handle system, a fee for registering some blabla.tdwg.org subdomain (was it authority?) as you mentioned would surely help to make the whole LSID SYSTEM more persistent.
Well, and I hope LSID registrant(s) would manage the metadata issue better than the DOI system does. Most people ignore that while doi registrants do have to register dois+metadata there is no common way to retrieve this metadata. Crossref, which is mentioned frequently here, does not hold the metadata on any doi but only on those dois they registered. Try to get some metadata for example for some dois of the other registrants such as doi:10.1594/PANGAEA.339110.
best regards, Robert
On Tue, Apr 28, 2009 at 12:25 AM, Dave Vieglais vieglais@ku.edu wrote:
I'm not sure if anyone has suggested this strategy (I'll be surprised if not):
TDWG seems determined to use LSIDs for GUIDs, yet the technical issues for implementation are discouraging enough for some to defer deployment. Perhaps TDWG could offer as a bonus for membership (or perhaps a small additional charge) the provision of some elements of the LSID infrastructure stack, overloading the tdwg.org domain?
Then, instead of having each institution create DNS entries such as "mydepartment.institution.org" and deal with the SRV details, use TDWG as a kind of registrar and do something like "mycollectionid.tdwg.org". TDWG would then be responsible for the appropriate DNS SRV registration, and could even operate a resolver and/or redirection service for that domain.
The income would not be very much (say $10/year per org * 100 participants = $1k), but it should be a lot less expensive for the entire community than the total cost for each organization operating their own infrastructure (perhaps $10/year DNS + $1000/year operations
- 100 participants = $101k).
So as not to overload the TDWG infrastructure, it would be sensible to encourage technically astute groups (e.g. GBIF, EoL, NCEAS) to contribute computing cycles and fallback DNS services to ensure reliability of the entire system.
The end result could be a reliable, distributed infrastructure for LSID (or whatever GUID scheme is decided upon) resolution that conforms to the requirements / specifications derived from the TDWG procedures at a small cost for participation. The low cost and deferral of technical overhead to a knowledgeable group would hopefully encourage participation by a broader audience in this piece of fundamental architecture.
(It may also help reduce the endless cycles of discussion about GUIDs and LSIDs)
Dave V. _______________________________________________ tdwg-tag mailing list tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag
Probably no one knew about it, but TDWG did offer to help with this (for free) for a long time (since Aug 2007), but no one took it up I believe:
http://www.tdwg.org/activities/online-services/lsid-authority-ids/
I think the idea is worth exploring further. Perhaps a quick "hands up who wants this?" vote to canvas interest?
Crazy idea: I would propose taking it further and suggesting "centralising" a cache of the LSID response also (configurable expiry date on items in the cache), to alleviate load on servers (thinking of IPNI for example). It would not be drastically expensive to offer a billion LSID cache with high availability with cloud computing ("centralised" in the sense that it is on a distributed cloud such as Amazon S3 + Cloudfront). We could share Amazon S3 as a data store and everyone just pays for their own PUT cost ($0.01 for 1000 records) and then share the bandwidth cost. I think this is worth exploring regardless of GUID technology used... What was it Rod said - "distributed begets centralized"
Tim
On 28 Apr 2009, at 10:09, Robert Huber wrote:
Dear Dave,
Good to read something about this issue in this list! I like the idea, it reminds a bit on how handle and/or doi manage this. The doi system allows registrants to act as doi resellers, and this is working. Prices vary, but e.g. for packages of several thousand dois are sold for an annual fee (http://www.medra.org/en/terms.htm) or for free (http://www.std-doi.de). Unless someone does it for free (GBIF?), selling LSIDs could be part of the business model for TDWG?. And in analogy to the handle system, a fee for registering some blabla.tdwg.org subdomain (was it authority?) as you mentioned would surely help to make the whole LSID SYSTEM more persistent.
Well, and I hope LSID registrant(s) would manage the metadata issue better than the DOI system does. Most people ignore that while doi registrants do have to register dois+metadata there is no common way to retrieve this metadata. Crossref, which is mentioned frequently here, does not hold the metadata on any doi but only on those dois they registered. Try to get some metadata for example for some dois of the other registrants such as doi:10.1594/PANGAEA.339110.
best regards, Robert
On Tue, Apr 28, 2009 at 12:25 AM, Dave Vieglais vieglais@ku.edu wrote:
I'm not sure if anyone has suggested this strategy (I'll be surprised if not):
TDWG seems determined to use LSIDs for GUIDs, yet the technical issues for implementation are discouraging enough for some to defer deployment. Perhaps TDWG could offer as a bonus for membership (or perhaps a small additional charge) the provision of some elements of the LSID infrastructure stack, overloading the tdwg.org domain?
Then, instead of having each institution create DNS entries such as "mydepartment.institution.org" and deal with the SRV details, use TDWG as a kind of registrar and do something like "mycollectionid.tdwg.org". TDWG would then be responsible for the appropriate DNS SRV registration, and could even operate a resolver and/or redirection service for that domain.
The income would not be very much (say $10/year per org * 100 participants = $1k), but it should be a lot less expensive for the entire community than the total cost for each organization operating their own infrastructure (perhaps $10/year DNS + $1000/year operations
- 100 participants = $101k).
So as not to overload the TDWG infrastructure, it would be sensible to encourage technically astute groups (e.g. GBIF, EoL, NCEAS) to contribute computing cycles and fallback DNS services to ensure reliability of the entire system.
The end result could be a reliable, distributed infrastructure for LSID (or whatever GUID scheme is decided upon) resolution that conforms to the requirements / specifications derived from the TDWG procedures at a small cost for participation. The low cost and deferral of technical overhead to a knowledgeable group would hopefully encourage participation by a broader audience in this piece of fundamental architecture.
(It may also help reduce the endless cycles of discussion about GUIDs and LSIDs)
Dave V. _______________________________________________ tdwg-tag mailing list tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag
-- Dr. Robert Huber,
WDC-MARE / PANGAEA - www.pangaea.de Stratigraphy.net - www.stratigraphy.net _____________________________________________ MARUM - Center for Marine Environmental Sciences University Bremen Leobener Strasse POP 330 440 28359 Bremen Phone ++49 421 218-65593, Fax ++49 421 218-65505 e-mail rhuber@wdc-mare.org _______________________________________________ tdwg-tag mailing list tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag
I'm with Tim on this one, and taking one of Rod's other posts ("LSIDs, disaster or opportunity" http://iphylo.blogspot.com/2009/04/lsids-disaster-or-opportunity.html) a bit further, I think coming up with a simple, extend-able URL resolver would give us many benefits and allow LSIDs with extra, added information around them for all to use. Looking at his example, a URL would get permanent tracking that would also post referrers, location and traffic. A summary of the link could even be a page in itself, a cached version, a screenshot, or just a scrape of the code - pulling out the HTML tags, for future reference in case the real link goes down. We could use the ability to create a customizable prefix (ie- http://someresolvr.com/bhl/SDFoijF), to somewhat follow DOI conventions, but could even save old DOIs or handles for historical purposes in a field attached to the new URL, or for reuse, making the new URL resolve to a current DOI with a simple post at the end of the new URL (ie- http://someresolvr.com/bhl/SDFoijF/DOI). In the same way we could use user input, data pulled about the URL semantically to generate RDFa (by using pyRdfahttp://www.w3.org/2007/08/pyRdfa/), then exposing that for all newly created URLS, and coming up with a standard to make it predictable (ie- http://someresolvr.com/bhl/SDFoijF/RDF). The example at bit.ly shows the use of Open Calais (http://opencalais.com/) to get more background information on the original link to provide more information, but it could also be pointed to other services we provide/use in biodiversity to provide a snapshot across the board of more context/content. Users of the service could login to examine/add/edit the data by hand if desired, so they would still retain ultimate control over how their record is presented. Thus, from a simple URL, we could build a complete summary that would build on what we're given while sharing it all back out.
Then the architecture (aka, the fun part) would be simple and distributed. A webserver able to process PHP, running the database CouchDB (http://couchdb.apache.org/) would be all that is needed to run the resolver. CouchDB is schema-less, so the way it handles replication is very simple, and is built to be distributed, only handing out the bits that have changed during replication, as well as scale in this manner. Having a batch of main servers behind a URL in a pooled setup (think of a simplified/smaller version of the Pool of Unix networked time servers (http://www.pool.ntp.org/)) would allow a round-robin DNS, or a ucarp setup ("urcarp allows a couple of hosts to share common virtual IP addresses in order to provide automatic failover." http://www.ucarp.org/project/ucarp), so if one main server went down, another would automatically take over, without the user needing to change the URL. Plus, if we wanted to, to battle heavy usage of the main servers we could use the idea of Primary and Secondary servers as outlined in the pool.ntp.org model, so an institution with heavy usage could become a Secondary host and run their own resolver simply, with almost no maintenance. They would just need the PHP files, which would be a versioned project, and then have a cron task to replicate the database from a pool of the main servers. The institution's resolver could be customized to appear as their own, (ie- http://someresolvr.bhl.org/bhl/SDFoijF) and for simplicity could be read-only. This way a link like http://someresolvr.com/bhl/SDFoijF could be resolvable against any institution's server, like http://someresolvr.bhl.org/bhl/SDFoijF or http://someresolvr.ebio.org/bhl/SDFoijF - as all of the databases would be the same, although maybe a day behind, depending on the replication schedule. New entries would only be entered on a main server, or in 'the pool' (ie- http://pool.someresolvr.com/), then those changes would be in the database to be handed out to all on the next replication (I won't add my P2P ideas in this email - it may not be needed for the deltas that would need to be transfered daily or weekly). Add to all of this that CouchDB is designed as "...a distributed, fault-tolerant and schema-free document-oriented database" which would fit into what we want to do; build a store of documents (data) about a URL that we can serve, while being a permanent, sustainable resolver to the original document. If the service ever died, it could be resurrected from anyone's copy of the database (think LOCKSS (Lots of Copies Keep Stuff Safe) http://www.lockss.org/lockss/Home), so that no data (original or accumulated) would be lost. The data could be exported from the database in XML, and then migrated from that to a desired platform.
I have not been dealing with LSIDs as long as most on this list so I expect I'm glossing over (or missing) some of the concepts, so please let me know what I am lacking. This is a needed service, and is a project I'd like to be involved in building.
Thanks
P
On Tue, Apr 28, 2009 at 4:06 AM, Tim Robertson trobertson@gbif.org wrote:
Probably no one knew about it, but TDWG did offer to help with this (for free) for a long time (since Aug 2007), but no one took it up I believe:
http://www.tdwg.org/activities/online-services/lsid-authority-ids/
I think the idea is worth exploring further. Perhaps a quick "hands up who wants this?" vote to canvas interest?
Crazy idea: I would propose taking it further and suggesting "centralising" a cache of the LSID response also (configurable expiry date on items in the cache), to alleviate load on servers (thinking of IPNI for example). It would not be drastically expensive to offer a billion LSID cache with high availability with cloud computing ("centralised" in the sense that it is on a distributed cloud such as Amazon S3 + Cloudfront). We could share Amazon S3 as a data store and everyone just pays for their own PUT cost ($0.01 for 1000 records) and then share the bandwidth cost. I think this is worth exploring regardless of GUID technology used... What was it Rod said - "distributed begets centralized"
Tim
On 28 Apr 2009, at 10:09, Robert Huber wrote:
Dear Dave,
Good to read something about this issue in this list! I like the idea, it reminds a bit on how handle and/or doi manage this. The doi system allows registrants to act as doi resellers, and this is working. Prices vary, but e.g. for packages of several thousand dois are sold for an annual fee (http://www.medra.org/en/terms.htm) or for free (http://www.std-doi.de). Unless someone does it for free (GBIF?), selling LSIDs could be part of the business model for TDWG?. And in analogy to the handle system, a fee for registering some blabla.tdwg.org subdomain (was it authority?) as you mentioned would surely help to make the whole LSID SYSTEM more persistent.
Well, and I hope LSID registrant(s) would manage the metadata issue better than the DOI system does. Most people ignore that while doi registrants do have to register dois+metadata there is no common way to retrieve this metadata. Crossref, which is mentioned frequently here, does not hold the metadata on any doi but only on those dois they registered. Try to get some metadata for example for some dois of the other registrants such as doi:10.1594/PANGAEA.339110.
best regards, Robert
On Tue, Apr 28, 2009 at 12:25 AM, Dave Vieglais vieglais@ku.edu wrote:
I'm not sure if anyone has suggested this strategy (I'll be surprised if not):
TDWG seems determined to use LSIDs for GUIDs, yet the technical issues for implementation are discouraging enough for some to defer deployment. Perhaps TDWG could offer as a bonus for membership (or perhaps a small additional charge) the provision of some elements of the LSID infrastructure stack, overloading the tdwg.org domain?
Then, instead of having each institution create DNS entries such as "mydepartment.institution.org" and deal with the SRV details, use TDWG as a kind of registrar and do something like "mycollectionid.tdwg.org". TDWG would then be responsible for the appropriate DNS SRV registration, and could even operate a resolver and/or redirection service for that domain.
The income would not be very much (say $10/year per org * 100 participants = $1k), but it should be a lot less expensive for the entire community than the total cost for each organization operating their own infrastructure (perhaps $10/year DNS + $1000/year operations
- 100 participants = $101k).
So as not to overload the TDWG infrastructure, it would be sensible to encourage technically astute groups (e.g. GBIF, EoL, NCEAS) to contribute computing cycles and fallback DNS services to ensure reliability of the entire system.
The end result could be a reliable, distributed infrastructure for LSID (or whatever GUID scheme is decided upon) resolution that conforms to the requirements / specifications derived from the TDWG procedures at a small cost for participation. The low cost and deferral of technical overhead to a knowledgeable group would hopefully encourage participation by a broader audience in this piece of fundamental architecture.
(It may also help reduce the endless cycles of discussion about GUIDs and LSIDs)
Dave V. _______________________________________________ tdwg-tag mailing list tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag
-- Dr. Robert Huber,
WDC-MARE / PANGAEA - www.pangaea.de Stratigraphy.net - www.stratigraphy.net _____________________________________________ MARUM - Center for Marine Environmental Sciences University Bremen Leobener Strasse POP 330 440 28359 Bremen Phone ++49 421 218-65593, Fax ++49 421 218-65505 e-mail rhuber@wdc-mare.org _______________________________________________ tdwg-tag mailing list tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag
tdwg-tag mailing list tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag
I have been playing around with a prototype for such a system in my "spare" time. See it at http://lsid.orpington.co.nz/ login with user test, password test.
Kevin
-----Original Message----- From: tdwg-tag-bounces@lists.tdwg.org [mailto:tdwg-tag-bounces@lists.tdwg.org] On Behalf Of Tim Robertson Sent: Tuesday, 28 April 2009 9:06 p.m. To: Robert Huber Cc: Technical Architecture Group mailing list Subject: Re: [tdwg-tag] LSID business model?
Probably no one knew about it, but TDWG did offer to help with this (for free) for a long time (since Aug 2007), but no one took it up I believe:
http://www.tdwg.org/activities/online-services/lsid-authority-ids/
I think the idea is worth exploring further. Perhaps a quick "hands up who wants this?" vote to canvas interest?
Crazy idea: I would propose taking it further and suggesting "centralising" a cache of the LSID response also (configurable expiry date on items in the cache), to alleviate load on servers (thinking of IPNI for example). It would not be drastically expensive to offer a billion LSID cache with high availability with cloud computing ("centralised" in the sense that it is on a distributed cloud such as Amazon S3 + Cloudfront). We could share Amazon S3 as a data store and everyone just pays for their own PUT cost ($0.01 for 1000 records) and then share the bandwidth cost. I think this is worth exploring regardless of GUID technology used... What was it Rod said - "distributed begets centralized"
Tim
On 28 Apr 2009, at 10:09, Robert Huber wrote:
Dear Dave,
Good to read something about this issue in this list! I like the idea, it reminds a bit on how handle and/or doi manage this. The doi system allows registrants to act as doi resellers, and this is working. Prices vary, but e.g. for packages of several thousand dois are sold for an annual fee (http://www.medra.org/en/terms.htm) or for free (http://www.std-doi.de). Unless someone does it for free (GBIF?), selling LSIDs could be part of the business model for TDWG?. And in analogy to the handle system, a fee for registering some blabla.tdwg.org subdomain (was it authority?) as you mentioned would surely help to make the whole LSID SYSTEM more persistent.
Well, and I hope LSID registrant(s) would manage the metadata issue better than the DOI system does. Most people ignore that while doi registrants do have to register dois+metadata there is no common way to retrieve this metadata. Crossref, which is mentioned frequently here, does not hold the metadata on any doi but only on those dois they registered. Try to get some metadata for example for some dois of the other registrants such as doi:10.1594/PANGAEA.339110.
best regards, Robert
On Tue, Apr 28, 2009 at 12:25 AM, Dave Vieglais vieglais@ku.edu wrote:
I'm not sure if anyone has suggested this strategy (I'll be surprised if not):
TDWG seems determined to use LSIDs for GUIDs, yet the technical issues for implementation are discouraging enough for some to defer deployment. Perhaps TDWG could offer as a bonus for membership (or perhaps a small additional charge) the provision of some elements of the LSID infrastructure stack, overloading the tdwg.org domain?
Then, instead of having each institution create DNS entries such as "mydepartment.institution.org" and deal with the SRV details, use TDWG as a kind of registrar and do something like "mycollectionid.tdwg.org". TDWG would then be responsible for the appropriate DNS SRV registration, and could even operate a resolver and/or redirection service for that domain.
The income would not be very much (say $10/year per org * 100 participants = $1k), but it should be a lot less expensive for the entire community than the total cost for each organization operating their own infrastructure (perhaps $10/year DNS + $1000/year operations
- 100 participants = $101k).
So as not to overload the TDWG infrastructure, it would be sensible to encourage technically astute groups (e.g. GBIF, EoL, NCEAS) to contribute computing cycles and fallback DNS services to ensure reliability of the entire system.
The end result could be a reliable, distributed infrastructure for LSID (or whatever GUID scheme is decided upon) resolution that conforms to the requirements / specifications derived from the TDWG procedures at a small cost for participation. The low cost and deferral of technical overhead to a knowledgeable group would hopefully encourage participation by a broader audience in this piece of fundamental architecture.
(It may also help reduce the endless cycles of discussion about GUIDs and LSIDs)
Dave V. _______________________________________________ tdwg-tag mailing list tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag
-- Dr. Robert Huber,
WDC-MARE / PANGAEA - www.pangaea.de Stratigraphy.net - www.stratigraphy.net _____________________________________________ MARUM - Center for Marine Environmental Sciences University Bremen Leobener Strasse POP 330 440 28359 Bremen Phone ++49 421 218-65593, Fax ++49 421 218-65505 e-mail rhuber@wdc-mare.org _______________________________________________ tdwg-tag mailing list tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag
_______________________________________________ tdwg-tag mailing list tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag
Please consider the environment before printing this email Warning: This electronic message together with any attachments is confidential. If you receive it in error: (i) you must not read, use, disclose, copy or retain it; (ii) please contact the sender immediately by reply email and then delete the emails. The views expressed in this email may not be those of Landcare Research New Zealand Limited. http://www.landcareresearch.co.nz
I have no real understanding of business (as will become obvious), but $10 a year strikes me as a bad idea:
1. It's a tiny amount of money, which implies the service itself isn't worth much. Organisations are used to paying big license fees, and individuals will happily stump up $US99 per annum for services such as Dropbox. Asking $10 makes it seem trivial (especially as, unlike domain name registration, there's no economy of scale for TDWG)
2. It closes the possibility of a market for these services, in the sense that $10 isn't an incentive to get others involved (which I think is ultimately what you want). If people's commercial livelihoods depended on this stuff, we'd stop faffing around and get solutions up and running.
3. Any exchange of money implies a commitment to service, which requires resources, which TDWG doesn't seem to have, so the cost of the service will have to cover that commitment.
Perhaps I've misunderstood, and the service is just the DNS SRV record, but it think if you're going to talk about money you want to think about providing more services, to take the hassle away from providers. There should be the option of on-site, drop in installation, and help for big players, and simple, easy to use tools for smaller providers.
Regards
Rod
On 27 Apr 2009, at 23:25, Dave Vieglais wrote:
I'm not sure if anyone has suggested this strategy (I'll be surprised if not):
TDWG seems determined to use LSIDs for GUIDs, yet the technical issues for implementation are discouraging enough for some to defer deployment. Perhaps TDWG could offer as a bonus for membership (or perhaps a small additional charge) the provision of some elements of the LSID infrastructure stack, overloading the tdwg.org domain?
Then, instead of having each institution create DNS entries such as "mydepartment.institution.org" and deal with the SRV details, use TDWG as a kind of registrar and do something like "mycollectionid.tdwg.org". TDWG would then be responsible for the appropriate DNS SRV registration, and could even operate a resolver and/or redirection service for that domain.
The income would not be very much (say $10/year per org * 100 participants = $1k), but it should be a lot less expensive for the entire community than the total cost for each organization operating their own infrastructure (perhaps $10/year DNS + $1000/year operations
- 100 participants = $101k).
So as not to overload the TDWG infrastructure, it would be sensible to encourage technically astute groups (e.g. GBIF, EoL, NCEAS) to contribute computing cycles and fallback DNS services to ensure reliability of the entire system.
The end result could be a reliable, distributed infrastructure for LSID (or whatever GUID scheme is decided upon) resolution that conforms to the requirements / specifications derived from the TDWG procedures at a small cost for participation. The low cost and deferral of technical overhead to a knowledgeable group would hopefully encourage participation by a broader audience in this piece of fundamental architecture.
(It may also help reduce the endless cycles of discussion about GUIDs and LSIDs)
Dave V. _______________________________________________ tdwg-tag mailing list tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag
--------------------------------------------------------- Roderic Page Professor of Taxonomy DEEB, FBLS Graham Kerr Building University of Glasgow Glasgow G12 8QQ, UK
Email: r.page@bio.gla.ac.uk Tel: +44 141 330 4778 Fax: +44 141 330 2792 AIM: rodpage1962@aim.com Facebook: http://www.facebook.com/profile.php?id=1112517192 Twitter: http://twitter.com/rdmpage Blog: http://iphylo.blogspot.com Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html
I apologize for not having more time to keep up with the discussions on this forum, but they are very interesting, relevant, and timely. This thread in particular has caught my attention, because it touches on things I've been thinking about a lot since around the turn of the century.
Tim already pointed out that TDWG has been offering the sort of service people are suggesting here. In fact, it was suggested and discussed during at least one of the GBIF/TDWG GUID workshops, and most people then thought it was a good idea, but it doesn't seem to have gained much traction. At least not until now (maybe). Anyway, consider me a strong (and long) supporter of a LSID hosting service, and TDWG seems like the perfect organization for this task, provided that the resources & technical expertise are available to support it.
There is so much I want to say about this issue, but neither do I want to bore everyone, nor do I want to stay awake for yet another hour tonight writing a long diatribe.
So I'll cut to the chase.
The most frustrating thing to me about all of these GUID discussions is the perpetual conflation of the needs for identification, and the needs for metadata and data resolution. We data nerds think mostly about identification, while the app developers are primarily focused on resolution. I think if we recognize these two (very different) needs, and are careful to sharpen the focus of our discussions accordingly, we'll make a lot more progress with much better efficiency.
At the pure identity end of the spectrum, most databases go with integers as local identifiers. Unfortunately, integers (especially the ones that are sequential and start with "1") are useless without some sort of context. We also have UUIDs. Wonderful, ubiquitous, locally-generated, adequately unique on a global scale, supported by most major DBMS apps, etc. And, again: by themselves, they are utterly unresolvable.
At the resolution end of the spectrum, we see a lot of appeal for PURLs. No need for special programming to parse and resolve, no special SRV records on DNS, etc. Unfortunately -- whether deserved or not -- PURLs are tarnished by the historical impermanence of their unqualified URL brethren. Even if we assume a strong commitment to the social contract of permanency, whose to say that 50 years from now that any domain name will still be intact (indeed, whether http is even relevant anymore). Even if we can commit to the "P" of PURLs for the short term, they're a bit of a gamble for the long term.
In between these two, we have LSIDs, DOIs and Handles (of course, DOIs are Handles). These all incorporate aspects of both identity and resolution, but does neither perfectly. If I type any of the following into a web browser:
doi:10.1594/PANGAEA.339110 urn:lsid:zoobank.org:act:8BDC0735-FEA4-4298-83FA-D04F67C3FBEC 10199/15417
...I get bupkis.
So, in my mind, they are *almost* self-resolving, and *not quite* identifiers. The reason I think of them as "not quite" identifiers is because the first two embed some syntax-dependant resolving information within them (leading to problems with opacity and potentially with permanence); and the third one, though lacking any resolution baggage, is also approximately 0.66154245313614840760199779464228.
I have said this before, and I will say it again: I think it would be WONDERFUL if the biodiversity informatics community all agreed to incorporate UUIDs as the standard GUID for all data objects that are shared or exposed outside of a local database. The fact that they look ugly, are difficult to type, and are impossible to memorize is a red herring. If you've ever used a PC, Mac, or Linux-based computer, you have used UUIDs. Probably hundreds, or even thousands of them. You just never knew it. And that, in my mind, is the hallmark of an effectively used GUID -- i.e., the one that the end-user doesn't even knows exists.
Following up on Peter DeVries' post:
If you want to keep the branding on the identifier you could also do
something like this.
http://lod.ipni.org/ipni-org_names_783030-1 <- the entity or
concept, 303 redirect to either
http://lod.ipni.org/ipni-org_names_783030-1.html <- human readable
page
http://lod.ipni.org/ipni-org_names_783030-1.rdf <- rdf data
Why not take it a step further and go with UUIDs in place of "ipni-org_names_783030-1"?
Couldn't the free and ubiquitous Google cache provide some caching of
these normal uri's
Well...sort of. The problem is, you don't always know what you'll get back from Google. For example, I get 43 links to a Google search on "8BDC0735-FEA4-4298-83FA-D04F67C3FBEC". Which one do I go to get my metadata. One of the links Google provided was particularly interesting:
http://lists.tdwg.org/pipermail/tdwg-tag/2009-March/000393.html
If I'd only had time to be following this forum since March, I would have seen that Roger has already made some of the points I was about to make in this post. Now I see they've already been made, so I will just reiterate:
In my personal biodiversity informatics utopia, this would be the identifier: 8BDC0735-FEA4-4298-83FA-D04F67C3FBEC
...and these would all be legitimate ways of resolving the exact same information, formatted according to some standard set of indicators along the lines of what Peter was suggesting:
urn:lsid:zoobank.org:act:8BDC0735-FEA4-4298-83FA-D04F67C3FBEC (just because) http://zoobank.org/urn:lsid:zoobank.org:act:8BDC0735-FEA4-4298-83FA-D04F67C3 FBEC (Human readable) http://purl.zoobank.org/urn:lsid:zoobank.org:act:8BDC0735-FEA4-4298-83FA-D04 F67C3FBEC (to make Roger happy) http://zoobank.org/authority/?lsid=urn:lsid:zoobank.org:act:8BDC0735-FEA4-42 98-83FA-D04F67C3FBEC (to make supporters of LSIDs happy) http://zoobank.org/8BDC0735-FEA4-4298-83FA-D04F67C3FBEC (to make me happy) http://uuid.tdwg.org/8BDC0735-FEA4-4298-83FA-D04F67C3FBEC (to make Dave Vieglias happy) http://lsid.tdwg.org/urn:lsid:zoobank.org:act:8BDC0735-FEA4-4298-83FA-D04F67 C3FBEC (to make Lee Belbin happy) http://cache.gbif.org/?uuid=8BDC0735-FEA4-4298-83FA-D04F67C3FBEC (to make Tim happy) ...and so on, and so on, and so on.
No, we shouldn't do all of them. Neither should we do only one of them. Let's figure out a few alternatives that give LSIDs a fair shake (before we abandon them altogether), and most importantly, dissociate the identifier from the resolution protocol.
On the topic of centralized GUID resolution.... Over a decade ago, before I knew what a UUID was, I tried to lobby TDWG to create a GUID issuing service that our community could adopt. Back then I was thinking in terms of simply issuing integers that individual organizations could reserve in blocks of a million at a time (or whatever large number). Now that UUIDs have attained such ubiquity, I no longer think such a service is necessary. However, I have been a strong supporter of distributed content -- not in the sense of DiGIR, where there are (theoretically) non-overlapping blocks of content that get assembled and stacked at query time; but rather in the sense of mirroring or replication of ALL content, on ALL servers.
I first started pushing for this during the days of the All-Species Foundation. I didn't (and still do not) think that this concept will go over very well for proprietary content, such a specimen data, images, and other similar sorts of content (perhaps someday, but I don't think our community is ready for that just yet). But certainly for things we all share -- taxonomy, literature, agent data, geography, etc. -- there was (and still is) a great deal of potential value in the notion of "once digitized, always available". Redundancy of effort is useful to an extent, but I think we have wasted a lot of time populating different databases belonging to different organizations with records representing the exact same objects (e.g., the citation record for Linnaeus, 1758), and even more time (specifically, *my* time) trying to cross-link different databases with overlapping content.
So....I'm going to "see" Tim's crazy idea, and "raise" him an even crazier one: rather than "a" centralized cache of LSID response content (with expiration), why don't we have dozens or hundreds of mirror copies of *all* the content? We don't even have to confine it to LSIDs -- make it compatible with several of the most commonly used GUID protocols. I'm assuming that technology allows for maintaining this via replication, etc. The only issues that need to be worked out are a security/authorization mechanism somewhere between the Wikipedia model and the ITIS model (first one that came to my head), and/or a robust audit system for tracking (and rolling back) content edits.
The GNA is already heading this way for both GNI and GNUB; and there is talk of something similar for literature citations (which would include agents), as well as for shape files for species distributions. What I'd like to see is a "Global Shared Biodiversity Data Repository" as more than just a centralized cache of metadata, but a common infrastructure that supports broad global replication (and associated automated synchronization) of anything and everything that our community is willing to share. I would propose that each object be identified by a UUID, and that any one of the dozens/hundreds of replicates could establish whatever services they want on top of that content. Different hosts might create different services for content resolution catered to different community needs. Some would wrap the UUIDs in LSID syntax, some might convert them to Handles, some would represent them as PURLs, etc., etc. The important thing is that we have a common standard for identifiers (UUIDs), built within an architecture that can support multiple and evolving resolution protocols, mechanisms, and services.
Well, dang! Not only did I just blow an hour, but I suspect I bored most of you to tears.
Sorry about that...
Aloha, Rich
Richard L. Pyle, PhD Database Coordinator for Natural Sciences and Associate Zoologist in Ichthyology Department of Natural Sciences, Bishop Museum 1525 Bernice St., Honolulu, HI 96817 Ph: (808)848-4115, Fax: (808)847-8252 email: deepreef@bishopmuseum.org http://hbs.bishopmuseum.org/staff/pylerichard.html
In re-reading my post, I am reminded of one other (quite legitimate) objection to UUIDs as identfiers. When embedded within an HTTP proxy, we get the dreaded wrapping on certain email forums; to wit:
==============================================
urn:lsid:zoobank.org:act:8BDC0735-FEA4-4298-83FA-D04F67C3FBEC (just because) http://zoobank.org/urn:lsid:zoobank.org:act:8BDC0735-FEA4-4298-83FA-D04F67C3 FBEC (Human readable) http://purl.zoobank.org/urn:lsid:zoobank.org:act:8BDC0735-FEA4-4298-83FA-D04 F67C3FBEC (to make Roger happy) http://zoobank.org/authority/?lsid=urn:lsid:zoobank.org:act:8BDC0735-FEA4-42 98-83FA-D04F67C3FBEC (to make supporters of LSIDs happy) http://zoobank.org/8BDC0735-FEA4-4298-83FA-D04F67C3FBEC (to make me happy) http://uuid.tdwg.org/8BDC0735-FEA4-4298-83FA-D04F67C3FBEC (to make Dave Vieglias happy) http://lsid.tdwg.org/urn:lsid:zoobank.org:act:8BDC0735-FEA4-4298-83FA-D04F67 C3FBEC (to make Lee Belbin happy) http://cache.gbif.org/?uuid=8BDC0735-FEA4-4298-83FA-D04F67C3FBEC (to make Tim happy) ...and so on, and so on, and so on.
==============================================
Ouch.
Oh, well....perhaps we need a tinyurl service to allow us to embed example UUID resolvers within force-wrapped email-message forums.
Aloha, RIch
Kevin
I agree, technically it is very easy. Corporate IT groups in largish government institutions such as ours, however have a deeply ingrained mistrust of anything "new" especially if it came from an online community such as ours.
I know I could do this in 5 minutes while sipping a coffee, but instead I'd have to generate an overview document explaining why the change is necessary, along with a project plan, and a change control document indicating what is to happen and what happens if the change doesn't work. Sheer tedium.
Cheers, Ben
________________________________
From: tdwg-tag-bounces@lists.tdwg.org [mailto:tdwg-tag-bounces@lists.tdwg.org] On Behalf Of Kevin Richards Sent: Tuesday, 28 April 2009 5:27 To: Roger Hyam; Roderic Page Cc: tdwg-tag@lists.tdwg.org Subject: Re: [tdwg-tag] LSIDs: web based (HTTP) resolvers and web crawlers
Its not owning the domain that seems to be the problem - its apparently adding the DNS SRV record that is too technical to cope with (although I didn't think this was much harder than buying a domain name).
And I still think it is going to be essential to provider LSID / GUID hosting services for those who cant set them up themselves.
Kevin
From: tdwg-tag-bounces@lists.tdwg.org [mailto:tdwg-tag-bounces@lists.tdwg.org] On Behalf Of Roger Hyam Sent: Tuesday, 28 April 2009 2:17 a.m. To: Roderic Page Cc: tdwg-tag@lists.tdwg.org Subject: Re: [tdwg-tag] LSIDs: web based (HTTP) resolvers and web crawlers
This confirms my growing prejudice that third party GUID providers are bad news beyond a simple PURL service. Even then it is better if the owner of the PURL service is the owner of the data. i.e. only use some one else's PURL if you don't/can't own your own domain. Hands up who is incapable of buying a domain name!
Roger
On 27 Apr 2009, at 14:30, Roderic Page wrote:
Dear Nicky,
Ouch!
I'm not sure I fully understand how 30* redirects work with respect to web crawlers, but I'm not sure they will help in this case.
If the TDWG LSID resolver is a full blown resolver, then for each request from the crawler it will be doing the full LSID resolution (three calls, one for authority WSDL, one for service WSDL, one for metadata). It may cache the WSDLs, but it will still do at least one call to the service (unless it has cached the metadata as well).
Is the solution to add TDWG to your robots.txt file, and have the LSID resolver respect the settings in that file? TDWG could also implement metadata caching so it wouldn't need to hammer you so much (i.e., when a crawler hit TDWG, TDWG would reply with the cached metadata).
Perhaps LSID services such as IPNI's could also implement etag headers, which would help avoid excessive traffic from TDWG when caching (TDWG could regularly cache metadata form IPNI, respecting the robots.txt files, and first checking whether the metadata had changed using etag and/or last modified headers).
I assume the DOI resolve has similar issues. It's robots.txt file looks like this:
Crawl-delay: 5
Request-rate: 1/5
Hope this makes sense, my understanding of the HTTP headers/redirects/robots.txt is not particularly deep.
Regards
Rod
On 27 Apr 2009, at 13:54, Nicola Nicolson wrote:
Hi,
Further to my last design question re LSID HTTP proxies (thanks for the responses), I wanted to raise the issue of HTTP LSID proxies and crawlers, in particular the crawl delay part of the robots exclusion protocol.
I'll outline a situation we had recently:
The GBIF portal and ZipCodeZoo site both inclde IPNI LSIDs in the pages. These are presented in their proxied form using the TDWG LSID resolver (eg http://lsid.tdwg.org/urn:lsid:ipni.org:names:783030-1 http://lsid.tdwg.org/urn:lsid:ipni.org:names:783030-1 ). Using the TDWG resolver to access the data for an IPNI LSID does not issue any kind of HTTP redirect, instead the web resolver uses the LSID resolution steps to get the data and presents it in its own response (ie returning a HTTP 200 OK response).
The problem happens when one of these sites that includes proxied IPNI LSIDs is crawled by a search engine. The proxied links appear to belong to tdwg.org, so whatever crawl delay is agreed between TDWG and the crawler in question is used. The crawler has no knowledge that behind the scenes the TDWG resolver is hitting ipni.org. We (ipni.org) have agreed our own crawl limits with Google and the other major search engines using directives in robots.txt and directly agreed limits with Google (who don't use the robots.txt directly).
On a couple of occasions in the past we have had to deny access to the TDWG LSID resolver as it has been responsible for far more traffic than we can support (up to 10 times the crawl limits we have agreed with search engine bots) - this due to the pages on the GBIF portal and / or zipcodezoo being crawled by a search engine, which in turn triggers a high volume of requests from TDWG to IPNI. The crawler itself has no knowledge that it is in effect accessing data held at ipni.org rather than tdwg.org as the HTTP response is HTTP 200.
One of Rod's emails recently mentioned that we need a resolver to act like a tinyurl or bit.ly. I have pasted below the HTTP headers for an HTTP request to the TDWG LSID resolver, and to tinyurl / bit.ly. To the end user it looks as though tdwg.org is the true location of the LSID resource, whereas with the tinyurl and bitly both just redirect traffic.
I'm just posting this for discussion really - if we are to mandate use of a web based HTTP resolver/proxies, it should really issue 30* redirects so that established crawl delays between producer and consumer will be used. The alternative would be for the HTTP resolver to read and process the directives in robots.txt, but this would be difficult to implement as it is not in itself a crawler, just a gateway.
I'm sure that if proxied forms of LSIDs become more prevalent this problem will become more widespread, so now - with the on-going attempt to define what services a GUID resolver should provide - might be a good time to plan how to fix this.
cheers, Nicky
[nn00kg@kvstage01 ~]$ curl -I http://lsid.tdwg.org/urn:lsid:ipni.org:names:783030-1 http://lsid.tdwg.org/urn:lsid:ipni.org:names:783030-1 HTTP/1.1 200 OK Via: 1.1 KISA01 Connection: close Proxy-Connection: close Date: Mon, 27 Apr 2009 11:41:55 GMT Content-Type: application/xml Server: Apache/2.2.3 (CentOS)
[nn00kg@kvstage01 ~]$ curl -I http://tinyurl.com/czkquy http://tinyurl.com/czkquy HTTP/1.1 301 Moved Permanently Via: 1.1 KISA01 Connection: close Proxy-Connection: close Date: Mon, 27 Apr 2009 12:16:38 GMT Location: http://www.ipni.org/ipni/plantNameByVersion.do?id=783030-1&version=1.4&a... http://www.ipni.org/ipni/plantNameByVersion.do?id=783030-1&version=1.4&output_format=lsid-metadata&show_history=true Content-type: text/html Server: TinyURL/1.6 X-Powered-By: PHP/5.2.9
[nn00kg@kvstage01 ~]$ curl -I http://bit.ly/KO1Ko http://bit.ly/KO1Ko HTTP/1.1 301 Moved Permanently Via: 1.1 KISA01 Connection: Keep-Alive Proxy-Connection: Keep-Alive Content-Length: 287 Date: Mon, 27 Apr 2009 12:19:48 GMT Location: http://www.ipni.org/ipni/plantNameByVersion.do?id=783030-1&version=1.4&a... http://www.ipni.org/ipni/plantNameByVersion.do?id=783030-1&version=1.4&output_format=lsid-metadata&show_history=true Content-Type: text/html;charset=utf-8 Server: nginx/0.7.42 Allow: GET, HEAD, POST
- Nicola Nicolson - Science Applications Development, - Royal Botanic Gardens, Kew, - Richmond, Surrey, TW9 3AB, UK - email: n.nicolson@rbgkew.org.uk - phone: 020-8332-5766
_______________________________________________ tdwg-tag mailing list tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag
---------------------------------------------------------
Roderic Page
Professor of Taxonomy
DEEB, FBLS
Graham Kerr Building
University of Glasgow
Glasgow G12 8QQ, UK
Email: r.page@bio.gla.ac.uk
Tel: +44 141 330 4778
Fax: +44 141 330 2792
AIM: rodpage1962@aim.com
Facebook: http://www.facebook.com/profile.php?id=1112517192
Twitter: http://twitter.com/rdmpage
Blog: http://iphylo.blogspot.com
Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html
_______________________________________________ tdwg-tag mailing list tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag
------------------------------------------------------------- Roger Hyam Roger@BiodiversityCollectionsIndex.org http://www.BiodiversityCollectionsIndex.org http://www.BiodiversityCollectionsIndex.org/ -------------------------------------------------------------
Royal Botanic Garden Edinburgh
20A Inverleith Row, Edinburgh, EH3 5LR, UK
Tel: +44 131 552 7171 ext 3015
Fax: +44 131 248 2901
-------------------------------------------------------------
________________________________
Please consider the environment before printing this email Warning: This electronic message together with any attachments is confidential. If you receive it in error: (i) you must not read, use, disclose, copy or retain it; (ii) please contact the sender immediately by reply email and then delete the emails. The views expressed in this email may not be those of Landcare Research New Zealand Limited. http://www.landcareresearch.co.nz
This email, together with any attachments, is intended for the addressee only. It may contain confidential or privileged information. If you are not the intended recipient of this email, please notify the sender, delete the email and attachments from your system and destroy any copies you may have taken of the email and its attachments. Duplication or further distribution by hardcopy, by electronic means or verbally is not permitted without permission.
Hi Rod,
I'm fairly sure a HTTP 30* redirect would help - if you think about what a web crawler is doing, its just processing the contents of a page and whilst doing that building a list of links for further processing. If referencing one of those links returns a redirect response with another URL to try, the returned URL is pushed onto the queue of links to be processed.
The TDWG resolver could fairly easily return a 301 (or some other variant of 30* redirect if more appropriate) as its not embellishing the IPNI data at all, it is presented "as is" ie compare: http://lsid.tdwg.org/urn:lsid:ipni.org:names:30000959-2 and http://www.ipni.org/ipni/plantNameByVersion.do?id=30000959-2&version=1.1... (The latter being the end address used to access LSID metadata at ipni.org).
Only the "summary" page adds anything to the metadata - reformatted into a more user friendly layout: http://lsid.tdwg.org/summary/urn:lsid:ipni.org:names:30000959-2
As you point out, the TDWG LSID resolver is indeed a full blown LSID resolver, and hence also generates calls to access the WSDL(s) for the LSID authority, in addition to that required to get the metadata. The authority WSDL is the same every time so could well be cached. According to the spec, the service WSDL must indicate if the requested LSID does in fact exist. But slowing down the traffic from the current 3 x calls per request with no crawl delay to: 1 x potentially cached request for authority WSDL 1 x request service WSDL 1 x HTTP 30* redirected and hence crawl delayed (the actual metadata address) ...should improve our situation a bit.
I'm not sure how one would go about making the TDWG resolver implement the robots exclusion protocol, as the resolver is not itself a crawler.
cheers, Nicky
- Nicola Nicolson - Science Applications Development, - Royal Botanic Gardens, Kew, - Richmond, Surrey, TW9 3AB, UK - email: n.nicolson@rbgkew.org.uk - phone: 020-8332-5766 ________________________________ From: Roderic Page [r.page@bio.gla.ac.uk] Sent: 27 April 2009 14:30 To: Nicola Nicolson Cc: tdwg-tag@lists.tdwg.org Subject: Re: [tdwg-tag] LSIDs: web based (HTTP) resolvers and web crawlers
Dear Nicky,
Ouch!
I'm not sure I fully understand how 30* redirects work with respect to web crawlers, but I'm not sure they will help in this case.
If the TDWG LSID resolver is a full blown resolver, then for each request from the crawler it will be doing the full LSID resolution (three calls, one for authority WSDL, one for service WSDL, one for metadata). It may cache the WSDLs, but it will still do at least one call to the service (unless it has cached the metadata as well).
Is the solution to add TDWG to your robots.txt file, and have the LSID resolver respect the settings in that file? TDWG could also implement metadata caching so it wouldn't need to hammer you so much (i.e., when a crawler hit TDWG, TDWG would reply with the cached metadata).
Perhaps LSID services such as IPNI's could also implement etag headers, which would help avoid excessive traffic from TDWG when caching (TDWG could regularly cache metadata form IPNI, respecting the robots.txt files, and first checking whether the metadata had changed using etag and/or last modified headers).
I assume the DOI resolve has similar issues. It's robots.txt file looks like this:
Crawl-delay: 5 Request-rate: 1/5
Hope this makes sense, my understanding of the HTTP headers/redirects/robots.txt is not particularly deep.
Regards
Rod
On 27 Apr 2009, at 13:54, Nicola Nicolson wrote:
Hi,
Further to my last design question re LSID HTTP proxies (thanks for the responses), I wanted to raise the issue of HTTP LSID proxies and crawlers, in particular the crawl delay part of the robots exclusion protocol.
I'll outline a situation we had recently:
The GBIF portal and ZipCodeZoo site both inclde IPNI LSIDs in the pages. These are presented in their proxied form using the TDWG LSID resolver (eg http://lsid.tdwg.org/urn:lsid:ipni.org:names:783030-1). Using the TDWG resolver to access the data for an IPNI LSID does not issue any kind of HTTP redirect, instead the web resolver uses the LSID resolution steps to get the data and presents it in its own response (ie returning a HTTP 200 OK response).
The problem happens when one of these sites that includes proxied IPNI LSIDs is crawled by a search engine. The proxied links appear to belong to tdwg.org, so whatever crawl delay is agreed between TDWG and the crawler in question is used. The crawler has no knowledge that behind the scenes the TDWG resolver is hitting ipni.org. We (ipni.org) have agreed our own crawl limits with Google and the other major search engines using directives in robots.txt and directly agreed limits with Google (who don't use the robots.txt directly).
On a couple of occasions in the past we have had to deny access to the TDWG LSID resolver as it has been responsible for far more traffic than we can support (up to 10 times the crawl limits we have agreed with search engine bots) - this due to the pages on the GBIF portal and / or zipcodezoo being crawled by a search engine, which in turn triggers a high volume of requests from TDWG to IPNI. The crawler itself has no knowledge that it is in effect accessing data held at ipni.org rather than tdwg.org as the HTTP response is HTTP 200.
One of Rod's emails recently mentioned that we need a resolver to act like a tinyurl or bit.ly. I have pasted below the HTTP headers for an HTTP request to the TDWG LSID resolver, and to tinyurl / bit.ly. To the end user it looks as though tdwg.org is the true location of the LSID resource, whereas with the tinyurl and bitly both just redirect traffic.
I'm just posting this for discussion really - if we are to mandate use of a web based HTTP resolver/proxies, it should really issue 30* redirects so that established crawl delays between producer and consumer will be used. The alternative would be for the HTTP resolver to read and process the directives in robots.txt, but this would be difficult to implement as it is not in itself a crawler, just a gateway.
I'm sure that if proxied forms of LSIDs become more prevalent this problem will become more widespread, so now - with the on-going attempt to define what services a GUID resolver should provide - might be a good time to plan how to fix this.
cheers, Nicky
[nn00kg@kvstage01 ~]$ curl -I http://lsid.tdwg.org/urn:lsid:ipni.org:names:783030-1 HTTP/1.1 200 OK Via: 1.1 KISA01 Connection: close Proxy-Connection: close Date: Mon, 27 Apr 2009 11:41:55 GMT Content-Type: application/xml Server: Apache/2.2.3 (CentOS)
[nn00kg@kvstage01 ~]$ curl -I http://tinyurl.com/czkquy HTTP/1.1 301 Moved Permanently Via: 1.1 KISA01 Connection: close Proxy-Connection: close Date: Mon, 27 Apr 2009 12:16:38 GMT Location: http://www.ipni.org/ipni/plantNameByVersion.do?id=783030-1&version=1.4&a... Content-type: text/html Server: TinyURL/1.6 X-Powered-By: PHP/5.2.9
[nn00kg@kvstage01 ~]$ curl -I http://bit.ly/KO1Ko HTTP/1.1 301 Moved Permanently Via: 1.1 KISA01 Connection: Keep-Alive Proxy-Connection: Keep-Alive Content-Length: 287 Date: Mon, 27 Apr 2009 12:19:48 GMT Location: http://www.ipni.org/ipni/plantNameByVersion.do?id=783030-1&version=1.4&a... Content-Type: text/html;charset=utf-8 Server: nginx/0.7.42 Allow: GET, HEAD, POST
- Nicola Nicolson - Science Applications Development, - Royal Botanic Gardens, Kew, - Richmond, Surrey, TW9 3AB, UK - email: n.nicolson@rbgkew.org.ukmailto:n.nicolson@rbgkew.org.uk - phone: 020-8332-5766 _______________________________________________ tdwg-tag mailing list tdwg-tag@lists.tdwg.orgmailto:tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag
--------------------------------------------------------- Roderic Page Professor of Taxonomy DEEB, FBLS Graham Kerr Building University of Glasgow Glasgow G12 8QQ, UK
Email: r.page@bio.gla.ac.ukmailto:r.page@bio.gla.ac.uk Tel: +44 141 330 4778 Fax: +44 141 330 2792 AIM: rodpage1962@aim.commailto:rodpage1962@aim.com Facebook: http://www.facebook.com/profile.php?id=1112517192 Twitter: http://twitter.com/rdmpage Blog: http://iphylo.blogspot.com Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html
This seems to be another example of how the use of LSID's creates problems and addscosts for data providers.
It would be much more straight forward to adopt the linked data standards and have this data be available in a widely supported standard.
Here is one linked data alternative:
http://lod.ipni.org/names/783030-1 <- the entity or concept ... redirects via 303 to either http://lod.ipni.org/names/783030-1.html <- human readable page http://lod.ipni.org/names/783030-1.rdf <- rdf data
See http://linkeddata.org/guides-and-tutorials
Test with this service http://validator.linkeddata.org/vapour
There are other ways to avoid service outages and data replication. Google and others have to deal with this problem everyday.
If you want to keep the branding on the identifier you could also do something like this.
http://lod.ipni.org/ipni-org_names_783030-1 <- the entity or concept, 303 redirect to either http://lod.ipni.org/ipni-org_names_783030-1.html <- human readable page http://lod.ipni.org/ipni-org_names_783030-1.rdf <- rdf data
Couldn't the free and ubiquitous Google cache provide some caching of these normal uri's
- Pete
On Mon, Apr 27, 2009 at 7:54 AM, Nicola Nicolson n.nicolson@rbgkew.org.ukwrote:
Hi,
Further to my last design question re LSID HTTP proxies (thanks for the responses), I wanted to raise the issue of HTTP LSID proxies and crawlers, in particular the crawl delay part of the robots exclusion protocol.
I'll outline a situation we had recently:
The GBIF portal and ZipCodeZoo site both inclde IPNI LSIDs in the pages. These are presented in their proxied form using the TDWG LSID resolver (eg http://lsid.tdwg.org/urn:lsid:ipni.org:names:783030-1). Using the TDWG resolver to access the data for an IPNI LSID does not issue any kind of HTTP redirect, instead the web resolver uses the LSID resolution steps to get the data and presents it in its own response (ie returning a HTTP 200 OK response).
The problem happens when one of these sites that includes proxied IPNI LSIDs is crawled by a search engine. The proxied links appear to belong to tdwg.org, so whatever crawl delay is agreed between TDWG and the crawler in question is used. The crawler has no knowledge that behind the scenes the TDWG resolver is hitting ipni.org. We (ipni.org) have agreed our own crawl limits with Google and the other major search engines using directives in robots.txt and directly agreed limits with Google (who don't use the robots.txt directly).
On a couple of occasions in the past we have had to deny access to the TDWG LSID resolver as it has been responsible for far more traffic than we can support (up to 10 times the crawl limits we have agreed with search engine bots) - this due to the pages on the GBIF portal and / or zipcodezoo being crawled by a search engine, which in turn triggers a high volume of requests from TDWG to IPNI. The crawler itself has no knowledge that it is in effect accessing data held at ipni.org rather than tdwg.org as the HTTP response is HTTP 200.
One of Rod's emails recently mentioned that we need a resolver to act like a tinyurl or bit.ly. I have pasted below the HTTP headers for an HTTP request to the TDWG LSID resolver, and to tinyurl / bit.ly. To the end user it looks as though tdwg.org is the true location of the LSID resource, whereas with the tinyurl and bitly both just redirect traffic.
I'm just posting this for discussion really - if we are to mandate use of a web based HTTP resolver/proxies, it should really issue 30* redirects so that established crawl delays between producer and consumer will be used. The alternative would be for the HTTP resolver to read and process the directives in robots.txt, but this would be difficult to implement as it is not in itself a crawler, just a gateway.
I'm sure that if proxied forms of LSIDs become more prevalent this problem will become more widespread, so now - with the on-going attempt to define what services a GUID resolver should provide - might be a good time to plan how to fix this.
cheers, Nicky
[nn00kg@kvstage01 ~]$ curl -I http://lsid.tdwg.org/urn:lsid:ipni.org:names:783030-1 HTTP/1.1 200 OK Via: 1.1 KISA01 Connection: close Proxy-Connection: close Date: Mon, 27 Apr 2009 11:41:55 GMT Content-Type: application/xml Server: Apache/2.2.3 (CentOS)
[nn00kg@kvstage01 ~]$ curl -I http://tinyurl.com/czkquy HTTP/1.1 301 Moved Permanently Via: 1.1 KISA01 Connection: close Proxy-Connection: close Date: Mon, 27 Apr 2009 12:16:38 GMT Location: http://www.ipni.org/ipni/plantNameByVersion.do?id=783030-1&version=1.4&a... Content-type: text/html Server: TinyURL/1.6 X-Powered-By: PHP/5.2.9
[nn00kg@kvstage01 ~]$ curl -I http://bit.ly/KO1Ko HTTP/1.1 301 Moved Permanently Via: 1.1 KISA01 Connection: Keep-Alive Proxy-Connection: Keep-Alive Content-Length: 287 Date: Mon, 27 Apr 2009 12:19:48 GMT Location: http://www.ipni.org/ipni/plantNameByVersion.do?id=783030-1&version=1.4&a... Content-Type: text/html;charset=utf-8 Server: nginx/0.7.42 Allow: GET, HEAD, POST
- Nicola Nicolson
- Science Applications Development,
- Royal Botanic Gardens, Kew,
- Richmond, Surrey, TW9 3AB, UK
- email: n.nicolson@rbgkew.org.uk
- phone: 020-8332-5766
tdwg-tag mailing list tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag
Dear Pete,
On 1 May 2009, at 04:37, Peter DeVries wrote:
This seems to be another example of how the use of LSID's creates problems and adds costs for data providers.
I'm not sure being hit by the Google bot is due to LSIDs as such. I think the problems of LSIDs lie more with the overhead of fussing with the DNS SRV (in theory trivial, but in practice not), the need for software beyond a web server, and the fact they don't resolve by themselves in browsers without proxies (although this hasn't hindered DOIs becoming widespread).
It would be much more straight forward to adopt the linked data standards and have this data be available in a widely supported standard.
Here is one linked data alternative:
http://lod.ipni.org/names/783030-1 <- the entity or concept ... redirects via 303 to either http://lod.ipni.org/names/783030-1.html <- human readable page http://lod.ipni.org/names/783030-1.rdf <- rdf data
See http://linkeddata.org/guides-and-tutorials
Test with this service http://validator.linkeddata.org/vapour
Playing nice with linked data makes sense, but we can do this with appropriate proxies. For example, http://validator.linkeddata.org/vapour?vocabUri=http%3A%2F%2Fbioguid.info%2F...
(if link broken in email try http://tinyurl.com/dkl755 )
Given that LSIDs are in the wild (including the scientific literature), we need to support them (that's the bugger with "persistent" identifiers, once you release them you're stuck with them).
That said, I'm guessing that anybody starting a new data providing service would be well advised to use HTTP URIs with 303 redirects, providing that they got the memo about cool URIs (http://www.w3.org/Provider/Style/URI ).
There are other ways to avoid service outages and data replication. Google and others have to deal with this problem everyday.
If you want to keep the branding on the identifier you could also do something like this.
http://lod.ipni.org/ipni-org_names_783030-1 <- the entity or concept, 303 redirect to either http://lod.ipni.org/ipni-org_names_783030-1.html <- human readable page http://lod.ipni.org/ipni-org_names_783030-1.rdf <- rdf data
Couldn't the free and ubiquitous Google cache provide some caching of these normal uri's
Firstly, is there any linked data in the Google cache? If the Google bot is harvesting as a web browser, it will get 303 redirects to HTML and not the RDF. I've had a quick look for DBPedia RDF in the cache and haven't found any.
Secondly, how would I get the cached copy? If I'm doing large-scale harvesting, I'll need programatic access to the cache, and that's not really possible (especially now that Google's SOAP API is deprecated).
Gel jockeys don't expect to have to get GenBank sequences from Google's cache because GenBank keeps falling over, so why do we expect to have to do this? OK, our situation is different because we have distributed data sources, but I'd prefer something like http://www.fak3r.com/2009/04/29/resolving-lsids-wit-url-resolvers-and-couchd...
Regards
Rod
- Pete
On Mon, Apr 27, 2009 at 7:54 AM, Nicola Nicolson <n.nicolson@rbgkew.org.uk
wrote:
Hi,
Further to my last design question re LSID HTTP proxies (thanks for the responses), I wanted to raise the issue of HTTP LSID proxies and crawlers, in particular the crawl delay part of the robots exclusion protocol.
I'll outline a situation we had recently:
The GBIF portal and ZipCodeZoo site both inclde IPNI LSIDs in the pages. These are presented in their proxied form using the TDWG LSID resolver (eg http://lsid.tdwg.org/urn:lsid:ipni.org:names:783030-1). Using the TDWG resolver to access the data for an IPNI LSID does not issue any kind of HTTP redirect, instead the web resolver uses the LSID resolution steps to get the data and presents it in its own response (ie returning a HTTP 200 OK response).
The problem happens when one of these sites that includes proxied IPNI LSIDs is crawled by a search engine. The proxied links appear to belong to tdwg.org, so whatever crawl delay is agreed between TDWG and the crawler in question is used. The crawler has no knowledge that behind the scenes the TDWG resolver is hitting ipni.org. We (ipni.org) have agreed our own crawl limits with Google and the other major search engines using directives in robots.txt and directly agreed limits with Google (who don't use the robots.txt directly).
On a couple of occasions in the past we have had to deny access to the TDWG LSID resolver as it has been responsible for far more traffic than we can support (up to 10 times the crawl limits we have agreed with search engine bots) - this due to the pages on the GBIF portal and / or zipcodezoo being crawled by a search engine, which in turn triggers a high volume of requests from TDWG to IPNI. The crawler itself has no knowledge that it is in effect accessing data held at ipni.org rather than tdwg.org as the HTTP response is HTTP 200.
One of Rod's emails recently mentioned that we need a resolver to act like a tinyurl or bit.ly. I have pasted below the HTTP headers for an HTTP request to the TDWG LSID resolver, and to tinyurl / bit.ly. To the end user it looks as though tdwg.org is the true location of the LSID resource, whereas with the tinyurl and bitly both just redirect traffic.
I'm just posting this for discussion really - if we are to mandate use of a web based HTTP resolver/proxies, it should really issue 30* redirects so that established crawl delays between producer and consumer will be used. The alternative would be for the HTTP resolver to read and process the directives in robots.txt, but this would be difficult to implement as it is not in itself a crawler, just a gateway.
I'm sure that if proxied forms of LSIDs become more prevalent this problem will become more widespread, so now - with the on-going attempt to define what services a GUID resolver should provide - might be a good time to plan how to fix this.
cheers, Nicky
[nn00kg@kvstage01 ~]$ curl -I http://lsid.tdwg.org/urn:lsid:ipni.org:names:783030-1 HTTP/1.1 200 OK Via: 1.1 KISA01 Connection: close Proxy-Connection: close Date: Mon, 27 Apr 2009 11:41:55 GMT Content-Type: application/xml Server: Apache/2.2.3 (CentOS)
[nn00kg@kvstage01 ~]$ curl -I http://tinyurl.com/czkquy HTTP/1.1 301 Moved Permanently Via: 1.1 KISA01 Connection: close Proxy-Connection: close Date: Mon, 27 Apr 2009 12:16:38 GMT Location: http://www.ipni.org/ipni/plantNameByVersion.do?id=783030-1&version=1.4&a... Content-type: text/html Server: TinyURL/1.6 X-Powered-By: PHP/5.2.9
[nn00kg@kvstage01 ~]$ curl -I http://bit.ly/KO1Ko HTTP/1.1 301 Moved Permanently Via: 1.1 KISA01 Connection: Keep-Alive Proxy-Connection: Keep-Alive Content-Length: 287 Date: Mon, 27 Apr 2009 12:19:48 GMT Location: http://www.ipni.org/ipni/plantNameByVersion.do?id=783030-1&version=1.4&a... Content-Type: text/html;charset=utf-8 Server: nginx/0.7.42 Allow: GET, HEAD, POST
- Nicola Nicolson
- Science Applications Development,
- Royal Botanic Gardens, Kew,
- Richmond, Surrey, TW9 3AB, UK
- email: n.nicolson@rbgkew.org.uk
- phone: 020-8332-5766
tdwg-tag mailing list tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag
--
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706
tdwg-tag mailing list tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag
--------------------------------------------------------- Roderic Page Professor of Taxonomy DEEB, FBLS Graham Kerr Building University of Glasgow Glasgow G12 8QQ, UK
Email: r.page@bio.gla.ac.uk Tel: +44 141 330 4778 Fax: +44 141 330 2792 AIM: rodpage1962@aim.com Facebook: http://www.facebook.com/profile.php?id=1112517192 Twitter: http://twitter.com/rdmpage Blog: http://iphylo.blogspot.com Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html
2009/5/1 Roderic Page r.page@bio.gla.ac.uk:
Firstly, is there any linked data in the Google cache? If the Google bot is harvesting as a web browser, it will get 303 redirects to HTML and not the RDF. I've had a quick look for DBPedia RDF in the cache and haven't found any.
The Bio2RDF RDF is harvested and crawled by the Google bot. It seems to understand RDF URI links at least at an elementary stage.
Cheers,
Peter
Hi Rod, I am in favor of couchDB based distributed solutions. I just don't see how LSID's can be justified base on their cost/benefits.
The current LSID's can still be used, but if any group can easily make the transition to linked data it would be those groups that have already successfully implemented LSID's.
Without the proxy, the providers can work out a caching solution that works well for them. The TDWG proxy has to cache all lsid requests, not just those for ipini. It probably caches less of the ipini data than ipini would.
Also a lot of people use simpler crawlers that may not know how to correctly follow LSID proxies.
My .rdf files are cached by Google
Do a Google Search on:
http://species.geospecies.org/specs/Ochlerotatus_triseriatus.rdf
or
http://species.geospecies.org/specs/Culex_pipiens.rdf
The Google cache is not ideal, but it is an accessible alternative version. They may be open to making it work as a real alternative cache for linked data.
- Pete
On Fri, May 1, 2009 at 1:08 AM, Roderic Page r.page@bio.gla.ac.uk wrote:
Dear Pete, On 1 May 2009, at 04:37, Peter DeVries wrote:
This seems to be another example of how the use of LSID's creates problems and addscosts for data providers.
I'm not sure being hit by the Google bot is due to LSIDs as such. I think the problems of LSIDs lie more with the overhead of fussing with the DNS SRV (in theory trivial, but in practice not), the need for software beyond a web server, and the fact they don't resolve by themselves in browsers without proxies (although this hasn't hindered DOIs becoming widespread).
It would be much more straight forward to adopt the linked data standards and have this data be available in a widely supported standard.
Here is one linked data alternative:
http://lod.ipni.org/names/783030-1 <- the entity or concept ... redirects via 303 to either http://lod.ipni.org/names/783030-1.html <- human readable page http://lod.ipni.org/names/783030-1.rdf <- rdf data
See http://linkeddata.org/guides-and-tutorials
Test with this service http://validator.linkeddata.org/vapour
Playing nice with linked data makes sense, but we can do this with appropriate proxies. For example, http://validator.linkeddata.org/vapour?vocabUri=http%3A%2F%2Fbioguid.info%2F...
(if link broken in email try http://tinyurl.com/dkl755 )
Given that LSIDs are in the wild (including the scientific literature), we need to support them (that's the bugger with "persistent" identifiers, once you release them you're stuck with them).
That said, I'm guessing that anybody starting a new data providing service would be well advised to use HTTP URIs with 303 redirects, providing that they got the memo about cool URIs (http://www.w3.org/Provider/Style/URI).
There are other ways to avoid service outages and data replication. Google and others have to deal with this problem everyday.
If you want to keep the branding on the identifier you could also do something like this.
http://lod.ipni.org/ipni-org_names_783030-1 <- the entity or concept, 303 redirect to either http://lod.ipni.org/ipni-org_names_783030-1.html <- human readable page http://lod.ipni.org/ipni-org_names_783030-1.rdf <- rdf data
Couldn't the free and ubiquitous Google cache provide some caching of these normal uri's
Firstly, is there any linked data in the Google cache? If the Google bot is harvesting as a web browser, it will get 303 redirects to HTML and not the RDF. I've had a quick look for DBPedia RDF in the cache and haven't found any.
Secondly, how would I get the cached copy? If I'm doing large-scale harvesting, I'll need programatic access to the cache, and that's not really possible (especially now that Google's SOAP API is deprecated).
Gel jockeys don't expect to have to get GenBank sequences from Google's cache because GenBank keeps falling over, so why do we expect to have to do this? OK, our situation is different because we have distributed data sources, but I'd prefer something like http://www.fak3r.com/2009/04/29/resolving-lsids-wit-url-resolvers-and-couchd...
Regards
Rod
- Pete
On Mon, Apr 27, 2009 at 7:54 AM, Nicola Nicolson <n.nicolson@rbgkew.org.uk
wrote:
Hi,
Further to my last design question re LSID HTTP proxies (thanks for the responses), I wanted to raise the issue of HTTP LSID proxies and crawlers, in particular the crawl delay part of the robots exclusion protocol.
I'll outline a situation we had recently:
The GBIF portal and ZipCodeZoo site both inclde IPNI LSIDs in the pages. These are presented in their proxied form using the TDWG LSID resolver (eg http://lsid.tdwg.org/urn:lsid:ipni.org:names:783030-1). Using the TDWG resolver to access the data for an IPNI LSID does not issue any kind of HTTP redirect, instead the web resolver uses the LSID resolution steps to get the data and presents it in its own response (ie returning a HTTP 200 OK response).
The problem happens when one of these sites that includes proxied IPNI LSIDs is crawled by a search engine. The proxied links appear to belong to tdwg.org, so whatever crawl delay is agreed between TDWG and the crawler in question is used. The crawler has no knowledge that behind the scenes the TDWG resolver is hitting ipni.org. We (ipni.org) have agreed our own crawl limits with Google and the other major search engines using directives in robots.txt and directly agreed limits with Google (who don't use the robots.txt directly).
On a couple of occasions in the past we have had to deny access to the TDWG LSID resolver as it has been responsible for far more traffic than we can support (up to 10 times the crawl limits we have agreed with search engine bots) - this due to the pages on the GBIF portal and / or zipcodezoo being crawled by a search engine, which in turn triggers a high volume of requests from TDWG to IPNI. The crawler itself has no knowledge that it is in effect accessing data held at ipni.org rather than tdwg.org as the HTTP response is HTTP 200.
One of Rod's emails recently mentioned that we need a resolver to act like a tinyurl or bit.ly. I have pasted below the HTTP headers for an HTTP request to the TDWG LSID resolver, and to tinyurl / bit.ly. To the end user it looks as though tdwg.org is the true location of the LSID resource, whereas with the tinyurl and bitly both just redirect traffic.
I'm just posting this for discussion really - if we are to mandate use of a web based HTTP resolver/proxies, it should really issue 30* redirects so that established crawl delays between producer and consumer will be used. The alternative would be for the HTTP resolver to read and process the directives in robots.txt, but this would be difficult to implement as it is not in itself a crawler, just a gateway.
I'm sure that if proxied forms of LSIDs become more prevalent this problem will become more widespread, so now - with the on-going attempt to define what services a GUID resolver should provide - might be a good time to plan how to fix this.
cheers, Nicky
[nn00kg@kvstage01 ~]$ curl -I http://lsid.tdwg.org/urn:lsid:ipni.org:names:783030-1 HTTP/1.1 200 OK Via: 1.1 KISA01 Connection: close Proxy-Connection: close Date: Mon, 27 Apr 2009 11:41:55 GMT Content-Type: application/xml Server: Apache/2.2.3 (CentOS)
[nn00kg@kvstage01 ~]$ curl -I http://tinyurl.com/czkquy HTTP/1.1 301 Moved Permanently Via: 1.1 KISA01 Connection: close Proxy-Connection: close Date: Mon, 27 Apr 2009 12:16:38 GMT Location: http://www.ipni.org/ipni/plantNameByVersion.do?id=783030-1&version=1.4&a... Content-type: text/html Server: TinyURL/1.6 X-Powered-By: PHP/5.2.9
[nn00kg@kvstage01 ~]$ curl -I http://bit.ly/KO1Ko HTTP/1.1 301 Moved Permanently Via: 1.1 KISA01 Connection: Keep-Alive Proxy-Connection: Keep-Alive Content-Length: 287 Date: Mon, 27 Apr 2009 12:19:48 GMT Location: http://www.ipni.org/ipni/plantNameByVersion.do?id=783030-1&version=1.4&a... Content-Type: text/html;charset=utf-8 Server: nginx/0.7.42 Allow: GET, HEAD, POST
- Nicola Nicolson
- Science Applications Development,
- Royal Botanic Gardens, Kew,
- Richmond, Surrey, TW9 3AB, UK
- email: n.nicolson@rbgkew.org.uk
- phone: 020-8332-5766
tdwg-tag mailing list tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag
--
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706
tdwg-tag mailing list tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag
Roderic Page Professor of Taxonomy DEEB, FBLS Graham Kerr Building University of Glasgow Glasgow G12 8QQ, UK
Email: r.page@bio.gla.ac.uk Tel: +44 141 330 4778 Fax: +44 141 330 2792 AIM: rodpage1962@aim.com Facebook: http://www.facebook.com/profile.php?id=1112517192 Twitter: http://twitter.com/rdmpage Blog: http://iphylo.blogspot.com Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html
I wish I could keep out of this debate but...
Linked data approach is an order of magnitude simpler than LSID and very easy to layer on top of an existing LSID authority - you already have a RDF metadata response you just need the redirect URL which can be implemented in the Apache or IIS config or a very simple script.
It doesn't have to be only google that caches the metadata it could be GBIF/EoL or some other party who are interested in caching metadata from biodiversity suppliers. They could even have a submission mechanism. So the whole architecture would go:
1) work out how to get your data into RDF (tricky bit we should be working on as Markus points out - this could even be RDFa in a web page - anyone for Dreamweaver templates!!) 2) set up a 303 redirect to the RDF metadata. (very easy even on an ISP hosted domain or corporate internet - unlike messing with SRV records) 3) tell the world about it (GBIF/EoL can then scrape it and cache it if the license permits - and the license is in the data)
This approach is totally modular, distributed, loosely coupled and robust. The data supplier doesn't even need to have a search/browse function themselves they could just have a submission tool (SiteMap or RSS feed) and allow GBIF or whoever to supply those services on top.
We handle the social side of "URLs just break" by having recommendations for how URLs are designed. How about this one:
10.682772.info/specimen/E002719
Does that look enough like a DOI to keep people happy? I could secure the 682772.info domain for £12.50/year (£125 secures it for the next 10 years at the least). This includes free hosting of scripts to do my redirection etc. This is a cheeky example but I hope it illustrates the point that a well designed string can also be a URL. I don't include the transport protocol just as many quotes of DOIs don't include the doi: and all those adverts on the bus stops just have nike.com written on them not http://www.nike.com
There is plenty of room for innovation around this simple model. This is the most important thing. No strict protocols just enough to let people add their value. People can develop data hosting and other tools and packages just as GBIF do today.
Now it is a long weekend for nearly everyone I guess. I must stop thinking about identifiers!
All the best,
Roger
On 1 May 2009, at 08:41, Peter DeVries wrote:
Hi Rod,
I am in favor of couchDB based distributed solutions. I just don't see how LSID's can be justified base on their cost/benefits.
The current LSID's can still be used, but if any group can easily make the transition to linked data it would be those groups that have already successfully implemented LSID's.
Without the proxy, the providers can work out a caching solution that works well for them. The TDWG proxy has to cache all lsid requests, not just those for ipini. It probably caches less of the ipini data than ipini would.
Also a lot of people use simpler crawlers that may not know how to correctly follow LSID proxies.
My .rdf files are cached by Google
Do a Google Search on:
http://species.geospecies.org/specs/Ochlerotatus_triseriatus.rdf
or
http://species.geospecies.org/specs/Culex_pipiens.rdf
The Google cache is not ideal, but it is an accessible alternative version. They may be open to making it work as a real alternative cache for linked data.
- Pete
On Fri, May 1, 2009 at 1:08 AM, Roderic Page r.page@bio.gla.ac.uk wrote: Dear Pete,
On 1 May 2009, at 04:37, Peter DeVries wrote:
This seems to be another example of how the use of LSID's creates problems and adds costs for data providers.
I'm not sure being hit by the Google bot is due to LSIDs as such. I think the problems of LSIDs lie more with the overhead of fussing with the DNS SRV (in theory trivial, but in practice not), the need for software beyond a web server, and the fact they don't resolve by themselves in browsers without proxies (although this hasn't hindered DOIs becoming widespread).
It would be much more straight forward to adopt the linked data standards and have this data be available in a widely supported standard.
Here is one linked data alternative:
http://lod.ipni.org/names/783030-1 <- the entity or concept ... redirects via 303 to either http://lod.ipni.org/names/783030-1.html <- human readable page http://lod.ipni.org/names/783030-1.rdf <- rdf data
See http://linkeddata.org/guides-and-tutorials
Test with this service http://validator.linkeddata.org/vapour
Playing nice with linked data makes sense, but we can do this with appropriate proxies. For example, http://validator.linkeddata.org/vapour?vocabUri=http%3A%2F%2Fbioguid.info%2F...
(if link broken in email try http://tinyurl.com/dkl755 )
Given that LSIDs are in the wild (including the scientific literature), we need to support them (that's the bugger with "persistent" identifiers, once you release them you're stuck with them).
That said, I'm guessing that anybody starting a new data providing service would be well advised to use HTTP URIs with 303 redirects, providing that they got the memo about cool URIs (http://www.w3.org/Provider/Style/URI ).
There are other ways to avoid service outages and data replication. Google and others have to deal with this problem everyday.
If you want to keep the branding on the identifier you could also do something like this.
http://lod.ipni.org/ipni-org_names_783030-1 <- the entity or concept, 303 redirect to either http://lod.ipni.org/ipni-org_names_783030-1.html <- human readable page http://lod.ipni.org/ipni-org_names_783030-1.rdf <- rdf data
Couldn't the free and ubiquitous Google cache provide some caching of these normal uri's
Firstly, is there any linked data in the Google cache? If the Google bot is harvesting as a web browser, it will get 303 redirects to HTML and not the RDF. I've had a quick look for DBPedia RDF in the cache and haven't found any.
Secondly, how would I get the cached copy? If I'm doing large-scale harvesting, I'll need programatic access to the cache, and that's not really possible (especially now that Google's SOAP API is deprecated).
Gel jockeys don't expect to have to get GenBank sequences from Google's cache because GenBank keeps falling over, so why do we expect to have to do this? OK, our situation is different because we have distributed data sources, but I'd prefer something like http://www.fak3r.com/2009/04/29/resolving-lsids-wit-url-resolvers-and-couchd...
Regards
Rod
- Pete
On Mon, Apr 27, 2009 at 7:54 AM, Nicola Nicolson <n.nicolson@rbgkew.org.uk
wrote:
Hi,
Further to my last design question re LSID HTTP proxies (thanks for the responses), I wanted to raise the issue of HTTP LSID proxies and crawlers, in particular the crawl delay part of the robots exclusion protocol.
I'll outline a situation we had recently:
The GBIF portal and ZipCodeZoo site both inclde IPNI LSIDs in the pages. These are presented in their proxied form using the TDWG LSID resolver (eg http://lsid.tdwg.org/urn:lsid:ipni.org:names:783030-1) . Using the TDWG resolver to access the data for an IPNI LSID does not issue any kind of HTTP redirect, instead the web resolver uses the LSID resolution steps to get the data and presents it in its own response (ie returning a HTTP 200 OK response).
The problem happens when one of these sites that includes proxied IPNI LSIDs is crawled by a search engine. The proxied links appear to belong to tdwg.org, so whatever crawl delay is agreed between TDWG and the crawler in question is used. The crawler has no knowledge that behind the scenes the TDWG resolver is hitting ipni.org. We (ipni.org) have agreed our own crawl limits with Google and the other major search engines using directives in robots.txt and directly agreed limits with Google (who don't use the robots.txt directly).
On a couple of occasions in the past we have had to deny access to the TDWG LSID resolver as it has been responsible for far more traffic than we can support (up to 10 times the crawl limits we have agreed with search engine bots) - this due to the pages on the GBIF portal and / or zipcodezoo being crawled by a search engine, which in turn triggers a high volume of requests from TDWG to IPNI. The crawler itself has no knowledge that it is in effect accessing data held at ipni.org rather than tdwg.org as the HTTP response is HTTP 200.
One of Rod's emails recently mentioned that we need a resolver to act like a tinyurl or bit.ly. I have pasted below the HTTP headers for an HTTP request to the TDWG LSID resolver, and to tinyurl / bit.ly. To the end user it looks as though tdwg.org is the true location of the LSID resource, whereas with the tinyurl and bitly both just redirect traffic.
I'm just posting this for discussion really - if we are to mandate use of a web based HTTP resolver/proxies, it should really issue 30* redirects so that established crawl delays between producer and consumer will be used. The alternative would be for the HTTP resolver to read and process the directives in robots.txt, but this would be difficult to implement as it is not in itself a crawler, just a gateway.
I'm sure that if proxied forms of LSIDs become more prevalent this problem will become more widespread, so now - with the on-going attempt to define what services a GUID resolver should provide - might be a good time to plan how to fix this.
cheers, Nicky
[nn00kg@kvstage01 ~]$ curl -I http://lsid.tdwg.org/urn:lsid:ipni.org:names:783030-1 HTTP/1.1 200 OK Via: 1.1 KISA01 Connection: close Proxy-Connection: close Date: Mon, 27 Apr 2009 11:41:55 GMT Content-Type: application/xml Server: Apache/2.2.3 (CentOS)
[nn00kg@kvstage01 ~]$ curl -I http://tinyurl.com/czkquy HTTP/1.1 301 Moved Permanently Via: 1.1 KISA01 Connection: close Proxy-Connection: close Date: Mon, 27 Apr 2009 12:16:38 GMT Location: http://www.ipni.org/ipni/plantNameByVersion.do?id=783030-1&version=1.4&a... Content-type: text/html Server: TinyURL/1.6 X-Powered-By: PHP/5.2.9
[nn00kg@kvstage01 ~]$ curl -I http://bit.ly/KO1Ko HTTP/1.1 301 Moved Permanently Via: 1.1 KISA01 Connection: Keep-Alive Proxy-Connection: Keep-Alive Content-Length: 287 Date: Mon, 27 Apr 2009 12:19:48 GMT Location: http://www.ipni.org/ipni/plantNameByVersion.do?id=783030-1&version=1.4&a... Content-Type: text/html;charset=utf-8 Server: nginx/0.7.42 Allow: GET, HEAD, POST
- Nicola Nicolson
- Science Applications Development,
- Royal Botanic Gardens, Kew,
- Richmond, Surrey, TW9 3AB, UK
- email: n.nicolson@rbgkew.org.uk
- phone: 020-8332-5766
tdwg-tag mailing list tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag
--
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706
tdwg-tag mailing list tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag
Roderic Page Professor of Taxonomy DEEB, FBLS Graham Kerr Building University of Glasgow Glasgow G12 8QQ, UK
Email: r.page@bio.gla.ac.uk Tel: +44 141 330 4778 Fax: +44 141 330 2792 AIM: rodpage1962@aim.com Facebook: http://www.facebook.com/profile.php?id=1112517192 Twitter: http://twitter.com/rdmpage Blog: http://iphylo.blogspot.com Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html
--
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706
tdwg-tag mailing list tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag
So, can we save TDWG, GBIF, etc. the hassle of meetings and working groups, etc., and just do this!?
TDWG/GBIF/etc. can then focus on providing any assistance to make this happen, and offer tools and services to help, such as:
1. metadata validator to check vocabulary is OK (the BigDig did this sort of thing for Darwin Core)
2. service to suggest additional identifiers so that we can link stuff together (breaking the silos) (e.g., telling the provider that "Edinburgh J. Bot. 66(1): 110 (-113; fig. 2, map). 2009 [Mar 2009]" has the identifier doi:10.1017/S0960428609005320 )
3. Service to "augment" provider data (bit like 2, see http://n2.talis.com/wiki/Augment_Service)
4. monitor service availability, provide feedback and assistance to providers that are struggling
5. Linked data-compliant proxy for existing non-HTTP URI GUIDs that we will always have to deal with (e.g., DOIs for literature, Handles for DSpace repositories, LSIDs that we already have)
6. Caching service (on top of which cool applications can be built to sell the approach and answer the "remind me again, why are we doing this" question).
7. Other cool stuff...
Regards
Rod
On 1 May 2009, at 09:39, Roger Hyam wrote:
I wish I could keep out of this debate but...
Linked data approach is an order of magnitude simpler than LSID and very easy to layer on top of an existing LSID authority - you already have a RDF metadata response you just need the redirect URL which can be implemented in the Apache or IIS config or a very simple script.
It doesn't have to be only google that caches the metadata it could be GBIF/EoL or some other party who are interested in caching metadata from biodiversity suppliers. They could even have a submission mechanism. So the whole architecture would go:
- work out how to get your data into RDF (tricky bit we should be
working on as Markus points out - this could even be RDFa in a web page - anyone for Dreamweaver templates!!) 2) set up a 303 redirect to the RDF metadata. (very easy even on an ISP hosted domain or corporate internet - unlike messing with SRV records) 3) tell the world about it (GBIF/EoL can then scrape it and cache it if the license permits - and the license is in the data)
This approach is totally modular, distributed, loosely coupled and robust. The data supplier doesn't even need to have a search/browse function themselves they could just have a submission tool (SiteMap or RSS feed) and allow GBIF or whoever to supply those services on top.
We handle the social side of "URLs just break" by having recommendations for how URLs are designed. How about this one:
10.682772.info/specimen/E002719
Does that look enough like a DOI to keep people happy? I could secure the 682772.info domain for £12.50/year (£125 secures it for the next 10 years at the least). This includes free hosting of scripts to do my redirection etc. This is a cheeky example but I hope it illustrates the point that a well designed string can also be a URL. I don't include the transport protocol just as many quotes of DOIs don't include the doi: and all those adverts on the bus stops just have nike.com written on them not http://www.nike.com
There is plenty of room for innovation around this simple model. This is the most important thing. No strict protocols just enough to let people add their value. People can develop data hosting and other tools and packages just as GBIF do today.
Now it is a long weekend for nearly everyone I guess. I must stop thinking about identifiers!
All the best,
Roger
On 1 May 2009, at 08:41, Peter DeVries wrote:
Hi Rod,
I am in favor of couchDB based distributed solutions. I just don't see how LSID's can be justified base on their cost/benefits.
The current LSID's can still be used, but if any group can easily make the transition to linked data it would be those groups that have already successfully implemented LSID's.
Without the proxy, the providers can work out a caching solution that works well for them. The TDWG proxy has to cache all lsid requests, not just those for ipini. It probably caches less of the ipini data than ipini would.
Also a lot of people use simpler crawlers that may not know how to correctly follow LSID proxies.
My .rdf files are cached by Google
Do a Google Search on:
http://species.geospecies.org/specs/Ochlerotatus_triseriatus.rdf
or
http://species.geospecies.org/specs/Culex_pipiens.rdf
The Google cache is not ideal, but it is an accessible alternative version. They may be open to making it work as a real alternative cache for linked data.
- Pete
On Fri, May 1, 2009 at 1:08 AM, Roderic Page r.page@bio.gla.ac.uk wrote: Dear Pete,
On 1 May 2009, at 04:37, Peter DeVries wrote:
This seems to be another example of how the use of LSID's creates problems and adds costs for data providers.
I'm not sure being hit by the Google bot is due to LSIDs as such. I think the problems of LSIDs lie more with the overhead of fussing with the DNS SRV (in theory trivial, but in practice not), the need for software beyond a web server, and the fact they don't resolve by themselves in browsers without proxies (although this hasn't hindered DOIs becoming widespread).
It would be much more straight forward to adopt the linked data standards and have this data be available in a widely supported standard.
Here is one linked data alternative:
http://lod.ipni.org/names/783030-1 <- the entity or concept ... redirects via 303 to either http://lod.ipni.org/names/783030-1.html <- human readable page http://lod.ipni.org/names/783030-1.rdf <- rdf data
See http://linkeddata.org/guides-and-tutorials
Test with this service http://validator.linkeddata.org/vapour
Playing nice with linked data makes sense, but we can do this with appropriate proxies. For example, http://validator.linkeddata.org/vapour?vocabUri=http%3A%2F%2Fbioguid.info%2F...
(if link broken in email try http://tinyurl.com/dkl755 )
Given that LSIDs are in the wild (including the scientific literature), we need to support them (that's the bugger with "persistent" identifiers, once you release them you're stuck with them).
That said, I'm guessing that anybody starting a new data providing service would be well advised to use HTTP URIs with 303 redirects, providing that they got the memo about cool URIs (http://www.w3.org/Provider/Style/URI ).
There are other ways to avoid service outages and data replication. Google and others have to deal with this problem everyday.
If you want to keep the branding on the identifier you could also do something like this.
http://lod.ipni.org/ipni-org_names_783030-1 <- the entity or concept, 303 redirect to either http://lod.ipni.org/ipni-org_names_783030-1.html <- human readable page http://lod.ipni.org/ipni-org_names_783030-1.rdf <- rdf data
Couldn't the free and ubiquitous Google cache provide some caching of these normal uri's
Firstly, is there any linked data in the Google cache? If the Google bot is harvesting as a web browser, it will get 303 redirects to HTML and not the RDF. I've had a quick look for DBPedia RDF in the cache and haven't found any.
Secondly, how would I get the cached copy? If I'm doing large-scale harvesting, I'll need programatic access to the cache, and that's not really possible (especially now that Google's SOAP API is deprecated).
Gel jockeys don't expect to have to get GenBank sequences from Google's cache because GenBank keeps falling over, so why do we expect to have to do this? OK, our situation is different because we have distributed data sources, but I'd prefer something like http://www.fak3r.com/2009/04/29/resolving-lsids-wit-url-resolvers-and-couchd...
Regards
Rod
- Pete
On Mon, Apr 27, 2009 at 7:54 AM, Nicola Nicolson <n.nicolson@rbgkew.org.uk
wrote:
Hi,
Further to my last design question re LSID HTTP proxies (thanks for the responses), I wanted to raise the issue of HTTP LSID proxies and crawlers, in particular the crawl delay part of the robots exclusion protocol.
I'll outline a situation we had recently:
The GBIF portal and ZipCodeZoo site both inclde IPNI LSIDs in the pages. These are presented in their proxied form using the TDWG LSID resolver (eg http://lsid.tdwg.org/urn:lsid:ipni.org:names:783030-1) . Using the TDWG resolver to access the data for an IPNI LSID does not issue any kind of HTTP redirect, instead the web resolver uses the LSID resolution steps to get the data and presents it in its own response (ie returning a HTTP 200 OK response).
The problem happens when one of these sites that includes proxied IPNI LSIDs is crawled by a search engine. The proxied links appear to belong to tdwg.org, so whatever crawl delay is agreed between TDWG and the crawler in question is used. The crawler has no knowledge that behind the scenes the TDWG resolver is hitting ipni.org. We (ipni.org) have agreed our own crawl limits with Google and the other major search engines using directives in robots.txt and directly agreed limits with Google (who don't use the robots.txt directly).
On a couple of occasions in the past we have had to deny access to the TDWG LSID resolver as it has been responsible for far more traffic than we can support (up to 10 times the crawl limits we have agreed with search engine bots) - this due to the pages on the GBIF portal and / or zipcodezoo being crawled by a search engine, which in turn triggers a high volume of requests from TDWG to IPNI. The crawler itself has no knowledge that it is in effect accessing data held at ipni.org rather than tdwg.org as the HTTP response is HTTP 200.
One of Rod's emails recently mentioned that we need a resolver to act like a tinyurl or bit.ly. I have pasted below the HTTP headers for an HTTP request to the TDWG LSID resolver, and to tinyurl / bit.ly. To the end user it looks as though tdwg.org is the true location of the LSID resource, whereas with the tinyurl and bitly both just redirect traffic.
I'm just posting this for discussion really - if we are to mandate use of a web based HTTP resolver/proxies, it should really issue 30* redirects so that established crawl delays between producer and consumer will be used. The alternative would be for the HTTP resolver to read and process the directives in robots.txt, but this would be difficult to implement as it is not in itself a crawler, just a gateway.
I'm sure that if proxied forms of LSIDs become more prevalent this problem will become more widespread, so now - with the on-going attempt to define what services a GUID resolver should provide - might be a good time to plan how to fix this.
cheers, Nicky
[nn00kg@kvstage01 ~]$ curl -I http://lsid.tdwg.org/urn:lsid:ipni.org:names:783030-1 HTTP/1.1 200 OK Via: 1.1 KISA01 Connection: close Proxy-Connection: close Date: Mon, 27 Apr 2009 11:41:55 GMT Content-Type: application/xml Server: Apache/2.2.3 (CentOS)
[nn00kg@kvstage01 ~]$ curl -I http://tinyurl.com/czkquy HTTP/1.1 301 Moved Permanently Via: 1.1 KISA01 Connection: close Proxy-Connection: close Date: Mon, 27 Apr 2009 12:16:38 GMT Location: http://www.ipni.org/ipni/plantNameByVersion.do?id=783030-1&version=1.4&a... Content-type: text/html Server: TinyURL/1.6 X-Powered-By: PHP/5.2.9
[nn00kg@kvstage01 ~]$ curl -I http://bit.ly/KO1Ko HTTP/1.1 301 Moved Permanently Via: 1.1 KISA01 Connection: Keep-Alive Proxy-Connection: Keep-Alive Content-Length: 287 Date: Mon, 27 Apr 2009 12:19:48 GMT Location: http://www.ipni.org/ipni/plantNameByVersion.do?id=783030-1&version=1.4&a... Content-Type: text/html;charset=utf-8 Server: nginx/0.7.42 Allow: GET, HEAD, POST
- Nicola Nicolson
- Science Applications Development,
- Royal Botanic Gardens, Kew,
- Richmond, Surrey, TW9 3AB, UK
- email: n.nicolson@rbgkew.org.uk
- phone: 020-8332-5766
tdwg-tag mailing list tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag
--
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706
tdwg-tag mailing list tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag
Roderic Page Professor of Taxonomy DEEB, FBLS Graham Kerr Building University of Glasgow Glasgow G12 8QQ, UK
Email: r.page@bio.gla.ac.uk Tel: +44 141 330 4778 Fax: +44 141 330 2792 AIM: rodpage1962@aim.com Facebook: http://www.facebook.com/profile.php?id=1112517192 Twitter: http://twitter.com/rdmpage Blog: http://iphylo.blogspot.com Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html
--
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706
tdwg-tag mailing list tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag
--------------------------------------------------------- Roderic Page Professor of Taxonomy DEEB, FBLS Graham Kerr Building University of Glasgow Glasgow G12 8QQ, UK
Email: r.page@bio.gla.ac.uk Tel: +44 141 330 4778 Fax: +44 141 330 2792 AIM: rodpage1962@aim.com Facebook: http://www.facebook.com/profile.php?id=1112517192 Twitter: http://twitter.com/rdmpage Blog: http://iphylo.blogspot.com Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html
Yay!
On 1 May 2009, at 10:15, Roger Hyam wrote:
Yes.
On 1 May 2009, at 10:07, Roderic Page wrote:
So, can we save TDWG, GBIF, etc. the hassle of meetings and working groups, etc., and just do this!?
--------------------------------------------------------- Roderic Page Professor of Taxonomy DEEB, FBLS Graham Kerr Building University of Glasgow Glasgow G12 8QQ, UK
Email: r.page@bio.gla.ac.uk Tel: +44 141 330 4778 Fax: +44 141 330 2792 AIM: rodpage1962@aim.com Facebook: http://www.facebook.com/profile.php?id=1112517192 Twitter: http://twitter.com/rdmpage Blog: http://iphylo.blogspot.com Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html
Dear Pete,
On 1 May 2009, at 08:41, Peter DeVries wrote:
Hi Rod,
I am in favor of couchDB based distributed solutions. I just don't see how LSID's can be justified base on their cost/benefits.
I think it is fair to say that some of LSID's benefits are historical, in that as a result of their adoption, most if not all major nomenclators and name databases are serving RDF, many using the same vocabulary. We could have the same result with HTTP URIs, but I suspect we wouldn't have. LSIDs pretty much killed off big XML schemas for transferring name information (a good thing). They also forced the providers to have clean URIs (in the sense that we can access an IPNI record with urn:lsid:ipni.org:names:o?id=77096980-1, instead of http://www.ipni.org:80/ipni/idPlantNameSearch.do?id=77096980-1&back_page... . Again, I doubt this would have happened without LSIDs.
So I think they should be viewed as one route to the Semantic Web. Now that we are pretty much there, the key question is should we mint new ones. If people had clean HTTP URIs that conformed to linked data standards, then it's hard to see the point...
The current LSID's can still be used, but if any group can easily make the transition to linked data it would be those groups that have already successfully implemented LSID's.
Yes, although they might well say "but last time you said LSIDs were the new hotness, and we did that, and now we've other things to do -- make up your minds, please".
Given that LSIDs need a HTTP proxy, if we have a linked data compliant proxy, the LSIDs are pretty much covered. What we really want is to get new providers on board.
Without the proxy, the providers can work out a caching solution that works well for them. The TDWG proxy has to cache all lsid requests, not just those for ipini. It probably caches less of the ipini data than ipini would.
Also a lot of people use simpler crawlers that may not know how to correctly follow LSID proxies.
Yes, but the linked data crowd are aware of them, and there are tools that can handle LSIDs.
My .rdf files are cached by Google
Do a Google Search on:
http://species.geospecies.org/specs/Ochlerotatus_triseriatus.rdf
or
http://species.geospecies.org/specs/Culex_pipiens.rdf
The Google cache is not ideal, but it is an accessible alternative version. They may be open to making it work as a real alternative cache for linked data.
That's great, and will be useful, although won't help crawlers unless there's programmatic access to the Google cache. If you site goes down for, say, a day, and I want 10,000 records, I won't be clicking Google's web site.
But I take your point. Let's assume that one or more global linked data caches will emerge, as well as data dumps in various places. I still think a service that provided reasonable guarantees of availability would be an asset, plus we'd get discovery thrown in for free.
Regards
Rod
- Pete
On Fri, May 1, 2009 at 1:08 AM, Roderic Page r.page@bio.gla.ac.uk wrote: Dear Pete,
On 1 May 2009, at 04:37, Peter DeVries wrote:
This seems to be another example of how the use of LSID's creates problems and adds costs for data providers.
I'm not sure being hit by the Google bot is due to LSIDs as such. I think the problems of LSIDs lie more with the overhead of fussing with the DNS SRV (in theory trivial, but in practice not), the need for software beyond a web server, and the fact they don't resolve by themselves in browsers without proxies (although this hasn't hindered DOIs becoming widespread).
It would be much more straight forward to adopt the linked data standards and have this data be available in a widely supported standard.
Here is one linked data alternative:
http://lod.ipni.org/names/783030-1 <- the entity or concept ... redirects via 303 to either http://lod.ipni.org/names/783030-1.html <- human readable page http://lod.ipni.org/names/783030-1.rdf <- rdf data
See http://linkeddata.org/guides-and-tutorials
Test with this service http://validator.linkeddata.org/vapour
Playing nice with linked data makes sense, but we can do this with appropriate proxies. For example, http://validator.linkeddata.org/vapour?vocabUri=http%3A%2F%2Fbioguid.info%2F...
(if link broken in email try http://tinyurl.com/dkl755 )
Given that LSIDs are in the wild (including the scientific literature), we need to support them (that's the bugger with "persistent" identifiers, once you release them you're stuck with them).
That said, I'm guessing that anybody starting a new data providing service would be well advised to use HTTP URIs with 303 redirects, providing that they got the memo about cool URIs (http://www.w3.org/Provider/Style/URI ).
There are other ways to avoid service outages and data replication. Google and others have to deal with this problem everyday.
If you want to keep the branding on the identifier you could also do something like this.
http://lod.ipni.org/ipni-org_names_783030-1 <- the entity or concept, 303 redirect to either http://lod.ipni.org/ipni-org_names_783030-1.html <- human readable page http://lod.ipni.org/ipni-org_names_783030-1.rdf <- rdf data
Couldn't the free and ubiquitous Google cache provide some caching of these normal uri's
Firstly, is there any linked data in the Google cache? If the Google bot is harvesting as a web browser, it will get 303 redirects to HTML and not the RDF. I've had a quick look for DBPedia RDF in the cache and haven't found any.
Secondly, how would I get the cached copy? If I'm doing large-scale harvesting, I'll need programatic access to the cache, and that's not really possible (especially now that Google's SOAP API is deprecated).
Gel jockeys don't expect to have to get GenBank sequences from Google's cache because GenBank keeps falling over, so why do we expect to have to do this? OK, our situation is different because we have distributed data sources, but I'd prefer something like http://www.fak3r.com/2009/04/29/resolving-lsids-wit-url-resolvers-and-couchd...
Regards
Rod
- Pete
On Mon, Apr 27, 2009 at 7:54 AM, Nicola Nicolson <n.nicolson@rbgkew.org.uk
wrote:
Hi,
Further to my last design question re LSID HTTP proxies (thanks for the responses), I wanted to raise the issue of HTTP LSID proxies and crawlers, in particular the crawl delay part of the robots exclusion protocol.
I'll outline a situation we had recently:
The GBIF portal and ZipCodeZoo site both inclde IPNI LSIDs in the pages. These are presented in their proxied form using the TDWG LSID resolver (eg http://lsid.tdwg.org/urn:lsid:ipni.org:names:783030-1) . Using the TDWG resolver to access the data for an IPNI LSID does not issue any kind of HTTP redirect, instead the web resolver uses the LSID resolution steps to get the data and presents it in its own response (ie returning a HTTP 200 OK response).
The problem happens when one of these sites that includes proxied IPNI LSIDs is crawled by a search engine. The proxied links appear to belong to tdwg.org, so whatever crawl delay is agreed between TDWG and the crawler in question is used. The crawler has no knowledge that behind the scenes the TDWG resolver is hitting ipni.org. We (ipni.org) have agreed our own crawl limits with Google and the other major search engines using directives in robots.txt and directly agreed limits with Google (who don't use the robots.txt directly).
On a couple of occasions in the past we have had to deny access to the TDWG LSID resolver as it has been responsible for far more traffic than we can support (up to 10 times the crawl limits we have agreed with search engine bots) - this due to the pages on the GBIF portal and / or zipcodezoo being crawled by a search engine, which in turn triggers a high volume of requests from TDWG to IPNI. The crawler itself has no knowledge that it is in effect accessing data held at ipni.org rather than tdwg.org as the HTTP response is HTTP 200.
One of Rod's emails recently mentioned that we need a resolver to act like a tinyurl or bit.ly. I have pasted below the HTTP headers for an HTTP request to the TDWG LSID resolver, and to tinyurl / bit.ly. To the end user it looks as though tdwg.org is the true location of the LSID resource, whereas with the tinyurl and bitly both just redirect traffic.
I'm just posting this for discussion really - if we are to mandate use of a web based HTTP resolver/proxies, it should really issue 30* redirects so that established crawl delays between producer and consumer will be used. The alternative would be for the HTTP resolver to read and process the directives in robots.txt, but this would be difficult to implement as it is not in itself a crawler, just a gateway.
I'm sure that if proxied forms of LSIDs become more prevalent this problem will become more widespread, so now - with the on-going attempt to define what services a GUID resolver should provide - might be a good time to plan how to fix this.
cheers, Nicky
[nn00kg@kvstage01 ~]$ curl -I http://lsid.tdwg.org/urn:lsid:ipni.org:names:783030-1 HTTP/1.1 200 OK Via: 1.1 KISA01 Connection: close Proxy-Connection: close Date: Mon, 27 Apr 2009 11:41:55 GMT Content-Type: application/xml Server: Apache/2.2.3 (CentOS)
[nn00kg@kvstage01 ~]$ curl -I http://tinyurl.com/czkquy HTTP/1.1 301 Moved Permanently Via: 1.1 KISA01 Connection: close Proxy-Connection: close Date: Mon, 27 Apr 2009 12:16:38 GMT Location: http://www.ipni.org/ipni/plantNameByVersion.do?id=783030-1&version=1.4&a... Content-type: text/html Server: TinyURL/1.6 X-Powered-By: PHP/5.2.9
[nn00kg@kvstage01 ~]$ curl -I http://bit.ly/KO1Ko HTTP/1.1 301 Moved Permanently Via: 1.1 KISA01 Connection: Keep-Alive Proxy-Connection: Keep-Alive Content-Length: 287 Date: Mon, 27 Apr 2009 12:19:48 GMT Location: http://www.ipni.org/ipni/plantNameByVersion.do?id=783030-1&version=1.4&a... Content-Type: text/html;charset=utf-8 Server: nginx/0.7.42 Allow: GET, HEAD, POST
- Nicola Nicolson
- Science Applications Development,
- Royal Botanic Gardens, Kew,
- Richmond, Surrey, TW9 3AB, UK
- email: n.nicolson@rbgkew.org.uk
- phone: 020-8332-5766
tdwg-tag mailing list tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag
--
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706
tdwg-tag mailing list tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag
Roderic Page Professor of Taxonomy DEEB, FBLS Graham Kerr Building University of Glasgow Glasgow G12 8QQ, UK
Email: r.page@bio.gla.ac.uk Tel: +44 141 330 4778 Fax: +44 141 330 2792 AIM: rodpage1962@aim.com Facebook: http://www.facebook.com/profile.php?id=1112517192 Twitter: http://twitter.com/rdmpage Blog: http://iphylo.blogspot.com Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html
--
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706
--------------------------------------------------------- Roderic Page Professor of Taxonomy DEEB, FBLS Graham Kerr Building University of Glasgow Glasgow G12 8QQ, UK
Email: r.page@bio.gla.ac.uk Tel: +44 141 330 4778 Fax: +44 141 330 2792 AIM: rodpage1962@aim.com Facebook: http://www.facebook.com/profile.php?id=1112517192 Twitter: http://twitter.com/rdmpage Blog: http://iphylo.blogspot.com Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html
participants (13)
-
Bob Morris
-
Dave Vieglais
-
Kevin Richards
-
Nicola Nicolson
-
Peter Ansell
-
Peter DeVries
-
phil.cryer@mobot.org
-
Richard Pyle
-
Richardson, Ben
-
Robert Huber
-
Roderic Page
-
Roger Hyam
-
Tim Robertson