Dear Pete,



On 1 May 2009, at 08:41, Peter DeVries wrote:

Hi Rod,

I am in favor of couchDB based distributed solutions. I just don't see how LSID's can
be justified base on their cost/benefits.

I think it is fair to say that some of LSID's benefits are historical, in that as a result of their adoption, most if not all major nomenclators and name databases are serving RDF, many using the same vocabulary. We could have the same result with HTTP URIs, but I suspect we wouldn't have. LSIDs pretty much killed off big XML schemas for transferring name information (a good thing). They also forced the providers to have clean URIs (in the sense that we can access an IPNI record with urn:lsid:ipni.org:names:o?id=77096980-1, instead of http://www.ipni.org:80/ipni/idPlantNameSearch.do?id=77096980-1&back_page=%2Fipni%2FeditSimplePlantNameSearch.do%3Ffind_wholeName%3DBegonia%2Bhekensis%26output_format%3Dnormal. Again, I doubt this would have happened without LSIDs.

So I think they should be viewed as one route to the Semantic Web. Now that we are pretty much there, the key question is should we mint new ones. If people had clean HTTP URIs that conformed to linked data standards, then it's hard to see the point...


The current LSID's can still be used, but if any group can easily make the transition to linked data it would be
those groups that have already successfully implemented LSID's.

Yes, although they might well say "but last time you said LSIDs were the new hotness, and we did that, and now we've other things to do -- make up your minds, please". 

Given that LSIDs need a HTTP proxy, if we have a linked data compliant proxy, the LSIDs are pretty much covered. What we really want is to get new providers on board.


Without the proxy, the providers can work out a caching solution that works well for them. The TDWG proxy has
to cache all lsid requests, not just those for ipini. It probably caches less of the ipini data than ipini would.

Also a lot of people use simpler crawlers that may not know how to correctly follow LSID proxies.

Yes, but the linked data crowd are aware of them, and there are tools that can handle LSIDs.



My .rdf files are cached by Google

Do a Google Search on:

http://species.geospecies.org/specs/Ochlerotatus_triseriatus.rdf

or

http://species.geospecies.org/specs/Culex_pipiens.rdf

The Google cache is not ideal, but it is an accessible alternative version. They may be open to making it work
as a real alternative cache for linked data.

That's great, and will be useful, although won't help crawlers unless there's programmatic access to the Google cache. If you site goes down for, say, a day, and I want 10,000 records, I won't be clicking Google's web site.

But I take your point. Let's assume that one or more global linked data caches will emerge, as well as data dumps in various places. I still think a service that provided reasonable guarantees of availability would be an asset, plus we'd get discovery thrown in for free.

Regards

Rod



- Pete


On Fri, May 1, 2009 at 1:08 AM, Roderic Page <r.page@bio.gla.ac.uk> wrote:
Dear Pete,

On 1 May 2009, at 04:37, Peter DeVries wrote:

This seems to be another example of how the use of LSID's creates problems and adds
costs for data providers.

I'm not sure being hit by the Google bot is due to LSIDs as such. I think the problems of LSIDs lie more with the overhead of fussing with the DNS SRV (in theory trivial, but in practice not), the need for software beyond a web server, and the fact they don't resolve by themselves in browsers without proxies (although this hasn't hindered DOIs becoming widespread).




It would be much more straight forward to adopt the linked data standards and have this data
be available in a widely supported standard.

Here is one linked data alternative:

http://lod.ipni.org/names/783030-1         <- the entity or concept ... redirects via 303 to either

See 
http://linkeddata.org/guides-and-tutorials

Test with this service



(if link broken in email try http://tinyurl.com/dkl755 )

Given that LSIDs are in the wild (including the scientific literature), we need to support them (that's the bugger with "persistent" identifiers, once you release them you're stuck with them).

That said, I'm guessing that anybody starting a new data providing service would be well advised to use HTTP URIs with 303 redirects, providing that they got the memo about cool URIs (http://www.w3.org/Provider/Style/URI).



There are other ways to avoid service outages and data replication.
Google and others have to deal with this problem everyday.

If you want to keep the branding on the identifier you could also do something like this.

http://lod.ipni.org/ipni-org_names_783030-1         <- the entity or concept, 303 redirect to either
http://lod.ipni.org/ipni-org_names_783030-1.html  <- human readable page
http://lod.ipni.org/ipni-org_names_783030-1.rdf    <- rdf data 

Couldn't the free and ubiquitous Google cache provide some caching of these normal uri's

Firstly, is there any linked data in the Google cache? If the Google bot is harvesting as a web browser,  it will get 303 redirects to HTML and not the RDF. I've had a quick look for DBPedia RDF in the cache and haven't found any.

Secondly, how would I get the cached copy? If I'm doing large-scale harvesting, I'll need programatic access to the cache, and that's not really possible (especially now that Google's SOAP API is deprecated).

Gel jockeys don't expect to have to get GenBank sequences from Google's cache because GenBank keeps falling over, so why do we expect to have to do this? OK, our situation is different because we have distributed data sources, but I'd prefer something like http://www.fak3r.com/2009/04/29/resolving-lsids-wit-url-resolvers-and-couchdb

Regards

Rod





- Pete

On Mon, Apr 27, 2009 at 7:54 AM, Nicola Nicolson <n.nicolson@rbgkew.org.uk> wrote:

Hi,

 

Further to my last design question re LSID HTTP proxies (thanks for the responses), I wanted to raise the issue of HTTP LSID proxies and crawlers, in particular the crawl delay part of the robots exclusion protocol.

 

I'll outline a situation we had recently:

 

The GBIF portal and ZipCodeZoo site both inclde IPNI LSIDs in the pages. These are presented in their proxied form using the TDWG LSID resolver (eg http://lsid.tdwg.org/urn:lsid:ipni.org:names:783030-1). Using the TDWG resolver to access the data for an IPNI LSID does not issue any kind of HTTP redirect, instead the web resolver uses the LSID resolution steps to get the data and presents it in its own response (ie returning a HTTP 200 OK response).

 

The problem happens when one of these sites that includes proxied IPNI LSIDs is crawled by a search engine. The proxied links appear to belong to tdwg.org, so whatever crawl delay is agreed between TDWG and the crawler in question is used. The crawler has no knowledge that behind the scenes the TDWG resolver is hitting ipni.org. We (ipni.org) have agreed our own crawl limits with Google and the other major search engines using directives in robots.txt and directly agreed limits with Google (who don't use the robots.txt directly).

 

On a couple of occasions in the past we have had to deny access to the TDWG LSID resolver as it has been responsible for far more traffic than we can support (up to 10 times the crawl limits we have agreed with search engine bots) - this due to the pages on the GBIF portal and / or zipcodezoo being crawled by a search engine, which in turn triggers a high volume of requests from TDWG to IPNI. The crawler itself has no knowledge that it is in effect accessing data held at ipni.org rather than tdwg.org as the HTTP response is HTTP 200.

 

One of Rod's emails recently mentioned that we need a resolver to act like a tinyurl or bit.ly. I have pasted below the HTTP headers for an HTTP request to the TDWG LSID resolver, and to tinyurl / bit.ly. To the end user it looks as though tdwg.org is the true location of the LSID resource, whereas with the tinyurl and bitly both just redirect traffic.

 

I'm just posting this for discussion really - if we are to mandate use of a web based HTTP resolver/proxies, it should really issue 30* redirects so that established crawl delays between producer and consumer will be used. The alternative would be for the HTTP resolver to read and process the directives in robots.txt, but this would be difficult to implement as it is not in itself a crawler, just a gateway.

 

I'm sure that if proxied forms of LSIDs become more prevalent this problem will become more widespread, so now - with the on-going attempt to define what services a GUID resolver should provide - might be a good time to plan how to fix this.

 

cheers,
Nicky


[nn00kg@kvstage01 ~]$ curl -I http://lsid.tdwg.org/urn:lsid:ipni.org:names:783030-1
HTTP/1.1 200 OK
Via: 1.1 KISA01
Connection: close
Proxy-Connection: close
Date: Mon, 27 Apr 2009 11:41:55 GMT
Content-Type: application/xml
Server: Apache/2.2.3 (CentOS)

 

[nn00kg@kvstage01 ~]$ curl -I http://tinyurl.com/czkquy
HTTP/1.1 301 Moved Permanently
Via: 1.1 KISA01
Connection: close
Proxy-Connection: close
Date: Mon, 27 Apr 2009 12:16:38 GMT
Location:
http://www.ipni.org/ipni/plantNameByVersion.do?id=783030-1&version=1.4&output_format=lsid-metadata&show_history=true
Content-type: text/html
Server: TinyURL/1.6
X-Powered-By: PHP/5.2.9

 

[nn00kg@kvstage01 ~]$ curl -I http://bit.ly/KO1Ko
HTTP/1.1 301 Moved Permanently
Via: 1.1 KISA01
Connection: Keep-Alive
Proxy-Connection: Keep-Alive
Content-Length: 287
Date: Mon, 27 Apr 2009 12:19:48 GMT
Location:
http://www.ipni.org/ipni/plantNameByVersion.do?id=783030-1&version=1.4&output_format=lsid-metadata&show_history=true
Content-Type: text/html;charset=utf-8
Server: nginx/0.7.42
Allow: GET, HEAD, POST




- Nicola Nicolson
- Science Applications Development,
- Royal Botanic Gardens, Kew,
- Richmond, Surrey, TW9 3AB, UK
- email: n.nicolson@rbgkew.org.uk
- phone: 020-8332-5766


_______________________________________________
tdwg-tag mailing list
tdwg-tag@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-tag




--
---------------------------------------------------------------
Pete DeVries
Department of Entomology
University of Wisconsin - Madison
445 Russell Laboratories
1630 Linden Drive
Madison, WI 53706
------------------------------------------------------------
_______________________________________________
tdwg-tag mailing list
tdwg-tag@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-tag

---------------------------------------------------------
Roderic Page
Professor of Taxonomy
DEEB, FBLS
Graham Kerr Building
University of Glasgow
Glasgow G12 8QQ, UK

Tel: +44 141 330 4778
Fax: +44 141 330 2792









--
---------------------------------------------------------------
Pete DeVries
Department of Entomology
University of Wisconsin - Madison
445 Russell Laboratories
1630 Linden Drive
Madison, WI 53706
------------------------------------------------------------

---------------------------------------------------------
Roderic Page
Professor of Taxonomy
DEEB, FBLS
Graham Kerr Building
University of Glasgow
Glasgow G12 8QQ, UK

Tel: +44 141 330 4778
Fax: +44 141 330 2792