<html><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; "><div><br></div><div>I wish I could keep out of this debate but...</div><div><br></div><div>Linked data approach is an order of magnitude simpler than LSID and very easy to layer on top of an existing LSID authority - you already have a RDF metadata response you just need the redirect URL which can be implemented in the Apache or IIS config or a very simple script.</div><div><br></div><div>It doesn't have to be only google that caches the metadata it could be GBIF/EoL or some other party who are interested in caching metadata from biodiversity suppliers. They could even have a submission mechanism. So the whole architecture would go:</div><div><br></div><div>1) work out how to get your data into RDF (tricky bit we should be working on as Markus points out - this could even be RDFa in a web page - anyone for Dreamweaver templates!!)</div><div>2) set up a 303 redirect to the RDF metadata. (very easy even on an ISP hosted domain or corporate internet - unlike messing with SRV records)</div>3) tell the world about it (GBIF/EoL can then scrape it and cache it if the license permits - and the license is in the data)<div><br></div><div>This approach is totally modular, distributed, loosely coupled and robust. The data supplier doesn't even need to have a search/browse function themselves they could just have a submission tool (SiteMap or RSS feed) and allow GBIF or whoever to supply those services on top.</div><div><br></div><div>We handle the social side of "URLs just break" by having recommendations for how URLs are designed. How about this one:</div><div><br></div><div>10.682772.info/specimen/E002719&nbsp;</div><div><br></div><div>Does that look enough like a DOI to keep people happy? I could secure the 682772.info domain for Ł12.50/year (Ł125 secures it for the next 10 years at the least). This includes free hosting of scripts to do my redirection etc. This is a&nbsp;cheeky example but I hope it illustrates the point that a well designed string can also be a URL.&nbsp;&nbsp;I don't include the transport protocol just as many quotes of DOIs don't include the doi: and all those adverts on the bus stops just have nike.com written on them not <a href="http://www.nike.com">http://www.nike.com</a>&nbsp;</div><div><br></div><div>There is plenty of room for innovation around this simple model. This is the most important thing. No strict protocols just enough to let people add their value. People can develop data hosting and other tools and packages just as GBIF do today.</div><div><br></div><div>Now it is a long weekend for nearly everyone I guess. I must stop thinking about identifiers!</div><div><br></div><div>All the best,</div><div><br></div><div>Roger</div><div><br></div><div><br></div><div><div><br><div><div>On 1 May 2009, at 08:41, Peter DeVries wrote:</div><br class="Apple-interchange-newline"><blockquote type="cite">Hi Rod,<div><br></div><div>I am in favor of couchDB based distributed solutions. I just don't see how LSID's can</div><div>be justified base on their cost/benefits.</div><div><br></div><div>The current LSID's can still be used, but&nbsp;if any group can easily make the transition to linked data it would be</div> <div>those groups that have already successfully implemented LSID's.</div><div><br></div><div>Without the proxy, the providers can work out a caching solution that works well for them. The TDWG proxy has</div><div>to cache all lsid requests, not just those for ipini. It probably caches less of the ipini data than ipini would.</div> <div><br></div><div>Also a lot of people use simpler crawlers that may not know how to correctly follow LSID proxies.</div><div><br></div><div>My .rdf files are cached by Google</div><div><br></div><div>Do a Google Search on:</div> <div><br></div><div><a href="http://species.geospecies.org/specs/Ochlerotatus_triseriatus.rdf">http://species.geospecies.org/specs/Ochlerotatus_triseriatus.rdf</a></div><div><br></div><div>or</div><div><br></div><div><a href="http://species.geospecies.org/specs/Culex_pipiens.rdf">http://species.geospecies.org/specs/Culex_pipiens.rdf</a><br> </div><div><br></div><div>The Google cache is not ideal, but it is an accessible alternative version. They may be open to making it work</div><div>as a real alternative cache for linked data.</div><div><br></div><div>- Pete</div> <div><br><br><div class="gmail_quote">On Fri, May 1, 2009 at 1:08 AM, Roderic Page <span dir="ltr">&lt;<a href="mailto:r.page@bio.gla.ac.uk">r.page@bio.gla.ac.uk</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"> <div style="word-wrap:break-word">Dear Pete,<div><br><div><div class="im"><div>On 1 May 2009, at 04:37, Peter DeVries wrote:</div><br><blockquote type="cite">This seems to be another example of how the use of LSID's creates problems and adds<div> costs for data providers.</div></blockquote><div><br></div></div><div>I'm not sure being hit by the Google bot is due to LSIDs as such. I think the problems of LSIDs lie more with the overhead of fussing with the DNS SRV (in theory trivial, but in practice not), the need for software beyond a web server, and the fact they don't resolve by themselves in browsers without proxies (although this hasn't hindered DOIs becoming widespread).</div> <div class="im"><div><br></div><br><blockquote type="cite"><div><br><div><br></div><div>It would be much more straight forward to adopt the linked data standards and have this data</div> <div>be available in a widely supported standard.</div> <div><br></div><div>Here is one linked data alternative:</div><div><br></div><div><div><a href="http://lod.ipni.org/names/783030-1" target="_blank">http://lod.ipni.org/names/783030-1</a> &nbsp; &nbsp; &nbsp; &nbsp; &lt;- the entity or concept ... redirects via 303 to either</div> <div><a href="http://lod.ipni.org/names/783030-1.html" target="_blank">http://lod.ipni.org/names/783030-1.html</a> &lt;- human readable page</div><div><a href="http://lod.ipni.org/names/783030-1.rdf" target="_blank">http://lod.ipni.org/names/783030-1.rdf</a> &nbsp; &nbsp;&lt;- rdf data</div> <div><br></div><div>See&nbsp;</div><a href="http://linkeddata.org/guides-and-tutorials" target="_blank">http://linkeddata.org/guides-and-tutorials</a></div><div><br></div><div>Test with this service</div><div><a href="http://validator.linkeddata.org/vapour" target="_blank">http://validator.linkeddata.org/vapour</a></div> </div></blockquote><div><br></div><div><br></div></div><div>Playing nice with linked data makes sense, but we can do this with appropriate proxies. For example,&nbsp;<a href="http://validator.linkeddata.org/vapour?vocabUri=http%3A%2F%2Fbioguid.info%2Furn%3Alsid%3Aipni.org%3Anames%3A138762-3&amp;classUri=http%3A%2F%2F&amp;propertyUri=http%3A%2F%2F&amp;instanceUri=http%3A%2F%2F&amp;defaultResponse=dontmind&amp;userAgent=vapour.sourceforge.net" target="_blank">http://validator.linkeddata.org/vapour?vocabUri=http%3A%2F%2Fbioguid.info%2Furn%3Alsid%3Aipni.org%3Anames%3A138762-3&amp;classUri=http%3A%2F%2F&amp;propertyUri=http%3A%2F%2F&amp;instanceUri=http%3A%2F%2F&amp;defaultResponse=dontmind&amp;userAgent=vapour.sourceforge.net</a></div> <div><br></div><div>(if link broken in email try&nbsp;<a href="http://tinyurl.com/dkl755" target="_blank">http://tinyurl.com/dkl755</a> )</div><div><br></div><div>Given that LSIDs are in the wild (including the scientific literature), we need to support them (that's the bugger with "persistent" identifiers, once you release them you're stuck with them).</div> <div><br></div><div>That said, I'm guessing that anybody starting a new data providing service would be well advised to use HTTP URIs with 303 redirects, providing that they got the memo about cool URIs (<a href="http://www.w3.org/Provider/Style/URI)" target="_blank">http://www.w3.org/Provider/Style/URI)</a>.</div> <div class="im"><div><br></div><br><blockquote type="cite"><div> <div><br></div><div>There are other ways to avoid service outages and data replication.</div><div>Google and others have to deal with this&nbsp;problem everyday.</div> <div><br></div><div>If you want to keep the branding on the identifier you could also do something like this.</div> <div><br></div><div><div><a href="http://lod.ipni.org/" target="_blank">http://lod.ipni.org/</a><font color="#330099">ipni-org_names_783030-1</font>&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &lt;- the entity or concept, 303 redirect to either</div> <div><a href="http://lod.ipni.org/" target="_blank">http://lod.ipni.org/</a><font color="#330099">ipni-org_names_783030-1</font>.html &nbsp;&lt;- human readable page</div><div><a href="http://lod.ipni.org/" target="_blank">http://lod.ipni.org/</a><font color="#330099">ipni-org_names_783030-1</font>.rdf &nbsp; &nbsp;&lt;- rdf data&nbsp;</div> <div><br></div><div>Couldn't the free and ubiquitous Google cache provide some caching of these normal uri's</div></div></div></blockquote><div><br></div></div><div>Firstly, is there any linked data in the Google cache? If the Google bot is harvesting as a web browser, &nbsp;it will get 303 redirects to HTML and not the RDF. I've had a quick look for DBPedia RDF in the cache and haven't found any.</div> <div><br></div><div>Secondly, how would I get the cached copy? If I'm doing large-scale harvesting, I'll need programatic access to the cache, and that's not really possible (especially now that Google's SOAP API is deprecated).</div> <div><br></div><div>Gel jockeys don't expect to have to get GenBank sequences from Google's cache because GenBank keeps falling over, so why do we expect to have to do this? OK, our situation is different because we have distributed data sources, but I'd prefer something like&nbsp;<a href="http://www.fak3r.com/2009/04/29/resolving-lsids-wit-url-resolvers-and-couchdb" target="_blank">http://www.fak3r.com/2009/04/29/resolving-lsids-wit-url-resolvers-and-couchdb</a></div> <div><br></div><div>Regards</div><div><br></div><div>Rod</div><div><div></div><div class="h5"><div><br></div><div><br></div><div><br></div><br><blockquote type="cite"><div><div><div><br></div><div>- Pete</div><div><br></div> <div class="gmail_quote">On Mon, Apr 27, 2009 at 7:54 AM, Nicola Nicolson <span dir="ltr">&lt;<a href="mailto:n.nicolson@rbgkew.org.uk" target="_blank">n.nicolson@rbgkew.org.uk</a>></span> wrote:<br> <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> <div><p><font face="Courier New" size="2">Hi,</font></p><div><font face="Courier New" size="2"></font>&nbsp;<br></div><p><font face="Courier New" size="2">Further to my last design question re LSID HTTP proxies (thanks for the responses), I wanted to raise the issue of HTTP LSID proxies and crawlers, in particular the crawl delay part of the robots exclusion protocol.</font></p> <div><font face="Courier New" size="2"></font>&nbsp;<br></div><p><font face="Courier New" size="2">I'll outline a situation we had recently:</font></p><div><font face="Courier New" size="2"></font>&nbsp;<br></div><p><font face="Courier New" size="2">The GBIF portal and ZipCodeZoo site both inclde IPNI LSIDs in the pages. These are presented in their proxied form using the TDWG LSID resolver (eg </font><a href="http://lsid.tdwg.org/urn:lsid:ipni.org:names:783030-1" target="_blank"><font face="Courier New" size="2">http://lsid.tdwg.org/urn:lsid:ipni.org:names:783030-1</font></a><font face="Courier New" size="2">). Using the TDWG resolver to access the data for an IPNI LSID does not issue any kind of HTTP redirect, instead the web resolver uses the LSID resolution steps to get the data and presents it in its own response (ie returning a HTTP 200 OK response).</font></p> <div><font face="Courier New" size="2"></font>&nbsp;<br></div><p><font face="Courier New" size="2">The problem happens when one of these sites that includes proxied IPNI LSIDs is crawled by a search engine. The proxied links appear to belong to <a href="http://tdwg.org" target="_blank">tdwg.org</a>, so whatever crawl delay is agreed between TDWG and the crawler in question is used. The crawler has no knowledge that behind the scenes the TDWG resolver is hitting <a href="http://ipni.org" target="_blank">ipni.org</a>. We (<a href="http://ipni.org" target="_blank">ipni.org</a>) have agreed our own crawl limits with Google and the other major search engines using directives in robots.txt and directly agreed limits with Google (who don't use the robots.txt directly).</font></p> <div><font face="Courier New" size="2"></font>&nbsp;<br></div><p><font face="Courier New" size="2">On a couple of occasions in the past we have had to deny access to the TDWG LSID resolver as it has been responsible for far more traffic than we can support (up to&nbsp;10 times the crawl limits we have agreed with search engine bots) - this due to the pages on the GBIF portal and / or zipcodezoo being crawled by a search engine, which in turn triggers a high volume of requests from TDWG to IPNI. The crawler itself has no knowledge that it is in effect accessing data held at <a href="http://ipni.org" target="_blank">ipni.org</a> rather than <a href="http://tdwg.org" target="_blank">tdwg.org</a> as the HTTP response is HTTP 200.</font></p> <div><font face="Courier New" size="2"></font>&nbsp;<br></div><p><font face="Courier New" size="2">One of Rod's emails recently mentioned that we need a resolver to act like a tinyurl or <a href="http://bit.ly" target="_blank">bit.ly</a>. I have pasted below the HTTP headers for an HTTP request to the TDWG LSID resolver, and to tinyurl / <a href="http://bit.ly" target="_blank">bit.ly</a>. To the end user it looks as though <a href="http://tdwg.org" target="_blank">tdwg.org</a> is the true location of the LSID resource, whereas with the tinyurl and bitly both just redirect traffic.</font></p> <div><font face="Courier New" size="2"></font>&nbsp;<br></div><p><font face="Courier New" size="2">I'm just posting this for discussion really - if we are to mandate use of a web based HTTP resolver/proxies, it should really issue 30* redirects so that established crawl delays between producer and consumer will be used. The alternative would be for the HTTP resolver to read and process the directives in robots.txt, but this would be difficult to implement as it is not in itself a crawler, just a gateway.</font></p> <div><font face="Courier New" size="2"></font>&nbsp;<br></div><p><font face="Courier New" size="2">I'm sure that if proxied forms of LSIDs become more prevalent this problem will become more widespread, so now - with the on-going attempt to define what services a GUID resolver should provide -&nbsp;might be a good time to plan how to fix this.</font></p> <div><font face="Courier New" size="2"></font>&nbsp;<br></div><p><font face="Courier New" size="2">cheers,<br> Nicky</font></p><p><br> <font face="Courier New" size="2">[nn00kg@kvstage01 ~]$ curl -I </font><a href="http://lsid.tdwg.org/urn:lsid:ipni.org:names:783030-1" target="_blank"><font face="Courier New" size="2">http://lsid.tdwg.org/urn:lsid:ipni.org:names:783030-1</font></a><br> <font face="Courier New" size="2">HTTP/1.1 200 OK<br> Via: 1.1 KISA01<br> Connection: close<br> Proxy-Connection: close<br> Date: Mon, 27 Apr 2009 11:41:55 GMT<br> Content-Type: application/xml<br> Server: Apache/2.2.3 (CentOS)</font></p> <div><font face="Courier New" size="2"></font>&nbsp;<br></div><p><font face="Courier New" size="2">[nn00kg@kvstage01 ~]$ curl -I </font><a href="http://tinyurl.com/czkquy" target="_blank"><font face="Courier New" size="2">http://tinyurl.com/czkquy</font></a><br> <font face="Courier New" size="2">HTTP/1.1 301 Moved Permanently<br> Via: 1.1 KISA01<br> Connection: close<br> Proxy-Connection: close<br> Date: Mon, 27 Apr 2009 12:16:38 GMT<br> Location: </font><a href="http://www.ipni.org/ipni/plantNameByVersion.do?id=783030-1&amp;version=1.4&amp;output_format=lsid-metadata&amp;show_history=true" target="_blank"><font face="Courier New" size="2">http://www.ipni.org/ipni/plantNameByVersion.do?id=783030-1&amp;version=1.4&amp;output_format=lsid-metadata&amp;show_history=true</font></a><br> <font face="Courier New" size="2">Content-type: text/html<br> Server: TinyURL/1.6<br> X-Powered-By: PHP/5.2.9</font></p><div><font face="Courier New" size="2"></font>&nbsp;<br></div><p><font face="Courier New" size="2">[nn00kg@kvstage01 ~]$ curl -I </font><a href="http://bit.ly/KO1Ko" target="_blank"><font face="Courier New" size="2">http://bit.ly/KO1Ko</font></a><br> <font face="Courier New" size="2">HTTP/1.1 301 Moved Permanently<br> Via: 1.1 KISA01<br> Connection: Keep-Alive<br> Proxy-Connection: Keep-Alive<br> Content-Length: 287<br> Date: Mon, 27 Apr 2009 12:19:48 GMT<br> Location: </font><a href="http://www.ipni.org/ipni/plantNameByVersion.do?id=783030-1&amp;version=1.4&amp;output_format=lsid-metadata&amp;show_history=true" target="_blank"><font face="Courier New" size="2">http://www.ipni.org/ipni/plantNameByVersion.do?id=783030-1&amp;version=1.4&amp;output_format=lsid-metadata&amp;show_history=true</font></a><br> <font face="Courier New" size="2">Content-Type: text/html;charset=utf-8<br> Server: nginx/0.7.42<br> Allow: GET, HEAD, POST</font></p><p><br><font color="#888888"> <br> <br> <font face="Courier New" size="2">- Nicola Nicolson<br> - Science Applications Development,<br> - Royal Botanic Gardens, Kew,<br> - Richmond, Surrey, TW9 3AB, UK<br> - email: <a href="mailto:n.nicolson@rbgkew.org.uk" target="_blank">n.nicolson@rbgkew.org.uk</a><br> - phone: 020-8332-5766</font></font></p> </div> <br>_______________________________________________<br> tdwg-tag mailing list<br> <a href="mailto:tdwg-tag@lists.tdwg.org" target="_blank">tdwg-tag@lists.tdwg.org</a><br> <a href="http://lists.tdwg.org/mailman/listinfo/tdwg-tag" target="_blank">http://lists.tdwg.org/mailman/listinfo/tdwg-tag</a><br> <br></blockquote></div><br><br clear="all"><br>-- <br>---------------------------------------------------------------<br>Pete DeVries<br>Department of Entomology<br>University of Wisconsin - Madison<br>445 Russell Laboratories<br> 1630 Linden Drive<br>Madison, WI 53706<br>------------------------------------------------------------<br> </div></div> _______________________________________________<br>tdwg-tag mailing list<br><a href="mailto:tdwg-tag@lists.tdwg.org" target="_blank">tdwg-tag@lists.tdwg.org</a><br> <a href="http://lists.tdwg.org/mailman/listinfo/tdwg-tag" target="_blank">http://lists.tdwg.org/mailman/listinfo/tdwg-tag</a><br></blockquote></div></div></div><br><div> <span style="border-collapse:separate;color:rgb(0, 0, 0);font-family:Helvetica;font-size:12px;font-style:normal;font-variant:normal;font-weight:normal;letter-spacing:normal;line-height:normal;text-align:auto;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px"><div style="word-wrap:break-word"> <span style="border-collapse:separate;color:rgb(0, 0, 0);font-family:Helvetica;font-size:12px;font-style:normal;font-variant:normal;font-weight:normal;letter-spacing:normal;line-height:normal;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px"><div style="word-wrap:break-word"> <span style="border-collapse:separate;color:rgb(0, 0, 0);font-family:Helvetica;font-size:12px;font-style:normal;font-variant:normal;font-weight:normal;letter-spacing:normal;line-height:normal;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px"><div style="word-wrap:break-word"> <span style="border-collapse:separate;color:rgb(0, 0, 0);font-family:Helvetica;font-size:12px;font-style:normal;font-variant:normal;font-weight:normal;letter-spacing:normal;line-height:normal;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px"><div style="word-wrap:break-word"> <span style="border-collapse:separate;color:rgb(0, 0, 0);font-family:Helvetica;font-size:12px;font-style:normal;font-variant:normal;font-weight:normal;letter-spacing:normal;line-height:normal;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px"><div style="word-wrap:break-word"> <div>---------------------------------------------------------</div><div class="im"><div>Roderic Page</div><div>Professor of Taxonomy</div><div>DEEB, FBLS</div><div>Graham Kerr Building</div><div>University of Glasgow</div> <div>Glasgow G12 8QQ, UK</div><div><br></div><div>Email: <a href="mailto:r.page@bio.gla.ac.uk" target="_blank">r.page@bio.gla.ac.uk</a></div><div>Tel: +44 141 330 4778</div><div>Fax: +44 141 330 2792</div><div>AIM: <a href="mailto:rodpage1962@aim.com" target="_blank">rodpage1962@aim.com</a></div> <div>Facebook:&nbsp;<a href="http://www.facebook.com/profile.php?id=1112517192" target="_blank">http://www.facebook.com/profile.php?id=1112517192</a></div><div>Twitter:&nbsp;<a href="http://twitter.com/rdmpage" target="_blank">http://twitter.com/rdmpage</a></div> <div>Blog: <a href="http://iphylo.blogspot.com" target="_blank">http://iphylo.blogspot.com</a></div><div>Home page: <a href="http://taxonomy.zoology.gla.ac.uk/rod/rod.html" target="_blank">http://taxonomy.zoology.gla.ac.uk/rod/rod.html</a></div> <div><br></div><div><br></div></div></div></span></div></span></div></span><br></div></span><br></div></span><br> </div><br></div></div></blockquote></div><br><br clear="all"><br>-- <br>---------------------------------------------------------------<br> Pete DeVries<br>Department of Entomology<br>University of Wisconsin - Madison<br>445 Russell Laboratories<br>1630 Linden Drive<br>Madison, WI 53706<br>------------------------------------------------------------<br> </div> _______________________________________________<br>tdwg-tag mailing list<br><a href="mailto:tdwg-tag@lists.tdwg.org">tdwg-tag@lists.tdwg.org</a><br>http://lists.tdwg.org/mailman/listinfo/tdwg-tag<br></blockquote></div><br></div></div></body></html>