[tdwg-content] GUIDs for publications (usages and names) [SEC=UNCLASSIFIED]
Paul Murray
pmurray at anbg.gov.au
Thu Jan 6 01:54:28 CET 2011
On 06/01/2011, at 7:48 AM, Peter DeVries wrote:
> Also, although I like a lot of what Steve says, I think that most existing crawlers expect that a seeAlso link is to some html, xml, rdf type thing and will
> not be able to handle a multi-megabyte PDF.
>
> This is why I reluctantly minted the predicate "hasPDF"
Hmm. This is an issue with linkeddata: when you fetch a URI while crawling the semantic web, if it redirects, then it's an "other resource" and you get RDF. If not, then you are potentially pulling a multimegabyte "information resource" across the wire.
A solution is to use an HTTP "HEAD" request when you do the initial URI fetch. If it's an "other resource", the HEAD return will be a 303 and contain redirect that you want in the "Location" header, and that's all you need. If not, the 200 result will contain the content type and possibly even the size, which is what you need to know before you GET it.
So .. the problem that "hasPDF" is meant to address might be addressable by the crawlers just being a bit smarter about how they browse the semweb.
_______________________________________________
If you have received this transmission in error please notify us immediately by return e-mail and delete all copies. If this e-mail or any attachments have been sent to you in error, that error does not constitute waiver of any confidentiality, privilege or copyright in respect of information in the e-mail or attachments.
Please consider the environment before printing this email.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.tdwg.org/pipermail/tdwg-content/attachments/20110106/ffeb62ee/attachment.html
More information about the tdwg-content
mailing list