[tdwg-content] GUIDs for publications (usages and names) [SEC=UNCLASSIFIED]
Peter DeVries
pete.devries at gmail.com
Thu Jan 6 03:36:01 CET 2011
I suspect that the major crawlers have better error handling, but I have
used Elmo from OpenRDF.org.
It does not have very robust error handling. It will try to pull in anything
in it's whitelist that is linked via seeAlso, and fail if it is a PDF.
I have not tried Virtuoso for data crawling since I have worked out other
ways to get RDF, but I suspect that it does a much better job.
Most groups now make their data available as an RDF dump which eliminated
the need to crawl if you want to pull in a lot of data.
I guess the question is do you want to use a generic seeAlso which most
crawlers follow, vs some more specific predicate that says "here is the PDF"
My reluctance was more about minting my own vs. finding some other
vocabulary which has a similar predicate.
With the *hasPDF* predicate it would be pretty easy to query for all species
concepts that have a linked original description PDF etc.
I suspect that some standard predicate will eventually become accepted since
it is very useful to have something more specific than foaf:Document.
Respectively,
- Pete
On Wed, Jan 5, 2011 at 6:54 PM, Paul Murray <pmurray at anbg.gov.au> wrote:
>
> On 06/01/2011, at 7:48 AM, Peter DeVries wrote:
>
> Also, although I like a lot of what Steve says, I think that most existing
> crawlers expect that a seeAlso link is to some html, xml, rdf type thing and
> will
> not be able to handle a multi-megabyte PDF.
>
> This is why I reluctantly minted the predicate "hasPDF"
>
>
> Hmm. This is an issue with linkeddata: when you fetch a URI while crawling
> the semantic web, if it redirects, then it's an "other resource" and you get
> RDF. If not, then you are potentially pulling a multimegabyte "information
> resource" across the wire.
>
> A solution is to use an HTTP "HEAD" request when you do the initial URI
> fetch. If it's an "other resource", the HEAD return will be a 303 and
> contain redirect that you want in the "Location" header, and that's all you
> need. If not, the 200 result will contain the content type and possibly even
> the size, which is what you need to know before you GET it.
>
> So .. the problem that "hasPDF" is meant to address might be addressable by
> the crawlers just being a bit smarter about how they browse the semweb.
>
> _______________________________________________
>
> If you have received this transmission in error please notify us
> immediately by return e-mail and delete all copies. If this e-mail or any
> attachments have been sent to you in error, that error does not constitute
> waiver of any confidentiality, privilege or copyright in respect of
> information in the e-mail or attachments. Please consider the environment
> before printing this email.
>
--
---------------------------------------------------------------
Pete DeVries
Department of Entomology
University of Wisconsin - Madison
445 Russell Laboratories
1630 Linden Drive
Madison, WI 53706
TaxonConcept Knowledge Base <http://www.taxonconcept.org/> / GeoSpecies
Knowledge Base <http://lod.geospecies.org/>
About the GeoSpecies Knowledge Base <http://about.geospecies.org/>
------------------------------------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.tdwg.org/pipermail/tdwg-content/attachments/20110105/4a0cac62/attachment.html
More information about the tdwg-content
mailing list