[tdwg-content] Literal and URI versions of terms (was Re: [tdwg-tag] Do terms in the http://rs.tdwg.org/dwc/terms/ namespace actually resolve?)

Peter DeVries pete.devries at gmail.com
Wed Sep 1 20:19:23 CEST 2010


Hi Steve,

I think that in many cases terms that are currently literals should be
replaced with URI's. The note below explains a major reason, but also
consider all the potential misspellings etc. The choice of a literal
requires that all the potentially consuming applications will know which of
the various ways of saying "University of Louisiana at Monroe Herbarium"
mean the same thing and which mean different things.

Also, I am assuming that the URI below maps to the same thing listed in the
label and not a collection within that Herbarium?

I would have GBIF/TDWG have a controlled list of Institutions that are
represented with URI's. The label would occur in that vocabulary document
and not repeated in each occurrence record. Using this approach the label
would still be seen in a LOD browser but it would not be duplicated millions
of times.

The other issue is that URI's are stored much more efficiently than
literals. See below:

--
The section below provides some insight as to why I think it is good to
replace literals with URI's.

It is from the Virtuoso FAQ page, but I also noticed a substantial
improvement in performance in the Sesame triple store when I changed some of
my literal fields into URI's
http://docs.openlinksw.com/virtuoso/virtuosofaq.html#virtuosofaq1

There are many places with in the current DarwinCore that a standard
vocabulary is used. However, these are represented as literals in the RDF.

It might be useful to think of making URI versions of some of these such as
datum, taxon level, lifestage etc.

Another option would be to having these represented as URI's in some GBIF
processed version of the originally submitted data.

Respectfully,

- Pete

-------------------------------------------------------------------------

1.4.1. What is the storage cost per triple?

This depends on the index scheme. If indexed 2 ways, assuming that the graph
will always be stated in queries, this is 31 bytes.

With 4 indices, supporting queries where the graph can be left unspecified
(i.e., triples from any graph will be considered in query evaluation), this
is 39 bytes. The numbers are measured with the LUBM validation data set of
121K triples, with no full-text index on literals.

With 4 indices and a full text index on all literals, the Billion Triples
Challenge data set, 1115M triples, is about 120 GB of database pages. The
database file size is larger due to space in reserve and other factors. 120
GB is the number to use when assessing RAM-to-disk ratio, i.e., how much RAM
the system ought to have in order to provide good response. This data set is
a heterogeneous collection including social network data, conversations
harvested from the Web, DBpedia, Freebase, etc., with relatively numerous
and long text literals.

-----


On Wed, Sep 1, 2010 at 9:40 AM, Steve Baskauf
<steve.baskauf at vanderbilt.edu>wrote:

>  Pete,
> Thanks for the response about term resolution.  I'm over my head on that
> topic, so I'll let others respond to that part.
>
> With regards to a vocabulary that uses URIs rather than literals, I'm in
> favor of that.  At one point in a previous discussion, I think it was
> suggested that separate terms be created for literal and URI versions of
> terms like dwc:recordedBy.  At first I liked that idea, but after thinking
> about it and playing with it for a while, I think that the suggestion of
> just applying a label property to the resource identified by the URI is
> simpler and wouldn't require a proliferation of new terms.  For example:
>
>             <dcterms:creator>
>                 <rdf:Description rdf:about=
> "http://biocol.org/urn:lsid:biocol.org:col:15539"<http://biocol.org/urn:lsid:biocol.org:col:15539>
> >
>                     <rdfs:label>University of Louisiana at Monroe
> Herbarium</rdfs:label>
>                 </rdf:Description>
>             </dcterms:creator>
>
> could be used if both a literal and URI were available and
>
>             <dcterms:creator>University of Louisiana at Monroe
> Herbarium</dcterms:creator>
>
> could be used if a URI were not available.  It seems like it should be
> relatively easy for a linked data client to have contingencies to deal with
> this.  Even with technology that's semantically "dumb" like XSLT, it's
> pretty easy to code for the two possibilities.
>
> But I suppose it would be good to have some kind of consensus that this is
> the preferred approach.  Otherwise, separate terms might be better.  There
> aren't a whole lot of dwc terms to which this situation would apply.
>
> Steve
>
> Peter DeVries wrote:
>
>
>
>  By "efficient", I mean a vocabulary that uses standard resolvable URI's
> instead of literals for standard terms etc. This solution would also avoid
> the problem that Markus just mentioned.
>
>  I am also wondering if the "individual" definition should be changed to
> mean one individual organism rather than a potential collection of
> individuals. Individuals from the same colony could be represented using a
> separate related vocabulary. Allowing multiple
> individuals will cause problems for consuming applications. For instance,
> is the queen a separate individual or not? How do you differentiate between
> a photo of the queen vs. a photo of one of the workers. There are also
> potential problems even if the individuals
> are all workers.
>
>  I have been thinking that for some attributes like character states, it
> might be best to have a family level ontology. In this example, you
> might have a "formicidae_ontology", that could be used to deal with
> individuals from the same colony as well as ant specific character states.
>
>
> xmlns:ant="http://rs.gbif.org/family_ontology/ant.owl#"
>
>  <rdf:Description rdf:about="http://example.org/individual/123412">
>
>  <ant:colonyMateOf rdf:resource="http://example.org/individual/123414"/>
> </rdf:Description>
>
>  This could be defined as a subproperty of dc:relation or something
> similar in the gbif/tdwg vocabulary.
>
>  - Pete
>
> On Tue, Aug 24, 2010 at 6:43 PM, Steve Baskauf <
> steve.baskauf at vanderbilt.edu> wrote:
>
>> I was doing some GUID testing using a Linked Data client and I noticed
>> that some Darwin Core terms did not seem to resolve to anything.  I ran
>> a test using
>> http://demo.openlinksw.com/rdfbrowser2/
>> http://api.talis.com/stores/iand-dev1/items/dipper.html
>> http://www5.wiwiss.fu-berlin.de/marbles/
>> and
>> http://dataviewer.zitgist.com/
>> I first I looked up
>> http://purl.org/dc/terms/creator
>> and all four clients reported the properties of the term.  Then I tried
>> http://rs.tdwg.org/dwc/terms/basisOfRecord
>> and nothing happened with any of them.  I ran a Vapour
>> http://validator.linkeddata.org/vapour
>> validation on the basisOfRecord URI and got the following message:
>>
>> Vapour was unable to complete the request due to the following exception:
>>
>> ForbiddenAddress: forbidden request from 98.87.45.8 to
>> http://rs.tdwg.org/dwc/terms/basisOfRecord (resolves to IP
>> 192.38.28.106), internal IP addresses are forbidden
>>
>> I have no idea what that means, but all of this seems to mean that
>> Darwin Core is currently "broken" from a Linked Data point of view.
>>
>> Steve
>>
>> --
>> Steven J. Baskauf, Ph.D., Senior Lecturer
>> Vanderbilt University Dept. of Biological Sciences
>>
>> postal mail address:
>> VU Station B 351634
>> Nashville, TN  37235-1634,  U.S.A.
>>
>> delivery address:
>> 2125 Stevenson Center
>> 1161 21st Ave., S.
>> Nashville, TN 37235
>>
>> office: 2128 Stevenson Center
>> phone: (615) 343-4582,  fax: (615) 343-6707
>> http://bioimages.vanderbilt.edu
>>
>> _______________________________________________
>> tdwg-tag mailing list
>> tdwg-tag at lists.tdwg.org
>> http://lists.tdwg.org/mailman/listinfo/tdwg-tag
>>
>
>
>
> --
> ----------------------------------------------------------------
> Pete DeVries
> Department of Entomology
> University of Wisconsin - Madison
> 445 Russell Laboratories
> 1630 Linden Drive
> Madison, WI 53706
> TaxonConcept Knowledge Base <http://www.taxonconcept.org/> / GeoSpecies
> Knowledge Base <http://lod.geospecies.org/>
> About the GeoSpecies Knowledge Base <http://about.geospecies.org/>
> ------------------------------------------------------------
>
>
> --
> Steven J. Baskauf, Ph.D., Senior Lecturer
> Vanderbilt University Dept. of Biological Sciences
>
> postal mail address:
> VU Station B 351634
> Nashville, TN  37235-1634,  U.S.A.
>
> delivery address:
> 2125 Stevenson Center
> 1161 21st Ave., S.
> Nashville, TN 37235
>
> office: 2128 Stevenson Center
> phone: (615) 343-4582,  fax: (615) 343-6707http://bioimages.vanderbilt.edu
>
>


-- 
----------------------------------------------------------------
Pete DeVries
Department of Entomology
University of Wisconsin - Madison
445 Russell Laboratories
1630 Linden Drive
Madison, WI 53706
TaxonConcept Knowledge Base <http://www.taxonconcept.org/> / GeoSpecies
Knowledge Base <http://lod.geospecies.org/>
About the GeoSpecies Knowledge Base <http://about.geospecies.org/>
------------------------------------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.tdwg.org/pipermail/tdwg-content/attachments/20100901/544e8efc/attachment.html 


More information about the tdwg-content mailing list