Hi Steve, 

I think that in many cases terms that are currently literals should be replaced with URI's. The note below explains a major reason, but also consider all the potential misspellings etc. The choice of a literal requires that all the potentially consuming applications will know which of the various ways of saying "University of Louisiana at Monroe Herbarium" mean the same thing and which mean different things.

Also, I am assuming that the URI below maps to the same thing listed in the label and not a collection within that Herbarium?

I would have GBIF/TDWG have a controlled list of Institutions that are represented with URI's. The label would occur in that vocabulary document and not repeated in each occurrence record. Using this approach the label would still be seen in a LOD browser but it would not be duplicated millions of times.

The other issue is that URI's are stored much more efficiently than literals. See below:

--
The section below provides some insight as to why I think it is good to replace literals with URI's.

It is from the Virtuoso FAQ page, but I also noticed a substantial improvement in performance in the Sesame triple store when I changed some of my literal fields into URI's http://docs.openlinksw.com/virtuoso/virtuosofaq.html#virtuosofaq1

There are many places with in the current DarwinCore that a standard vocabulary is used. However, these are represented as literals in the RDF.

It might be useful to think of making URI versions of some of these such as datum, taxon level, lifestage etc. 

Another option would be to having these represented as URI's in some GBIF processed version of the originally submitted data.

Respectfully,

- Pete

-------------------------------------------------------------------------

1.4.1. What is the storage cost per triple?

This depends on the index scheme. If indexed 2 ways, assuming that the graph will always be stated in queries, this is 31 bytes.

With 4 indices, supporting queries where the graph can be left unspecified (i.e., triples from any graph will be considered in query evaluation), this is 39 bytes. The numbers are measured with the LUBM validation data set of 121K triples, with no full-text index on literals.

With 4 indices and a full text index on all literals, the Billion Triples Challenge data set, 1115M triples, is about 120 GB of database pages. The database file size is larger due to space in reserve and other factors. 120 GB is the number to use when assessing RAM-to-disk ratio, i.e., how much RAM the system ought to have in order to provide good response. This data set is a heterogeneous collection including social network data, conversations harvested from the Web, DBpedia, Freebase, etc., with relatively numerous and long text literals.

-----


On Wed, Sep 1, 2010 at 9:40 AM, Steve Baskauf <steve.baskauf@vanderbilt.edu> wrote:
Pete,
Thanks for the response about term resolution.  I'm over my head on that topic, so I'll let others respond to that part.

With regards to a vocabulary that uses URIs rather than literals, I'm in favor of that.  At one point in a previous discussion, I think it was suggested that separate terms be created for literal and URI versions of terms like dwc:recordedBy.  At first I liked that idea, but after thinking about it and playing with it for a while, I think that the suggestion of just applying a label property to the resource identified by the URI is simpler and wouldn't require a proliferation of new terms.  For example:

            <dcterms:creator>
                <rdf:Description rdf:about="http://biocol.org/urn:lsid:biocol.org:col:15539">
                    <rdfs:label>University of Louisiana at Monroe Herbarium</rdfs:label>
                </rdf:Description>
            </dcterms:creator>

could be used if both a literal and URI were available and

            <dcterms:creator>University of Louisiana at Monroe Herbarium</dcterms:creator>

could be used if a URI were not available.  It seems like it should be relatively easy for a linked data client to have contingencies to deal with this.  Even with technology that's semantically "dumb" like XSLT, it's pretty easy to code for the two possibilities.

But I suppose it would be good to have some kind of consensus that this is the preferred approach.  Otherwise, separate terms might be better.  There aren't a whole lot of dwc terms to which this situation would apply.

Steve

Peter DeVries wrote:


By "efficient", I mean a vocabulary that uses standard resolvable URI's instead of literals for standard terms etc. This solution would also avoid the problem that Markus just mentioned. 

I am also wondering if the "individual" definition should be changed to mean one individual organism rather than a potential collection of individuals. Individuals from the same colony could be represented using a separate related vocabulary. Allowing multiple
individuals will cause problems for consuming applications. For instance, is the queen a separate individual or not? How do you differentiate between a photo of the queen vs. a photo of one of the workers. There are also potential problems even if the individuals
are all workers.

I have been thinking that for some attributes like character states, it might be best to have a family level ontology. In this example, you might have a "formicidae_ontology", that could be used to deal with individuals from the same colony as well as ant specific character states.


<rdf:Description rdf:about="http://example.org/individual/123412">

 <ant:colonyMateOf rdf:resource="http://example.org/individual/123414"/>
</rdf:Description>

This could be defined as a subproperty of dc:relation or something similar in the gbif/tdwg vocabulary.

- Pete

On Tue, Aug 24, 2010 at 6:43 PM, Steve Baskauf <steve.baskauf@vanderbilt.edu> wrote:
I was doing some GUID testing using a Linked Data client and I noticed
that some Darwin Core terms did not seem to resolve to anything.  I ran
a test using
http://demo.openlinksw.com/rdfbrowser2/
http://api.talis.com/stores/iand-dev1/items/dipper.html
http://www5.wiwiss.fu-berlin.de/marbles/
and
http://dataviewer.zitgist.com/
I first I looked up
http://purl.org/dc/terms/creator
and all four clients reported the properties of the term.  Then I tried
http://rs.tdwg.org/dwc/terms/basisOfRecord
and nothing happened with any of them.  I ran a Vapour
http://validator.linkeddata.org/vapour
validation on the basisOfRecord URI and got the following message:

Vapour was unable to complete the request due to the following exception:

ForbiddenAddress: forbidden request from 98.87.45.8 to http://rs.tdwg.org/dwc/terms/basisOfRecord (resolves to IP 192.38.28.106), internal IP addresses are forbidden

I have no idea what that means, but all of this seems to mean that
Darwin Core is currently "broken" from a Linked Data point of view.

Steve

--
Steven J. Baskauf, Ph.D., Senior Lecturer
Vanderbilt University Dept. of Biological Sciences

postal mail address:
VU Station B 351634
Nashville, TN  37235-1634,  U.S.A.

delivery address:
2125 Stevenson Center
1161 21st Ave., S.
Nashville, TN 37235

office: 2128 Stevenson Center
phone: (615) 343-4582,  fax: (615) 343-6707
http://bioimages.vanderbilt.edu

_______________________________________________
tdwg-tag mailing list
tdwg-tag@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-tag



--
----------------------------------------------------------------
Pete DeVries
Department of Entomology
University of Wisconsin - Madison
445 Russell Laboratories
1630 Linden Drive
Madison, WI 53706
TaxonConcept Knowledge Base / GeoSpecies Knowledge Base
About the GeoSpecies Knowledge Base
------------------------------------------------------------

-- 
Steven J. Baskauf, Ph.D., Senior Lecturer
Vanderbilt University Dept. of Biological Sciences

postal mail address:
VU Station B 351634
Nashville, TN  37235-1634,  U.S.A.

delivery address:
2125 Stevenson Center
1161 21st Ave., S.
Nashville, TN 37235

office: 2128 Stevenson Center
phone: (615) 343-4582,  fax: (615) 343-6707
http://bioimages.vanderbilt.edu



--
----------------------------------------------------------------
Pete DeVries
Department of Entomology
University of Wisconsin - Madison
445 Russell Laboratories
1630 Linden Drive
Madison, WI 53706
TaxonConcept Knowledge Base / GeoSpecies Knowledge Base
About the GeoSpecies Knowledge Base
------------------------------------------------------------