Re: [tdwg-content] [tdwg-tag] Do terms in the http://rs.tdwg.org/dwc/terms/ namespace actually resolve?
Hi Steve,
I originally replied to this in the tdwg-tag list, rather than the tdwg-content list.
The resolvability of ontological terms is needed by services like Sindice to figure out how to interpret the RDF.
As in this example:
http://sig.ma/search?pid=02c02e379a5b11a344ed7519ff198222
Like Bob, I have also had problems incorporating the TDWG vocabulary into Protege, which is one reason that I had trouble getting it to work. My current thinking is that it might be best to keep the TDWG vocabulary as it is for submitting data to GBIF etc, while designing a more efficient vocabulary that works well on the LOD.
By "efficient", I mean a vocabulary that uses standard resolvable URI's instead of literals for standard terms etc. This solution would also avoid the problem that Markus just mentioned.
I am also wondering if the "individual" definition should be changed to mean one individual organism rather than a potential collection of individuals. Individuals from the same colony could be represented using a separate related vocabulary. Allowing multiple individuals will cause problems for consuming applications. For instance, is the queen a separate individual or not? How do you differentiate between a photo of the queen vs. a photo of one of the workers. There are also potential problems even if the individuals are all workers.
I have been thinking that for some attributes like character states, it might be best to have a family level ontology. In this example, you might have a "formicidae_ontology", that could be used to deal with individuals from the same colony as well as ant specific character states.
xmlns:ant="http://rs.gbif.org/family_ontology/ant.owl#"
<rdf:Description rdf:about="http://example.org/individual/123412%22%3E <ant:colonyMateOf rdf:resource="http://example.org/individual/123414%22/%3E </rdf:Description>
This could be defined as a subproperty of dc:relation or something similar in the gbif/tdwg vocabulary.
http://sig.ma/search?pid=02c02e379a5b11a344ed7519ff198222- Pete
On Tue, Aug 24, 2010 at 6:43 PM, Steve Baskauf <steve.baskauf@vanderbilt.edu
wrote:
I was doing some GUID testing using a Linked Data client and I noticed that some Darwin Core terms did not seem to resolve to anything. I ran a test using http://demo.openlinksw.com/rdfbrowser2/ http://api.talis.com/stores/iand-dev1/items/dipper.html http://www5.wiwiss.fu-berlin.de/marbles/ and http://dataviewer.zitgist.com/ I first I looked up http://purl.org/dc/terms/creator and all four clients reported the properties of the term. Then I tried http://rs.tdwg.org/dwc/terms/basisOfRecord and nothing happened with any of them. I ran a Vapour http://validator.linkeddata.org/vapour validation on the basisOfRecord URI and got the following message:
Vapour was unable to complete the request due to the following exception:
ForbiddenAddress: forbidden request from 98.87.45.8 to http://rs.tdwg.org/dwc/terms/basisOfRecord (resolves to IP 192.38.28.106), internal IP addresses are forbidden
I have no idea what that means, but all of this seems to mean that Darwin Core is currently "broken" from a Linked Data point of view.
Steve
-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences
postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A.
delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235
office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707 http://bioimages.vanderbilt.edu
tdwg-tag mailing list tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag
Pete, Thanks for the response about term resolution. I'm over my head on that topic, so I'll let others respond to that part.
With regards to a vocabulary that uses URIs rather than literals, I'm in favor of that. At one point in a previous discussion, I think it was suggested that separate terms be created for literal and URI versions of terms like dwc:recordedBy. At first I liked that idea, but after thinking about it and playing with it for a while, I think that the suggestion of just applying a label property to the resource identified by the URI is simpler and wouldn't require a proliferation of new terms. For example:
dcterms:creator <rdf:Description rdf:about="http://biocol.org/urn:lsid:biocol.org:col:15539%22%3E rdfs:labelUniversity of Louisiana at Monroe Herbarium</rdfs:label> </rdf:Description> </dcterms:creator>
could be used if both a literal and URI were available and
dcterms:creatorUniversity of Louisiana at Monroe Herbarium</dcterms:creator>
could be used if a URI were not available. It seems like it should be relatively easy for a linked data client to have contingencies to deal with this. Even with technology that's semantically "dumb" like XSLT, it's pretty easy to code for the two possibilities.
But I suppose it would be good to have some kind of consensus that this is the preferred approach. Otherwise, separate terms might be better. There aren't a whole lot of dwc terms to which this situation would apply.
Steve
Peter DeVries wrote:
By "efficient", I mean a vocabulary that uses standard resolvable URI's instead of literals for standard terms etc. This solution would also avoid the problem that Markus just mentioned.
I am also wondering if the "individual" definition should be changed to mean one individual organism rather than a potential collection of individuals. Individuals from the same colony could be represented using a separate related vocabulary. Allowing multiple individuals will cause problems for consuming applications. For instance, is the queen a separate individual or not? How do you differentiate between a photo of the queen vs. a photo of one of the workers. There are also potential problems even if the individuals are all workers.
I have been thinking that for some attributes like character states, it might be best to have a family level ontology. In this example, you might have a "formicidae_ontology", that could be used to deal with individuals from the same colony as well as ant specific character states.
xmlns:ant="http://rs.gbif.org/family_ontology/ant.owl#"
<rdf:Description rdf:about="http://example.org/individual/123412%22%3E
<ant:colonyMateOf rdf:resource="http://example.org/individual/123414%22/%3E </rdf:Description>
This could be defined as a subproperty of dc:relation or something similar in the gbif/tdwg vocabulary.
- Pete
On Tue, Aug 24, 2010 at 6:43 PM, Steve Baskauf <steve.baskauf@vanderbilt.edu mailto:steve.baskauf@vanderbilt.edu> wrote:
I was doing some GUID testing using a Linked Data client and I noticed that some Darwin Core terms did not seem to resolve to anything. I ran a test using http://demo.openlinksw.com/rdfbrowser2/ http://api.talis.com/stores/iand-dev1/items/dipper.html http://www5.wiwiss.fu-berlin.de/marbles/ and http://dataviewer.zitgist.com/ I first I looked up http://purl.org/dc/terms/creator and all four clients reported the properties of the term. Then I tried http://rs.tdwg.org/dwc/terms/basisOfRecord and nothing happened with any of them. I ran a Vapour http://validator.linkeddata.org/vapour validation on the basisOfRecord URI and got the following message: Vapour was unable to complete the request due to the following exception: ForbiddenAddress: forbidden request from 98.87.45.8 to http://rs.tdwg.org/dwc/terms/basisOfRecord (resolves to IP 192.38.28.106), internal IP addresses are forbidden I have no idea what that means, but all of this seems to mean that Darwin Core is currently "broken" from a Linked Data point of view. Steve -- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A. delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235 office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707 http://bioimages.vanderbilt.edu _______________________________________________ tdwg-tag mailing list tdwg-tag@lists.tdwg.org <mailto:tdwg-tag@lists.tdwg.org> http://lists.tdwg.org/mailman/listinfo/tdwg-tag
--
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 TaxonConcept Knowledge Base http://www.taxonconcept.org/ / GeoSpecies Knowledge Base http://lod.geospecies.org/ About the GeoSpecies Knowledge Base http://about.geospecies.org/
Pete, Well, my thinking on including small populations of individuals in the definition of the proposed dwc:Individual class (http://code.google.com/p/darwincore/issues/detail?id=69 and related issue http://code.google.com/p/darwincore/issues/detail?id=80) was simply based on practicality. I know that collectors put bundles of grass individuals together on an herbarium sheet, bryophyte collectors have numerous moss individuals clumped together in a specimen, and entomologists put several individual insects of the same species collected at the same time and place together in the same jar. I also take images that include several individuals of the same species (e.g. http://bioimages.vanderbilt.edu/baskauf/67323). People collect different individuals at the same place and time and call them "duplicates". In these cases, the person recording the Occurrence isn't interested (or perhaps CAN'T in the case of clonal plants where it might not be clear where one individual starts and another ends) separate out the individuals. As I have conceptualized it, the purpose of being able to create instances of the class Individual is to be able to create named nodes that connect other resources. I will not elaborate about that here because you can read about the idea in detail at Biodiversity Informatics 7:17-44 (https://journals.ku.edu/index.php/jbi/article/view/3664). But the point is that any entity that meets my definition of Individual can usefully serve as such a node. If clarification is needed about what "Individual" means in a particular circumstance, another proposed term (individualRemarks) can be used to elaborate.
There are circumstances such as you brought up (queen vs. worker ant) where it would be better to describe the biological individuals as separate dwc:Individual's. In fact, it would probably ALWAYS be better to have Individuals be separate biological individuals if it is possible to do so. But in cases where it's not possible (or if somebody in the past chose not to do so, as in the case of duplicate specimens), allowing small populations to be considered as Individuals still allows the benefits of them sharing common identifications and linking those identifications to multiple Occurrences.
There is of course the problem of "collections" of Occurrences where Individuals of different species end up together in the same resource (see the Conclusions section of the above paper), e.g. contents of a pitfall trap, an image showing multiple species in their habitat, and a specimen with evidence of parasitism by another species. I haven't put enough thought into how to handle these situations to suggest a solution, but it does not seem out of the question to define some other term ("conglomeration" maybe? "bag" and "collection" are already taken) to connect an Occurrence resource to multiple Individuals. This in itself is a strong argument for why determinations should be associated with Individuals rather than Occurrences. In the model that I described in the paper, an Occurrence has an "individualID" property that connects the occurrence to its determinations via the Individual. But that wouldn't have to be the case. An Occurrence could have a "conglomerationID" property that would connect it to the Conglomeration resource and that resource could then have several "individualID" properties that connect it to the multiple individuals with there separate determinations. Anyway, more thought needs to go into this, but the problem does not seem unsurmountable.
Steve
Peter DeVries wrote:
...
I am also wondering if the "individual" definition should be changed to mean one individual organism rather than a potential collection of individuals. Individuals from the same colony could be represented using a separate related vocabulary. Allowing multiple individuals will cause problems for consuming applications. For instance, is the queen a separate individual or not? How do you differentiate between a photo of the queen vs. a photo of one of the workers. There are also potential problems even if the individuals are all workers.
I have been thinking that for some attributes like character states, it might be best to have a family level ontology. In this example, you might have a "formicidae_ontology", that could be used to deal with individuals from the same colony as well as ant specific character states.
xmlns:ant="http://rs.gbif.org/family_ontology/ant.owl#"
<rdf:Description rdf:about="http://example.org/individual/123412%22%3E
<ant:colonyMateOf rdf:resource="http://example.org/individual/123414%22/%3E </rdf:Description>
This could be defined as a subproperty of dc:relation or something similar in the gbif/tdwg vocabulary.
- Pete
On Wed, Sep 1, 2010 at 8:33 AM, Steve Baskauf steve.baskauf@vanderbilt.edu wrote:
In the model that I described in the paper, an Occurrence has an "individualID" property that connects the occurrence to its determinations via the Individual. But that wouldn't have to be the case. An Occurrence could have a "conglomerationID" property that would connect it to the Conglomeration resource and that resource could then have several "individualID" properties that connect it to the multiple individuals with there separate determinations.
The Composite design pattern may be of use here: http://en.wikipedia.org/wiki/Composite_pattern
///ark Web Applications Developer Center for Applied Biodiversity Informatics California Academy of Sciences
Hi Steve,
I think that in many cases terms that are currently literals should be replaced with URI's. The note below explains a major reason, but also consider all the potential misspellings etc. The choice of a literal requires that all the potentially consuming applications will know which of the various ways of saying "University of Louisiana at Monroe Herbarium" mean the same thing and which mean different things.
Also, I am assuming that the URI below maps to the same thing listed in the label and not a collection within that Herbarium?
I would have GBIF/TDWG have a controlled list of Institutions that are represented with URI's. The label would occur in that vocabulary document and not repeated in each occurrence record. Using this approach the label would still be seen in a LOD browser but it would not be duplicated millions of times.
The other issue is that URI's are stored much more efficiently than literals. See below:
-- The section below provides some insight as to why I think it is good to replace literals with URI's.
It is from the Virtuoso FAQ page, but I also noticed a substantial improvement in performance in the Sesame triple store when I changed some of my literal fields into URI's http://docs.openlinksw.com/virtuoso/virtuosofaq.html#virtuosofaq1
There are many places with in the current DarwinCore that a standard vocabulary is used. However, these are represented as literals in the RDF.
It might be useful to think of making URI versions of some of these such as datum, taxon level, lifestage etc.
Another option would be to having these represented as URI's in some GBIF processed version of the originally submitted data.
Respectfully,
- Pete
-------------------------------------------------------------------------
1.4.1. What is the storage cost per triple?
This depends on the index scheme. If indexed 2 ways, assuming that the graph will always be stated in queries, this is 31 bytes.
With 4 indices, supporting queries where the graph can be left unspecified (i.e., triples from any graph will be considered in query evaluation), this is 39 bytes. The numbers are measured with the LUBM validation data set of 121K triples, with no full-text index on literals.
With 4 indices and a full text index on all literals, the Billion Triples Challenge data set, 1115M triples, is about 120 GB of database pages. The database file size is larger due to space in reserve and other factors. 120 GB is the number to use when assessing RAM-to-disk ratio, i.e., how much RAM the system ought to have in order to provide good response. This data set is a heterogeneous collection including social network data, conversations harvested from the Web, DBpedia, Freebase, etc., with relatively numerous and long text literals.
-----
On Wed, Sep 1, 2010 at 9:40 AM, Steve Baskauf steve.baskauf@vanderbilt.eduwrote:
Pete, Thanks for the response about term resolution. I'm over my head on that topic, so I'll let others respond to that part.
With regards to a vocabulary that uses URIs rather than literals, I'm in favor of that. At one point in a previous discussion, I think it was suggested that separate terms be created for literal and URI versions of terms like dwc:recordedBy. At first I liked that idea, but after thinking about it and playing with it for a while, I think that the suggestion of just applying a label property to the resource identified by the URI is simpler and wouldn't require a proliferation of new terms. For example:
<dcterms:creator> <rdf:Description rdf:about=
"http://biocol.org/urn:lsid:biocol.org:col:15539"http://biocol.org/urn:lsid:biocol.org:col:15539
<rdfs:label>University of Louisiana at Monroe
Herbarium</rdfs:label> </rdf:Description> </dcterms:creator>
could be used if both a literal and URI were available and
<dcterms:creator>University of Louisiana at Monroe
Herbarium</dcterms:creator>
could be used if a URI were not available. It seems like it should be relatively easy for a linked data client to have contingencies to deal with this. Even with technology that's semantically "dumb" like XSLT, it's pretty easy to code for the two possibilities.
But I suppose it would be good to have some kind of consensus that this is the preferred approach. Otherwise, separate terms might be better. There aren't a whole lot of dwc terms to which this situation would apply.
Steve
Peter DeVries wrote:
By "efficient", I mean a vocabulary that uses standard resolvable URI's instead of literals for standard terms etc. This solution would also avoid the problem that Markus just mentioned.
I am also wondering if the "individual" definition should be changed to mean one individual organism rather than a potential collection of individuals. Individuals from the same colony could be represented using a separate related vocabulary. Allowing multiple individuals will cause problems for consuming applications. For instance, is the queen a separate individual or not? How do you differentiate between a photo of the queen vs. a photo of one of the workers. There are also potential problems even if the individuals are all workers.
I have been thinking that for some attributes like character states, it might be best to have a family level ontology. In this example, you might have a "formicidae_ontology", that could be used to deal with individuals from the same colony as well as ant specific character states.
xmlns:ant="http://rs.gbif.org/family_ontology/ant.owl#"
<rdf:Description rdf:about="http://example.org/individual/123412%22%3E
<ant:colonyMateOf rdf:resource="http://example.org/individual/123414%22/%3E </rdf:Description>
This could be defined as a subproperty of dc:relation or something similar in the gbif/tdwg vocabulary.
- Pete
On Tue, Aug 24, 2010 at 6:43 PM, Steve Baskauf < steve.baskauf@vanderbilt.edu> wrote:
I was doing some GUID testing using a Linked Data client and I noticed that some Darwin Core terms did not seem to resolve to anything. I ran a test using http://demo.openlinksw.com/rdfbrowser2/ http://api.talis.com/stores/iand-dev1/items/dipper.html http://www5.wiwiss.fu-berlin.de/marbles/ and http://dataviewer.zitgist.com/ I first I looked up http://purl.org/dc/terms/creator and all four clients reported the properties of the term. Then I tried http://rs.tdwg.org/dwc/terms/basisOfRecord and nothing happened with any of them. I ran a Vapour http://validator.linkeddata.org/vapour validation on the basisOfRecord URI and got the following message:
Vapour was unable to complete the request due to the following exception:
ForbiddenAddress: forbidden request from 98.87.45.8 to http://rs.tdwg.org/dwc/terms/basisOfRecord (resolves to IP 192.38.28.106), internal IP addresses are forbidden
I have no idea what that means, but all of this seems to mean that Darwin Core is currently "broken" from a Linked Data point of view.
Steve
-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences
postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A.
delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235
office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707 http://bioimages.vanderbilt.edu
tdwg-tag mailing list tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag
--
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 TaxonConcept Knowledge Base http://www.taxonconcept.org/ / GeoSpecies Knowledge Base http://lod.geospecies.org/ About the GeoSpecies Knowledge Base http://about.geospecies.org/
-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences
postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A.
delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235
office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707http://bioimages.vanderbilt.edu
participants (3)
-
Mark Wilden
-
Peter DeVries
-
Steve Baskauf