Bioblitz as rdf: does this make sense?
Hi Everyone,
I've posted rdf of the bioblitz data. It's at http://www.cs.umbc.edu/~jsachs/occurrences/TechnoBioblitzOccurrences.rdf .
Individual occurrences can be retrieved via http://www.cs.umbc.edu/~jsachs/occurrences/%5Boccurrence_id] e.g. http://www.cs.umbc.edu/~jsachs/occurrences/tdwg2010bioblitz_1835
Individual identifications can be retrieved via http://www.cs.umbc.edu/~jsachs/identifications/%5Bidentification_id] e.g. http://www.cs.umbc.edu/~jsachs/identifications/tdwg2010bioblitz_1835_id_1
The scripts behind this are on the kludgy side, so reports of errors and abnormalities will be warmly welcomed.
Implicit in each of the following notes is the question "Is this a good way to do it?":
1. The data is "normalized" w.r.t. identification. "Normalized" is in quotes because I mean it in the sense that Steve Baskauf was using in his Fall 2010 series of posts. His meaning of the term makes sense to me, but many people (e.g. the OBO folks), take "normalized ontology" to mean "disentangled" (i.e. no multiple inheritance.) As an example, here's an occurrence with two crowdsourced determinations: http://www.cs.umbc.edu/~jsachs/occurrences/tdwg2010bioblitz_1644
2. I used sequential integers for observation and identification IDs; in practice, a mechanism needs to be in place to prevent two people from assigning the same id to their respective identifications.
3. My answer to Cam Webb's Question #1 from http://lists.tdwg.org/pipermail/tdwg-content/2010-October/001720.html is "both". In other words, just as "Joel Sachs" is both me and also my name, so http://www.cs.umbc.edu/~jsachs/occurrences/tdwg2010bioblitz_1668 is both an occurrence and an occurrence_id, expressed as: --- <dwc:Occurrence rdf:about="http://www.cs.umbc.edu/~jsachs/occurrences/tdwg2010bioblitz_1644%22%3E dwc:occurrenceID http://www.cs.umbc.edu/~jsachs/occurrences/tdwg2010bioblitz_1644 </dwc:occurrenceID> <blah blah blah/> </dwc:Occurrence> ---
4. I was surprised to see that the Darwin Core Identification class has no "occurrenceID" or "specimenID" term. How is one supposed to tie an identification to an observation (assuming the identification is not in-lined, of course)? DeVries and Baskauf each mint their own terms for doing this (txn:identificationHasOccurrence, and sernec:basedOnOccurrence, respectively); I used dwc:occurrenceID as if it were a record level term.
5. We had scope for multiple taxonConceptID columns in the Fusion table, and assigned lsids where possible. I also mean to work with Pete to assign GUIDs from taxoncocept.org. In addition, I assigned ethan taxon concept ids, which look like this: http:.//spire.umbc.edu/ethan/Coffea_arabica
In their argument over opaque vs. transparent taxonCoceptIDs, I was sympathetic to both Pete's and Gregor's arguments. Ultimately, if the tooling exists to always display the rdfs:labels every time I'm loooking at a list of opaqueIDs, then transparent IDs are unnecessary. But, for now, it's really helpful to look at an ID and know what it's referring to.
(For species names not in the spire database, the rdf returned by http:.//spire.umbc.edu/ethan/$name is simply an rdfs:seeAlso to http://http://gni.globalnames.org/name_strings?search_term=$name)
6. It was easy to assert membership in RDF classes corresponding to various Cape Cod categories of concern - invasive species, threatenened species, indicators, etc. You can see these classes at http://spire.umbc.edu/ontologies/lists (Information of where these lists come from is included as rdfs:comments. I'll add further documentation, e.g. links to eml files.)
Note that "ThingOfConcern" is defined as the superclass of all the other classes in the collection. The idea here is that people can create their own "ThingOfConcern" class, and then query for observations that are of concern to them. You can see sample sparql queries at http://www.csee.umbc.edu/~jsachs/occurrences/queries/sample.txt
As an aside, I think we, as a community, should come up with a biodiversity benchmark suite of rdf data and corresponding sparql queries, that can be used to test the suitability and scalability of semantic web knowledge bases. I'll take this up in a future post (unless someone beats me to it).
Comments, questions, and better ideas are welcome.
Thanks - Joel.
Hi Joel,
Cool :-)
I just loaded this into my SPARQL endpoint.
In the named graph urn:org:linkedopenspeciesdata:dataspace:tdwg2010bioblitz
It consists of 19,990 Triples
Here is one of the dwc:taxonConceptID entries.
*About: http://spire.umbc.edu/ethan/Ampelopsis_brevipedunculata* http://lsd.taxonconcept.org/describe/?url=http://spire.umbc.edu/ethan/Ampelo...
*About: http://www.cs.umbc.edu/~jsachs/occurrences/tdwg2010bioblitz_1627* http://lsd.taxonconcept.org/describe/?url=http://spire.umbc.edu/ethan/Ampelopsis_brevipedunculata http://lsd.taxonconcept.org/describe/?url=http://www.cs.umbc.edu/~jsachs/occ...
http://lsd.taxonconcept.org/describe/?url=http://www.cs.umbc.edu/~jsachs/occurrences/tdwg2010bioblitz_1627This should give you an a count of occurrences.
SELECT count(*) WHERE {?s a http://rs.tdwg.org/dwc/terms/#Occurrence};
= 1882
SELECT count(*) WHERE {?s a http://rs.tdwg.org/dwc/terms/#taxonConceptID};
This should give you a list of occurrences
http://lsd.taxonconcept.org/describe/?url=http://rs.tdwg.org/dwc/terms/%23Oc...
If this did not come through your email system try the bit.ly.
I tried the following that should have given me a google map of all the occurrences but it did not result in the map.
DESCRIBE ?x WHERE { ?x http://www.w3.org/1999/02/22-rdf-syntax-ns#type < http://rs.tdwg.org/dwc/terms/#Occurrence%3E. }
I looked that the RDF and I think I see the problem.
In the RDF
geo:latitude 41.53 </geo:latitude>
geo:longitude -70.67 </geo:longitude>
Should be
geo:lat 41.53 </geo:lat>
geo:long -70.67 </geo:long>
See http://www.w3.org/2003/01/geo/
I did the following query to get a list of all the dwc:taxonConceptID's and have attached them as a .txt file.
select distinct ?o WHERE {?s http://rs.tdwg.org/dwc/terms/#taxonConceptID ?o}
Pretty neat :-)
There are some things that I will get back to Joel on.
Here is where you can manually enter a SPARQL query. Click on "Advanced" for the entry window.
http://lsd.taxonconcept.org/isparql/
Respectfully,
- Pete
On Wed, Jan 12, 2011 at 5:55 PM, joel sachs jsachs@csee.umbc.edu wrote:
Hi Everyone,
I've posted rdf of the bioblitz data. It's at http://www.cs.umbc.edu/~jsachs/occurrences/TechnoBioblitzOccurrences.rdf .
Individual occurrences can be retrieved via http://www.cs.umbc.edu/~jsachs/occurrences/%5Boccurrence_id] e.g. http://www.cs.umbc.edu/~jsachs/occurrences/tdwg2010bioblitz_1835
Individual identifications can be retrieved via http://www.cs.umbc.edu/~jsachs/identifications/%5Bidentification_id] e.g. http://www.cs.umbc.edu/~jsachs/identifications/tdwg2010bioblitz_1835_id_1
The scripts behind this are on the kludgy side, so reports of errors and abnormalities will be warmly welcomed.
Implicit in each of the following notes is the question "Is this a good way to do it?":
- The data is "normalized" w.r.t. identification. "Normalized" is in
quotes because I mean it in the sense that Steve Baskauf was using in his Fall 2010 series of posts. His meaning of the term makes sense to me, but many people (e.g. the OBO folks), take "normalized ontology" to mean "disentangled" (i.e. no multiple inheritance.) As an example, here's an occurrence with two crowdsourced determinations: http://www.cs.umbc.edu/~jsachs/occurrences/tdwg2010bioblitz_1644
- I used sequential integers for observation and identification IDs; in
practice, a mechanism needs to be in place to prevent two people from assigning the same id to their respective identifications.
- My answer to Cam Webb's Question #1 from
http://lists.tdwg.org/pipermail/tdwg-content/2010-October/001720.html is "both". In other words, just as "Joel Sachs" is both me and also my name, so http://www.cs.umbc.edu/~jsachs/occurrences/tdwg2010bioblitz_1668 is both an occurrence and an occurrence_id, expressed as:
dwc:Occurrence rdf:about=" http://www.cs.umbc.edu/~jsachs/occurrences/tdwg2010bioblitz_1644" dwc:occurrenceID http://www.cs.umbc.edu/~jsachs/occurrences/tdwg2010bioblitz_1644 </dwc:occurrenceID>
<blah blah blah/> </dwc:Occurrence> ---
- I was surprised to see that the Darwin Core Identification class has no
"occurrenceID" or "specimenID" term. How is one supposed to tie an identification to an observation (assuming the identification is not in-lined, of course)? DeVries and Baskauf each mint their own terms for doing this (txn:identificationHasOccurrence, and sernec:basedOnOccurrence, respectively); I used dwc:occurrenceID as if it were a record level term.
- We had scope for multiple taxonConceptID columns in the Fusion table,
and assigned lsids where possible. I also mean to work with Pete to assign GUIDs from taxoncocept.org. In addition, I assigned ethan taxon concept ids, which look like this: http:.//spire.umbc.edu/ethan/Coffea_arabica
In their argument over opaque vs. transparent taxonCoceptIDs, I was sympathetic to both Pete's and Gregor's arguments. Ultimately, if the tooling exists to always display the rdfs:labels every time I'm loooking at a list of opaqueIDs, then transparent IDs are unnecessary. But, for now, it's really helpful to look at an ID and know what it's referring to.
(For species names not in the spire database, the rdf returned by http:.//spire.umbc.edu/ethan/$name is simply an rdfs:seeAlso to http://http://gni.globalnames.org/name_strings?search_term=$name)
- It was easy to assert membership in RDF classes corresponding to
various Cape Cod categories of concern - invasive species, threatenened species, indicators, etc. You can see these classes at http://spire.umbc.edu/ontologies/lists (Information of where these lists come from is included as rdfs:comments. I'll add further documentation, e.g. links to eml files.)
Note that "ThingOfConcern" is defined as the superclass of all the other classes in the collection. The idea here is that people can create their own "ThingOfConcern" class, and then query for observations that are of concern to them. You can see sample sparql queries at http://www.csee.umbc.edu/~jsachs/occurrences/queries/sample.txt
As an aside, I think we, as a community, should come up with a biodiversity benchmark suite of rdf data and corresponding sparql queries, that can be used to test the suitability and scalability of semantic web knowledge bases. I'll take this up in a future post (unless someone beats me to it).
Comments, questions, and better ideas are welcome.
Thanks - Joel.
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
Pete, Thanks - I corrected the geo properties. Joel.
On Wed, 12 Jan 2011, Peter DeVries wrote:
Hi Joel,
Cool :-)
I just loaded this into my SPARQL endpoint.
In the named graph urn:org:linkedopenspeciesdata:dataspace:tdwg2010bioblitz
It consists of 19,990 Triples
Here is one of the dwc:taxonConceptID entries.
*About: http://spire.umbc.edu/ethan/Ampelopsis_brevipedunculata* http://lsd.taxonconcept.org/describe/?url=http://spire.umbc.edu/ethan/Ampelo...
*About: http://www.cs.umbc.edu/~jsachs/occurrences/tdwg2010bioblitz_1627* http://lsd.taxonconcept.org/describe/?url=http://spire.umbc.edu/ethan/Ampelopsis_brevipedunculata http://lsd.taxonconcept.org/describe/?url=http://www.cs.umbc.edu/~jsachs/occ...
http://lsd.taxonconcept.org/describe/?url=http://www.cs.umbc.edu/~jsachs/occurrences/tdwg2010bioblitz_1627This should give you an a count of occurrences.
SELECT count(*) WHERE {?s a http://rs.tdwg.org/dwc/terms/#Occurrence};
= 1882
SELECT count(*) WHERE {?s a http://rs.tdwg.org/dwc/terms/#taxonConceptID};
This should give you a list of occurrences
http://lsd.taxonconcept.org/describe/?url=http://rs.tdwg.org/dwc/terms/%23Oc...
If this did not come through your email system try the bit.ly.
I tried the following that should have given me a google map of all the occurrences but it did not result in the map.
DESCRIBE ?x WHERE { ?x http://www.w3.org/1999/02/22-rdf-syntax-ns#type < http://rs.tdwg.org/dwc/terms/#Occurrence%3E. }
I looked that the RDF and I think I see the problem.
In the RDF
geo:latitude 41.53 </geo:latitude>
geo:longitude -70.67 </geo:longitude>
Should be
geo:lat 41.53 </geo:lat>
geo:long -70.67 </geo:long>
See http://www.w3.org/2003/01/geo/
I did the following query to get a list of all the dwc:taxonConceptID's and have attached them as a .txt file.
select distinct ?o WHERE {?s http://rs.tdwg.org/dwc/terms/#taxonConceptID ?o}
Pretty neat :-)
There are some things that I will get back to Joel on.
Here is where you can manually enter a SPARQL query. Click on "Advanced" for the entry window.
http://lsd.taxonconcept.org/isparql/
Respectfully,
- Pete
On Wed, Jan 12, 2011 at 5:55 PM, joel sachs jsachs@csee.umbc.edu wrote:
Hi Everyone,
I've posted rdf of the bioblitz data. It's at http://www.cs.umbc.edu/~jsachs/occurrences/TechnoBioblitzOccurrences.rdf .
Individual occurrences can be retrieved via http://www.cs.umbc.edu/~jsachs/occurrences/%5Boccurrence_id] e.g. http://www.cs.umbc.edu/~jsachs/occurrences/tdwg2010bioblitz_1835
Individual identifications can be retrieved via http://www.cs.umbc.edu/~jsachs/identifications/%5Bidentification_id] e.g. http://www.cs.umbc.edu/~jsachs/identifications/tdwg2010bioblitz_1835_id_1
The scripts behind this are on the kludgy side, so reports of errors and abnormalities will be warmly welcomed.
Implicit in each of the following notes is the question "Is this a good way to do it?":
- The data is "normalized" w.r.t. identification. "Normalized" is in
quotes because I mean it in the sense that Steve Baskauf was using in his Fall 2010 series of posts. His meaning of the term makes sense to me, but many people (e.g. the OBO folks), take "normalized ontology" to mean "disentangled" (i.e. no multiple inheritance.) As an example, here's an occurrence with two crowdsourced determinations: http://www.cs.umbc.edu/~jsachs/occurrences/tdwg2010bioblitz_1644
- I used sequential integers for observation and identification IDs; in
practice, a mechanism needs to be in place to prevent two people from assigning the same id to their respective identifications.
- My answer to Cam Webb's Question #1 from
http://lists.tdwg.org/pipermail/tdwg-content/2010-October/001720.html is "both". In other words, just as "Joel Sachs" is both me and also my name, so http://www.cs.umbc.edu/~jsachs/occurrences/tdwg2010bioblitz_1668 is both an occurrence and an occurrence_id, expressed as:
dwc:Occurrence rdf:about=" http://www.cs.umbc.edu/~jsachs/occurrences/tdwg2010bioblitz_1644" dwc:occurrenceID http://www.cs.umbc.edu/~jsachs/occurrences/tdwg2010bioblitz_1644 </dwc:occurrenceID>
<blah blah blah/> </dwc:Occurrence> ---
- I was surprised to see that the Darwin Core Identification class has no
"occurrenceID" or "specimenID" term. How is one supposed to tie an identification to an observation (assuming the identification is not in-lined, of course)? DeVries and Baskauf each mint their own terms for doing this (txn:identificationHasOccurrence, and sernec:basedOnOccurrence, respectively); I used dwc:occurrenceID as if it were a record level term.
- We had scope for multiple taxonConceptID columns in the Fusion table,
and assigned lsids where possible. I also mean to work with Pete to assign GUIDs from taxoncocept.org. In addition, I assigned ethan taxon concept ids, which look like this: http:.//spire.umbc.edu/ethan/Coffea_arabica
In their argument over opaque vs. transparent taxonCoceptIDs, I was sympathetic to both Pete's and Gregor's arguments. Ultimately, if the tooling exists to always display the rdfs:labels every time I'm loooking at a list of opaqueIDs, then transparent IDs are unnecessary. But, for now, it's really helpful to look at an ID and know what it's referring to.
(For species names not in the spire database, the rdf returned by http:.//spire.umbc.edu/ethan/$name is simply an rdfs:seeAlso to http://http://gni.globalnames.org/name_strings?search_term=$name)
- It was easy to assert membership in RDF classes corresponding to
various Cape Cod categories of concern - invasive species, threatenened species, indicators, etc. You can see these classes at http://spire.umbc.edu/ontologies/lists (Information of where these lists come from is included as rdfs:comments. I'll add further documentation, e.g. links to eml files.)
Note that "ThingOfConcern" is defined as the superclass of all the other classes in the collection. The idea here is that people can create their own "ThingOfConcern" class, and then query for observations that are of concern to them. You can see sample sparql queries at http://www.csee.umbc.edu/~jsachs/occurrences/queries/sample.txt
As an aside, I think we, as a community, should come up with a biodiversity benchmark suite of rdf data and corresponding sparql queries, that can be used to test the suitability and scalability of semantic web knowledge bases. I'll take this up in a future post (unless someone beats me to it).
Comments, questions, and better ideas are welcome.
Thanks - Joel.
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
--
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 TaxonConcept Knowledge Base http://www.taxonconcept.org/ / GeoSpecies Knowledge Base http://lod.geospecies.org/ About the GeoSpecies Knowledge Base http://about.geospecies.org/
Thanks Joel,
Here is one of the BioBlitz Occurrence Records marked up with Darwin Core
http://lsd.taxonconcept.org/describe/?url=http://www.cs.umbc.edu/~jsachs/occ...
Here is one of the TaxonConcept Records marked up in the txn vocabulary http://lsd.taxonconcept.org/describe/?url=http://ocs.taxonconcept.org/ocs/f5...
Some people have trouble with how the %23 above is escaped in their email, they might like this bit.ly bundle better.http://lsd.taxonconcept.org/describe/?url=http://ocs.taxonconcept.org/ocs/f522444a-2dd9-400e-be59-47213ef38cb9%23Occurrence http://lsd.taxonconcept.org/about/html/http/ocs.taxonconcept.org/ocs/f522444a-2dd9-400e-be59-47213ef38cb9%01Occurrence http://bit.ly/hZdUpP
http://bit.ly/hZdUpPAlso an issue to discuss is that identifications are to both a name and a concept so
Should these should be the same dwc:taxonConcepts
http://spire.umbc.edu/ethan/Apis_mellifera http://spire.umbc.edu/ethan/Apis_Mellifera ==> http://bit.ly/g1zzJC (Apis mellifera se:z9oqP)
http://spire.umbc.edu/ethan/Ascelpius_syruaca http://spire.umbc.edu/ethan/Asclepias_syriaca ==> HTML page http://lod.taxonconcept.org/ses/tTEIq.html Concept View Bit.ly http://bit.ly/dJHJqj
http://spire.umbc.edu/ethan/Aster_nova-belgii http://spire.umbc.edu/ethan/Aster_nova-gelgii
http://spire.umbc.edu/ethan/Baccharis_halimifolia http://spire.umbc.edu/ethan/Baccharis_halimifolia_L.
http://spire.umbc.edu/ethan/Bartonia_virginia http://spire.umbc.edu/ethan/Bartonia_virginica
http://spire.umbc.edu/ethan/Branta_canadensis http://spire.umbc.edu/ethan/Branta_canadensis_(Linnaeus,_1758)
http://spire.umbc.edu/ethan/Carex_pennsylvanica http://spire.umbc.edu/ethan/Carex_pensylvanica
http://spire.umbc.edu/ethan/Cyperus_esculantus http://spire.umbc.edu/ethan/Cyperus_Esculantus http://spire.umbc.edu/ethan/Cyperus_esculentus http://spire.umbc.edu/ethan/Cyperus_esculentus_L.
http://spire.umbc.edu/ethan/Carpodacus_mexicanus http://spire.umbc.edu/ethan/Carpodacus_mexicanus_(Statius_Muller,_1776)
From the taxonconcepts their should be a link to the various name strings.
vs. modeling the namestring as the concept.
Also I think that DarwinCore is good for somethings but maybe not as a semantic web representation.
Respectfully,
- Pete
On Thu, Jan 13, 2011 at 6:53 AM, joel sachs jsachs@csee.umbc.edu wrote:
Pete, Thanks - I corrected the geo properties. Joel.
On Wed, 12 Jan 2011, Peter DeVries wrote:
Hi Joel,
Cool :-)
I just loaded this into my SPARQL endpoint.
In the named graph urn:org:linkedopenspeciesdata:dataspace:tdwg2010bioblitz
It consists of 19,990 Triples
Here is one of the dwc:taxonConceptID entries.
*About: http://spire.umbc.edu/ethan/Ampelopsis_brevipedunculata*
http://lsd.taxonconcept.org/describe/?url=http://spire.umbc.edu/ethan/Ampelo...
*About: http://www.cs.umbc.edu/~jsachs/occurrences/tdwg2010bioblitz_1627* < http://lsd.taxonconcept.org/describe/?url=http://spire.umbc.edu/ethan/Ampelo...
http://lsd.taxonconcept.org/describe/?url=http://www.cs.umbc.edu/~jsachs/occ...
< http://lsd.taxonconcept.org/describe/?url=http://www.cs.umbc.edu/~jsachs/occ...
This
should give you an a count of occurrences.
SELECT count(*) WHERE {?s a http://rs.tdwg.org/dwc/terms/#Occurrence};
= 1882
SELECT count(*) WHERE {?s a <http://rs.tdwg.org/dwc/terms/#taxonConceptID
};
This should give you a list of occurrences
http://lsd.taxonconcept.org/describe/?url=http://rs.tdwg.org/dwc/terms/%23Oc...
If this did not come through your email system try the bit.ly.
I tried the following that should have given me a google map of all the occurrences but it did not result in the map.
DESCRIBE ?x WHERE { ?x http://www.w3.org/1999/02/22-rdf-syntax-ns#type < http://rs.tdwg.org/dwc/terms/#Occurrence%3E. }
I looked that the RDF and I think I see the problem.
In the RDF
geo:latitude 41.53 </geo:latitude>
geo:longitude -70.67 </geo:longitude>
Should be
geo:lat 41.53 </geo:lat>
geo:long -70.67 </geo:long>
See http://www.w3.org/2003/01/geo/
I did the following query to get a list of all the dwc:taxonConceptID's and have attached them as a .txt file.
select distinct ?o WHERE {?s < http://rs.tdwg.org/dwc/terms/#taxonConceptID%3E ?o}
Pretty neat :-)
There are some things that I will get back to Joel on.
Here is where you can manually enter a SPARQL query. Click on "Advanced" for the entry window.
http://lsd.taxonconcept.org/isparql/
Respectfully,
- Pete
On Wed, Jan 12, 2011 at 5:55 PM, joel sachs jsachs@csee.umbc.edu wrote:
Hi Everyone,
I've posted rdf of the bioblitz data. It's at http://www.cs.umbc.edu/~jsachs/occurrences/TechnoBioblitzOccurrences.rdf.
Individual occurrences can be retrieved via http://www.cs.umbc.edu/~jsachs/occurrences/%5Boccurrence_id]http://www.cs.umbc.edu/~jsachs/occurrences/%5Boccurrence_id%5D e.g. http://www.cs.umbc.edu/~jsachs/occurrences/tdwg2010bioblitz_1835
Individual identifications can be retrieved via http://www.cs.umbc.edu/~jsachs/identifications/%5Bidentification_id]http://www.cs.umbc.edu/~jsachs/identifications/%5Bidentification_id%5D e.g. http://www.cs.umbc.edu/~jsachs/identifications/tdwg2010bioblitz_1835_id_1
The scripts behind this are on the kludgy side, so reports of errors and abnormalities will be warmly welcomed.
Implicit in each of the following notes is the question "Is this a good way to do it?":
- The data is "normalized" w.r.t. identification. "Normalized" is in
quotes because I mean it in the sense that Steve Baskauf was using in his Fall 2010 series of posts. His meaning of the term makes sense to me, but many people (e.g. the OBO folks), take "normalized ontology" to mean "disentangled" (i.e. no multiple inheritance.) As an example, here's an occurrence with two crowdsourced determinations: http://www.cs.umbc.edu/~jsachs/occurrences/tdwg2010bioblitz_1644
- I used sequential integers for observation and identification IDs; in
practice, a mechanism needs to be in place to prevent two people from assigning the same id to their respective identifications.
- My answer to Cam Webb's Question #1 from
http://lists.tdwg.org/pipermail/tdwg-content/2010-October/001720.html is "both". In other words, just as "Joel Sachs" is both me and also my name, so http://www.cs.umbc.edu/~jsachs/occurrences/tdwg2010bioblitz_1668 is both an occurrence and an occurrence_id, expressed as:
dwc:Occurrence rdf:about=" http://www.cs.umbc.edu/~jsachs/occurrences/tdwg2010bioblitz_1644" dwc:occurrenceID http://www.cs.umbc.edu/~jsachs/occurrences/tdwg2010bioblitz_1644 </dwc:occurrenceID>
<blah blah blah/> </dwc:Occurrence> ---
- I was surprised to see that the Darwin Core Identification class has
no "occurrenceID" or "specimenID" term. How is one supposed to tie an identification to an observation (assuming the identification is not in-lined, of course)? DeVries and Baskauf each mint their own terms for doing this (txn:identificationHasOccurrence, and sernec:basedOnOccurrence, respectively); I used dwc:occurrenceID as if it were a record level term.
- We had scope for multiple taxonConceptID columns in the Fusion table,
and assigned lsids where possible. I also mean to work with Pete to assign GUIDs from taxoncocept.org. In addition, I assigned ethan taxon concept ids, which look like this: http:.//spire.umbc.edu/ethan/Coffea_arabica
In their argument over opaque vs. transparent taxonCoceptIDs, I was sympathetic to both Pete's and Gregor's arguments. Ultimately, if the tooling exists to always display the rdfs:labels every time I'm loooking at a list of opaqueIDs, then transparent IDs are unnecessary. But, for now, it's really helpful to look at an ID and know what it's referring to.
(For species names not in the spire database, the rdf returned by http:.//spire.umbc.edu/ethan/$name is simply an rdfs:seeAlso to http://http://gni.globalnames.org/name_strings?search_term=$name)
- It was easy to assert membership in RDF classes corresponding to
various Cape Cod categories of concern - invasive species, threatenened species, indicators, etc. You can see these classes at http://spire.umbc.edu/ontologies/lists (Information of where these lists come from is included as rdfs:comments. I'll add further documentation, e.g. links to eml files.)
Note that "ThingOfConcern" is defined as the superclass of all the other classes in the collection. The idea here is that people can create their own "ThingOfConcern" class, and then query for observations that are of concern to them. You can see sample sparql queries at http://www.csee.umbc.edu/~jsachs/occurrences/queries/sample.txt
As an aside, I think we, as a community, should come up with a biodiversity benchmark suite of rdf data and corresponding sparql queries, that can be used to test the suitability and scalability of semantic web knowledge bases. I'll take this up in a future post (unless someone beats me to it).
Comments, questions, and better ideas are welcome.
Thanks - Joel.
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
--
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 TaxonConcept Knowledge Base http://www.taxonconcept.org/ / GeoSpecies Knowledge Base http://lod.geospecies.org/ About the GeoSpecies Knowledge Base http://about.geospecies.org/
I have gotten the fixed records from Joel and added some elements to the RDF.
Also we need to use URI's for people so you can see browse between their records and identifications etc.
I only did a few of these. I changed all the records from Dima to "Dmitry Mozzherin" there was at least one misspelling (Dimitry Mozzherin)
dwc:recordedByDmitry Mozzherin</dwc:recordedBy> *fixed <txn:hasCollector rdf:resource=" http://lod.taxonconcept.org/people/tdwg2010bioblitz#Dmitry_Mozzherin%22/%3E *added
Now you can see all the observations he recorded
http://lsd.taxonconcept.org/describe/?url=http://lod.taxonconcept.org/people...
The newer modified versions of the RDF are here: (These are still a work in progress)
BioBlitz Occurrences http://lod.taxonconcept.org/tdwg2010bioblitz/TechnoBioblitzOccurrences.rdf
TDWG People http://lod.taxonconcept.org/people/tdwg2010bioblitz_people.rdf
I added a TaxonConcept species concept to only one set of records - those for the Honey Bee Apis mellifera.
It shows up in this occurrence record
http://lsd.taxonconcept.org/describe/?url=http://www.cs.umbc.edu/~jsachs/occ...
Also the geoquery works but it seems to return some strange locations?
Respectfully,
- Pete
On Thu, Jan 13, 2011 at 6:53 AM, joel sachs jsachs@csee.umbc.edu wrote:
Pete, Thanks - I corrected the geo properties. Joel.
On Wed, 12 Jan 2011, Peter DeVries wrote:
Hi Joel,
Cool :-)
I just loaded this into my SPARQL endpoint.
In the named graph urn:org:linkedopenspeciesdata:dataspace:tdwg2010bioblitz
It consists of 19,990 Triples
Here is one of the dwc:taxonConceptID entries.
*About: http://spire.umbc.edu/ethan/Ampelopsis_brevipedunculata*
http://lsd.taxonconcept.org/describe/?url=http://spire.umbc.edu/ethan/Ampelo...
*About: http://www.cs.umbc.edu/~jsachs/occurrences/tdwg2010bioblitz_1627* < http://lsd.taxonconcept.org/describe/?url=http://spire.umbc.edu/ethan/Ampelo...
http://lsd.taxonconcept.org/describe/?url=http://www.cs.umbc.edu/~jsachs/occ...
< http://lsd.taxonconcept.org/describe/?url=http://www.cs.umbc.edu/~jsachs/occ...
This
should give you an a count of occurrences.
SELECT count(*) WHERE {?s a http://rs.tdwg.org/dwc/terms/#Occurrence};
= 1882
SELECT count(*) WHERE {?s a <http://rs.tdwg.org/dwc/terms/#taxonConceptID
};
This should give you a list of occurrences
http://lsd.taxonconcept.org/describe/?url=http://rs.tdwg.org/dwc/terms/%23Oc...
If this did not come through your email system try the bit.ly.
I tried the following that should have given me a google map of all the occurrences but it did not result in the map.
DESCRIBE ?x WHERE { ?x http://www.w3.org/1999/02/22-rdf-syntax-ns#type < http://rs.tdwg.org/dwc/terms/#Occurrence%3E. }
I looked that the RDF and I think I see the problem.
In the RDF
geo:latitude 41.53 </geo:latitude>
geo:longitude -70.67 </geo:longitude>
Should be
geo:lat 41.53 </geo:lat>
geo:long -70.67 </geo:long>
See http://www.w3.org/2003/01/geo/
I did the following query to get a list of all the dwc:taxonConceptID's and have attached them as a .txt file.
select distinct ?o WHERE {?s < http://rs.tdwg.org/dwc/terms/#taxonConceptID%3E ?o}
Pretty neat :-)
There are some things that I will get back to Joel on.
Here is where you can manually enter a SPARQL query. Click on "Advanced" for the entry window.
http://lsd.taxonconcept.org/isparql/
Respectfully,
- Pete
On Wed, Jan 12, 2011 at 5:55 PM, joel sachs jsachs@csee.umbc.edu wrote:
Hi Everyone,
I've posted rdf of the bioblitz data. It's at http://www.cs.umbc.edu/~jsachs/occurrences/TechnoBioblitzOccurrences.rdf.
Individual occurrences can be retrieved via http://www.cs.umbc.edu/~jsachs/occurrences/%5Boccurrence_id]http://www.cs.umbc.edu/~jsachs/occurrences/%5Boccurrence_id%5D e.g. http://www.cs.umbc.edu/~jsachs/occurrences/tdwg2010bioblitz_1835
Individual identifications can be retrieved via http://www.cs.umbc.edu/~jsachs/identifications/%5Bidentification_id]http://www.cs.umbc.edu/~jsachs/identifications/%5Bidentification_id%5D e.g. http://www.cs.umbc.edu/~jsachs/identifications/tdwg2010bioblitz_1835_id_1
The scripts behind this are on the kludgy side, so reports of errors and abnormalities will be warmly welcomed.
Implicit in each of the following notes is the question "Is this a good way to do it?":
- The data is "normalized" w.r.t. identification. "Normalized" is in
quotes because I mean it in the sense that Steve Baskauf was using in his Fall 2010 series of posts. His meaning of the term makes sense to me, but many people (e.g. the OBO folks), take "normalized ontology" to mean "disentangled" (i.e. no multiple inheritance.) As an example, here's an occurrence with two crowdsourced determinations: http://www.cs.umbc.edu/~jsachs/occurrences/tdwg2010bioblitz_1644
- I used sequential integers for observation and identification IDs; in
practice, a mechanism needs to be in place to prevent two people from assigning the same id to their respective identifications.
- My answer to Cam Webb's Question #1 from
http://lists.tdwg.org/pipermail/tdwg-content/2010-October/001720.html is "both". In other words, just as "Joel Sachs" is both me and also my name, so http://www.cs.umbc.edu/~jsachs/occurrences/tdwg2010bioblitz_1668 is both an occurrence and an occurrence_id, expressed as:
dwc:Occurrence rdf:about=" http://www.cs.umbc.edu/~jsachs/occurrences/tdwg2010bioblitz_1644" dwc:occurrenceID http://www.cs.umbc.edu/~jsachs/occurrences/tdwg2010bioblitz_1644 </dwc:occurrenceID>
<blah blah blah/> </dwc:Occurrence> ---
- I was surprised to see that the Darwin Core Identification class has
no "occurrenceID" or "specimenID" term. How is one supposed to tie an identification to an observation (assuming the identification is not in-lined, of course)? DeVries and Baskauf each mint their own terms for doing this (txn:identificationHasOccurrence, and sernec:basedOnOccurrence, respectively); I used dwc:occurrenceID as if it were a record level term.
- We had scope for multiple taxonConceptID columns in the Fusion table,
and assigned lsids where possible. I also mean to work with Pete to assign GUIDs from taxoncocept.org. In addition, I assigned ethan taxon concept ids, which look like this: http:.//spire.umbc.edu/ethan/Coffea_arabica
In their argument over opaque vs. transparent taxonCoceptIDs, I was sympathetic to both Pete's and Gregor's arguments. Ultimately, if the tooling exists to always display the rdfs:labels every time I'm loooking at a list of opaqueIDs, then transparent IDs are unnecessary. But, for now, it's really helpful to look at an ID and know what it's referring to.
(For species names not in the spire database, the rdf returned by http:.//spire.umbc.edu/ethan/$name is simply an rdfs:seeAlso to http://http://gni.globalnames.org/name_strings?search_term=$name)
- It was easy to assert membership in RDF classes corresponding to
various Cape Cod categories of concern - invasive species, threatenened species, indicators, etc. You can see these classes at http://spire.umbc.edu/ontologies/lists (Information of where these lists come from is included as rdfs:comments. I'll add further documentation, e.g. links to eml files.)
Note that "ThingOfConcern" is defined as the superclass of all the other classes in the collection. The idea here is that people can create their own "ThingOfConcern" class, and then query for observations that are of concern to them. You can see sample sparql queries at http://www.csee.umbc.edu/~jsachs/occurrences/queries/sample.txt
As an aside, I think we, as a community, should come up with a biodiversity benchmark suite of rdf data and corresponding sparql queries, that can be used to test the suitability and scalability of semantic web knowledge bases. I'll take this up in a future post (unless someone beats me to it).
Comments, questions, and better ideas are welcome.
Thanks - Joel.
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
--
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 TaxonConcept Knowledge Base http://www.taxonconcept.org/ / GeoSpecies Knowledge Base http://lod.geospecies.org/ About the GeoSpecies Knowledge Base http://about.geospecies.org/
The geolocations in the *stans apparently have a systematic sign error in the longitude. The one on the Turkish coast is more mysterious. If the actual observer and observation time were available, the strategy for quality control on geolocation outliers would start with correlation of the locations of all the observations that day of the observer whose data assert <geo:lat 38.47 geo:long 27.09> For example, it's fairly unlikely that the observer could get from Massachusetts to Turkey in a few hours.... OTOH, if it's a solitary observation from that observer, it might really be a Turkish observation.
Bob
On Thu, Jan 13, 2011 at 8:55 PM, Peter DeVries pete.devries@gmail.com wrote:
... Also the geoquery works but it seems to return some strange locations? http://bit.ly/dYQXUp
Respectfully,
- Pete
On Thu, Jan 13, 2011 at 6:53 AM, joel sachs jsachs@csee.umbc.edu wrote:
Pete, Thanks - I corrected the geo properties. Joel.
On Wed, 12 Jan 2011, Peter DeVries wrote:
Hi Joel,
Cool :-)
I just loaded this into my SPARQL endpoint.
In the named graph urn:org:linkedopenspeciesdata:dataspace:tdwg2010bioblitz
It consists of 19,990 Triples
Here is one of the dwc:taxonConceptID entries.
*About: http://spire.umbc.edu/ethan/Ampelopsis_brevipedunculata*
http://lsd.taxonconcept.org/describe/?url=http://spire.umbc.edu/ethan/Ampelo...
*About: http://www.cs.umbc.edu/~jsachs/occurrences/tdwg2010bioblitz_1627*
http://lsd.taxonconcept.org/describe/?url=http://spire.umbc.edu/ethan/Ampelopsis_brevipedunculata
http://lsd.taxonconcept.org/describe/?url=http://www.cs.umbc.edu/~jsachs/occ...
http://lsd.taxonconcept.org/describe/?url=http://www.cs.umbc.edu/~jsachs/occurrences/tdwg2010bioblitz_1627This should give you an a count of occurrences.
SELECT count(*) WHERE {?s a http://rs.tdwg.org/dwc/terms/#Occurrence};
= 1882
SELECT count(*) WHERE {?s a http://rs.tdwg.org/dwc/terms/#taxonConceptID};
This should give you a list of occurrences
http://lsd.taxonconcept.org/describe/?url=http://rs.tdwg.org/dwc/terms/%23Oc...
If this did not come through your email system try the bit.ly.
I tried the following that should have given me a google map of all the occurrences but it did not result in the map.
DESCRIBE ?x WHERE { ?x http://www.w3.org/1999/02/22-rdf-syntax-ns#type < http://rs.tdwg.org/dwc/terms/#Occurrence%3E. }
I looked that the RDF and I think I see the problem.
In the RDF
geo:latitude 41.53 </geo:latitude>
geo:longitude -70.67 </geo:longitude>
Should be
geo:lat 41.53 </geo:lat>
geo:long -70.67 </geo:long>
See http://www.w3.org/2003/01/geo/
I did the following query to get a list of all the dwc:taxonConceptID's and have attached them as a .txt file.
select distinct ?o WHERE {?s http://rs.tdwg.org/dwc/terms/#taxonConceptID ?o}
Pretty neat :-)
There are some things that I will get back to Joel on.
Here is where you can manually enter a SPARQL query. Click on "Advanced" for the entry window.
http://lsd.taxonconcept.org/isparql/
Respectfully,
- Pete
On Wed, Jan 12, 2011 at 5:55 PM, joel sachs jsachs@csee.umbc.edu wrote:
Hi Everyone,
I've posted rdf of the bioblitz data. It's at http://www.cs.umbc.edu/~jsachs/occurrences/TechnoBioblitzOccurrences.rdf .
Individual occurrences can be retrieved via http://www.cs.umbc.edu/~jsachs/occurrences/%5Boccurrence_id] e.g. http://www.cs.umbc.edu/~jsachs/occurrences/tdwg2010bioblitz_1835
Individual identifications can be retrieved via http://www.cs.umbc.edu/~jsachs/identifications/%5Bidentification_id] e.g.
http://www.cs.umbc.edu/~jsachs/identifications/tdwg2010bioblitz_1835_id_1
The scripts behind this are on the kludgy side, so reports of errors and abnormalities will be warmly welcomed.
Implicit in each of the following notes is the question "Is this a good way to do it?":
- The data is "normalized" w.r.t. identification. "Normalized" is in
quotes because I mean it in the sense that Steve Baskauf was using in his Fall 2010 series of posts. His meaning of the term makes sense to me, but many people (e.g. the OBO folks), take "normalized ontology" to mean "disentangled" (i.e. no multiple inheritance.) As an example, here's an occurrence with two crowdsourced determinations: http://www.cs.umbc.edu/~jsachs/occurrences/tdwg2010bioblitz_1644
- I used sequential integers for observation and identification IDs; in
practice, a mechanism needs to be in place to prevent two people from assigning the same id to their respective identifications.
- My answer to Cam Webb's Question #1 from
http://lists.tdwg.org/pipermail/tdwg-content/2010-October/001720.html is "both". In other words, just as "Joel Sachs" is both me and also my name, so http://www.cs.umbc.edu/~jsachs/occurrences/tdwg2010bioblitz_1668 is both an occurrence and an occurrence_id, expressed as:
dwc:Occurrence rdf:about=" http://www.cs.umbc.edu/~jsachs/occurrences/tdwg2010bioblitz_1644" dwc:occurrenceID http://www.cs.umbc.edu/~jsachs/occurrences/tdwg2010bioblitz_1644 </dwc:occurrenceID>
<blah blah blah/> </dwc:Occurrence> ---
- I was surprised to see that the Darwin Core Identification class has
no "occurrenceID" or "specimenID" term. How is one supposed to tie an identification to an observation (assuming the identification is not in-lined, of course)? DeVries and Baskauf each mint their own terms for doing this (txn:identificationHasOccurrence, and sernec:basedOnOccurrence, respectively); I used dwc:occurrenceID as if it were a record level term.
- We had scope for multiple taxonConceptID columns in the Fusion table,
and assigned lsids where possible. I also mean to work with Pete to assign GUIDs from taxoncocept.org. In addition, I assigned ethan taxon concept ids, which look like this: http:.//spire.umbc.edu/ethan/Coffea_arabica
In their argument over opaque vs. transparent taxonCoceptIDs, I was sympathetic to both Pete's and Gregor's arguments. Ultimately, if the tooling exists to always display the rdfs:labels every time I'm loooking at a list of opaqueIDs, then transparent IDs are unnecessary. But, for now, it's really helpful to look at an ID and know what it's referring to.
(For species names not in the spire database, the rdf returned by http:.//spire.umbc.edu/ethan/$name is simply an rdfs:seeAlso to http://http://gni.globalnames.org/name_strings?search_term=$name)
- It was easy to assert membership in RDF classes corresponding to
various Cape Cod categories of concern - invasive species, threatenened species, indicators, etc. You can see these classes at http://spire.umbc.edu/ontologies/lists (Information of where these lists come from is included as rdfs:comments. I'll add further documentation, e.g. links to eml files.)
Note that "ThingOfConcern" is defined as the superclass of all the other classes in the collection. The idea here is that people can create their own "ThingOfConcern" class, and then query for observations that are of concern to them. You can see sample sparql queries at http://www.csee.umbc.edu/~jsachs/occurrences/queries/sample.txt
As an aside, I think we, as a community, should come up with a biodiversity benchmark suite of rdf data and corresponding sparql queries, that can be used to test the suitability and scalability of semantic web knowledge bases. I'll take this up in a future post (unless someone beats me to it).
Comments, questions, and better ideas are welcome.
Thanks - Joel.
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
--
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 TaxonConcept Knowledge Base http://www.taxonconcept.org/ / GeoSpecies Knowledge Base http://lod.geospecies.org/ About the GeoSpecies Knowledge Base http://about.geospecies.org/
--
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 TaxonConcept Knowledge Base / GeoSpecies Knowledge Base About the GeoSpecies Knowledge Base
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
Most of those were from a lat that was 70 rather than -70.
I fixed that in the RDF and reloaded the data set.
The SPARQL Query is still via http://bit.ly/dYQXUp
If you downloaded the RDF recently from my links above you might want to do so again, since they just changed.
- Pete
On Thu, Jan 13, 2011 at 10:33 PM, Bob Morris morris.bob@gmail.com wrote:
The geolocations in the *stans apparently have a systematic sign error in the longitude. The one on the Turkish coast is more mysterious. If the actual observer and observation time were available, the strategy for quality control on geolocation outliers would start with correlation of the locations of all the observations that day of the observer whose data assert <geo:lat 38.47 geo:long 27.09> For example, it's fairly unlikely that the observer could get from Massachusetts to Turkey in a few hours.... OTOH, if it's a solitary observation from that observer, it might really be a Turkish observation.
Bob
On Thu, Jan 13, 2011 at 8:55 PM, Peter DeVries pete.devries@gmail.com wrote:
... Also the geoquery works but it seems to return some strange locations? http://bit.ly/dYQXUp
Respectfully,
- Pete
On Thu, Jan 13, 2011 at 6:53 AM, joel sachs jsachs@csee.umbc.edu
wrote:
Pete, Thanks - I corrected the geo properties. Joel.
On Wed, 12 Jan 2011, Peter DeVries wrote:
Hi Joel,
Cool :-)
I just loaded this into my SPARQL endpoint.
In the named graph urn:org:linkedopenspeciesdata:dataspace:tdwg2010bioblitz
It consists of 19,990 Triples
Here is one of the dwc:taxonConceptID entries.
*About: http://spire.umbc.edu/ethan/Ampelopsis_brevipedunculata*
http://lsd.taxonconcept.org/describe/?url=http://spire.umbc.edu/ethan/Ampelo...
*About:
http://www.cs.umbc.edu/~jsachs/occurrences/tdwg2010bioblitz_1627*
<
http://lsd.taxonconcept.org/describe/?url=http://spire.umbc.edu/ethan/Ampelo...
http://lsd.taxonconcept.org/describe/?url=http://www.cs.umbc.edu/~jsachs/occ...
<
http://lsd.taxonconcept.org/describe/?url=http://www.cs.umbc.edu/~jsachs/occ...
This
should give you an a count of occurrences.
SELECT count(*) WHERE {?s a <http://rs.tdwg.org/dwc/terms/#Occurrence
};
= 1882
SELECT count(*) WHERE {?s a http://rs.tdwg.org/dwc/terms/#taxonConceptID};
This should give you a list of occurrences
http://lsd.taxonconcept.org/describe/?url=http://rs.tdwg.org/dwc/terms/%23Oc...
If this did not come through your email system try the bit.ly.
I tried the following that should have given me a google map of all the occurrences but it did not result in the map.
DESCRIBE ?x WHERE { ?x http://www.w3.org/1999/02/22-rdf-syntax-ns#type < http://rs.tdwg.org/dwc/terms/#Occurrence%3E. }
I looked that the RDF and I think I see the problem.
In the RDF
geo:latitude 41.53 </geo:latitude>
geo:longitude -70.67 </geo:longitude>
Should be
geo:lat 41.53 </geo:lat>
geo:long -70.67 </geo:long>
See http://www.w3.org/2003/01/geo/
I did the following query to get a list of all the dwc:taxonConceptID's and have attached them as a .txt file.
select distinct ?o WHERE {?s http://rs.tdwg.org/dwc/terms/#taxonConceptID ?o}
Pretty neat :-)
There are some things that I will get back to Joel on.
Here is where you can manually enter a SPARQL query. Click on
"Advanced"
for the entry window.
http://lsd.taxonconcept.org/isparql/
Respectfully,
- Pete
On Wed, Jan 12, 2011 at 5:55 PM, joel sachs jsachs@csee.umbc.edu
wrote:
Hi Everyone,
I've posted rdf of the bioblitz data. It's at
http://www.cs.umbc.edu/~jsachs/occurrences/TechnoBioblitzOccurrences.rdf
.
Individual occurrences can be retrieved via http://www.cs.umbc.edu/~jsachs/occurrences/%5Boccurrence_id] e.g. http://www.cs.umbc.edu/~jsachs/occurrences/tdwg2010bioblitz_1835
Individual identifications can be retrieved via http://www.cs.umbc.edu/~jsachs/identifications/%5Bidentification_id] e.g.
http://www.cs.umbc.edu/~jsachs/identifications/tdwg2010bioblitz_1835_id_1
The scripts behind this are on the kludgy side, so reports of errors
and
abnormalities will be warmly welcomed.
Implicit in each of the following notes is the question "Is this a
good
way to do it?":
- The data is "normalized" w.r.t. identification. "Normalized" is in
quotes because I mean it in the sense that Steve Baskauf was using in his Fall 2010 series of posts. His meaning of the term makes sense to me, but many people (e.g. the OBO folks), take "normalized ontology" to mean "disentangled" (i.e. no multiple inheritance.) As an example, here's an occurrence with two crowdsourced determinations: http://www.cs.umbc.edu/~jsachs/occurrences/tdwg2010bioblitz_1644
- I used sequential integers for observation and identification IDs;
in
practice, a mechanism needs to be in place to prevent two people from assigning the same id to their respective identifications.
- My answer to Cam Webb's Question #1 from
http://lists.tdwg.org/pipermail/tdwg-content/2010-October/001720.html is "both". In other words, just as "Joel Sachs" is both me and also my name, so http://www.cs.umbc.edu/~jsachs/occurrences/tdwg2010bioblitz_1668 is
both
an occurrence and an occurrence_id, expressed as:
dwc:Occurrence rdf:about=" http://www.cs.umbc.edu/~jsachs/occurrences/tdwg2010bioblitz_1644" dwc:occurrenceID http://www.cs.umbc.edu/~jsachs/occurrences/tdwg2010bioblitz_1644 </dwc:occurrenceID>
<blah blah blah/> </dwc:Occurrence> ---
- I was surprised to see that the Darwin Core Identification class
has
no "occurrenceID" or "specimenID" term. How is one supposed to tie an identification to an observation (assuming the identification is not in-lined, of course)? DeVries and Baskauf each mint their own terms
for
doing this (txn:identificationHasOccurrence, and sernec:basedOnOccurrence, respectively); I used dwc:occurrenceID as if it were a record level term.
- We had scope for multiple taxonConceptID columns in the Fusion
table,
and assigned lsids where possible. I also mean to work with Pete to assign GUIDs from taxoncocept.org. In addition, I assigned ethan taxon
concept
ids, which look like this: http:.//spire.umbc.edu/ethan/Coffea_arabica
In their argument over opaque vs. transparent taxonCoceptIDs, I was sympathetic to both Pete's and Gregor's arguments. Ultimately, if the tooling exists to always display the rdfs:labels every time I'm
loooking
at a list of opaqueIDs, then transparent IDs are unnecessary. But, for now, it's really helpful to look at an ID and know what it's referring to.
(For species names not in the spire database, the rdf returned by http:.//spire.umbc.edu/ethan/$name is simply an rdfs:seeAlso to http://http://gni.globalnames.org/name_strings?search_term=$name)
- It was easy to assert membership in RDF classes corresponding to
various Cape Cod categories of concern - invasive species,
threatenened
species, indicators, etc. You can see these classes at http://spire.umbc.edu/ontologies/lists (Information of where these
lists
come from is included as rdfs:comments. I'll add further
documentation,
e.g. links to eml files.)
Note that "ThingOfConcern" is defined as the superclass of all the
other
classes in the collection. The idea here is that people can create
their
own "ThingOfConcern" class, and then query for observations that are
of
concern to them. You can see sample sparql queries at http://www.csee.umbc.edu/~jsachs/occurrences/queries/sample.txt
As an aside, I think we, as a community, should come up with a biodiversity benchmark suite of rdf data and corresponding sparql queries, that can be used to test the suitability and scalability of semantic web knowledge bases. I'll take this up in a future post (unless someone beats me to it).
Comments, questions, and better ideas are welcome.
Thanks - Joel.
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
--
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 TaxonConcept Knowledge Base http://www.taxonconcept.org/ /
GeoSpecies
Knowledge Base http://lod.geospecies.org/ About the GeoSpecies Knowledge Base http://about.geospecies.org/
--
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 TaxonConcept Knowledge Base / GeoSpecies Knowledge Base About the GeoSpecies Knowledge Base
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
-- Robert A. Morris Emeritus Professor of Computer Science UMASS-Boston 100 Morrissey Blvd Boston, MA 02125-3390 Associate, Harvard University Herbaria email: morris.bob@gmail.com web: http://bdei.cs.umb.edu/ web: http://etaxonomy.org/mw/FilteredPush http://www.cs.umb.edu/~ram phone (+1) 857 222 7992 (mobile)
Pete -
I'm glad you're doing cool things with the data. A few comments/answers/questions ...
i. The quests for consistently used canonical URIs, and consistently used canonical names are both, for now, somewhat quixotic, and I'm not sure one is more so than the other. GNA/GNI seems to be pursuing both. Is that a fair assessment?
ii. I used opaque taxon concept URIs where they were readily available. Laziness motivated me to also use transparent identifiers for a. making sure each taxonConcept had an rdf:resource as an identifier, and b. asserting membership in the various rdfs:classes that I defined. But laziness has its drawbacks, and you point out a few of them. I think best practice when using either names or transparent URIs as identifiers would be to first run them through a spell checker and normalizer that would produce consistent usage of upper and lowe case characters, and probably drop authorship. Of course, this would produce some false positives when doing any sort of querying. But using UUIDs produces false negatives since they're not in common use. If we (as a community) follow through on the enthusiasm for creating competency cases that we showed in the Fall, it would be interesting to consider each case from the question of what's worse - false positives or false negatives. In any event, false positives can be weeded out by using the full scientific names and contextual information in the record, whereas false negatives may never be discovered.
BTW, does anyone know a good spell checker for scientific names?
iii. Mapping names to opaque URIs doesn't resolve the problems you raise below. Will the identifer for Carpodacus mexicanus be the same as the identifier for Carpodacus mexicanus (Statius Muller, 1776)?
How about Arabis laevigata (Muhl. ex Willd.) Poir. vs. Arabis laevigata ?
If yes, then why noy just use normalized names as identifiers? If no, then we'll get false negatives in the sorts of SPARQL queries that I gave at http://www.csee.umbc.edu/~jsachs/occurrences/queries/
BTW, is there a lookup service for taxonconcept.org identifiers (i.e. give a list of names, get a list of identifiers)?
iv. In what sense do you see Darwin Core as being deficient as a semantic web representation?
Regards - Joel.
On Thu, 13 Jan 2011, Peter DeVries wrote:
Thanks Joel,
Here is one of the BioBlitz Occurrence Records marked up with Darwin Core
http://lsd.taxonconcept.org/describe/?url=http://www.cs.umbc.edu/~jsachs/occ...
Here is one of the TaxonConcept Records marked up in the txn vocabulary http://lsd.taxonconcept.org/describe/?url=http://ocs.taxonconcept.org/ocs/f5...
Some people have trouble with how the %23 above is escaped in their email, they might like this bit.ly bundle better.http://lsd.taxonconcept.org/describe/?url=http://ocs.taxonconcept.org/ocs/f522444a-2dd9-400e-be59-47213ef38cb9%23Occurrence http://lsd.taxonconcept.org/about/html/http/ocs.taxonconcept.org/ocs/f522444a-2dd9-400e-be59-47213ef38cb9%01Occurrence http://bit.ly/hZdUpP
http://bit.ly/hZdUpPAlso an issue to discuss is that identifications are to both a name and a concept so
Should these should be the same dwc:taxonConcepts
http://spire.umbc.edu/ethan/Apis_mellifera http://spire.umbc.edu/ethan/Apis_Mellifera ==> http://bit.ly/g1zzJC (Apis mellifera se:z9oqP)
http://spire.umbc.edu/ethan/Ascelpius_syruaca http://spire.umbc.edu/ethan/Asclepias_syriaca ==> HTML page http://lod.taxonconcept.org/ses/tTEIq.html Concept View Bit.ly http://bit.ly/dJHJqj
http://spire.umbc.edu/ethan/Aster_nova-belgii http://spire.umbc.edu/ethan/Aster_nova-gelgii
http://spire.umbc.edu/ethan/Baccharis_halimifolia http://spire.umbc.edu/ethan/Baccharis_halimifolia_L.
http://spire.umbc.edu/ethan/Bartonia_virginia http://spire.umbc.edu/ethan/Bartonia_virginica
http://spire.umbc.edu/ethan/Branta_canadensis http://spire.umbc.edu/ethan/Branta_canadensis_(Linnaeus,_1758)
http://spire.umbc.edu/ethan/Carex_pennsylvanica http://spire.umbc.edu/ethan/Carex_pensylvanica
http://spire.umbc.edu/ethan/Cyperus_esculantus http://spire.umbc.edu/ethan/Cyperus_Esculantus http://spire.umbc.edu/ethan/Cyperus_esculentus http://spire.umbc.edu/ethan/Cyperus_esculentus_L.
http://spire.umbc.edu/ethan/Carpodacus_mexicanus http://spire.umbc.edu/ethan/Carpodacus_mexicanus_(Statius_Muller,_1776)
From the taxonconcepts their should be a link to the various name strings.
vs. modeling the namestring as the concept.
Also I think that DarwinCore is good for somethings but maybe not as a semantic web representation.
Respectfully,
- Pete
On Thu, Jan 13, 2011 at 6:53 AM, joel sachs jsachs@csee.umbc.edu wrote:
Pete, Thanks - I corrected the geo properties. Joel.
On Wed, 12 Jan 2011, Peter DeVries wrote:
Hi Joel,
Cool :-)
I just loaded this into my SPARQL endpoint.
In the named graph urn:org:linkedopenspeciesdata:dataspace:tdwg2010bioblitz
It consists of 19,990 Triples
Here is one of the dwc:taxonConceptID entries.
*About: http://spire.umbc.edu/ethan/Ampelopsis_brevipedunculata*
http://lsd.taxonconcept.org/describe/?url=http://spire.umbc.edu/ethan/Ampelo...
*About: http://www.cs.umbc.edu/~jsachs/occurrences/tdwg2010bioblitz_1627* < http://lsd.taxonconcept.org/describe/?url=http://spire.umbc.edu/ethan/Ampelo...
http://lsd.taxonconcept.org/describe/?url=http://www.cs.umbc.edu/~jsachs/occ...
< http://lsd.taxonconcept.org/describe/?url=http://www.cs.umbc.edu/~jsachs/occ...
This
should give you an a count of occurrences.
SELECT count(*) WHERE {?s a http://rs.tdwg.org/dwc/terms/#Occurrence};
= 1882
SELECT count(*) WHERE {?s a <http://rs.tdwg.org/dwc/terms/#taxonConceptID
};
This should give you a list of occurrences
http://lsd.taxonconcept.org/describe/?url=http://rs.tdwg.org/dwc/terms/%23Oc...
If this did not come through your email system try the bit.ly.
I tried the following that should have given me a google map of all the occurrences but it did not result in the map.
DESCRIBE ?x WHERE { ?x http://www.w3.org/1999/02/22-rdf-syntax-ns#type < http://rs.tdwg.org/dwc/terms/#Occurrence%3E. }
I looked that the RDF and I think I see the problem.
In the RDF
geo:latitude 41.53 </geo:latitude>
geo:longitude -70.67 </geo:longitude>
Should be
geo:lat 41.53 </geo:lat>
geo:long -70.67 </geo:long>
See http://www.w3.org/2003/01/geo/
I did the following query to get a list of all the dwc:taxonConceptID's and have attached them as a .txt file.
select distinct ?o WHERE {?s < http://rs.tdwg.org/dwc/terms/#taxonConceptID%3E ?o}
Pretty neat :-)
There are some things that I will get back to Joel on.
Here is where you can manually enter a SPARQL query. Click on "Advanced" for the entry window.
http://lsd.taxonconcept.org/isparql/
Respectfully,
- Pete
On Wed, Jan 12, 2011 at 5:55 PM, joel sachs jsachs@csee.umbc.edu wrote:
Hi Everyone,
I've posted rdf of the bioblitz data. It's at http://www.cs.umbc.edu/~jsachs/occurrences/TechnoBioblitzOccurrences.rdf.
Individual occurrences can be retrieved via http://www.cs.umbc.edu/~jsachs/occurrences/%5Boccurrence_id]http://www.cs.umbc.edu/~jsachs/occurrences/%5Boccurrence_id%5D e.g. http://www.cs.umbc.edu/~jsachs/occurrences/tdwg2010bioblitz_1835
Individual identifications can be retrieved via http://www.cs.umbc.edu/~jsachs/identifications/%5Bidentification_id]http://www.cs.umbc.edu/~jsachs/identifications/%5Bidentification_id%5D e.g. http://www.cs.umbc.edu/~jsachs/identifications/tdwg2010bioblitz_1835_id_1
The scripts behind this are on the kludgy side, so reports of errors and abnormalities will be warmly welcomed.
Implicit in each of the following notes is the question "Is this a good way to do it?":
- The data is "normalized" w.r.t. identification. "Normalized" is in
quotes because I mean it in the sense that Steve Baskauf was using in his Fall 2010 series of posts. His meaning of the term makes sense to me, but many people (e.g. the OBO folks), take "normalized ontology" to mean "disentangled" (i.e. no multiple inheritance.) As an example, here's an occurrence with two crowdsourced determinations: http://www.cs.umbc.edu/~jsachs/occurrences/tdwg2010bioblitz_1644
- I used sequential integers for observation and identification IDs; in
practice, a mechanism needs to be in place to prevent two people from assigning the same id to their respective identifications.
- My answer to Cam Webb's Question #1 from
http://lists.tdwg.org/pipermail/tdwg-content/2010-October/001720.html is "both". In other words, just as "Joel Sachs" is both me and also my name, so http://www.cs.umbc.edu/~jsachs/occurrences/tdwg2010bioblitz_1668 is both an occurrence and an occurrence_id, expressed as:
dwc:Occurrence rdf:about=" http://www.cs.umbc.edu/~jsachs/occurrences/tdwg2010bioblitz_1644" dwc:occurrenceID http://www.cs.umbc.edu/~jsachs/occurrences/tdwg2010bioblitz_1644 </dwc:occurrenceID>
<blah blah blah/> </dwc:Occurrence> ---
- I was surprised to see that the Darwin Core Identification class has
no "occurrenceID" or "specimenID" term. How is one supposed to tie an identification to an observation (assuming the identification is not in-lined, of course)? DeVries and Baskauf each mint their own terms for doing this (txn:identificationHasOccurrence, and sernec:basedOnOccurrence, respectively); I used dwc:occurrenceID as if it were a record level term.
- We had scope for multiple taxonConceptID columns in the Fusion table,
and assigned lsids where possible. I also mean to work with Pete to assign GUIDs from taxoncocept.org. In addition, I assigned ethan taxon concept ids, which look like this: http:.//spire.umbc.edu/ethan/Coffea_arabica
In their argument over opaque vs. transparent taxonCoceptIDs, I was sympathetic to both Pete's and Gregor's arguments. Ultimately, if the tooling exists to always display the rdfs:labels every time I'm loooking at a list of opaqueIDs, then transparent IDs are unnecessary. But, for now, it's really helpful to look at an ID and know what it's referring to.
(For species names not in the spire database, the rdf returned by http:.//spire.umbc.edu/ethan/$name is simply an rdfs:seeAlso to http://http://gni.globalnames.org/name_strings?search_term=$name)
- It was easy to assert membership in RDF classes corresponding to
various Cape Cod categories of concern - invasive species, threatenened species, indicators, etc. You can see these classes at http://spire.umbc.edu/ontologies/lists (Information of where these lists come from is included as rdfs:comments. I'll add further documentation, e.g. links to eml files.)
Note that "ThingOfConcern" is defined as the superclass of all the other classes in the collection. The idea here is that people can create their own "ThingOfConcern" class, and then query for observations that are of concern to them. You can see sample sparql queries at http://www.csee.umbc.edu/~jsachs/occurrences/queries/sample.txt
As an aside, I think we, as a community, should come up with a biodiversity benchmark suite of rdf data and corresponding sparql queries, that can be used to test the suitability and scalability of semantic web knowledge bases. I'll take this up in a future post (unless someone beats me to it).
Comments, questions, and better ideas are welcome.
Thanks - Joel.
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
--
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 TaxonConcept Knowledge Base http://www.taxonconcept.org/ / GeoSpecies Knowledge Base http://lod.geospecies.org/ About the GeoSpecies Knowledge Base http://about.geospecies.org/
--
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 TaxonConcept Knowledge Base http://www.taxonconcept.org/ / GeoSpecies Knowledge Base http://lod.geospecies.org/ About the GeoSpecies Knowledge Base http://about.geospecies.org/
Hi Joel,
In the next day or so I will upload my TDWG talk to SlideShare, it explains some of this.
**But remember this was just an experiment to see if it would work.***
The main advantage of creating URI's for commonly used things is that then you can browse the connections between them and have them be the subjects of triples.
It also allows you to connect additional information to a person or thing that has a URI.
See attached png
In an ideal world, I might have done something like this.
http://notarealsite.org/genus_epithet
So a semantic web service would look up http://notarealsite.org/Puma_concolor
Retrieve http://notarealsite.org/Puma_concolor.rdf
Which would give you:
1) list of what databases have that name
2) and what various kingdoms or Name Rules the different names fall under (Plant, Animal etc)
For example: "This is both a plant and an animal name"
3) What are the current valid names for each of the species listed. => The list of related names you listed in your email.
=> *Puma concolor* (Linnaeus 1771)
I think this would be very useful since a number of data sources only list the genus and specific epithet.
* For instance the index that is at the back of the Entomology Society of America Annual Meeting Booklet
There are a couple of hurdles however.
1) What is the valid form of the name? Is there a machine that knows this or will this need to be partially curated by humans? In some cases the "valid" name is not clear.
2) Even though there are ~20 million names, there are probably some valid names that are not in the GNI. This is why I would encourage everyone to upload their names.
If these exist as URI's you get the following abilities
<gni_namestring:001> isSynonymOf <gni_namestring:002> etc.
Not that literals cannot be "subjects" of the subject-predicate-object triple.
Also, as current configured names like *Callithrix melanura* (É. Geoffroy 1812) and *Callithrix melanura* (E. Geoffroy 1812) are seen as different.
* Also the É. will have to be properly escaped to be part of a valid URL.
So a simple text search might not find them.
So in summary if we are going to get past everyone marking up their own name list in excel and then trying to merge them, we will need some architecture than can scale to millions of names and allow different groups to make assertions about the relationships between those names.
Searching TaxonConcept.org:
For now you have two options for searching for names. (I am mainly concentrating on the RDF representation right now)
You can use google "Puma concolor site:lod.taxonconcept.org" *Google seems to prefer the .rdf version even though my sitemap specifies the .html.
Or use the Knowledge Base http://lsd.taxonconcept.org/fct/
I think I sent out a list of all the species collected at the BioBlitz, with their matching species concept ID.
Did you not get this? I think it on the Google site.
*> iv. In what sense do you see Darwin Core as being deficient as a semantic web representation?*
You agreed previously that some of the fields that currently use Literals would be more efficiently and unambiguously represented as URI's.
I would characterize DarwinCore as "not optimal" rather than deficient.
There is no reason we can't still use the DarwinCore for what it is good for, but it does not work very well for SPARQL queries and the LOD.
For instance, a search of occurrences of this *Callithrix melanura* (É. Geoffroy 1812) will not return occurrences with *Callithrix melanura* (E. Geoffroy 1812).
Respectfully,
- Pete
On Tue, Jan 18, 2011 at 1:42 PM, joel sachs jsachs@csee.umbc.edu wrote:
Pete -
I'm glad you're doing cool things with the data. A few comments/answers/questions ...
i. The quests for consistently used canonical URIs, and consistently used canonical names are both, for now, somewhat quixotic, and I'm not sure one is more so than the other. GNA/GNI seems to be pursuing both. Is that a fair assessment?
ii. I used opaque taxon concept URIs where they were readily available. Laziness motivated me to also use transparent identifiers for a. making sure each taxonConcept had an rdf:resource as an identifier, and b. asserting membership in the various rdfs:classes that I defined. But laziness has its drawbacks, and you point out a few of them. I think best practice when using either names or transparent URIs as identifiers would be to first run them through a spell checker and normalizer that would produce consistent usage of upper and lowe case characters, and probably drop authorship. Of course, this would produce some false positives when doing any sort of querying. But using UUIDs produces false negatives since they're not in common use. If we (as a community) follow through on the enthusiasm for creating competency cases that we showed in the Fall, it would be interesting to consider each case from the question of what's worse - false positives or false negatives. In any event, false positives can be weeded out by using the full scientific names and contextual information in the record, whereas false negatives may never be discovered.
BTW, does anyone know a good spell checker for scientific names?
iii. Mapping names to opaque URIs doesn't resolve the problems you raise below. Will the identifer for Carpodacus mexicanus be the same as the identifier for Carpodacus mexicanus (Statius Muller, 1776)?
How about Arabis laevigata (Muhl. ex Willd.) Poir. vs. Arabis laevigata ?
If yes, then why noy just use normalized names as identifiers? If no, then we'll get false negatives in the sorts of SPARQL queries that I gave at
http://www.csee.umbc.edu/~jsachs/occurrences/queries/
BTW, is there a lookup service for taxonconcept.org identifiers (i.e. give a list of names, get a list of identifiers)?
iv. In what sense do you see Darwin Core as being deficient as a semantic web representation?
Regards - Joel.
On Thu, 13 Jan 2011, Peter DeVries wrote:
Thanks Joel,
Here is one of the BioBlitz Occurrence Records marked up with Darwin Core
http://lsd.taxonconcept.org/describe/?url=http://www.cs.umbc.edu/~jsachs/occ...
Here is one of the TaxonConcept Records marked up in the txn vocabulary
http://lsd.taxonconcept.org/describe/?url=http://ocs.taxonconcept.org/ocs/f5...
Some people have trouble with how the %23 above is escaped in their email, they might like this bit.ly bundle better.< http://lsd.taxonconcept.org/describe/?url=http://ocs.taxonconcept.org/ocs/f5...
< http://lsd.taxonconcept.org/about/html/http/ocs.taxonconcept.org/ocs/f522444...
http://bit.ly/hZdUpPAlso an issue to discuss is that identifications are to both a name and a concept so
Should these should be the same dwc:taxonConcepts
http://spire.umbc.edu/ethan/Apis_mellifera http://spire.umbc.edu/ethan/Apis_Mellifera ==> http://bit.ly/g1zzJC (Apis mellifera se:z9oqP)
http://spire.umbc.edu/ethan/Ascelpius_syruaca http://spire.umbc.edu/ethan/Asclepias_syriaca ==> HTML page http://lod.taxonconcept.org/ses/tTEIq.html Concept View Bit.ly http://bit.ly/dJHJqj
http://spire.umbc.edu/ethan/Aster_nova-belgii http://spire.umbc.edu/ethan/Aster_nova-gelgii
http://spire.umbc.edu/ethan/Baccharis_halimifolia http://spire.umbc.edu/ethan/Baccharis_halimifolia_L.
http://spire.umbc.edu/ethan/Bartonia_virginia http://spire.umbc.edu/ethan/Bartonia_virginica
http://spire.umbc.edu/ethan/Branta_canadensis http://spire.umbc.edu/ethan/Branta_canadensis_(Linnaeus,_1758)
http://spire.umbc.edu/ethan/Carex_pennsylvanica http://spire.umbc.edu/ethan/Carex_pensylvanica
http://spire.umbc.edu/ethan/Cyperus_esculantus http://spire.umbc.edu/ethan/Cyperus_Esculantus http://spire.umbc.edu/ethan/Cyperus_esculentus http://spire.umbc.edu/ethan/Cyperus_esculentus_L.
http://spire.umbc.edu/ethan/Carpodacus_mexicanus http://spire.umbc.edu/ethan/Carpodacus_mexicanus_(Statius_Muller,_1776)
From the taxonconcepts their should be a link to the various name
strings.
vs. modeling the namestring as the concept.
Also I think that DarwinCore is good for somethings but maybe not as a semantic web representation.
Respectfully,
- Pete
On Thu, Jan 13, 2011 at 6:53 AM, joel sachs jsachs@csee.umbc.edu wrote:
Pete,
Thanks - I corrected the geo properties. Joel.
On Wed, 12 Jan 2011, Peter DeVries wrote:
Hi Joel,
Cool :-)
I just loaded this into my SPARQL endpoint.
In the named graph urn:org:linkedopenspeciesdata:dataspace:tdwg2010bioblitz
It consists of 19,990 Triples
Here is one of the dwc:taxonConceptID entries.
*About: http://spire.umbc.edu/ethan/Ampelopsis_brevipedunculata*
http://lsd.taxonconcept.org/describe/?url=http://spire.umbc.edu/ethan/Ampelo...
*About: http://www.cs.umbc.edu/~jsachs/occurrences/tdwg2010bioblitz_1627* <
http://lsd.taxonconcept.org/describe/?url=http://spire.umbc.edu/ethan/Ampelo...
http://lsd.taxonconcept.org/describe/?url=http://www.cs.umbc.edu/~jsachs/occ...
<
http://lsd.taxonconcept.org/describe/?url=http://www.cs.umbc.edu/~jsachs/occ...
This
should give you an a count of occurrences.
SELECT count(*) WHERE {?s a <http://rs.tdwg.org/dwc/terms/#Occurrence
};
= 1882
SELECT count(*) WHERE {?s a < http://rs.tdwg.org/dwc/terms/#taxonConceptID
};
This should give you a list of occurrences
http://lsd.taxonconcept.org/describe/?url=http://rs.tdwg.org/dwc/terms/%23Oc...
If this did not come through your email system try the bit.ly.
I tried the following that should have given me a google map of all the occurrences but it did not result in the map.
DESCRIBE ?x WHERE { ?x http://www.w3.org/1999/02/22-rdf-syntax-ns#type < http://rs.tdwg.org/dwc/terms/#Occurrence%3E. }
I looked that the RDF and I think I see the problem.
In the RDF
geo:latitude 41.53 </geo:latitude>
geo:longitude -70.67 </geo:longitude>
Should be
geo:lat 41.53 </geo:lat>
geo:long -70.67 </geo:long>
See http://www.w3.org/2003/01/geo/
I did the following query to get a list of all the dwc:taxonConceptID's and have attached them as a .txt file.
select distinct ?o WHERE {?s < http://rs.tdwg.org/dwc/terms/#taxonConceptID%3E ?o}
Pretty neat :-)
There are some things that I will get back to Joel on.
Here is where you can manually enter a SPARQL query. Click on "Advanced" for the entry window.
http://lsd.taxonconcept.org/isparql/
Respectfully,
- Pete
On Wed, Jan 12, 2011 at 5:55 PM, joel sachs jsachs@csee.umbc.edu wrote:
Hi Everyone,
I've posted rdf of the bioblitz data. It's at
http://www.cs.umbc.edu/~jsachs/occurrences/TechnoBioblitzOccurrences.rdf .
Individual occurrences can be retrieved via http://www.cs.umbc.edu/~jsachs/occurrences/%5Boccurrence_id]http://www.cs.umbc.edu/~jsachs/occurrences/%5Boccurrence_id%5D http://www.cs.umbc.edu/~jsachs/occurrences/%5Boccurrence_id%5D
e.g. http://www.cs.umbc.edu/~jsachs/occurrences/tdwg2010bioblitz_1835
Individual identifications can be retrieved via http://www.cs.umbc.edu/~jsachs/identifications/%5Bidentification_id]http://www.cs.umbc.edu/~jsachs/identifications/%5Bidentification_id%5D < http://www.cs.umbc.edu/~jsachs/identifications/%5Bidentification_id%5D
e.g.
http://www.cs.umbc.edu/~jsachs/identifications/tdwg2010bioblitz_1835_id_1
The scripts behind this are on the kludgy side, so reports of errors and abnormalities will be warmly welcomed.
Implicit in each of the following notes is the question "Is this a good way to do it?":
- The data is "normalized" w.r.t. identification. "Normalized" is in
quotes because I mean it in the sense that Steve Baskauf was using in his Fall 2010 series of posts. His meaning of the term makes sense to me, but many people (e.g. the OBO folks), take "normalized ontology" to mean "disentangled" (i.e. no multiple inheritance.) As an example, here's an occurrence with two crowdsourced determinations: http://www.cs.umbc.edu/~jsachs/occurrences/tdwg2010bioblitz_1644
- I used sequential integers for observation and identification IDs;
in practice, a mechanism needs to be in place to prevent two people from assigning the same id to their respective identifications.
- My answer to Cam Webb's Question #1 from
http://lists.tdwg.org/pipermail/tdwg-content/2010-October/001720.html is "both". In other words, just as "Joel Sachs" is both me and also my name, so http://www.cs.umbc.edu/~jsachs/occurrences/tdwg2010bioblitz_1668 is both an occurrence and an occurrence_id, expressed as:
dwc:Occurrence rdf:about=" http://www.cs.umbc.edu/~jsachs/occurrences/tdwg2010bioblitz_1644" dwc:occurrenceID http://www.cs.umbc.edu/~jsachs/occurrences/tdwg2010bioblitz_1644 </dwc:occurrenceID>
<blah blah blah/> </dwc:Occurrence> ---
- I was surprised to see that the Darwin Core Identification class has
no "occurrenceID" or "specimenID" term. How is one supposed to tie an identification to an observation (assuming the identification is not in-lined, of course)? DeVries and Baskauf each mint their own terms for doing this (txn:identificationHasOccurrence, and sernec:basedOnOccurrence, respectively); I used dwc:occurrenceID as if it were a record level term.
- We had scope for multiple taxonConceptID columns in the Fusion
table, and assigned lsids where possible. I also mean to work with Pete to assign GUIDs from taxoncocept.org. In addition, I assigned ethan taxon concept ids, which look like this: http:.//spire.umbc.edu/ethan/Coffea_arabica
In their argument over opaque vs. transparent taxonCoceptIDs, I was sympathetic to both Pete's and Gregor's arguments. Ultimately, if the tooling exists to always display the rdfs:labels every time I'm loooking at a list of opaqueIDs, then transparent IDs are unnecessary. But, for now, it's really helpful to look at an ID and know what it's referring to.
(For species names not in the spire database, the rdf returned by http:.//spire.umbc.edu/ethan/$name is simply an rdfs:seeAlso to http://http://gni.globalnames.org/name_strings?search_term=$name)
- It was easy to assert membership in RDF classes corresponding to
various Cape Cod categories of concern - invasive species, threatenened species, indicators, etc. You can see these classes at http://spire.umbc.edu/ontologies/lists (Information of where these lists come from is included as rdfs:comments. I'll add further documentation, e.g. links to eml files.)
Note that "ThingOfConcern" is defined as the superclass of all the other classes in the collection. The idea here is that people can create their own "ThingOfConcern" class, and then query for observations that are of concern to them. You can see sample sparql queries at http://www.csee.umbc.edu/~jsachs/occurrences/queries/sample.txt
As an aside, I think we, as a community, should come up with a biodiversity benchmark suite of rdf data and corresponding sparql queries, that can be used to test the suitability and scalability of semantic web knowledge bases. I'll take this up in a future post (unless someone beats me to it).
Comments, questions, and better ideas are welcome.
Thanks - Joel.
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
--
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 TaxonConcept Knowledge Base http://www.taxonconcept.org/ / GeoSpecies Knowledge Base http://lod.geospecies.org/ About the GeoSpecies Knowledge Base http://about.geospecies.org/
--
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 TaxonConcept Knowledge Base http://www.taxonconcept.org/ / GeoSpecies Knowledge Base http://lod.geospecies.org/ About the GeoSpecies Knowledge Base http://about.geospecies.org/
Re: Searching
Also you can download the entire open data set
See http://www.ckan.net/package/taxonconcept
http://www.ckan.net/package/taxonconcept- Pete
On Tue, Jan 18, 2011 at 9:32 PM, Peter DeVries pete.devries@gmail.comwrote:
Hi Joel,
In the next day or so I will upload my TDWG talk to SlideShare, it explains some of this.
**But remember this was just an experiment to see if it would work.***
The main advantage of creating URI's for commonly used things is that then you can browse the connections between them and have them be the subjects of triples.
It also allows you to connect additional information to a person or thing that has a URI.
See attached png
In an ideal world, I might have done something like this.
http://notarealsite.org/genus_epithet
So a semantic web service would look up http://notarealsite.org/Puma_concolor
Retrieve http://notarealsite.org/Puma_concolor.rdf
Which would give you:
list of what databases have that name
and what various kingdoms or Name Rules the different names fall under
(Plant, Animal etc)
For example: "This is both a plant and an animal name"
- What are the current valid names for each of the species listed. => The
list of related names you listed in your email.
=> *Puma concolor* (Linnaeus 1771)
I think this would be very useful since a number of data sources only list the genus and specific epithet.
- For instance the index that is at the back of the Entomology Society of
America Annual Meeting Booklet
There are a couple of hurdles however.
- What is the valid form of the name? Is there a machine that knows this
or will this need to be partially curated by humans? In some cases the "valid" name is not clear.
- Even though there are ~20 million names, there are probably some valid
names that are not in the GNI. This is why I would encourage everyone to upload their names.
If these exist as URI's you get the following abilities
<gni_namestring:001> isSynonymOf <gni_namestring:002> etc.
Not that literals cannot be "subjects" of the subject-predicate-object triple.
Also, as current configured names like *Callithrix melanura* (É. Geoffroy 1812) and *Callithrix melanura* (E. Geoffroy 1812) are seen as different.
- Also the É. will have to be properly escaped to be part of a valid URL.
So a simple text search might not find them.
So in summary if we are going to get past everyone marking up their own name list in excel and then trying to merge them, we will need some architecture than can scale to millions of names and allow different groups to make assertions about the relationships between those names.
Searching TaxonConcept.org:
For now you have two options for searching for names. (I am mainly concentrating on the RDF representation right now)
You can use google "Puma concolor site:lod.taxonconcept.org" *Google seems to prefer the .rdf version even though my sitemap specifies the .html.
Or use the Knowledge Base http://lsd.taxonconcept.org/fct/
I think I sent out a list of all the species collected at the BioBlitz, with their matching species concept ID.
Did you not get this? I think it on the Google site.
*> iv. In what sense do you see Darwin Core as being deficient as a semantic web representation?*
You agreed previously that some of the fields that currently use Literals would be more efficiently and unambiguously represented as URI's.
I would characterize DarwinCore as "not optimal" rather than deficient.
There is no reason we can't still use the DarwinCore for what it is good for, but it does not work very well for SPARQL queries and the LOD.
For instance, a search of occurrences of this *Callithrix melanura* (É. Geoffroy 1812) will not return occurrences with *Callithrix melanura* (E. Geoffroy 1812).
Respectfully,
- Pete
On Tue, Jan 18, 2011 at 1:42 PM, joel sachs jsachs@csee.umbc.edu wrote:
Pete -
I'm glad you're doing cool things with the data. A few comments/answers/questions ...
i. The quests for consistently used canonical URIs, and consistently used canonical names are both, for now, somewhat quixotic, and I'm not sure one is more so than the other. GNA/GNI seems to be pursuing both. Is that a fair assessment?
ii. I used opaque taxon concept URIs where they were readily available. Laziness motivated me to also use transparent identifiers for a. making sure each taxonConcept had an rdf:resource as an identifier, and b. asserting membership in the various rdfs:classes that I defined. But laziness has its drawbacks, and you point out a few of them. I think best practice when using either names or transparent URIs as identifiers would be to first run them through a spell checker and normalizer that would produce consistent usage of upper and lowe case characters, and probably drop authorship. Of course, this would produce some false positives when doing any sort of querying. But using UUIDs produces false negatives since they're not in common use. If we (as a community) follow through on the enthusiasm for creating competency cases that we showed in the Fall, it would be interesting to consider each case from the question of what's worse - false positives or false negatives. In any event, false positives can be weeded out by using the full scientific names and contextual information in the record, whereas false negatives may never be discovered.
BTW, does anyone know a good spell checker for scientific names?
iii. Mapping names to opaque URIs doesn't resolve the problems you raise below. Will the identifer for Carpodacus mexicanus be the same as the identifier for Carpodacus mexicanus (Statius Muller, 1776)?
How about Arabis laevigata (Muhl. ex Willd.) Poir. vs. Arabis laevigata ?
If yes, then why noy just use normalized names as identifiers? If no, then we'll get false negatives in the sorts of SPARQL queries that I gave at
http://www.csee.umbc.edu/~jsachs/occurrences/queries/
BTW, is there a lookup service for taxonconcept.org identifiers (i.e. give a list of names, get a list of identifiers)?
iv. In what sense do you see Darwin Core as being deficient as a semantic web representation?
Regards - Joel.
On Thu, 13 Jan 2011, Peter DeVries wrote:
Thanks Joel,
Here is one of the BioBlitz Occurrence Records marked up with Darwin Core
http://lsd.taxonconcept.org/describe/?url=http://www.cs.umbc.edu/~jsachs/occ...
Here is one of the TaxonConcept Records marked up in the txn vocabulary
http://lsd.taxonconcept.org/describe/?url=http://ocs.taxonconcept.org/ocs/f5...
Some people have trouble with how the %23 above is escaped in their email, they might like this bit.ly bundle better.< http://lsd.taxonconcept.org/describe/?url=http://ocs.taxonconcept.org/ocs/f5...
< http://lsd.taxonconcept.org/about/html/http/ocs.taxonconcept.org/ocs/f522444...
http://bit.ly/hZdUpPAlso an issue to discuss is that identifications are to both a name and a concept so
Should these should be the same dwc:taxonConcepts
http://spire.umbc.edu/ethan/Apis_mellifera http://spire.umbc.edu/ethan/Apis_Mellifera ==> http://bit.ly/g1zzJC (Apis mellifera se:z9oqP)
http://spire.umbc.edu/ethan/Ascelpius_syruaca http://spire.umbc.edu/ethan/Asclepias_syriaca ==> HTML page http://lod.taxonconcept.org/ses/tTEIq.html Concept View Bit.ly http://bit.ly/dJHJqj
http://spire.umbc.edu/ethan/Aster_nova-belgii http://spire.umbc.edu/ethan/Aster_nova-gelgii
http://spire.umbc.edu/ethan/Baccharis_halimifolia http://spire.umbc.edu/ethan/Baccharis_halimifolia_L.
http://spire.umbc.edu/ethan/Bartonia_virginia http://spire.umbc.edu/ethan/Bartonia_virginica
http://spire.umbc.edu/ethan/Branta_canadensis http://spire.umbc.edu/ethan/Branta_canadensis_(Linnaeus,_1758)
http://spire.umbc.edu/ethan/Carex_pennsylvanica http://spire.umbc.edu/ethan/Carex_pensylvanica
http://spire.umbc.edu/ethan/Cyperus_esculantus http://spire.umbc.edu/ethan/Cyperus_Esculantus http://spire.umbc.edu/ethan/Cyperus_esculentus http://spire.umbc.edu/ethan/Cyperus_esculentus_L.
http://spire.umbc.edu/ethan/Carpodacus_mexicanus http://spire.umbc.edu/ethan/Carpodacus_mexicanus_(Statius_Muller,_1776)
From the taxonconcepts their should be a link to the various name
strings.
vs. modeling the namestring as the concept.
Also I think that DarwinCore is good for somethings but maybe not as a semantic web representation.
Respectfully,
- Pete
On Thu, Jan 13, 2011 at 6:53 AM, joel sachs jsachs@csee.umbc.edu wrote:
Pete,
Thanks - I corrected the geo properties. Joel.
On Wed, 12 Jan 2011, Peter DeVries wrote:
Hi Joel,
Cool :-)
I just loaded this into my SPARQL endpoint.
In the named graph urn:org:linkedopenspeciesdata:dataspace:tdwg2010bioblitz
It consists of 19,990 Triples
Here is one of the dwc:taxonConceptID entries.
*About: http://spire.umbc.edu/ethan/Ampelopsis_brevipedunculata*
http://lsd.taxonconcept.org/describe/?url=http://spire.umbc.edu/ethan/Ampelo...
*About: http://www.cs.umbc.edu/~jsachs/occurrences/tdwg2010bioblitz_1627* <
http://lsd.taxonconcept.org/describe/?url=http://spire.umbc.edu/ethan/Ampelo...
http://lsd.taxonconcept.org/describe/?url=http://www.cs.umbc.edu/~jsachs/occ...
<
http://lsd.taxonconcept.org/describe/?url=http://www.cs.umbc.edu/~jsachs/occ...
This
should give you an a count of occurrences.
SELECT count(*) WHERE {?s a <http://rs.tdwg.org/dwc/terms/#Occurrence
};
= 1882
SELECT count(*) WHERE {?s a < http://rs.tdwg.org/dwc/terms/#taxonConceptID
};
This should give you a list of occurrences
http://lsd.taxonconcept.org/describe/?url=http://rs.tdwg.org/dwc/terms/%23Oc...
If this did not come through your email system try the bit.ly.
I tried the following that should have given me a google map of all the occurrences but it did not result in the map.
DESCRIBE ?x WHERE { ?x http://www.w3.org/1999/02/22-rdf-syntax-ns#type < http://rs.tdwg.org/dwc/terms/#Occurrence%3E. }
I looked that the RDF and I think I see the problem.
In the RDF
geo:latitude 41.53 </geo:latitude>
geo:longitude -70.67 </geo:longitude>
Should be
geo:lat 41.53 </geo:lat>
geo:long -70.67 </geo:long>
See http://www.w3.org/2003/01/geo/
I did the following query to get a list of all the dwc:taxonConceptID's and have attached them as a .txt file.
select distinct ?o WHERE {?s < http://rs.tdwg.org/dwc/terms/#taxonConceptID%3E ?o}
Pretty neat :-)
There are some things that I will get back to Joel on.
Here is where you can manually enter a SPARQL query. Click on "Advanced" for the entry window.
http://lsd.taxonconcept.org/isparql/
Respectfully,
- Pete
On Wed, Jan 12, 2011 at 5:55 PM, joel sachs jsachs@csee.umbc.edu wrote:
Hi Everyone,
I've posted rdf of the bioblitz data. It's at
http://www.cs.umbc.edu/~jsachs/occurrences/TechnoBioblitzOccurrences.rdf .
Individual occurrences can be retrieved via http://www.cs.umbc.edu/~jsachs/occurrences/%5Boccurrence_id]http://www.cs.umbc.edu/~jsachs/occurrences/%5Boccurrence_id%5D http://www.cs.umbc.edu/~jsachs/occurrences/%5Boccurrence_id%5D
e.g. http://www.cs.umbc.edu/~jsachs/occurrences/tdwg2010bioblitz_1835
Individual identifications can be retrieved via http://www.cs.umbc.edu/~jsachs/identifications/%5Bidentification_id]http://www.cs.umbc.edu/~jsachs/identifications/%5Bidentification_id%5D < http://www.cs.umbc.edu/~jsachs/identifications/%5Bidentification_id%5D >
e.g.
http://www.cs.umbc.edu/~jsachs/identifications/tdwg2010bioblitz_1835_id_1
The scripts behind this are on the kludgy side, so reports of errors and abnormalities will be warmly welcomed.
Implicit in each of the following notes is the question "Is this a good way to do it?":
- The data is "normalized" w.r.t. identification. "Normalized" is in
quotes because I mean it in the sense that Steve Baskauf was using in his Fall 2010 series of posts. His meaning of the term makes sense to me, but many people (e.g. the OBO folks), take "normalized ontology" to mean "disentangled" (i.e. no multiple inheritance.) As an example, here's an occurrence with two crowdsourced determinations: http://www.cs.umbc.edu/~jsachs/occurrences/tdwg2010bioblitz_1644
- I used sequential integers for observation and identification IDs;
in practice, a mechanism needs to be in place to prevent two people from assigning the same id to their respective identifications.
- My answer to Cam Webb's Question #1 from
http://lists.tdwg.org/pipermail/tdwg-content/2010-October/001720.html is "both". In other words, just as "Joel Sachs" is both me and also my name, so http://www.cs.umbc.edu/~jsachs/occurrences/tdwg2010bioblitz_1668 is both an occurrence and an occurrence_id, expressed as:
dwc:Occurrence rdf:about=" http://www.cs.umbc.edu/~jsachs/occurrences/tdwg2010bioblitz_1644" dwc:occurrenceID http://www.cs.umbc.edu/~jsachs/occurrences/tdwg2010bioblitz_1644 </dwc:occurrenceID>
<blah blah blah/> </dwc:Occurrence> ---
- I was surprised to see that the Darwin Core Identification class
has no "occurrenceID" or "specimenID" term. How is one supposed to tie an identification to an observation (assuming the identification is not in-lined, of course)? DeVries and Baskauf each mint their own terms for doing this (txn:identificationHasOccurrence, and sernec:basedOnOccurrence, respectively); I used dwc:occurrenceID as if it were a record level term.
- We had scope for multiple taxonConceptID columns in the Fusion
table, and assigned lsids where possible. I also mean to work with Pete to assign GUIDs from taxoncocept.org. In addition, I assigned ethan taxon concept ids, which look like this: http:.//spire.umbc.edu/ethan/Coffea_arabica
In their argument over opaque vs. transparent taxonCoceptIDs, I was sympathetic to both Pete's and Gregor's arguments. Ultimately, if the tooling exists to always display the rdfs:labels every time I'm loooking at a list of opaqueIDs, then transparent IDs are unnecessary. But, for now, it's really helpful to look at an ID and know what it's referring to.
(For species names not in the spire database, the rdf returned by http:.//spire.umbc.edu/ethan/$name is simply an rdfs:seeAlso to http://http://gni.globalnames.org/name_strings?search_term=$name)
- It was easy to assert membership in RDF classes corresponding to
various Cape Cod categories of concern - invasive species, threatenened species, indicators, etc. You can see these classes at http://spire.umbc.edu/ontologies/lists (Information of where these lists come from is included as rdfs:comments. I'll add further documentation, e.g. links to eml files.)
Note that "ThingOfConcern" is defined as the superclass of all the other classes in the collection. The idea here is that people can create their own "ThingOfConcern" class, and then query for observations that are of concern to them. You can see sample sparql queries at http://www.csee.umbc.edu/~jsachs/occurrences/queries/sample.txt
As an aside, I think we, as a community, should come up with a biodiversity benchmark suite of rdf data and corresponding sparql queries, that can be used to test the suitability and scalability of semantic web knowledge bases. I'll take this up in a future post (unless someone beats me to it).
Comments, questions, and better ideas are welcome.
Thanks - Joel.
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
--
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 TaxonConcept Knowledge Base http://www.taxonconcept.org/ / GeoSpecies Knowledge Base http://lod.geospecies.org/ About the GeoSpecies Knowledge Base http://about.geospecies.org/
--
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 TaxonConcept Knowledge Base http://www.taxonconcept.org/ / GeoSpecies Knowledge Base http://lod.geospecies.org/ About the GeoSpecies Knowledge Base http://about.geospecies.org/
--
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 TaxonConcept Knowledge Base http://www.taxonconcept.org/ / GeoSpecies Knowledge Base http://lod.geospecies.org/
About the GeoSpecies Knowledge Base http://about.geospecies.org/
participants (3)
-
Bob Morris
-
joel sachs
-
Peter DeVries