[tdwg-content] Bioblitz as rdf: does this make sense?

12 Jan 2011

      Hi Everyone,

I've posted rdf of the bioblitz data. It's at
http://www.cs.umbc.edu/~jsachs/occurrences/TechnoBioblitzOccurrences.rdf .

Individual occurrences can be retrieved via
http://www.cs.umbc.edu/~jsachs/occurrences/[occurrence_id]
e.g. http://www.cs.umbc.edu/~jsachs/occurrences/tdwg2010bioblitz_1835

Individual identifications can be retrieved via
http://www.cs.umbc.edu/~jsachs/identifications/[identification_id]
e.g. 
http://www.cs.umbc.edu/~jsachs/identifications/tdwg2010bioblitz_1835_id_1

The scripts behind this are on the kludgy side, so reports of errors and 
abnormalities will be warmly welcomed.

Implicit in each of the following notes is the question "Is this a good 
way to do it?":

1. The data is "normalized" w.r.t. identification. "Normalized" is in 
quotes because I mean it in the sense that Steve Baskauf was using in his 
Fall 2010 series of posts. His meaning of the term makes sense to me, but 
many people (e.g. the OBO folks), take "normalized ontology" to mean 
"disentangled" (i.e. no multiple inheritance.)
As an example, here's an occurrence with two crowdsourced determinations:
http://www.cs.umbc.edu/~jsachs/occurrences/tdwg2010bioblitz_1644

2. I used sequential integers for observation and identification IDs; in 
practice, a mechanism needs to be in place to prevent two people from 
assigning the same id to their respective identifications.

3. My answer to Cam Webb's Question #1 from
http://lists.tdwg.org/pipermail/tdwg-content/2010-October/001720.html
is "both". In other words, just as "Joel Sachs" is both me and also my 
name, so
http://www.cs.umbc.edu/~jsachs/occurrences/tdwg2010bioblitz_1668 is both 
an occurrence and an occurrence_id, expressed as:
---
<dwc:Occurrence 
rdf:about="http://www.cs.umbc.edu/~jsachs/occurrences/tdwg2010bioblitz_1644">
<dwc:occurrenceID>
http://www.cs.umbc.edu/~jsachs/occurrences/tdwg2010bioblitz_1644
</dwc:occurrenceID>
<blah blah blah/>
</dwc:Occurrence>
---

4. I was surprised to see that the Darwin Core Identification class has no 
"occurrenceID" or "specimenID" term. How is one supposed to tie an 
identification to an observation (assuming the identification is not 
in-lined, of course)? DeVries and Baskauf each mint their own terms for 
doing this (txn:identificationHasOccurrence, and sernec:basedOnOccurrence, 
respectively); I used dwc:occurrenceID as if it were a record level term.

5. We had scope for multiple taxonConceptID columns in the Fusion table, 
and assigned lsids where possible. I also mean to work with Pete to assign 
GUIDs from taxoncocept.org. In addition, I assigned ethan taxon concept 
ids, which look like this:
http:.//spire.umbc.edu/ethan/Coffea_arabica

In their argument over opaque vs. transparent taxonCoceptIDs, I was 
sympathetic to both Pete's and Gregor's arguments. Ultimately, if the 
tooling exists to always display the rdfs:labels every time I'm loooking 
at a list of opaqueIDs, then transparent IDs are unnecessary. But, for 
now, it's really helpful to look at an ID and know what it's referring to.

(For species names not in the spire database, the rdf returned by
http:.//spire.umbc.edu/ethan/$name
is simply an rdfs:seeAlso to
http://http://gni.globalnames.org/name_strings?search_term=$name)

6. It was easy to assert membership in RDF classes corresponding to 
various Cape Cod categories of concern - invasive species, threatenened 
species, indicators, etc. You can see these classes at
http://spire.umbc.edu/ontologies/lists (Information of where these lists 
come from is included as rdfs:comments. I'll add further documentation, 
e.g. links to eml files.)

Note that "ThingOfConcern" is defined as the superclass of all the other 
classes in the collection. The idea here is that people can create their 
own "ThingOfConcern" class, and then query for observations that are of 
concern to them. You can see sample sparql queries at
http://www.csee.umbc.edu/~jsachs/occurrences/queries/sample.txt

As an aside, I think we, as a community, should come up with a 
biodiversity benchmark suite of rdf data and corresponding sparql queries, that can be 
used to test the suitability and scalability of semantic web knowledge 
bases. I'll take this up in a future post (unless someone beats me to it).

Comments, questions, and better ideas are welcome.

Thanks -
Joel.

[tdwg-content] Bioblitz as rdf: does this make sense?

joel sachs