[tdwg-content] Bioblitz as rdf: does this make sense?

Peter DeVries pete.devries at gmail.com
Wed Jan 19 04:36:14 CET 2011


Re: Searching

Also you can download the entire open data set

See http://www.ckan.net/package/taxonconcept

<http://www.ckan.net/package/taxonconcept>- Pete

On Tue, Jan 18, 2011 at 9:32 PM, Peter DeVries <pete.devries at gmail.com>wrote:

> Hi Joel,
>
> In the next day or so I will upload my TDWG talk to SlideShare, it explains
> some of this.
>
> **But remember this was just an experiment to see if it would work.***
>
> The main advantage of creating URI's for commonly used things is that then
> you can browse the connections between them and have them be the subjects of
> triples.
>
> See http://bit.ly/hTJkWz
>
> It also allows you to connect additional information to a person or thing
> that has a URI.
>
> See attached png
>
> In an ideal world, I might have done something like this.
>
> http://notarealsite.org/genus_epithet
>
> So a semantic web service would look up
> http://notarealsite.org/Puma_concolor
>
> Retrieve http://notarealsite.org/Puma_concolor.rdf
>
> Which would give you:
>
> 1) list of what databases have that name
>
> 2) and what various kingdoms or Name Rules the different names fall under
> (Plant, Animal etc)
>
>    For example: "This is both a plant and an animal name"
>
> 3) What are the current valid names for each of the species listed. => The
> list of related names you listed in your email.
>
> => *Puma concolor* (Linnaeus 1771)
>
> I think this would be very useful since a number of data sources only list
> the genus and specific epithet.
>
> * For instance the index that is at the back of the Entomology Society of
> America Annual Meeting Booklet
>
> There are a couple of hurdles however.
>
> 1) What is the valid form of the name? Is there a machine that knows this
> or will this need to be partially curated by humans?
>     In some cases the "valid" name is not clear.
>
> 2) Even though there are ~20 million names, there are probably some valid
> names that are not in the GNI.
>     This is why I would encourage everyone to upload their names.
>
> If these exist as URI's you get the following abilities
>
> <gni_namestring:001> isSynonymOf <gni_namestring:002> etc.
>
> Not that literals cannot be "subjects" of the subject-predicate-object
> triple.
>
> Also, as current configured names like *Callithrix melanura* (É. Geoffroy
> 1812) and *Callithrix melanura* (E. Geoffroy 1812) are seen as different.
>
> * Also the É. will have to be properly escaped to be part of a valid URL.
>
> So a simple text search might not find them.
>
> So in summary if we are going to get past everyone marking up their own
> name list in excel and then trying to merge them, we will need some
> architecture than
> can scale to millions of names and allow different groups to make
> assertions about the relationships between those names.
>
>
>
> Searching TaxonConcept.org:
>
> For now you have two options for searching for names. (I am mainly
> concentrating on the RDF representation right now)
>
> You can use google "Puma concolor site:lod.taxonconcept.org" *Google seems
> to prefer the .rdf version even though my sitemap specifies the .html.
>
> Or use the Knowledge Base http://lsd.taxonconcept.org/fct/
>
> I think I sent out a list of all the species collected at the BioBlitz,
> with their matching species concept ID.
>
> Did you not get this? I think it on the Google site.
>
>
> *> iv. In what sense do you see Darwin Core as being deficient as a
> semantic web representation?*
>
> You agreed previously that some of the fields that currently use Literals
> would be more efficiently and unambiguously represented as URI's.
>
> I would characterize DarwinCore as "not optimal" rather than deficient.
>
> There is no reason we can't still use the DarwinCore for what it is good
> for, but it does not work very well for SPARQL queries and the LOD.
>
> For instance, a search of occurrences of this *Callithrix melanura* (É.
> Geoffroy 1812) will not return occurrences with *Callithrix melanura* (E.
> Geoffroy 1812).
>
> Respectfully,
>
> - Pete
>
>
>
> On Tue, Jan 18, 2011 at 1:42 PM, joel sachs <jsachs at csee.umbc.edu> wrote:
>
>> Pete -
>>
>> I'm glad you're doing cool things with the data. A few
>> comments/answers/questions ...
>>
>> i. The quests for consistently used canonical URIs, and consistently used
>> canonical names are both, for now, somewhat quixotic, and I'm not sure one
>> is more so than the other. GNA/GNI seems to be pursuing both. Is that a fair
>> assessment?
>>
>> ii. I used opaque taxon concept URIs where they were readily available.
>> Laziness motivated me to also use transparent identifiers for
>> a. making sure each taxonConcept had an rdf:resource as an identifier, and
>> b. asserting membership in the various rdfs:classes that I defined. But
>> laziness has its drawbacks, and you point out a few of them. I think best
>> practice when using either names or transparent URIs as identifiers would be
>> to first run them through a spell checker and normalizer that would produce
>> consistent usage of upper and lowe case characters, and probably drop
>> authorship. Of course, this would produce some false positives when doing
>> any sort of querying. But using UUIDs produces false negatives since they're
>> not in common use. If we (as a community) follow through on the enthusiasm
>> for creating competency cases that we showed in the Fall, it would be
>> interesting to consider each case from the question of what's worse - false
>> positives or false negatives. In any event, false positives can be weeded
>> out by using the full scientific names and contextual information in the
>> record, whereas false negatives may never be discovered.
>>
>> BTW, does anyone know a good spell checker for scientific names?
>>
>> iii. Mapping names to opaque URIs doesn't resolve the problems you raise
>> below. Will the identifer for
>> Carpodacus mexicanus
>> be the same as the identifier for
>> Carpodacus mexicanus (Statius Muller, 1776)?
>>
>> How about
>> Arabis laevigata (Muhl. ex Willd.) Poir.
>> vs.
>> Arabis laevigata ?
>>
>> If yes, then why noy just use normalized names as identifiers? If no, then
>> we'll get false negatives in the sorts of SPARQL queries that I gave at
>>
>> http://www.csee.umbc.edu/~jsachs/occurrences/queries/
>>
>> BTW, is there a lookup service for taxonconcept.org identifiers (i.e.
>> give a list of names, get a list of identifiers)?
>>
>> iv. In what sense do you see Darwin Core as being deficient as a semantic
>> web representation?
>>
>> Regards -
>> Joel.
>>
>>
>>
>> On Thu, 13 Jan 2011, Peter DeVries wrote:
>>
>>  Thanks Joel,
>>>
>>> Here is one of the BioBlitz Occurrence Records marked up with Darwin Core
>>>
>>>
>>> http://lsd.taxonconcept.org/describe/?url=http://www.cs.umbc.edu/~jsachs/occurrences/tdwg2010bioblitz_5&sid=217&urilookup=1
>>>
>>> Here is one of the TaxonConcept Records marked up in the txn vocabulary
>>>
>>> http://lsd.taxonconcept.org/describe/?url=http://ocs.taxonconcept.org/ocs/f522444a-2dd9-400e-be59-47213ef38cb9%23Occurrence
>>>
>>> Some people have trouble with how the %23 above is escaped in their
>>> email,
>>> they might like this bit.ly bundle
>>> better.<
>>> http://lsd.taxonconcept.org/describe/?url=http://ocs.taxonconcept.org/ocs/f522444a-2dd9-400e-be59-47213ef38cb9%23Occurrence
>>> >
>>> <
>>> http://lsd.taxonconcept.org/about/html/http/ocs.taxonconcept.org/ocs/f522444a-2dd9-400e-be59-47213ef38cb9%01Occurrence
>>> >
>>> http://bit.ly/hZdUpP
>>>
>>>
>>> <http://bit.ly/hZdUpP>Also an issue to discuss is that identifications
>>> are
>>> to both a name and a concept so
>>>
>>> Should these should be the same dwc:taxonConcepts
>>>
>>> http://spire.umbc.edu/ethan/Apis_mellifera
>>> http://spire.umbc.edu/ethan/Apis_Mellifera
>>> ==> http://bit.ly/g1zzJC (Apis mellifera se:z9oqP)
>>>
>>> http://spire.umbc.edu/ethan/Ascelpius_syruaca
>>> http://spire.umbc.edu/ethan/Asclepias_syriaca
>>> ==> HTML page http://lod.taxonconcept.org/ses/tTEIq.html Concept View
>>> Bit.ly http://bit.ly/dJHJqj
>>>
>>> http://spire.umbc.edu/ethan/Aster_nova-belgii
>>> http://spire.umbc.edu/ethan/Aster_nova-gelgii
>>>
>>> http://spire.umbc.edu/ethan/Baccharis_halimifolia
>>> http://spire.umbc.edu/ethan/Baccharis_halimifolia_L.
>>>
>>> http://spire.umbc.edu/ethan/Bartonia_virginia
>>> http://spire.umbc.edu/ethan/Bartonia_virginica
>>>
>>> http://spire.umbc.edu/ethan/Branta_canadensis
>>> http://spire.umbc.edu/ethan/Branta_canadensis_(Linnaeus,_1758)
>>>
>>> http://spire.umbc.edu/ethan/Carex_pennsylvanica
>>> http://spire.umbc.edu/ethan/Carex_pensylvanica
>>>
>>> http://spire.umbc.edu/ethan/Cyperus_esculantus
>>> http://spire.umbc.edu/ethan/Cyperus_Esculantus
>>> http://spire.umbc.edu/ethan/Cyperus_esculentus
>>> http://spire.umbc.edu/ethan/Cyperus_esculentus_L.
>>>
>>> http://spire.umbc.edu/ethan/Carpodacus_mexicanus
>>> http://spire.umbc.edu/ethan/Carpodacus_mexicanus_(Statius_Muller,_1776)
>>>
>>>  From the taxonconcepts their should be a link to the various name
>>>> strings.
>>>>
>>>
>>> vs. modeling the namestring as the concept.
>>>
>>> Also I think that DarwinCore is good for somethings but maybe not as a
>>> semantic web representation.
>>>
>>> Respectfully,
>>>
>>> - Pete
>>>
>>>
>>> On Thu, Jan 13, 2011 at 6:53 AM, joel sachs <jsachs at csee.umbc.edu>
>>> wrote:
>>>
>>>  Pete,
>>>> Thanks - I corrected the geo properties.
>>>> Joel.
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, 12 Jan 2011, Peter DeVries wrote:
>>>>
>>>>  Hi Joel,
>>>>
>>>>>
>>>>> Cool :-)
>>>>>
>>>>> I just loaded this into my SPARQL endpoint.
>>>>>
>>>>> In the named graph
>>>>> urn:org:linkedopenspeciesdata:dataspace:tdwg2010bioblitz
>>>>>
>>>>> It consists of 19,990 Triples
>>>>>
>>>>> Here is one of the dwc:taxonConceptID entries.
>>>>>
>>>>> *About: http://spire.umbc.edu/ethan/Ampelopsis_brevipedunculata*
>>>>>
>>>>>
>>>>> http://lsd.taxonconcept.org/describe/?url=http://spire.umbc.edu/ethan/Ampelopsis_brevipedunculata
>>>>>
>>>>> *About:
>>>>> http://www.cs.umbc.edu/~jsachs/occurrences/tdwg2010bioblitz_1627*
>>>>> <
>>>>>
>>>>> http://lsd.taxonconcept.org/describe/?url=http://spire.umbc.edu/ethan/Ampelopsis_brevipedunculata
>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> http://lsd.taxonconcept.org/describe/?url=http://www.cs.umbc.edu/~jsachs/occurrences/tdwg2010bioblitz_1627
>>>>>
>>>>>
>>>>>
>>>>> <
>>>>>
>>>>> http://lsd.taxonconcept.org/describe/?url=http://www.cs.umbc.edu/~jsachs/occurrences/tdwg2010bioblitz_1627
>>>>>
>>>>>> This
>>>>>>
>>>>> should give you an a count of occurrences.
>>>>>
>>>>> SELECT count(*) WHERE {?s a <http://rs.tdwg.org/dwc/terms/#Occurrence
>>>>> >};
>>>>>
>>>>> = 1882
>>>>>
>>>>> SELECT count(*) WHERE {?s a <
>>>>> http://rs.tdwg.org/dwc/terms/#taxonConceptID
>>>>>
>>>>>> };
>>>>>>
>>>>>
>>>>> This should give you a list of occurrences
>>>>>
>>>>>
>>>>>
>>>>> http://lsd.taxonconcept.org/describe/?url=http://rs.tdwg.org/dwc/terms/%23Occurrence
>>>>>
>>>>> If this did not come through your email system try the bit.ly.
>>>>>
>>>>> http://bit.ly/g9BcoL
>>>>>
>>>>> I tried the following that should have given me a google map of all the
>>>>> occurrences but it did not result in the map.
>>>>>
>>>>> DESCRIBE ?x WHERE {
>>>>>  ?x <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <
>>>>> http://rs.tdwg.org/dwc/terms/#Occurrence>.
>>>>> }
>>>>>
>>>>> I looked that the RDF and I think I see the problem.
>>>>>
>>>>> In the RDF
>>>>>
>>>>> <geo:latitude>
>>>>> 41.53
>>>>> </geo:latitude>
>>>>>
>>>>> <geo:longitude>
>>>>> -70.67
>>>>> </geo:longitude>
>>>>>
>>>>> Should be
>>>>>
>>>>> <geo:lat>
>>>>> 41.53
>>>>> </geo:lat>
>>>>>
>>>>> <geo:long>
>>>>> -70.67
>>>>> </geo:long>
>>>>>
>>>>> See http://www.w3.org/2003/01/geo/
>>>>>
>>>>> I did the following query to get a list of all the dwc:taxonConceptID's
>>>>> and
>>>>> have attached them as a .txt file.
>>>>>
>>>>> select distinct ?o WHERE {?s <
>>>>> http://rs.tdwg.org/dwc/terms/#taxonConceptID>
>>>>> ?o}
>>>>>
>>>>> Pretty neat :-)
>>>>>
>>>>> There are some things that I will get back to Joel on.
>>>>>
>>>>> Here is where you can manually enter a SPARQL query. Click on
>>>>> "Advanced"
>>>>> for
>>>>> the entry window.
>>>>>
>>>>> http://lsd.taxonconcept.org/isparql/
>>>>>
>>>>> Respectfully,
>>>>>
>>>>> - Pete
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Jan 12, 2011 at 5:55 PM, joel sachs <jsachs at csee.umbc.edu>
>>>>> wrote:
>>>>>
>>>>>  Hi Everyone,
>>>>>
>>>>>>
>>>>>> I've posted rdf of the bioblitz data. It's at
>>>>>>
>>>>>> http://www.cs.umbc.edu/~jsachs/occurrences/TechnoBioblitzOccurrences.rdf
>>>>>> .
>>>>>>
>>>>>> Individual occurrences can be retrieved via
>>>>>> http://www.cs.umbc.edu/~jsachs/occurrences/[occurrence_id]<http://www.cs.umbc.edu/~jsachs/occurrences/%5Boccurrence_id%5D>
>>>>>> <http://www.cs.umbc.edu/~jsachs/occurrences/%5Boccurrence_id%5D>
>>>>>>
>>>>>> e.g. http://www.cs.umbc.edu/~jsachs/occurrences/tdwg2010bioblitz_1835
>>>>>>
>>>>>> Individual identifications can be retrieved via
>>>>>> http://www.cs.umbc.edu/~jsachs/identifications/[identification_id]<http://www.cs.umbc.edu/~jsachs/identifications/%5Bidentification_id%5D>
>>>>>> <
>>>>>> http://www.cs.umbc.edu/~jsachs/identifications/%5Bidentification_id%5D
>>>>>> >
>>>>>>
>>>>>> e.g.
>>>>>>
>>>>>> http://www.cs.umbc.edu/~jsachs/identifications/tdwg2010bioblitz_1835_id_1
>>>>>>
>>>>>> The scripts behind this are on the kludgy side, so reports of errors
>>>>>> and
>>>>>> abnormalities will be warmly welcomed.
>>>>>>
>>>>>> Implicit in each of the following notes is the question "Is this a
>>>>>> good
>>>>>> way to do it?":
>>>>>>
>>>>>> 1. The data is "normalized" w.r.t. identification. "Normalized" is in
>>>>>> quotes because I mean it in the sense that Steve Baskauf was using in
>>>>>> his
>>>>>> Fall 2010 series of posts. His meaning of the term makes sense to me,
>>>>>> but
>>>>>> many people (e.g. the OBO folks), take "normalized ontology" to mean
>>>>>> "disentangled" (i.e. no multiple inheritance.)
>>>>>> As an example, here's an occurrence with two crowdsourced
>>>>>> determinations:
>>>>>> http://www.cs.umbc.edu/~jsachs/occurrences/tdwg2010bioblitz_1644
>>>>>>
>>>>>> 2. I used sequential integers for observation and identification IDs;
>>>>>> in
>>>>>> practice, a mechanism needs to be in place to prevent two people from
>>>>>> assigning the same id to their respective identifications.
>>>>>>
>>>>>> 3. My answer to Cam Webb's Question #1 from
>>>>>> http://lists.tdwg.org/pipermail/tdwg-content/2010-October/001720.html
>>>>>> is "both". In other words, just as "Joel Sachs" is both me and also my
>>>>>> name, so
>>>>>> http://www.cs.umbc.edu/~jsachs/occurrences/tdwg2010bioblitz_1668 is
>>>>>> both
>>>>>> an occurrence and an occurrence_id, expressed as:
>>>>>> ---
>>>>>> <dwc:Occurrence
>>>>>> rdf:about="
>>>>>> http://www.cs.umbc.edu/~jsachs/occurrences/tdwg2010bioblitz_1644">
>>>>>> <dwc:occurrenceID>
>>>>>> http://www.cs.umbc.edu/~jsachs/occurrences/tdwg2010bioblitz_1644
>>>>>> </dwc:occurrenceID>
>>>>>> <blah blah blah/>
>>>>>> </dwc:Occurrence>
>>>>>> ---
>>>>>>
>>>>>> 4. I was surprised to see that the Darwin Core Identification class
>>>>>> has
>>>>>> no
>>>>>> "occurrenceID" or "specimenID" term. How is one supposed to tie an
>>>>>> identification to an observation (assuming the identification is not
>>>>>> in-lined, of course)? DeVries and Baskauf each mint their own terms
>>>>>> for
>>>>>> doing this (txn:identificationHasOccurrence, and
>>>>>> sernec:basedOnOccurrence,
>>>>>> respectively); I used dwc:occurrenceID as if it were a record level
>>>>>> term.
>>>>>>
>>>>>> 5. We had scope for multiple taxonConceptID columns in the Fusion
>>>>>> table,
>>>>>> and assigned lsids where possible. I also mean to work with Pete to
>>>>>> assign
>>>>>> GUIDs from taxoncocept.org. In addition, I assigned ethan taxon
>>>>>> concept
>>>>>> ids, which look like this:
>>>>>> http:.//spire.umbc.edu/ethan/Coffea_arabica
>>>>>>
>>>>>> In their argument over opaque vs. transparent taxonCoceptIDs, I was
>>>>>> sympathetic to both Pete's and Gregor's arguments. Ultimately, if the
>>>>>> tooling exists to always display the rdfs:labels every time I'm
>>>>>> loooking
>>>>>> at a list of opaqueIDs, then transparent IDs are unnecessary. But, for
>>>>>> now, it's really helpful to look at an ID and know what it's referring
>>>>>> to.
>>>>>>
>>>>>> (For species names not in the spire database, the rdf returned by
>>>>>> http:.//spire.umbc.edu/ethan/$name
>>>>>> is simply an rdfs:seeAlso to
>>>>>> http://http://gni.globalnames.org/name_strings?search_term=$name)
>>>>>>
>>>>>> 6. It was easy to assert membership in RDF classes corresponding to
>>>>>> various Cape Cod categories of concern - invasive species,
>>>>>> threatenened
>>>>>> species, indicators, etc. You can see these classes at
>>>>>> http://spire.umbc.edu/ontologies/lists (Information of where these
>>>>>> lists
>>>>>> come from is included as rdfs:comments. I'll add further
>>>>>> documentation,
>>>>>> e.g. links to eml files.)
>>>>>>
>>>>>> Note that "ThingOfConcern" is defined as the superclass of all the
>>>>>> other
>>>>>> classes in the collection. The idea here is that people can create
>>>>>> their
>>>>>> own "ThingOfConcern" class, and then query for observations that are
>>>>>> of
>>>>>> concern to them. You can see sample sparql queries at
>>>>>> http://www.csee.umbc.edu/~jsachs/occurrences/queries/sample.txt
>>>>>>
>>>>>>
>>>>>> As an aside, I think we, as a community, should come up with a
>>>>>> biodiversity benchmark suite of rdf data and corresponding sparql
>>>>>> queries,
>>>>>> that can be
>>>>>> used to test the suitability and scalability of semantic web knowledge
>>>>>> bases. I'll take this up in a future post (unless someone beats me to
>>>>>> it).
>>>>>>
>>>>>> Comments, questions, and better ideas are welcome.
>>>>>>
>>>>>> Thanks -
>>>>>> Joel.
>>>>>>
>>>>>> _______________________________________________
>>>>>> tdwg-content mailing list
>>>>>> tdwg-content at lists.tdwg.org
>>>>>> http://lists.tdwg.org/mailman/listinfo/tdwg-content
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> ---------------------------------------------------------------
>>>>> Pete DeVries
>>>>> Department of Entomology
>>>>> University of Wisconsin - Madison
>>>>> 445 Russell Laboratories
>>>>> 1630 Linden Drive
>>>>> Madison, WI 53706
>>>>> TaxonConcept Knowledge Base <http://www.taxonconcept.org/> /
>>>>> GeoSpecies
>>>>> Knowledge Base <http://lod.geospecies.org/>
>>>>> About the GeoSpecies Knowledge Base <http://about.geospecies.org/>
>>>>> ------------------------------------------------------------
>>>>>
>>>>>
>>>>>
>>>
>>> --
>>> ---------------------------------------------------------------
>>> Pete DeVries
>>> Department of Entomology
>>> University of Wisconsin - Madison
>>> 445 Russell Laboratories
>>> 1630 Linden Drive
>>> Madison, WI 53706
>>> TaxonConcept Knowledge Base <http://www.taxonconcept.org/> / GeoSpecies
>>> Knowledge Base <http://lod.geospecies.org/>
>>> About the GeoSpecies Knowledge Base <http://about.geospecies.org/>
>>> ------------------------------------------------------------
>>>
>>>
>
>
> --
> ---------------------------------------------------------------
> Pete DeVries
> Department of Entomology
> University of Wisconsin - Madison
> 445 Russell Laboratories
> 1630 Linden Drive
> Madison, WI 53706
> TaxonConcept Knowledge Base <http://www.taxonconcept.org/> / GeoSpecies
> Knowledge Base <http://lod.geospecies.org/>
>
> About the GeoSpecies Knowledge Base <http://about.geospecies.org/>
> ------------------------------------------------------------
>



-- 
---------------------------------------------------------------
Pete DeVries
Department of Entomology
University of Wisconsin - Madison
445 Russell Laboratories
1630 Linden Drive
Madison, WI 53706
TaxonConcept Knowledge Base <http://www.taxonconcept.org/> / GeoSpecies
Knowledge Base <http://lod.geospecies.org/>
About the GeoSpecies Knowledge Base <http://about.geospecies.org/>
------------------------------------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.tdwg.org/pipermail/tdwg-content/attachments/20110118/a204fe03/attachment-0001.html 


More information about the tdwg-content mailing list