[tdwg-content] Spell checker for scientific names?

19 Jan 2011

      joel sachs wrote:
...
BTW, does anyone know a good spell checker for scientific names?
Sounds like you may be interested in accessing my TAXAMATCH algorithm which is installed over my IRMNG database of generic and species names, available here:

http://www.cmar.csiro.au/datacentre/irmng/

The database is around 90%+ complete for generic names (including non-valid ones) and maybe 50% for valid species names plus some synonyms (the latter based mainly on Catalogue of Life 2006 edition, with more to come).

If you enter a single name for checking, fuzzy matches are automatically returned together with any exact matches. If you enter e.g. a line delimited list, the first pass does exact matches only (for speed) with the option to do a fuzzy search on any name not matched as a pre-formatted link.

Currently it returns human-readable HTML but I will probably make an option for XML return if there is sufficient demand.

Take a look and let me know if this is of interest, or if not, what you would prefer to see...

Regards - Tony Rees, Tasmania

________________________________________
From: tdwg-content-bounces@lists.tdwg.org [tdwg-content-bounces@lists.tdwg.org] On Behalf Of joel sachs [jsachs@csee.umbc.edu]
Sent: Wednesday, 19 January 2011 6:42 AM
To: Peter DeVries
Cc: tdwg-content@lists.tdwg.org
Subject: Re: [tdwg-content] Bioblitz as rdf: does this make sense?

Pete -

I'm glad you're doing cool things with the data. A few
comments/answers/questions ...

i. The quests for consistently used canonical URIs, and
consistently used canonical names are both, for now, somewhat quixotic,
and I'm not sure one is more so than the other. GNA/GNI seems to be
pursuing both. Is that a fair assessment?

ii. I used opaque taxon concept URIs where they were readily available.
Laziness motivated me to also use transparent identifiers for
a. making sure each taxonConcept had an rdf:resource as an identifier, and
b. asserting membership in the various rdfs:classes that I defined. But
laziness has its drawbacks, and you point out a few of them. I think best
practice when using either names or transparent URIs as identifiers would
be to first run them through a spell checker and normalizer that would
produce consistent usage of upper and lowe case characters, and probably
drop authorship. Of course, this would produce some false positives when
doing any sort of querying. But using UUIDs produces false negatives since
they're not in common use. If we (as a community) follow through on the
enthusiasm for creating competency cases that we showed in the Fall, it
would be interesting to consider each case from the question of what's
worse - false positives or false negatives. In any event, false positives
can be weeded out by using the full scientific names and contextual
information in the record, whereas false negatives may never be
discovered.

BTW, does anyone know a good spell checker for scientific names?

iii. Mapping names to opaque URIs doesn't resolve the problems you raise
below. Will the identifer for
Carpodacus mexicanus
be the same as the identifier for
Carpodacus mexicanus (Statius Muller, 1776)?

How about
Arabis laevigata (Muhl. ex Willd.) Poir.
vs.
Arabis laevigata ?

If yes, then why noy just use normalized names as
identifiers? If no, then we'll get false negatives in the sorts of SPARQL queries that I gave at
http://www.csee.umbc.edu/~jsachs/occurrences/queries/

BTW, is there a lookup service for taxonconcept.org identifiers (i.e. give
a list of names, get a list of identifiers)?

iv. In what sense do you see Darwin Core as being deficient as a semantic
web representation?

Regards -
Joel.

On Thu, 13 Jan 2011, Peter DeVries wrote:
...
Thanks Joel,
Here is one of the BioBlitz Occurrence Records marked up with Darwin Core
http://lsd.taxonconcept.org/describe/?url=http://www.cs.umbc.edu/~jsachs/occ...
Here is one of the TaxonConcept Records marked up in the txn vocabulary
http://lsd.taxonconcept.org/describe/?url=http://ocs.taxonconcept.org/ocs/f5...
Some people have trouble with how the %23 above is escaped in their email,
they might like this bit.ly bundle
better.<http://lsd.taxonconcept.org/describe/?url=http://ocs.taxonconcept.org/ocs/f522444a-2dd9-400e-be59-47213ef38cb9%23Occurrence>
<http://lsd.taxonconcept.org/about/html/http/ocs.taxonconcept.org/ocs/f522444a-2dd9-400e-be59-47213ef38cb9%01Occurrence>
http://bit.ly/hZdUpP
<http://bit.ly/hZdUpP>Also an issue to discuss is that identifications are
to both a name and a concept so
Should these should be the same dwc:taxonConcepts
http://spire.umbc.edu/ethan/Apis_mellifera
http://spire.umbc.edu/ethan/Apis_Mellifera
==> http://bit.ly/g1zzJC (Apis mellifera se:z9oqP)
http://spire.umbc.edu/ethan/Ascelpius_syruaca
http://spire.umbc.edu/ethan/Asclepias_syriaca
==> HTML page http://lod.taxonconcept.org/ses/tTEIq.html Concept View
Bit.ly http://bit.ly/dJHJqj
http://spire.umbc.edu/ethan/Aster_nova-belgii
http://spire.umbc.edu/ethan/Aster_nova-gelgii
http://spire.umbc.edu/ethan/Baccharis_halimifolia
http://spire.umbc.edu/ethan/Baccharis_halimifolia_L.
http://spire.umbc.edu/ethan/Bartonia_virginia
http://spire.umbc.edu/ethan/Bartonia_virginica
http://spire.umbc.edu/ethan/Branta_canadensis
http://spire.umbc.edu/ethan/Branta_canadensis_(Linnaeus,_1758)
http://spire.umbc.edu/ethan/Carex_pennsylvanica
http://spire.umbc.edu/ethan/Carex_pensylvanica
http://spire.umbc.edu/ethan/Cyperus_esculantus
http://spire.umbc.edu/ethan/Cyperus_Esculantus
http://spire.umbc.edu/ethan/Cyperus_esculentus
http://spire.umbc.edu/ethan/Cyperus_esculentus_L.
http://spire.umbc.edu/ethan/Carpodacus_mexicanus
http://spire.umbc.edu/ethan/Carpodacus_mexicanus_(Statius_Muller,_1776)
...
From the taxonconcepts their should be a link to the various name strings.
vs. modeling the namestring as the concept.
Also I think that DarwinCore is good for somethings but maybe not as a
semantic web representation.
Respectfully,
- Pete
On Thu, Jan 13, 2011 at 6:53 AM, joel sachs <jsachs@csee.umbc.edu> wrote:
...
Pete,
Thanks - I corrected the geo properties.
Joel.
On Wed, 12 Jan 2011, Peter DeVries wrote:
Hi Joel,
...
Cool :-)
I just loaded this into my SPARQL endpoint.
In the named graph
urn:org:linkedopenspeciesdata:dataspace:tdwg2010bioblitz
It consists of 19,990 Triples
Here is one of the dwc:taxonConceptID entries.
*About: http://spire.umbc.edu/ethan/Ampelopsis_brevipedunculata*
http://lsd.taxonconcept.org/describe/?url=http://spire.umbc.edu/ethan/Ampelo...
*About: http://www.cs.umbc.edu/~jsachs/occurrences/tdwg2010bioblitz_1627*
<
http://lsd.taxonconcept.org/describe/?url=http://spire.umbc.edu/ethan/Ampelo...
...
http://lsd.taxonconcept.org/describe/?url=http://www.cs.umbc.edu/~jsachs/occ...
<
http://lsd.taxonconcept.org/describe/?url=http://www.cs.umbc.edu/~jsachs/occ...
...
This
should give you an a count of occurrences.
SELECT count(*) WHERE {?s a <http://rs.tdwg.org/dwc/terms/#Occurrence>};
= 1882
SELECT count(*) WHERE {?s a <http://rs.tdwg.org/dwc/terms/#taxonConceptID
...
};
This should give you a list of occurrences
http://lsd.taxonconcept.org/describe/?url=http://rs.tdwg.org/dwc/terms/%23Oc...
If this did not come through your email system try the bit.ly.
http://bit.ly/g9BcoL
I tried the following that should have given me a google map of all the
occurrences but it did not result in the map.
DESCRIBE ?x WHERE {
 ?x <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <
http://rs.tdwg.org/dwc/terms/#Occurrence>.
}
I looked that the RDF and I think I see the problem.
In the RDF
<geo:latitude>
41.53
</geo:latitude>
<geo:longitude>
-70.67
</geo:longitude>
Should be
<geo:lat>
41.53
</geo:lat>
<geo:long>
-70.67
</geo:long>
See http://www.w3.org/2003/01/geo/
I did the following query to get a list of all the dwc:taxonConceptID's
and
have attached them as a .txt file.
select distinct ?o WHERE {?s <
http://rs.tdwg.org/dwc/terms/#taxonConceptID>
?o}
Pretty neat :-)
There are some things that I will get back to Joel on.
Here is where you can manually enter a SPARQL query. Click on "Advanced"
for
the entry window.
http://lsd.taxonconcept.org/isparql/
Respectfully,
- Pete
On Wed, Jan 12, 2011 at 5:55 PM, joel sachs <jsachs@csee.umbc.edu> wrote:
Hi Everyone,
...
I've posted rdf of the bioblitz data. It's at
http://www.cs.umbc.edu/~jsachs/occurrences/TechnoBioblitzOccurrences.rdf.
Individual occurrences can be retrieved via
http://www.cs.umbc.edu/~jsachs/occurrences/[occurrence_id]<http://www.cs.umbc.edu/~jsachs/occurrences/%5Boccurrence_id%5D>
e.g. http://www.cs.umbc.edu/~jsachs/occurrences/tdwg2010bioblitz_1835
Individual identifications can be retrieved via
http://www.cs.umbc.edu/~jsachs/identifications/[identification_id]<http://www.cs.umbc.edu/~jsachs/identifications/%5Bidentification_id%5D>
e.g.
http://www.cs.umbc.edu/~jsachs/identifications/tdwg2010bioblitz_1835_id_1
The scripts behind this are on the kludgy side, so reports of errors and
abnormalities will be warmly welcomed.
Implicit in each of the following notes is the question "Is this a good
way to do it?":
1. The data is "normalized" w.r.t. identification. "Normalized" is in
quotes because I mean it in the sense that Steve Baskauf was using in his
Fall 2010 series of posts. His meaning of the term makes sense to me, but
many people (e.g. the OBO folks), take "normalized ontology" to mean
"disentangled" (i.e. no multiple inheritance.)
As an example, here's an occurrence with two crowdsourced determinations:
http://www.cs.umbc.edu/~jsachs/occurrences/tdwg2010bioblitz_1644
2. I used sequential integers for observation and identification IDs; in
practice, a mechanism needs to be in place to prevent two people from
assigning the same id to their respective identifications.
3. My answer to Cam Webb's Question #1 from
http://lists.tdwg.org/pipermail/tdwg-content/2010-October/001720.html
is "both". In other words, just as "Joel Sachs" is both me and also my
name, so
http://www.cs.umbc.edu/~jsachs/occurrences/tdwg2010bioblitz_1668 is both
an occurrence and an occurrence_id, expressed as:
---
<dwc:Occurrence
rdf:about="
http://www.cs.umbc.edu/~jsachs/occurrences/tdwg2010bioblitz_1644">
<dwc:occurrenceID>
http://www.cs.umbc.edu/~jsachs/occurrences/tdwg2010bioblitz_1644
</dwc:occurrenceID>
<blah blah blah/>
</dwc:Occurrence>
---
4. I was surprised to see that the Darwin Core Identification class has
no
"occurrenceID" or "specimenID" term. How is one supposed to tie an
identification to an observation (assuming the identification is not
in-lined, of course)? DeVries and Baskauf each mint their own terms for
doing this (txn:identificationHasOccurrence, and
sernec:basedOnOccurrence,
respectively); I used dwc:occurrenceID as if it were a record level term.
5. We had scope for multiple taxonConceptID columns in the Fusion table,
and assigned lsids where possible. I also mean to work with Pete to
assign
GUIDs from taxoncocept.org. In addition, I assigned ethan taxon concept
ids, which look like this:
http:.//spire.umbc.edu/ethan/Coffea_arabica
In their argument over opaque vs. transparent taxonCoceptIDs, I was
sympathetic to both Pete's and Gregor's arguments. Ultimately, if the
tooling exists to always display the rdfs:labels every time I'm loooking
at a list of opaqueIDs, then transparent IDs are unnecessary. But, for
now, it's really helpful to look at an ID and know what it's referring
to.
(For species names not in the spire database, the rdf returned by
http:.//spire.umbc.edu/ethan/$name
is simply an rdfs:seeAlso to
http://http://gni.globalnames.org/name_strings?search_term=$name)
6. It was easy to assert membership in RDF classes corresponding to
various Cape Cod categories of concern - invasive species, threatenened
species, indicators, etc. You can see these classes at
http://spire.umbc.edu/ontologies/lists (Information of where these lists
come from is included as rdfs:comments. I'll add further documentation,
e.g. links to eml files.)
Note that "ThingOfConcern" is defined as the superclass of all the other
classes in the collection. The idea here is that people can create their
own "ThingOfConcern" class, and then query for observations that are of
concern to them. You can see sample sparql queries at
http://www.csee.umbc.edu/~jsachs/occurrences/queries/sample.txt
As an aside, I think we, as a community, should come up with a
biodiversity benchmark suite of rdf data and corresponding sparql
queries,
that can be
used to test the suitability and scalability of semantic web knowledge
bases. I'll take this up in a future post (unless someone beats me to
it).
Comments, questions, and better ideas are welcome.
Thanks -
Joel.
_______________________________________________
tdwg-content mailing list
tdwg-content@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-content
--
---------------------------------------------------------------
Pete DeVries
Department of Entomology
University of Wisconsin - Madison
445 Russell Laboratories
1630 Linden Drive
Madison, WI 53706
TaxonConcept Knowledge Base <http://www.taxonconcept.org/> / GeoSpecies
Knowledge Base <http://lod.geospecies.org/>
About the GeoSpecies Knowledge Base <http://about.geospecies.org/>
------------------------------------------------------------
--
---------------------------------------------------------------
Pete DeVries
Department of Entomology
University of Wisconsin - Madison
445 Russell Laboratories
1630 Linden Drive
Madison, WI 53706
TaxonConcept Knowledge Base <http://www.taxonconcept.org/> / GeoSpecies
Knowledge Base <http://lod.geospecies.org/>
About the GeoSpecies Knowledge Base <http://about.geospecies.org/>
------------------------------------------------------------
_______________________________________________
tdwg-content mailing list
tdwg-content@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-content

[tdwg-content] Spell checker for scientific names?

Tony.Rees＠csiro.au