Re: [tdwg-content] Another example of non-overlapping concepts

17 May 2011

      Hi Matt,

It took me a while to ponder your question. There is a long answer which
complex and easily misinterpreted and there is a shorter answer.

For now I think the "shorter" answer set in a historical context is best.

The best use of my abilities seems to be recognizing a "ability gap" and
figuring out a technical solution or tool to address it.

The most visible of these were involving microscopy and visualization tools
to make complex ideas understandable.

My interest in the species problem dates back to when I had the opportunity
to talk with E.O. WIlson in 1991/1992.

At that time he said that if you have a knack for computers we need all this
information in databases so it is accessible.

*One of his former Ph.D. students is on my committee.

Years later I had the opportunity to work on questions like this and started
to think about how to connect all these disparate facts about species
together in a usable queryable knowledge base.

I noticed that several groups and individuals were marking up data sets
including observations with different scientific names even though they were
clearly meaning the same "species".

* These groups would agree that they were communicating about the same
species, but not always agree on the name

This prevents large scale data integration and analysis which in part is
described here: http://about.geospecies.org/

With the advent of the web, and the the semantic web in particular, this
"database" could be global and almost infinitely scalable.

<http://about.geospecies.org/>I started lobbying TDWG starting in 2006 for
two things:

1) A GUID for the "species" that was not tied to a particular name string
2) A system that followed semantic web best practices which LSID etc. do
not.

Since my TDWG efforts were not successful, I started GeoSpecies and based on
comments from a semantic web expert modified these somewhat into what is now
TaxonConcept.org

The TCS is an xml standard for transmitting information about a taxon
concepts that I think maps best to a "name use concept." (Rich's TNU's)

The TaxonConcepts are identified with semantic web GUIDs that follow
semantic web best practices and resolve to an informative documents.

In their current form these documents are not ideal because they do not do a
good enough job clearing up what would be the best concept match for a given
individual or specimen.

They do however have most of the plumbing for this in that they allow
semantic web links to name uses, specimens, occurrence records, images, DNA,
authors and publications including the original description.

They also link to similar entities that are on the semantic web, most
notably DBpedia, Uniprot, Freebase, Bio2RDF etc.

This linking may not seem valuable to a humans, is valuable for machines
that need to determine what entities are similar and what entities are
different.

This also increases the "findability" of these other data sets.

I see my current set of about 105,000 species as an example set that people
can use to try out these models.

In their final form these should be authored by editors that determine what
specimens and other data are good examples of instances of these concepts.

The editors will be linked via a URI so it is easy to track attribution.

The final concepts do not have to be in one place, they could be distributed
but to avoid the kinds of nomenclatural differences that have occurred
between zoology / botany etc it would be best to have one code base for now.

They don't have to have the same underlying stack, which now is based on
Ruby on Rails, but could be ported to anything.

What they do need is a common structure and a common understanding as to
what each attribute means and how it can be appropriately used.

For some use cases it is appropriate to consider the following the same
"thing"

 http://lod.taxonconcept.org/ses/v6n7p#Species

 http://purl.uniprot.org/taxonomy/9696

<http://purl.uniprot.org/taxonomy/9696>
http://www.freebase.com/view/en/cougar

<http://www.freebase.com/view/en/cougar>
http://sw.opencyc.org/concept/Mx4rvVj5o5wpEbGdrcN5Y29ycA

 <http://sw.opencyc.org/concept/Mx4rvVj5o5wpEbGdrcN5Y29ycA>
http://www.bbc.co.uk/nature/species/Cougar#species

For other use cases, this sameAs is not appropriate.

Wikipedia is very valuable, but if someone changes the article title then
the URI changes in DBpedia.

Uniprot and Bio2RDF are useful in that they link to lots of related data but
they don't really give you any information about what specimens are
instances of that concept and they only have those species which have NCBI
ID's.

What I want is a set of GUID's that resolve to a human readable HTML page
and an RDF representation that people can use to "tag" their data.

For instance:
*
*
* I am going to assert that what I have under the microscope is an instance
of the concept described on this page. I do not tie this assertion to a
particular name or classification hierarchy.*

Because it makes no sense to replicate the functionality of the Encyclopedia
of Life etc., I am mainly concentrating on the RDF representations and
testing if they behave as expected in SPARQL queries.

* The HTML pages are not really pretty or as informative as the RDF or as
the concept as viewed in the knowledge base.

I have been working with the Encyclopedia of Life and GNI groups for a while
exploring how these may or may not be useful to them.

During my visited Woods Hole I said that I have no interest in building and
empire I just want to build a solution and would like to partner with them
and GBIF.

Although I remain active on TDWG I find the most valuable suggestions seem
to come from the LOD community since we seem to have a common goal - that is
creating something that works in a reasonable amount of time.

Also, in the LOD cloud every linked data set increases the value of all the
other data sets.

This is probably more than your question required, but it provides some
explanation as to what these are and why I have implemented them in the way
I have.

Respectfully,

- Pete

On Fri, May 13, 2011 at 4:14 PM, Matt Jones <jones@nceas.ucsb.edu> wrote:
...
Hi Peter,
Does your idea of #ObjectiveSpeciesModel correspond 1:1 with the TCS
standard's idea of a Nominal Concept (i.e., <TaxonConcept type="nominal">) ?
 Can you outline how your concept types differ from TCS concept types?
Thanks,
Matt
On Fri, May 13, 2011 at 12:41 PM, Peter DeVries <pete.devries@gmail.com>wrote:
...
Hi Nico,
Thanks for posting this.
I have something in the concept model to indicate the basis for the
species concept.
For now I have three types. An individual species concept can have a
combination of one, two or all three
In the RDF they look like this
<txn:speciesConceptBasedOn rdf:resource="
http://lod.taxonconcept.org/ontology/txn.owl#ObjectiveSpeciesModel"/>
The first is what I call the #ObjectiveSpeciesModel - this indicates that
it is a species concept because we say it is.
All the species concepts are at least an #ObjectiveSpeciesModel
*This is in part a way to handle things like the domestic cat which you
want to be seen as different from the African Wildcat.
There are also tags for
txn:PhylogeneticSpeciesModel
txn:BiologicalSpeciesModel
For now I don't have these other models set in the example data, but
fields are in the database and the code for that an editor could state the
basis for the model.
I can think of a couple of different ways to handle the issue of
alternative species concepts.
* Note that the identifications as proposed by DarwinCore don't seem to
indicate what kind of model the identifications were based on.
  So it is not clear to me if a straight DarwinCore data set would allow
the analysis above.
Instead of having multiple different statements like
*txn:occurrenceHasSpeciesConcept <> *in the record for each occurrence
one could use different predicates to link to different kinds of species
concepts.
*txn:occurrenceHasUniprotConcept* => <
http://purl.uniprot.org/taxonomy/9696>
This would allow someone to query for the occurrences of <
http://purl.uniprot.org/taxonomy/9696>
That said, it is not clear to me what people mean by different
identifications.
Is the intent to have identifications with different homotypic synonyms to
be an identification of the same thing or not?
The way it works now in many data sets is that Felis concolor, Puma
concolor and Puma conncolor are treated as identifications of different
things.
This is another way of saying* is the namestring the concept?*
*
*
My understanding of the eBird project is that it allows citizen scientists
to contribute their own observations. This creates a much larger data set
for analysis etc.
They have a created a curated list of species and a ~6 letter code for
each. This serves as a guide for observers on how to encode their
observations.
I think their progress would be inhibited, the occurrence coding
inconsistant, and contributors frustrated, if they have a list that included
many overlapping species concepts.
Thanks again for you comments,
- Pete
On Fri, May 13, 2011 at 3:05 AM, Nico Franz <nico.franz@upr.edu> wrote:
...
Hello Pete (et al.):
For bird, Town Peterson at KU and colleagues have published these
papers showing how alternative bird taxonomies affect the ranking of
conservation priorities.
http://specify5.specifysoftware.org/Informatics/bios/biostownpeterson/PN_CB_...
http://specify5.specifysoftware.org/Informatics/bios/biostownpeterson/NP_BN_...
http://specify5.specifysoftware.org/Informatics/bios/biostownpeterson/P_BCI_...
Here's the abstract of the 1999 paper:
Analysis of geographic concentrations of endemic taxa is often used to
determine priorities for conservation
action; nevertheless, assumptions inherent in the taxonomic authority
list used as the basis for
analysis are not always considered. We analyzed foci of avian endemism in
Mexico under two alternate species
concepts. Under the biological species concept, 101 bird species are
endemic to Mexico and are concentrated
in the mountains of the western and southern portions of the country.
Under the phylogenetic species
concept, however, total endemic species rises to 249, which are
concentrated in the mountains and lowlands
of western Mexico. Twenty-four narrow endemic biological species are
concentrated on offshore islands, but
97 narrow endemic phylogenetic species show a concentration in the
Transvolcanic Belt of the mainland and
on several offshore islands. Our study demonstrates that conservation
priorities based on concentrations of
endemic taxa depend critically on the particular taxonomic authority
employed and that biodiversity evaluations
need to be developed in collaboration or consultation with practicing
systematic specialists.
There was a debate recently on Taxacom that was started and
subsequently neatly summarized by Fabian Haas. The topic was "let's
summarize reasons why 'donors' seem to not fund taxonomy". One point from
the summary was this:
3) Taxonomy is over-accurate for most applications
Most (not all) decisions in e.g. modelling and conservation are done and
can be done without complete knowledge of taxa. As it is, decisions for
conservation areas are often based on flagship species (e.g. elephants), on
taxa which have an excellent research background, e.g. birds (IBAs), on
availability of land (e.g. land with a high Tsetse burden), importance as
corridor and other factors, but never on a complete view on an all
biodiversity in a specific area. Even if an inventory existed, it would be
an illusion that we could collect data on ecological requirements and
population dynamics for most of the species necessary for informed
decisions. A complete inventory does not seem to provide an advantage for
conservation.
   I personally think there's some truth to that. I also think that,
while it's understandable that an accurate representation of the (sometimes)
fleetingness of taxonomic consensus it not a priority for applied ecological
projects, if taxonomists themselves don't find better ways to document and
link these alternatives perspectives, then it's not the best science we can
do. That would be fine too if adopted outright as a pragmatic stance.
Regards,
Nico
On 5/13/2011 1:08 AM, Peter DeVries wrote:
I thought that I would also mention that in addition to The Plants List,
the eBird project also uses on overlapping concepts in its bird list (it
does have concepts for common hybrids)
What is clear to me is that you cannot create graphs like these if
every observation can have X number of species (especially those that
overlapping ) without any indication which is is the most appropriate one.
eBird Occurrence Maps Northern Cardinal
http://ebird.org/content/ebird/about/occurrence-maps/northern-cardinal
NCBI is also similar.
Perhaps a member of the consensus committee can comment?
-- Pete
------------------------------------------------------------------------------------
Pete DeVries
Department of Entomology
University of Wisconsin - Madison
445 Russell Laboratories
1630 Linden Drive
Madison, WI 53706
Email: pdevries@wisc.edu
TaxonConcept <http://www.taxonconcept.org/>  &  GeoSpecies<http://about.geospecies.org/> Knowledge
Bases
A Semantic Web, Linked Open Data <http://linkeddata.org/>  Project
--------------------------------------------------------------------------------------
_______________________________________________
tdwg-content mailing listtdwg-content@lists.tdwg.orghttp://lists.tdwg.org/mailman/listinfo/tdwg-content
_______________________________________________
tdwg-content mailing list
tdwg-content@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-content
--
------------------------------------------------------------------------------------
Pete DeVries
Department of Entomology
University of Wisconsin - Madison
445 Russell Laboratories
1630 Linden Drive
Madison, WI 53706
Email: pdevries@wisc.edu
TaxonConcept <http://www.taxonconcept.org/>  &  GeoSpecies<http://about.geospecies.org/> Knowledge
Bases
A Semantic Web, Linked Open Data <http://linkeddata.org/>  Project
--------------------------------------------------------------------------------------
_______________________________________________
tdwg-content mailing list
tdwg-content@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-content
-- 
------------------------------------------------------------------------------------
Pete DeVries
Department of Entomology
University of Wisconsin - Madison
445 Russell Laboratories
1630 Linden Drive
Madison, WI 53706
Email: pdevries@wisc.edu
TaxonConcept <http://www.taxonconcept.org/>  &
GeoSpecies<http://about.geospecies.org/> Knowledge
Bases
A Semantic Web, Linked Open Data <http://linkeddata.org/>  Project
--------------------------------------------------------------------------------------