[tdwg-content] Another example of non-overlapping concepts

Wed May 18 00:00:39 CEST 2011

Hi Matt,

It took me a while to ponder your question. There is a long answer which
complex and easily misinterpreted and there is a shorter answer.

For now I think the "shorter" answer set in a historical context is best.

The best use of my abilities seems to be recognizing a "ability gap" and
figuring out a technical solution or tool to address it.

The most visible of these were involving microscopy and visualization tools
to make complex ideas understandable.

My interest in the species problem dates back to when I had the opportunity
to talk with E.O. WIlson in 1991/1992.

At that time he said that if you have a knack for computers we need all this
information in databases so it is accessible.

*One of his former Ph.D. students is on my committee.

Years later I had the opportunity to work on questions like this and started
to think about how to connect all these disparate facts about species
together in a usable queryable knowledge base.

I noticed that several groups and individuals were marking up data sets
including observations with different scientific names even though they were
clearly meaning the same "species".

* These groups would agree that they were communicating about the same
species, but not always agree on the name

This prevents large scale data integration and analysis which in part is
described here: http://about.geospecies.org/

With the advent of the web, and the the semantic web in particular, this
"database" could be global and almost infinitely scalable.

<http://about.geospecies.org/>I started lobbying TDWG starting in 2006 for
two things:

1) A GUID for the "species" that was not tied to a particular name string
2) A system that followed semantic web best practices which LSID etc. do
not.

Since my TDWG efforts were not successful, I started GeoSpecies and based on
comments from a semantic web expert modified these somewhat into what is now
TaxonConcept.org

The TCS is an xml standard for transmitting information about a taxon
concepts that I think maps best to a "name use concept." (Rich's TNU's)

The TaxonConcepts are identified with semantic web GUIDs that follow
semantic web best practices and resolve to an informative documents.

In their current form these documents are not ideal because they do not do a
good enough job clearing up what would be the best concept match for a given
individual or specimen.

They do however have most of the plumbing for this in that they allow
semantic web links to name uses, specimens, occurrence records, images, DNA,
authors and publications including the original description.

They also link to similar entities that are on the semantic web, most
notably DBpedia, Uniprot, Freebase, Bio2RDF etc.

This linking may not seem valuable to a humans, is valuable for machines
that need to determine what entities are similar and what entities are
different.

This also increases the "findability" of these other data sets.

I see my current set of about 105,000 species as an example set that people
can use to try out these models.

In their final form these should be authored by editors that determine what
specimens and other data are good examples of instances of these concepts.

The editors will be linked via a URI so it is easy to track attribution.

The final concepts do not have to be in one place, they could be distributed
but to avoid the kinds of nomenclatural differences that have occurred
between zoology / botany etc it would be best to have one code base for now.

They don't have to have the same underlying stack, which now is based on
Ruby on Rails, but could be ported to anything.

What they do need is a common structure and a common understanding as to
what each attribute means and how it can be appropriately used.

For some use cases it is appropriate to consider the following the same
"thing"

 http://lod.taxonconcept.org/ses/v6n7p#Species

 http://purl.uniprot.org/taxonomy/9696

<http://purl.uniprot.org/taxonomy/9696>
http://www.freebase.com/view/en/cougar

<http://www.freebase.com/view/en/cougar>
http://sw.opencyc.org/concept/Mx4rvVj5o5wpEbGdrcN5Y29ycA

 <http://sw.opencyc.org/concept/Mx4rvVj5o5wpEbGdrcN5Y29ycA>
http://www.bbc.co.uk/nature/species/Cougar#species

For other use cases, this sameAs is not appropriate.

Wikipedia is very valuable, but if someone changes the article title then
the URI changes in DBpedia.

Uniprot and Bio2RDF are useful in that they link to lots of related data but
they don't really give you any information about what specimens are
instances of that concept and they only have those species which have NCBI
ID's.

What I want is a set of GUID's that resolve to a human readable HTML page
and an RDF representation that people can use to "tag" their data.

For instance:
*
*
* I am going to assert that what I have under the microscope is an instance
of the concept described on this page. I do not tie this assertion to a
particular name or classification hierarchy.*

Because it makes no sense to replicate the functionality of the Encyclopedia
of Life etc., I am mainly concentrating on the RDF representations and
testing if they behave as expected in SPARQL queries.

* The HTML pages are not really pretty or as informative as the RDF or as
the concept as viewed in the knowledge base.

I have been working with the Encyclopedia of Life and GNI groups for a while
exploring how these may or may not be useful to them.

During my visited Woods Hole I said that I have no interest in building and
empire I just want to build a solution and would like to partner with them
and GBIF.

Although I remain active on TDWG I find the most valuable suggestions seem
to come from the LOD community since we seem to have a common goal - that is
creating something that works in a reasonable amount of time.

Also, in the LOD cloud every linked data set increases the value of all the
other data sets.

This is probably more than your question required, but it provides some
explanation as to what these are and why I have implemented them in the way
I have.

Respectfully,

- Pete

On Fri, May 13, 2011 at 4:14 PM, Matt Jones <jones at nceas.ucsb.edu> wrote:

> Hi Peter,
>
> Does your idea of #ObjectiveSpeciesModel correspond 1:1 with the TCS
> standard's idea of a Nominal Concept (i.e., <TaxonConcept type="nominal">) ?
>  Can you outline how your concept types differ from TCS concept types?
>
> Thanks,
> Matt
>
> On Fri, May 13, 2011 at 12:41 PM, Peter DeVries <pete.devries at gmail.com>wrote:
>
>> Hi Nico,
>>
>> Thanks for posting this.
>>
>> I have something in the concept model to indicate the basis for the
>> species concept.
>>
>> For now I have three types. An individual species concept can have a
>> combination of one, two or all three
>>
>> In the RDF they look like this
>>
>> <txn:speciesConceptBasedOn rdf:resource="
>> http://lod.taxonconcept.org/ontology/txn.owl#ObjectiveSpeciesModel"/>
>>
>> The first is what I call the #ObjectiveSpeciesModel - this indicates that
>> it is a species concept because we say it is.
>>
>> All the species concepts are at least an #ObjectiveSpeciesModel
>>
>> *This is in part a way to handle things like the domestic cat which you
>> want to be seen as different from the African Wildcat.
>>
>> There are also tags for
>>
>> txn:PhylogeneticSpeciesModel
>> txn:BiologicalSpeciesModel
>>
>> For now I don't have these other models set in the example data, but
>> fields are in the database and the code for that an editor could state the
>> basis for the model.
>>
>> I can think of a couple of different ways to handle the issue of
>> alternative species concepts.
>>
>> * Note that the identifications as proposed by DarwinCore don't seem to
>> indicate what kind of model the identifications were based on.
>>   So it is not clear to me if a straight DarwinCore data set would allow
>> the analysis above.
>>
>> Instead of having multiple different statements like
>>
>> *txn:occurrenceHasSpeciesConcept <> *in the record for each occurrence
>>
>> one could use different predicates to link to different kinds of species
>> concepts.
>>
>> *txn:occurrenceHasUniprotConcept* => <
>> http://purl.uniprot.org/taxonomy/9696>
>>
>> This would allow someone to query for the occurrences of <
>> http://purl.uniprot.org/taxonomy/9696>
>>
>> That said, it is not clear to me what people mean by different
>> identifications.
>>
>> Is the intent to have identifications with different homotypic synonyms to
>> be an identification of the same thing or not?
>>
>> The way it works now in many data sets is that Felis concolor, Puma
>> concolor and Puma conncolor are treated as identifications of different
>> things.
>>
>> This is another way of saying* is the namestring the concept?*
>> *
>> *
>> My understanding of the eBird project is that it allows citizen scientists
>> to contribute their own observations. This creates a much larger data set
>> for analysis etc.
>>
>> They have a created a curated list of species and a ~6 letter code for
>> each. This serves as a guide for observers on how to encode their
>> observations.
>>
>> I think their progress would be inhibited, the occurrence coding
>> inconsistant, and contributors frustrated, if they have a list that included
>> many overlapping species concepts.
>>
>> Thanks again for you comments,
>>
>> - Pete
>>
>> On Fri, May 13, 2011 at 3:05 AM, Nico Franz <nico.franz at upr.edu> wrote:
>>
>>>  Hello Pete (et al.):
>>>
>>>    For bird, Town Peterson at KU and colleagues have published these
>>> papers showing how alternative bird taxonomies affect the ranking of
>>> conservation priorities.
>>>
>>>
>>> http://specify5.specifysoftware.org/Informatics/bios/biostownpeterson/PN_CB_1999.pdf
>>>
>>> http://specify5.specifysoftware.org/Informatics/bios/biostownpeterson/NP_BN_2004.pdf
>>>
>>> http://specify5.specifysoftware.org/Informatics/bios/biostownpeterson/P_BCI_2006.pdf
>>>
>>>    Here's the abstract of the 1999 paper:
>>>
>>> Analysis of geographic concentrations of endemic taxa is often used to
>>> determine priorities for conservation
>>> action; nevertheless, assumptions inherent in the taxonomic authority
>>> list used as the basis for
>>> analysis are not always considered. We analyzed foci of avian endemism in
>>> Mexico under two alternate species
>>> concepts. Under the biological species concept, 101 bird species are
>>> endemic to Mexico and are concentrated
>>> in the mountains of the western and southern portions of the country.
>>> Under the phylogenetic species
>>> concept, however, total endemic species rises to 249, which are
>>> concentrated in the mountains and lowlands
>>> of western Mexico. Twenty-four narrow endemic biological species are
>>> concentrated on offshore islands, but
>>> 97 narrow endemic phylogenetic species show a concentration in the
>>> Transvolcanic Belt of the mainland and
>>> on several offshore islands. Our study demonstrates that conservation
>>> priorities based on concentrations of
>>> endemic taxa depend critically on the particular taxonomic authority
>>> employed and that biodiversity evaluations
>>> need to be developed in collaboration or consultation with practicing
>>> systematic specialists.
>>>
>>>    There was a debate recently on Taxacom that was started and
>>> subsequently neatly summarized by Fabian Haas. The topic was "let's
>>> summarize reasons why 'donors' seem to not fund taxonomy". One point from
>>> the summary was this:
>>>
>>> 3) Taxonomy is over-accurate for most applications
>>>
>>> Most (not all) decisions in e.g. modelling and conservation are done and
>>> can be done without complete knowledge of taxa. As it is, decisions for
>>> conservation areas are often based on flagship species (e.g. elephants), on
>>> taxa which have an excellent research background, e.g. birds (IBAs), on
>>> availability of land (e.g. land with a high Tsetse burden), importance as
>>> corridor and other factors, but never on a complete view on an all
>>> biodiversity in a specific area. Even if an inventory existed, it would be
>>> an illusion that we could collect data on ecological requirements and
>>> population dynamics for most of the species necessary for informed
>>> decisions. A complete inventory does not seem to provide an advantage for
>>> conservation.
>>>    I personally think there's some truth to that. I also think that,
>>> while it's understandable that an accurate representation of the (sometimes)
>>> fleetingness of taxonomic consensus it not a priority for applied ecological
>>> projects, if taxonomists themselves don't find better ways to document and
>>> link these alternatives perspectives, then it's not the best science we can
>>> do. That would be fine too if adopted outright as a pragmatic stance.
>>>
>>> Regards,
>>>
>>> Nico
>>>
>>>
>>>
>>> On 5/13/2011 1:08 AM, Peter DeVries wrote:
>>>
>>> I thought that I would also mention that in addition to The Plants List,
>>> the eBird project also uses on overlapping concepts in its bird list (it
>>> does have concepts for common hybrids)
>>>
>>>  What is clear to me is that you cannot create graphs like these if
>>> every observation can have X number of species (especially those that
>>> overlapping ) without any indication which is is the most appropriate one.
>>>
>>>  eBird Occurrence Maps Northern Cardinal
>>> http://ebird.org/content/ebird/about/occurrence-maps/northern-cardinal
>>>
>>>  NCBI is also similar.
>>>
>>>  Perhaps a member of the consensus committee can comment?
>>>
>>> -- Pete
>>>
>>> ------------------------------------------------------------------------------------
>>> Pete DeVries
>>> Department of Entomology
>>> University of Wisconsin - Madison
>>> 445 Russell Laboratories
>>> 1630 Linden Drive
>>> Madison, WI 53706
>>> Email: pdevries at wisc.edu
>>> TaxonConcept <http://www.taxonconcept.org/>  &  GeoSpecies<http://about.geospecies.org/> Knowledge
>>> Bases
>>> A Semantic Web, Linked Open Data <http://linkeddata.org/>  Project
>>>
>>> --------------------------------------------------------------------------------------
>>>
>>>
>>> _______________________________________________
>>> tdwg-content mailing listtdwg-content at lists.tdwg.orghttp://lists.tdwg.org/mailman/listinfo/tdwg-content
>>>
>>>
>>>
>>> _______________________________________________
>>> tdwg-content mailing list
>>> tdwg-content at lists.tdwg.org
>>> http://lists.tdwg.org/mailman/listinfo/tdwg-content
>>>
>>>
>>
>>
>> --
>>
>> ------------------------------------------------------------------------------------
>> Pete DeVries
>> Department of Entomology
>> University of Wisconsin - Madison
>> 445 Russell Laboratories
>> 1630 Linden Drive
>> Madison, WI 53706
>> Email: pdevries at wisc.edu
>> TaxonConcept <http://www.taxonconcept.org/>  &  GeoSpecies<http://about.geospecies.org/> Knowledge
>> Bases
>> A Semantic Web, Linked Open Data <http://linkeddata.org/>  Project
>>
>> --------------------------------------------------------------------------------------
>>
>> _______________________________________________
>> tdwg-content mailing list
>> tdwg-content at lists.tdwg.org
>> http://lists.tdwg.org/mailman/listinfo/tdwg-content
>>
>>
>

-- 
------------------------------------------------------------------------------------
Pete DeVries
Department of Entomology
University of Wisconsin - Madison
445 Russell Laboratories
1630 Linden Drive
Madison, WI 53706
Email: pdevries at wisc.edu
TaxonConcept <http://www.taxonconcept.org/>  &
GeoSpecies<http://about.geospecies.org/> Knowledge
Bases
A Semantic Web, Linked Open Data <http://linkeddata.org/>  Project
--------------------------------------------------------------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.tdwg.org/pipermail/tdwg-content/attachments/20110517/17d2c475/attachment.html