Re: [tdwg-tag] [tdwg-content] Canonical name parsing

14 Mar 2012

      I do not have an opinion on this issue but wanted to note that the 
TaxonName part of the TDWG Ontology appears to be fully functional.  
Given that the TaxonName and TaxonConcept ontologies are based on TCS, 
there may be existing terms (based on TCS) with stable URIs to represent 
exactly what people want to say.  They wouldn't be Darwin Core terms, 
but they would be defined and have stable URIs nonetheless.  For example,

http://rs.tdwg.org/ontology/voc/TaxonName#nameComplete
which can be abbreviated tn:nameComplete
where tn:=http://rs.tdwg.org/ontology/voc/TaxonName#

is defined as "The complete uninomial, binomial or trinomial name 
without any authority or year components."

Thus one could mark up data as
<tn:nameComplete>Homo sapiens</tn:nameComplete>
and theoretically this would have meaning to the extent to which people 
take the TDWG Ontology seriously.  But that is a different item for 
discussion...

Steve

To view the rdf, see: 
http://code.google.com/p/tdwg-ontology/source/browse/trunk/ontology/voc/Taxo...
and 
http://code.google.com/p/tdwg-ontology/source/browse/trunk/ontology/voc/Taxo...

On 3/14/2012 3:15 PM, Kennedy, Jessie wrote:
...
Peter –you just described what TCS offered….this was all covered in 
the discussion on TCS… (and many more things that have been discussed 
recently)
The guide to using it covers some of the thoughts behind these issues 
I think…
http://www.tdwg.org/fileadmin/subgroups/tnc/User_Guide.pdf
*From:*tdwg-tag-bounces@lists.tdwg.org 
[mailto:tdwg-tag-bounces@lists.tdwg.org] *On Behalf Of *Peter Desmet
*Sent:* 14 March 2012 20:11
*To:* Paul Kirk
*Cc:* TDWG content mailing list; Donald Hobern (GBIF); TDWG TAG 
mailing list; Christian Gendreau; dev Developers
*Subject:* Re: [tdwg-tag] [tdwg-content] Canonical name parsing
Hi Paul,
Higher taxon: "Magnoliidae Novák ex Takhtajan" (a subclass).
- scientificName: Magnoliidae Novák ex Takhtajan
- taxonRank: subclass
But there are no terms to share the canonical name "Magnoliidae". The 
only available options are kingdom, phylum, class, order, family, 
genus, subgenus, specificEpithet, infraspecificEpithet, none of which 
are appropriate.
Solution:
- canonicalScientificName: Magnoliidae
Infrageneric taxon: "Abies sect. Amabilis (Matzenko) Farjon & 
Rushforth" (a section)
- scientificName: Abies sect. Amabilis (Matzenko) Farjon & Rushforth
- taxonRank: section
- genus: Abies
But there are no terms to share "Abies Amabilis", "Abies sect. 
Amabilis", "Abies section Amabilis" or even "Amabilis". The only 
available options are kingdom, phylum, class, order, family, genus, 
subgenus, specificEpithet, infraspecificEpithet, none of which are 
appropriate. Why we have subgenus, but not *infragenericEpithet* is 
another issue. I would at least be able to share "Amabilis".
Solution:
- canonicalScientificName: Abies Amabilis
- taxonRank: section
Peter
There is no place to share the canonical name "Magnoliidae" for this 
taxon.
On Wed, Mar 14, 2012 at 14:37, Paul Kirk <p.kirk@cabi.org 
<mailto:p.kirk@cabi.org>> wrote:
'For higher taxa or infrageneric taxa, these terms are not sufficient' 
... why?
Paul
------------------------------------------------------------------------
*From:*tdwg-tag-bounces@lists.tdwg.org 
<mailto:tdwg-tag-bounces@lists.tdwg.org> 
[tdwg-tag-bounces@lists.tdwg.org 
<mailto:tdwg-tag-bounces@lists.tdwg.org>] on behalf of Peter Desmet 
[peter.desmet@umontreal.ca <mailto:peter.desmet@umontreal.ca>]
*Sent:* 14 March 2012 18:26
*To:* Richard Pyle
*Cc:* TDWG content mailing list; Donald Hobern (GBIF); dev Developers; 
Christian Gendreau; TDWG TAG mailing list
*Subject:* Re: [tdwg-tag] [tdwg-content] Canonical name parsing
Rich,
I wished those terms were sufficient, but as mentioned in the 
justification for 
http://code.google.com/p/darwincore/issues/detail?id=150:
genus, specificEpithet, infraspecificEpithet: concatenated, this terms are identical to the canonicalScientificName for genera, species and infraspecific taxa. For higher taxa or infrageneric taxa, these terms are not sufficient. In addition, there is some ambiguity regarding the genus definition: for synonyms, is it the accepted genus or the genus that is part of the synonym name? See:http://lists.tdwg.org/pipermail/tdwg-content/2010-November/002052.html. In the former case, the genus cannot be used to concatenate a canonicalScientificName.
To give an example for a higher taxon:
scientificName: Magnoliidae Novák ex Takhtajan
taxonRank: subclass
There is no place to share the canonical name "Magnoliidae" for this 
taxon.
Peter
On Wed, Mar 14, 2012 at 14:13, Richard Pyle <deepreef@bishopmuseum.org 
<mailto:deepreef@bishopmuseum.org>> wrote:
I guess the parts that confuse me are:
1) What providers are able to produce a canonicalScientificName as per 
Peter’s definition, but are unable to provide the pre-parsed elements 
of genus | subgenus | specificEpithet | infraspecificEpithet?
2) What consumers could make use of a canonicalScientificName as per 
Peter’s definition, but are unable to make (even better) use of the 
pre-parsed elements of genus | subgenus | specificEpithet | 
infraspecificEpithet?
Aloha,
Rich
From: tdwg-content-bounces@lists.tdwg.org 
<mailto:tdwg-content-bounces@lists.tdwg.org> 
[mailto:tdwg-content-bounces@lists.tdwg.org 
<mailto:tdwg-content-bounces@lists.tdwg.org>] On Behalf Of Peter Desmet
Sent: Wednesday, March 14, 2012 7:03 AM
To: Donald Hobern (GBIF)
Cc: TDWG content mailing list; Christian Gendreau; Tim Robertson 
[GBIF]; TDWG TAG mailing list; dev Developers
Subject: Re: [tdwg-content] Canonical name parsing
Hi Donald,
scientificName, with its current definition [1] is a great term and 
should be continued to used as such. As with most Darwin Core terms, 
it offers flexibility, so its not an impediment for publishing data. 
In the GBIF context, this term is considered mandatory: records 
without it are ignored during indexing (I believe). All of this can stay.
canonicalScientificName would be an additional term with a clear rule 
(see my proposed definition [2]). This is the case for other Darwin 
Core terms as well, such as
decimalLatitude [3], minimalElevationInMeters [4] or countryCode [5]. 
They serve as an ready-to-use addition/alternative to verbatimLatitude 
[6], verbatimElevation [7] and country [8] respectively. These terms 
don't stop anyone from publishing data, but data publishers who can 
provide this kind of information have the choice to do so. It would be 
the same for canonicalScientificName.
And yes, an aggregator like GBIF can play an important role in 
providing consistent data to its users and figuring out what they 
really need, but not all data is consumed that way. In addition, I 
hope a user would be able to download cleaned data from the GBIF 
portal as Darwin Core. Wouldn't it be nice that the parsed 
canonicalScientificName created by GBIF can be provided in its proper 
term? There are users out there who want this!
Regards,
Peter
[1] http://rs.tdwg.org/dwc/terms/index.htm#scientificName
[2] http://code.google.com/p/darwincore/issues/detail?id=150
[3] http://rs.tdwg.org/dwc/terms/index.htm#decimalLatitude
[4] http://rs.tdwg.org/dwc/terms/index.htm#minimumElevationInMeters
[5] http://rs.tdwg.org/dwc/terms/index.htm#countryCode
[6] http://rs.tdwg.org/dwc/terms/index.htm#verbatimLatitude
[7] http://rs.tdwg.org/dwc/terms/index.htm#verbatimElevation
[8] http://rs.tdwg.org/dwc/terms/index.htm#country
On Wed, Mar 14, 2012 at 11:19, Donald Hobern (GBIF) <dhobern@gbif.org 
<mailto:dhobern@gbif.org>> wrote:
...
Hi Peter.
I certainly agree that aggregators only represent one use case here
but, having seen a lot of the mess of real-world data, I don't believe 
that simply adding a new term will fix this problem for the users you 
describe.  To get the results you want, we would need a sufficiently 
large majority of data sets to follow the rules perfectly that we 
could ignore those that were non-conformant.  This would mean we 
should mandate that every data set must use the new element (with or 
without the existing scientificName element) and that they must 
present scientific names in the expected way (or else have their data 
considered non-compliant). Until now, the philosophy on publishing 
Darwin Core data has been to make it as easy as possible for data 
providers to expose their data, even at the expense of greater 
complexity for consumers.  I suspect that we would have a lot less 
data available for use now if we had taken a more stringent approach.
...
In some ways, this proposal reminds me of the structures in ABCD
which seek to offer users verbatim and more normalised ways to 
represent several types of information.  This actually makes consuming 
all the possible forms of such data very complex, since a record may 
contain all variant forms or just any one of them.  If multiple forms 
are available, which one should be considered the primary version?
...
I suspect that things may also get complicated as soon as you
discuss botanical subspecies, varieties, subvarieties, forms and 
subforms.  There are recommended ways to abbreviate the rank markers 
in these cases but some variation can be expected.
...
Of course aggregators should be providing more robust services for
accessing exactly what you want in a consistent, predictable way and I 
would suggest that the best place to attack the problem is to define 
exactly what a typical user needs to see and then for GBIF and similar 
projects to work on delivering predictable data downloads and web 
services that clean out all of these nomenclatural inconsistencies - 
and perhaps also add value in other ways such as augmenting the data 
with associated environmental values (as the Atlas of Living Australia 
does).  This would allow us all to work together on developing a 
consistent and predictable algorithm for handling interpretation of 
name strings, including synonymy, misspellings, virus names and 
everything else that makes this such a difficult problem.
...
Best wishes,
Donald
----------------------------------------------------------------------
Donald Hobern - GBIF Director - dhobern@gbif.org
...
Global Biodiversity Information Facility http://www.gbif.org/
GBIF Secretariat, Universitetsparken 15, DK-2100 Copenhagen Ø, Denmark
Tel: +45 3532 1471 <tel:%2B45%203532%201471>  Mob: +45 2875 1471 
<tel:%2B45%202875%201471>  Fax: +45 2875 1480 <tel:%2B45%202875%201480>
----------------------------------------------------------------------
-----Original Message-----
From: peter.desmet.cubc@gmail.com 
<mailto:peter.desmet.cubc@gmail.com> 
[mailto:peter.desmet.cubc@gmail.com 
<mailto:peter.desmet.cubc@gmail.com>] On Behalf Of Peter Desmet
Sent: Wednesday, March 14, 2012 3:41 PM
To: Tim Robertson [GBIF]
Cc: Donald Hobern (GBIF); dev Developers; TDWG content mailing list; 
TDWG TAG mailing list; Christian Gendreau
Subject: Re: Canonical name parsing
Hi Tim,
I agree, aggregators like GBIF and Canadensys will have to deal with 
clean and dirty data in each field anyway: they need code libraries to 
deal with this and it is good that these are being developed. But,
...
Having to use Java libraries or even the Name Parser [1] (though both
great) is a barrier to data use. Darwin Core (Archives) is not only 
used for machine to machine interaction, humans use it too, and I
...
In addition, as a data publisher (e.g. for our VASCAN checklist) I
*do* have the information to provide a clean and simple to use 
canonicalScientificName, but I just can't share it via the otherwise 
excellent biodiversity sharing standard Darwin Core. I think that's a
<mailto:dhobern@gbif.org>
that doesn't help someone who wants to use data from a Darwin Core 
Archive with his data in Excel or a Roderic Page who wants to get 
things done for a prototype.
think we should allow easy hacking (I mean this in the good sense), 
especially for something as important as the scientific name.
pity.
...
Peter
[1] http://tools.gbif.org/nameparser/
[2] http://data.canadensys.net/vascan
PS: Yes, Canadensys will use the GBIF interpretation libraries.
Since we develop in Java as well, using those libraries is as easy as 
the proverbial "one line of code". We're looking forward in testing 
them and providing patches to enhance them. Open source FTW! :-)
...
On Wed, Mar 14, 2012 at 07:32, Tim Robertson [GBIF]
...
...
Hi Peter,
I'm replying off the TDWG list, since it is a bit of a tangent to 
your discussion.  If you feel it is relevant, please CC the list again.
At GBIF as you know, we have to interpret all kinds of quality of 
content.  I tend to agree with Donald that this would not really help 
in consumption, as in my experience we will have to deal with both 
clean and dirty data in each field *anyway* when this is used at 
network scale.  I would rather see us evolve the interpretation
...
...
I know you are a Java shop.  Are you using the GBIF interpretation
<trobertson@gbif.org <mailto:trobertson@gbif.org>> wrote:
libraries to handle all the corner cases, which we need to develop 
anyway.  We already do a pretty decent job at extracting canonicals. 
 This is further enhanced when you couple the extracted canonical with 
a fuzzy match against the "authoritative names" we can now index 
thanks to the availability of checklists in DwC-A format.
libraries [1] at the moment?  If not, is there a reason why you don't?
...
...
They are used in all GBIF projects (portal, checklistbank etc), 
and the more we enhance them, the better it is for everyone.  We have 
a significant test coverage [2,3] and there have been quite some man 
months (years?) spent already in their development and with some real 
regular expression experts (most notably Markus D. and Dave M.).  All 
our work is Maven-ized, versioned and available in our Maven 
repository [4].
I hope these are interesting to you.  We would welcome any patches 
to enhance them, or assistance in identifying the corner cases and 
capturing those as unit tests.
Hope this helps,
Tim
[1]
http://code.google.com/p/gbif-ecat/source/browse/trunk/ecat-common/src
/main/java/org/gbif/ecat/parser/NameParser.java
[2]
http://code.google.com/p/gbif-ecat/source/browse/trunk/ecat-common/src
/test/java/org/gbif/ecat/parser/NameParserTest.java
[3]
http://code.google.com/p/gbif-ecat/source/browse/trunk/ecat-common/src
/#src%2Ftest%2Fresources [4]
http://repository.gbif.org/index.html#nexus-search;quick~ecat-common 
<http://repository.gbif.org/index.html#nexus-search;quick%7Eecat-common>
...
--
Peter Desmet
Biodiversity Informatics Manager
Canadensys - www.canadensys.net <http://www.canadensys.net>
Université de Montréal Biodiversity Centre
4101 rue Sherbrooke est
Montreal, QC, H1X2B2
Canada
Phone: 514-343-6111 #82354 <tel:514-343-6111%20%2382354>
Fax: 514-343-2288 <tel:514-343-2288>
Email: peter.desmet@umontreal.ca <mailto:peter.desmet@umontreal.ca> 
/ peter.desmet.cubc@gmail.com <mailto:peter.desmet.cubc@gmail.com>
Skype: anderhalv
Public profile: http://www.linkedin.com/in/peterdesmet
_______________________________________________
tdwg-content mailing list
tdwg-content@lists.tdwg.org <mailto:tdwg-content@lists.tdwg.org>
http://lists.tdwg.org/mailman/listinfo/tdwg-content
--
Peter Desmet
Biodiversity Informatics Manager
Canadensys - www.canadensys.net <http://www.canadensys.net>
Université de Montréal Biodiversity Centre
4101 rue Sherbrooke est
Montreal, QC, H1X2B2
Canada
Phone: 514-343-6111 #82354 <tel:514-343-6111%20%2382354>
Fax: 514-343-2288 <tel:514-343-2288>
Email: peter.desmet@umontreal.ca <mailto:peter.desmet@umontreal.ca> / 
peter.desmet.cubc@gmail.com <mailto:peter.desmet.cubc@gmail.com>
Skype: anderhalv
Public profile: http://www.linkedin.com/in/peterdesmet
This message is only intended for the addressee named above.  Its 
contents may be privileged or otherwise protected.  Any unauthorized 
use, disclosure or copying of this message or its contents is 
prohibited.  If you have received this message by mistake, please 
notify us immediately by reply mail or by collect telephone call.  Any 
personal opinions expressed in this message do not necessarily 
represent the views of the Bishop Museum.
_______________________________________________
tdwg-content mailing list
tdwg-content@lists.tdwg.org <mailto:tdwg-content@lists.tdwg.org>
http://lists.tdwg.org/mailman/listinfo/tdwg-content
-- 
Peter Desmet
Biodiversity Informatics Manager
Canadensys - www.canadensys.net <http://www.canadensys.net>
Université de Montréal Biodiversity Centre
4101 rue Sherbrooke est
Montreal, QC, H1X2B2
Canada
Phone: 514-343-6111 #82354 <tel:514-343-6111%20%2382354>
Fax: 514-343-2288 <tel:514-343-2288>
Email: peter.desmet@umontreal.ca <mailto:peter.desmet@umontreal.ca> / 
peter.desmet.cubc@gmail.com <mailto:peter.desmet.cubc@gmail.com>
Skype: anderhalv
Public profile: http://www.linkedin.com/in/peterdesmet
PThink Green - don't print this email unless you really need to
************************************************************************
The information contained in this e-mail and any files transmitted 
with it is confidential and is for the exclusive use of the intended 
recipient. If you are not the intended recipient please note that any 
distribution, copying or use of this communication or the information 
in it is prohibited.
Whilst CAB International trading as CABI takes steps to prevent the 
transmission of viruses via e-mail, we cannot guarantee that any 
e-mail or attachment is free from computer viruses and you are 
strongly advised to undertake your own anti-virus precautions.
If you have received this communication in error, please notify us by 
e-mail at cabi@cabi.org <mailto:cabi@cabi.org> or by telephone on +44 
(0)1491 832111 <tel:%2B44%20%280%291491%20832111> and then delete the 
e-mail and any copies of it.
CABI is an International Organization recognised by the UK Government 
under Statutory Instrument 1982 No. 1071...
**************************************************************************
_______________________________________________
tdwg-tag mailing list
tdwg-tag@lists.tdwg.org <mailto:tdwg-tag@lists.tdwg.org>
http://lists.tdwg.org/mailman/listinfo/tdwg-tag
-- 
Peter Desmet
Biodiversity Informatics Manager
Canadensys - www.canadensys.net <http://www.canadensys.net>
Université de Montréal Biodiversity Centre
4101 rue Sherbrooke est
Montreal, QC, H1X2B2
Canada
Phone: 514-343-6111 #82354
Fax: 514-343-2288
Email: peter.desmet@umontreal.ca <mailto:peter.desmet@umontreal.ca> / 
peter.desmet.cubc@gmail.com <mailto:peter.desmet.cubc@gmail.com>
Skype: anderhalv
Public profile: http://www.linkedin.com/in/peterdesmet
Edinburgh Napier University is one of Scotland's top universities for 
graduate employability. 93.2% of graduates are in work or further 
study within six months of leaving. This university is also proud 
winner of the Queen's Anniversary Prize for Higher and Further 
Education 2009, awarded for innovative housing construction for 
environmental benefit and quality of life.
This message is intended for the addressee(s) only and should not be 
read, copied or disclosed to anyone else outwith the University 
without the permission of the sender.
It is your responsibility to ensure that this message and any 
attachments are scanned for viruses or other defects. Edinburgh Napier 
University does not accept liability for any loss or damage which may 
result from this email or any attachment, or for errors or omissions 
arising after it was sent. Email is not a secure medium. Email 
entering the University's system is subject to routine monitoring and 
filtering by the University.
Edinburgh Napier University is a registered Scottish charity. 
Registration number SC018373
-- 
Steven J. Baskauf, Ph.D., Senior Lecturer
Vanderbilt University Dept. of Biological Sciences

postal mail address:
VU Station B 351634
Nashville, TN  37235-1634,  U.S.A.

delivery address:
2125 Stevenson Center
1161 21st Ave., S.
Nashville, TN 37235

office: 2128 Stevenson Center
phone: (615) 343-4582,  fax: (615) 343-6707
http://bioimages.vanderbilt.edu

Re: [tdwg-tag] [tdwg-content] Canonical name parsing

Steve Baskauf