Re: [tdwg-tag] [tdwg-content] Canonical name parsing

14 Mar 2012


      Thanks everyone for the feedback so far!
Now, if you want to +1 the proposal, become a friend of Timon lepidus:
https://plus.google.com/114672072317054763788/posts/Nph2ksggNZW

:-) Peter

On Wed, Mar 14, 2012 at 16:15, Dmitry Mozzherin <dmozzherin@eol.org> wrote:
...
mmm where is +1 button? :)
...
Hi Paul,
Higher taxon: "Magnoliidae Novák ex Takhtajan" (a subclass).
- scientificName: Magnoliidae Novák ex Takhtajan
- taxonRank: subclass
But there are no terms to share the canonical name "Magnoliidae". The
only
available options are kingdom, phylum, class, order, family, genus,
subgenus, specificEpithet, infraspecificEpithet, none of which are
appropriate.
Solution:
- canonicalScientificName: Magnoliidae
Infrageneric taxon: "Abies sect. Amabilis (Matzenko) Farjon & Rushforth"
(a
section)
- scientificName: Abies sect. Amabilis (Matzenko) Farjon & Rushforth
- taxonRank: section
- genus: Abies
But there are no terms to share "Abies Amabilis", "Abies sect. Amabilis",
"Abies section Amabilis" or even "Amabilis". The only available options
are
kingdom, phylum, class, order, family, genus, subgenus, specificEpithet,
infraspecificEpithet, none of which are appropriate. Why we have
subgenus,
but not infragenericEpithet is another issue. I would at least be able to
share "Amabilis".
Solution:
- canonicalScientificName: Abies Amabilis
- taxonRank: section
Peter
There is no place to share the canonical name "Magnoliidae" for this
taxon.
On Wed, Mar 14, 2012 at 14:37, Paul Kirk <p.kirk@cabi.org> wrote:
...
'For higher taxa or infrageneric taxa, these terms are not sufficient'
...
...
why?
Paul
________________________________
From: tdwg-tag-bounces@lists.tdwg.org [tdwg-tag-bounces@lists.tdwg.org]
on
behalf of Peter Desmet [peter.desmet@umontreal.ca]
Sent: 14 March 2012 18:26
To: Richard Pyle
Cc: TDWG content mailing list; Donald Hobern (GBIF); dev Developers;
Christian Gendreau; TDWG TAG mailing list
Subject: Re: [tdwg-tag] [tdwg-content] Canonical name parsing
Rich,
I wished those terms were sufficient, but as mentioned in the
justification for
http://code.google.com/p/darwincore/issues/detail?id=150:
genus, specificEpithet, infraspecificEpithet: concatenated, this terms
are
identical to the canonicalScientificName for genera, species and
infraspecific taxa. For higher taxa or infrageneric taxa, these terms
are
not sufficient. In addition, there is some ambiguity regarding the genus
definition: for synonyms, is it the accepted genus or the genus that is
...
...
of the synonym name? See:
http://lists.tdwg.org/pipermail/tdwg-content/2010-November/002052.html.
In
the former case, the genus cannot be used to concatenate a
canonicalScientificName.
To give an example for a higher taxon:
scientificName: Magnoliidae Novák ex Takhtajan
taxonRank: subclass
There is no place to share the canonical name "Magnoliidae" for this
taxon.
Peter
On Wed, Mar 14, 2012 at 14:13, Richard Pyle <deepreef@bishopmuseum.org>
wrote:
...
I guess the parts that confuse me are:
1) What providers are able to produce a canonicalScientificName as per
Peter’s definition, but are unable to provide the pre-parsed elements
of
...
genus | subgenus | specificEpithet | infraspecificEpithet?
2) What consumers could make use of a canonicalScientificName as per
Peter’s definition, but are unable to make (even better) use of the
pre-parsed elements of genus | subgenus | specificEpithet |
infraspecificEpithet?
Aloha,
Rich
From: tdwg-content-bounces@lists.tdwg.org
[mailto:tdwg-content-bounces@lists.tdwg.org] On Behalf Of Peter Desmet
Sent: Wednesday, March 14, 2012 7:03 AM
To: Donald Hobern (GBIF)
Cc: TDWG content mailing list; Christian Gendreau; Tim Robertson
[GBIF];
TDWG TAG mailing list; dev Developers
Subject: Re: [tdwg-content] Canonical name parsing
Hi Donald,
scientificName, with its current definition [1] is a great term and
should be continued to used as such. As with most Darwin Core terms, it
offers flexibility, so its not an impediment for publishing data. In
...
...
...
GBIF context, this term is considered mandatory: records without it are
ignored during indexing (I believe). All of this can stay.
canonicalScientificName would be an additional term with a clear rule
(see my proposed definition [2]). This is the case for other Darwin
Core
terms as well, such as
decimalLatitude [3], minimalElevationInMeters [4] or countryCode [5].
They serve as an ready-to-use addition/alternative to verbatimLatitude
[6],
verbatimElevation [7] and country [8] respectively. These terms don't
stop
anyone from publishing data, but data publishers who can provide this
kind
of information have the choice to do so. It would be the same for
canonicalScientificName.
And yes, an aggregator like GBIF can play an important role in
...
...
...
consistent data to its users and figuring out what they really need,
but not
all data is consumed that way. In addition, I hope a user would be
able to
download cleaned data from the GBIF portal as Darwin Core. Wouldn't it
be
nice that the parsed canonicalScientificName created by GBIF can be
...
...
...
in its proper term? There are users out there who want this!
Regards,
Peter
[1] http://rs.tdwg.org/dwc/terms/index.htm#scientificName
[2] http://code.google.com/p/darwincore/issues/detail?id=150
[3] http://rs.tdwg.org/dwc/terms/index.htm#decimalLatitude
[4] http://rs.tdwg.org/dwc/terms/index.htm#minimumElevationInMeters
[5] http://rs.tdwg.org/dwc/terms/index.htm#countryCode
[6] http://rs.tdwg.org/dwc/terms/index.htm#verbatimLatitude
[7] http://rs.tdwg.org/dwc/terms/index.htm#verbatimElevation
[8] http://rs.tdwg.org/dwc/terms/index.htm#country
On Wed, Mar 14, 2012 at 11:19, Donald Hobern (GBIF) <dhobern@gbif.org>
wrote:
...
Hi Peter.
I certainly agree that aggregators only represent one use case here
but, having seen a lot of the mess of real-world data, I don't
believe that
...
simply adding a new term will fix this problem for the users you
describe.
 To get the results you want, we would need a sufficiently large
majority of
data sets to follow the rules perfectly that we could ignore those
...
...
...
...
non-conformant.  This would mean we should mandate that every data
set must
use the new element (with or without the existing scientificName
element)
and that they must present scientific names in the expected way (or
else
have their data considered non-compliant). Until now, the philosophy
on
publishing Darwin Core data has been to make it as easy as possible
for data
providers to expose their data, even at the expense of greater
complexity
for consumers.  I suspect that we would have a lot less data
available for
use now if we had taken a more stringent approach.
In some ways, this proposal reminds me of the structures in ABCD
which
seek to offer users verbatim and more normalised ways to represent
several
types of information.  This actually makes consuming all the
...
...
...
...
of such data very complex, since a record may contain all variant
On Wed, Mar 14, 2012 at 4:11 PM, Peter Desmet <peter.desmet@umontreal.ca>
wrote:
part
the
providing
provided
that were
possible forms
forms or
...
...
...
...
just any one of them.  If multiple forms are available, which one
should be
considered the primary version?
I suspect that things may also get complicated as soon as you discuss
botanical subspecies, varieties, subvarieties, forms and subforms.
 There
are recommended ways to abbreviate the rank markers in these cases
but some
variation can be expected.
Of course aggregators should be providing more robust services for
accessing exactly what you want in a consistent, predictable way and
I would
suggest that the best place to attack the problem is to define
exactly what
a typical user needs to see and then for GBIF and similar projects
to work
on delivering predictable data downloads and web services that clean
out all
of these nomenclatural inconsistencies - and perhaps also add value
in other
ways such as augmenting the data with associated environmental
values (as
the Atlas of Living Australia does).  This would allow us all to work
together on developing a consistent and predictable algorithm for
handling
interpretation of name strings, including synonymy, misspellings,
virus
names and everything else that makes this such a difficult problem.
Best wishes,
Donald

...
...
...
...
Donald Hobern - GBIF Director - dhobern@gbif.org
Global Biodiversity Information Facility http://www.gbif.org/
GBIF Secretariat, Universitetsparken 15, DK-2100 Copenhagen Ø,
Denmark
Tel: +45 3532 1471  Mob: +45 2875 1471  Fax: +45 2875 1480

...
...
...
...
-----Original Message-----
From: peter.desmet.cubc@gmail.com [mailto:
...
On Behalf Of Peter Desmet
Sent: Wednesday, March 14, 2012 3:41 PM
To: Tim Robertson [GBIF]
Cc: Donald Hobern (GBIF); dev Developers; TDWG content mailing list;
TDWG TAG mailing list; Christian Gendreau
Subject: Re: Canonical name parsing
Hi Tim,
I agree, aggregators like GBIF and Canadensys will have to deal with
clean and dirty data in each field anyway: they need code libraries
to deal
with this and it is good that these are being developed. But, that
doesn't
help someone who wants to use data from a Darwin Core Archive with
his data
in Excel or a Roderic Page who wants to get things done for a
...
Having to use Java libraries or even the Name Parser [1] (though both
great) is a barrier to data use. Darwin Core (Archives) is not only
used for machine to machine interaction, humans use it too, and I
...
should allow easy hacking (I mean this in the good sense),
especially for
something as important as the scientific name.
In addition, as a data publisher (e.g. for our VASCAN checklist) I
*do* have the information to provide a clean and simple to use
canonicalScientificName, but I just can't share it via the otherwise
excellent biodiversity sharing standard Darwin Core. I think that's
a pity.
Peter
[1] http://tools.gbif.org/nameparser/
[2] http://data.canadensys.net/vascan
PS: Yes, Canadensys will use the GBIF interpretation libraries. Since
we develop in Java as well, using those libraries is as easy as the
proverbial "one line of code". We're looking forward in testing them
and
providing patches to enhance them. Open source FTW! :-)
On Wed, Mar 14, 2012 at 07:32, Tim Robertson [GBIF]
<trobertson@gbif.org> wrote:
...
Hi Peter,
I'm replying off the TDWG list, since it is a bit of a tangent to
your discussion.  If you feel it is relevant, please CC the list
again.
At GBIF as you know, we have to interpret all kinds of quality of
content.  I tend to agree with Donald that this would not really
help in
consumption, as in my experience we will have to deal with both
clean and
dirty data in each field *anyway* when this is used at network
scale.  I
would rather see us evolve the interpretation libraries to handle
all the
corner cases, which we need to develop anyway.  We already do a
peter.desmet.cubc@gmail.com]
prototype.
think we
pretty
...
...
decent job at extracting canonicals.  This is further enhanced
when you
couple the extracted canonical with a fuzzy match against the
"authoritative
names" we can now index thanks to the availability of checklists
in DwC-A
format.
I know you are a Java shop.  Are you using the GBIF interpretation
libraries [1] at the moment?  If not, is there a reason why you
don't?
They are used in all GBIF projects (portal, checklistbank etc), and
the more we enhance them, the better it is for everyone.  We have a
significant test coverage [2,3] and there have been quite some man
months
(years?) spent already in their development and with some real
regular
expression experts (most notably Markus D. and Dave M.).  All our
work is
Maven-ized, versioned and available in our Maven repository [4].
I hope these are interesting to you.  We would welcome any patches
to
enhance them, or assistance in identifying the corner cases and
capturing
those as unit tests.
Hope this helps,
Tim
[1]
http://code.google.com/p/gbif-ecat/source/browse/trunk/ecat-common/src
...
/main/java/org/gbif/ecat/parser/NameParser.java
[2]
http://code.google.com/p/gbif-ecat/source/browse/trunk/ecat-common/src
...
/test/java/org/gbif/ecat/parser/NameParserTest.java
[3]
http://code.google.com/p/gbif-ecat/source/browse/trunk/ecat-common/src
...
/#src%2Ftest%2Fresources [4]
http://repository.gbif.org/index.html#nexus-search;quick~ecat-common
...
--
Peter Desmet
Biodiversity Informatics Manager
Canadensys - www.canadensys.net
Université de Montréal Biodiversity Centre
4101 rue Sherbrooke est
Montreal, QC, H1X2B2
Canada
Phone: 514-343-6111 #82354
Fax: 514-343-2288
Email: peter.desmet@umontreal.ca / peter.desmet.cubc@gmail.com
Skype: anderhalv
Public profile: http://www.linkedin.com/in/peterdesmet
_______________________________________________
tdwg-content mailing list
tdwg-content@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-content
--
Peter Desmet
Biodiversity Informatics Manager
Canadensys - www.canadensys.net
Université de Montréal Biodiversity Centre
4101 rue Sherbrooke est
Montreal, QC, H1X2B2
Canada
Phone: 514-343-6111 #82354
Fax: 514-343-2288
Email: peter.desmet@umontreal.ca / peter.desmet.cubc@gmail.com
Skype: anderhalv
Public profile: http://www.linkedin.com/in/peterdesmet
This message is only intended for the addressee named above.  Its
contents may be privileged or otherwise protected.  Any unauthorized
use,
disclosure or copying of this message or its contents is prohibited.
 If you
have received this message by mistake, please notify us immediately by
reply
mail or by collect telephone call.  Any personal opinions expressed in
this
message do not necessarily represent the views of the Bishop Museum.
_______________________________________________
tdwg-content mailing list
tdwg-content@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-content
--
Peter Desmet
Biodiversity Informatics Manager
Canadensys - www.canadensys.net
Université de Montréal Biodiversity Centre
4101 rue Sherbrooke est
Montreal, QC, H1X2B2
Canada
Phone: 514-343-6111 #82354
Fax: 514-343-2288
Email: peter.desmet@umontreal.ca / peter.desmet.cubc@gmail.com
Skype: anderhalv
Public profile: http://www.linkedin.com/in/peterdesmet
P Think Green - don't print this email unless you really need to
************************************************************************
The information contained in this e-mail and any files transmitted with
it
is confidential and is for the exclusive use of the intended recipient.
If
you are not the intended recipient please note that any distribution,
copying or use of this communication or the information in it is
prohibited.
Whilst CAB International trading as CABI takes steps to prevent the
transmission of viruses via e-mail, we cannot guarantee that any e-mail
or
attachment is free from computer viruses and you are strongly advised to
undertake your own anti-virus precautions.
If you have received this communication in error, please notify us by
e-mail at cabi@cabi.org or by telephone on +44 (0)1491 832111 and then
delete the e-mail and any copies of it.
CABI is an International Organization recognised by the UK Government
under Statutory Instrument 1982 No. 1071...

...
...
_______________________________________________
tdwg-tag mailing list
tdwg-tag@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-tag
--
Peter Desmet
Biodiversity Informatics Manager
Canadensys - www.canadensys.net
Université de Montréal Biodiversity Centre
4101 rue Sherbrooke est
Montreal, QC, H1X2B2
Canada
Phone: 514-343-6111 #82354
Fax: 514-343-2288
Email: peter.desmet@umontreal.ca / peter.desmet.cubc@gmail.com
Skype: anderhalv
Public profile: http://www.linkedin.com/in/peterdesmet
_______________________________________________
tdwg-content mailing list
tdwg-content@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-content
_______________________________________________
tdwg-content mailing list
tdwg-content@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-content
-- 
Peter Desmet
Biodiversity Informatics Manager
Canadensys - www.canadensys.net

Université de Montréal Biodiversity Centre
4101 rue Sherbrooke est
Montreal, QC, H1X2B2
Canada

Phone: 514-343-6111 #82354
Fax: 514-343-2288
Email: peter.desmet@umontreal.ca / peter.desmet.cubc@gmail.com
Skype: anderhalv
Public profile: http://www.linkedin.com/in/peterdesmet

Peter Desmet

tags

participants (1)