[tdwg-content] [tdwg-tag] Canonical name parsing

Wed Mar 14 21:15:58 CET 2012

mmm where is +1 button? :)

On Wed, Mar 14, 2012 at 4:11 PM, Peter Desmet <peter.desmet at umontreal.ca> wrote:
> Hi Paul,
>
> Higher taxon: "Magnoliidae Novák ex Takhtajan" (a subclass).
> - scientificName: Magnoliidae Novák ex Takhtajan
> - taxonRank: subclass
> But there are no terms to share the canonical name "Magnoliidae". The only
> available options are kingdom, phylum, class, order, family, genus,
> subgenus, specificEpithet, infraspecificEpithet, none of which are
> appropriate.
>
> Solution:
> - canonicalScientificName: Magnoliidae
>
> Infrageneric taxon: "Abies sect. Amabilis (Matzenko) Farjon & Rushforth" (a
> section)
> - scientificName: Abies sect. Amabilis (Matzenko) Farjon & Rushforth
> - taxonRank: section
> - genus: Abies
> But there are no terms to share "Abies Amabilis", "Abies sect. Amabilis",
> "Abies section Amabilis" or even "Amabilis". The only available options are
> kingdom, phylum, class, order, family, genus, subgenus, specificEpithet,
> infraspecificEpithet, none of which are appropriate. Why we have subgenus,
> but not infragenericEpithet is another issue. I would at least be able to
> share "Amabilis".
>
> Solution:
> - canonicalScientificName: Abies Amabilis
> - taxonRank: section
>
> Peter
>
> There is no place to share the canonical name "Magnoliidae" for this taxon.
>
> On Wed, Mar 14, 2012 at 14:37, Paul Kirk <p.kirk at cabi.org> wrote:
>>
>> 'For higher taxa or infrageneric taxa, these terms are not sufficient' ...
>> why?
>>
>>
>>
>> Paul
>>
>>
>>
>> ________________________________
>>
>> From: tdwg-tag-bounces at lists.tdwg.org [tdwg-tag-bounces at lists.tdwg.org] on
>> behalf of Peter Desmet [peter.desmet at umontreal.ca]
>> Sent: 14 March 2012 18:26
>> To: Richard Pyle
>> Cc: TDWG content mailing list; Donald Hobern (GBIF); dev Developers;
>> Christian Gendreau; TDWG TAG mailing list
>>
>> Subject: Re: [tdwg-tag] [tdwg-content] Canonical name parsing
>>
>> Rich,
>>
>> I wished those terms were sufficient, but as mentioned in the
>> justification for http://code.google.com/p/darwincore/issues/detail?id=150:
>>
>> genus, specificEpithet, infraspecificEpithet: concatenated, this terms are
>> identical to the canonicalScientificName for genera, species and
>> infraspecific taxa. For higher taxa or infrageneric taxa, these terms are
>> not sufficient. In addition, there is some ambiguity regarding the genus
>> definition: for synonyms, is it the accepted genus or the genus that is part
>> of the synonym name? See:
>> http://lists.tdwg.org/pipermail/tdwg-content/2010-November/002052.html. In
>> the former case, the genus cannot be used to concatenate a
>> canonicalScientificName.
>>
>> To give an example for a higher taxon:
>> scientificName: Magnoliidae Novák ex Takhtajan
>> taxonRank: subclass
>>
>> There is no place to share the canonical name "Magnoliidae" for this
>> taxon.
>>
>> Peter
>>
>> On Wed, Mar 14, 2012 at 14:13, Richard Pyle <deepreef at bishopmuseum.org>
>> wrote:
>>>
>>>
>>> I guess the parts that confuse me are:
>>>
>>> 1) What providers are able to produce a canonicalScientificName as per
>>> Peter’s definition, but are unable to provide the pre-parsed elements of
>>> genus | subgenus | specificEpithet | infraspecificEpithet?
>>>
>>> 2) What consumers could make use of a canonicalScientificName as per
>>> Peter’s definition, but are unable to make (even better) use of the
>>> pre-parsed elements of genus | subgenus | specificEpithet |
>>> infraspecificEpithet?
>>>
>>> Aloha,
>>> Rich
>>>
>>>
>>>
>>>
>>> From: tdwg-content-bounces at lists.tdwg.org
>>> [mailto:tdwg-content-bounces at lists.tdwg.org] On Behalf Of Peter Desmet
>>> Sent: Wednesday, March 14, 2012 7:03 AM
>>> To: Donald Hobern (GBIF)
>>> Cc: TDWG content mailing list; Christian Gendreau; Tim Robertson [GBIF];
>>> TDWG TAG mailing list; dev Developers
>>> Subject: Re: [tdwg-content] Canonical name parsing
>>>
>>> Hi Donald,
>>>
>>> scientificName, with its current definition [1] is a great term and
>>> should be continued to used as such. As with most Darwin Core terms, it
>>> offers flexibility, so its not an impediment for publishing data. In the
>>> GBIF context, this term is considered mandatory: records without it are
>>> ignored during indexing (I believe). All of this can stay.
>>>
>>> canonicalScientificName would be an additional term with a clear rule
>>> (see my proposed definition [2]). This is the case for other Darwin Core
>>> terms as well, such as
>>> decimalLatitude [3], minimalElevationInMeters [4] or countryCode [5].
>>> They serve as an ready-to-use addition/alternative to verbatimLatitude [6],
>>> verbatimElevation [7] and country [8] respectively. These terms don't stop
>>> anyone from publishing data, but data publishers who can provide this kind
>>> of information have the choice to do so. It would be the same for
>>> canonicalScientificName.
>>>
>>> And yes, an aggregator like GBIF can play an important role in providing
>>> consistent data to its users and figuring out what they really need, but not
>>> all data is consumed that way. In addition, I hope a user would be able to
>>> download cleaned data from the GBIF portal as Darwin Core. Wouldn't it be
>>> nice that the parsed canonicalScientificName created by GBIF can be provided
>>> in its proper term? There are users out there who want this!
>>>
>>> Regards,
>>>
>>> Peter
>>>
>>> [1] http://rs.tdwg.org/dwc/terms/index.htm#scientificName
>>> [2] http://code.google.com/p/darwincore/issues/detail?id=150
>>> [3] http://rs.tdwg.org/dwc/terms/index.htm#decimalLatitude
>>> [4] http://rs.tdwg.org/dwc/terms/index.htm#minimumElevationInMeters
>>> [5] http://rs.tdwg.org/dwc/terms/index.htm#countryCode
>>> [6] http://rs.tdwg.org/dwc/terms/index.htm#verbatimLatitude
>>> [7] http://rs.tdwg.org/dwc/terms/index.htm#verbatimElevation
>>> [8] http://rs.tdwg.org/dwc/terms/index.htm#country
>>>
>>> On Wed, Mar 14, 2012 at 11:19, Donald Hobern (GBIF) <dhobern at gbif.org>
>>> wrote:
>>> >
>>> > Hi Peter.
>>> >
>>> > I certainly agree that aggregators only represent one use case here
>>> > but, having seen a lot of the mess of real-world data, I don't believe that
>>> > simply adding a new term will fix this problem for the users you describe.
>>> >  To get the results you want, we would need a sufficiently large majority of
>>> > data sets to follow the rules perfectly that we could ignore those that were
>>> > non-conformant.  This would mean we should mandate that every data set must
>>> > use the new element (with or without the existing scientificName element)
>>> > and that they must present scientific names in the expected way (or else
>>> > have their data considered non-compliant). Until now, the philosophy on
>>> > publishing Darwin Core data has been to make it as easy as possible for data
>>> > providers to expose their data, even at the expense of greater complexity
>>> > for consumers.  I suspect that we would have a lot less data available for
>>> > use now if we had taken a more stringent approach.
>>> >
>>> > In some ways, this proposal reminds me of the structures in ABCD which
>>> > seek to offer users verbatim and more normalised ways to represent several
>>> > types of information.  This actually makes consuming all the possible forms
>>> > of such data very complex, since a record may contain all variant forms or
>>> > just any one of them.  If multiple forms are available, which one should be
>>> > considered the primary version?
>>> >
>>> > I suspect that things may also get complicated as soon as you discuss
>>> > botanical subspecies, varieties, subvarieties, forms and subforms.  There
>>> > are recommended ways to abbreviate the rank markers in these cases but some
>>> > variation can be expected.
>>> >
>>> > Of course aggregators should be providing more robust services for
>>> > accessing exactly what you want in a consistent, predictable way and I would
>>> > suggest that the best place to attack the problem is to define exactly what
>>> > a typical user needs to see and then for GBIF and similar projects to work
>>> > on delivering predictable data downloads and web services that clean out all
>>> > of these nomenclatural inconsistencies - and perhaps also add value in other
>>> > ways such as augmenting the data with associated environmental values (as
>>> > the Atlas of Living Australia does).  This would allow us all to work
>>> > together on developing a consistent and predictable algorithm for handling
>>> > interpretation of name strings, including synonymy, misspellings, virus
>>> > names and everything else that makes this such a difficult problem.
>>> >
>>> > Best wishes,
>>> >
>>> > Donald
>>> >
>>> > ----------------------------------------------------------------------
>>> > Donald Hobern - GBIF Director - dhobern at gbif.org
>>> > Global Biodiversity Information Facility http://www.gbif.org/
>>> > GBIF Secretariat, Universitetsparken 15, DK-2100 Copenhagen Ø, Denmark
>>> > Tel: +45 3532 1471  Mob: +45 2875 1471  Fax: +45 2875 1480
>>> > ----------------------------------------------------------------------
>>> >
>>> >
>>> > -----Original Message-----
>>> > From: peter.desmet.cubc at gmail.com [mailto:peter.desmet.cubc at gmail.com]
>>> > On Behalf Of Peter Desmet
>>> > Sent: Wednesday, March 14, 2012 3:41 PM
>>> > To: Tim Robertson [GBIF]
>>> > Cc: Donald Hobern (GBIF); dev Developers; TDWG content mailing list;
>>> > TDWG TAG mailing list; Christian Gendreau
>>> > Subject: Re: Canonical name parsing
>>> >
>>> > Hi Tim,
>>> >
>>> > I agree, aggregators like GBIF and Canadensys will have to deal with
>>> > clean and dirty data in each field anyway: they need code libraries to deal
>>> > with this and it is good that these are being developed. But, that doesn't
>>> > help someone who wants to use data from a Darwin Core Archive with his data
>>> > in Excel or a Roderic Page who wants to get things done for a prototype.
>>> > Having to use Java libraries or even the Name Parser [1] (though both
>>> > great) is a barrier to data use. Darwin Core (Archives) is not only
>>> > used for machine to machine interaction, humans use it too, and I think we
>>> > should allow easy hacking (I mean this in the good sense), especially for
>>> > something as important as the scientific name.
>>> > In addition, as a data publisher (e.g. for our VASCAN checklist) I
>>> > *do* have the information to provide a clean and simple to use
>>> > canonicalScientificName, but I just can't share it via the otherwise
>>> > excellent biodiversity sharing standard Darwin Core. I think that's a pity.
>>> >
>>> > Peter
>>> >
>>> > [1] http://tools.gbif.org/nameparser/
>>> > [2] http://data.canadensys.net/vascan
>>> >
>>> > PS: Yes, Canadensys will use the GBIF interpretation libraries. Since
>>> > we develop in Java as well, using those libraries is as easy as the
>>> > proverbial "one line of code". We're looking forward in testing them and
>>> > providing patches to enhance them. Open source FTW! :-)
>>> >
>>> >
>>> > On Wed, Mar 14, 2012 at 07:32, Tim Robertson [GBIF]
>>> > <trobertson at gbif.org> wrote:
>>> > > Hi Peter,
>>> > >
>>> > > I'm replying off the TDWG list, since it is a bit of a tangent to
>>> > > your discussion.  If you feel it is relevant, please CC the list again.
>>> > >
>>> > > At GBIF as you know, we have to interpret all kinds of quality of
>>> > > content.  I tend to agree with Donald that this would not really help in
>>> > > consumption, as in my experience we will have to deal with both clean and
>>> > > dirty data in each field *anyway* when this is used at network scale.  I
>>> > > would rather see us evolve the interpretation libraries to handle all the
>>> > > corner cases, which we need to develop anyway.  We already do a pretty
>>> > > decent job at extracting canonicals.  This is further enhanced when you
>>> > > couple the extracted canonical with a fuzzy match against the "authoritative
>>> > > names" we can now index thanks to the availability of checklists in DwC-A
>>> > > format.
>>> > >
>>> > > I know you are a Java shop.  Are you using the GBIF interpretation
>>> > > libraries [1] at the moment?  If not, is there a reason why you don't?
>>> > > They are used in all GBIF projects (portal, checklistbank etc), and
>>> > > the more we enhance them, the better it is for everyone.  We have a
>>> > > significant test coverage [2,3] and there have been quite some man months
>>> > > (years?) spent already in their development and with some real regular
>>> > > expression experts (most notably Markus D. and Dave M.).  All our work is
>>> > > Maven-ized, versioned and available in our Maven repository [4].
>>> > >
>>> > > I hope these are interesting to you.  We would welcome any patches to
>>> > > enhance them, or assistance in identifying the corner cases and capturing
>>> > > those as unit tests.
>>> > >
>>> > > Hope this helps,
>>> > > Tim
>>> > >
>>> > > [1]
>>> > >
>>> > > http://code.google.com/p/gbif-ecat/source/browse/trunk/ecat-common/src
>>> > > /main/java/org/gbif/ecat/parser/NameParser.java
>>> > > [2]
>>> > >
>>> > > http://code.google.com/p/gbif-ecat/source/browse/trunk/ecat-common/src
>>> > > /test/java/org/gbif/ecat/parser/NameParserTest.java
>>> > > [3]
>>> > >
>>> > > http://code.google.com/p/gbif-ecat/source/browse/trunk/ecat-common/src
>>> > > /#src%2Ftest%2Fresources [4]
>>> > > http://repository.gbif.org/index.html#nexus-search;quick~ecat-common
>>> > >
>>> >
>>> >
>>> >
>>> > --
>>> > Peter Desmet
>>> > Biodiversity Informatics Manager
>>> > Canadensys - www.canadensys.net
>>> >
>>> > Université de Montréal Biodiversity Centre
>>> > 4101 rue Sherbrooke est
>>> > Montreal, QC, H1X2B2
>>> > Canada
>>> >
>>> > Phone: 514-343-6111 #82354
>>> > Fax: 514-343-2288
>>> > Email: peter.desmet at umontreal.ca / peter.desmet.cubc at gmail.com
>>> > Skype: anderhalv
>>> > Public profile: http://www.linkedin.com/in/peterdesmet
>>> >
>>> >
>>> > _______________________________________________
>>> > tdwg-content mailing list
>>> > tdwg-content at lists.tdwg.org
>>> > http://lists.tdwg.org/mailman/listinfo/tdwg-content
>>>
>>>
>>>
>>>
>>> --
>>> Peter Desmet
>>> Biodiversity Informatics Manager
>>> Canadensys - www.canadensys.net
>>>
>>> Université de Montréal Biodiversity Centre
>>> 4101 rue Sherbrooke est
>>> Montreal, QC, H1X2B2
>>> Canada
>>>
>>> Phone: 514-343-6111 #82354
>>> Fax: 514-343-2288
>>> Email: peter.desmet at umontreal.ca / peter.desmet.cubc at gmail.com
>>> Skype: anderhalv
>>> Public profile: http://www.linkedin.com/in/peterdesmet
>>>
>>>
>>> This message is only intended for the addressee named above.  Its
>>> contents may be privileged or otherwise protected.  Any unauthorized use,
>>> disclosure or copying of this message or its contents is prohibited.  If you
>>> have received this message by mistake, please notify us immediately by reply
>>> mail or by collect telephone call.  Any personal opinions expressed in this
>>> message do not necessarily represent the views of the Bishop Museum.
>>> _______________________________________________
>>> tdwg-content mailing list
>>> tdwg-content at lists.tdwg.org
>>> http://lists.tdwg.org/mailman/listinfo/tdwg-content
>>
>>
>>
>>
>> --
>> Peter Desmet
>> Biodiversity Informatics Manager
>> Canadensys - www.canadensys.net
>>
>> Université de Montréal Biodiversity Centre
>> 4101 rue Sherbrooke est
>> Montreal, QC, H1X2B2
>> Canada
>>
>> Phone: 514-343-6111 #82354
>> Fax: 514-343-2288
>> Email: peter.desmet at umontreal.ca / peter.desmet.cubc at gmail.com
>> Skype: anderhalv
>> Public profile: http://www.linkedin.com/in/peterdesmet
>>
>> P Think Green - don't print this email unless you really need to
>>
>> ************************************************************************
>> The information contained in this e-mail and any files transmitted with it
>> is confidential and is for the exclusive use of the intended recipient. If
>> you are not the intended recipient please note that any distribution,
>> copying or use of this communication or the information in it is
>> prohibited.
>>
>> Whilst CAB International trading as CABI takes steps to prevent the
>> transmission of viruses via e-mail, we cannot guarantee that any e-mail or
>> attachment is free from computer viruses and you are strongly advised to
>> undertake your own anti-virus precautions.
>>
>> If you have received this communication in error, please notify us by
>> e-mail at cabi at cabi.org or by telephone on +44 (0)1491 832111 and then
>> delete the e-mail and any copies of it.
>>
>> CABI is an International Organization recognised by the UK Government
>> under Statutory Instrument 1982 No. 1071...
>>
>> **************************************************************************
>>
>>
>> _______________________________________________
>> tdwg-tag mailing list
>> tdwg-tag at lists.tdwg.org
>> http://lists.tdwg.org/mailman/listinfo/tdwg-tag
>>
>
>
>
> --
> Peter Desmet
> Biodiversity Informatics Manager
> Canadensys - www.canadensys.net
>
> Université de Montréal Biodiversity Centre
> 4101 rue Sherbrooke est
> Montreal, QC, H1X2B2
> Canada
>
> Phone: 514-343-6111 #82354
> Fax: 514-343-2288
> Email: peter.desmet at umontreal.ca / peter.desmet.cubc at gmail.com
> Skype: anderhalv
> Public profile: http://www.linkedin.com/in/peterdesmet
>
> _______________________________________________
> tdwg-content mailing list
> tdwg-content at lists.tdwg.org
> http://lists.tdwg.org/mailman/listinfo/tdwg-content
>