[tdwg-content] [tdwg-tag] Canonical name parsing

Peter Desmet peter.desmet at umontreal.ca
Thu Mar 15 00:19:54 CET 2012


Thanks everyone for the feedback so far!
Now, if you want to +1 the proposal, become a friend of Timon lepidus:
https://plus.google.com/114672072317054763788/posts/Nph2ksggNZW

:-) Peter

On Wed, Mar 14, 2012 at 16:15, Dmitry Mozzherin <dmozzherin at eol.org> wrote:

> mmm where is +1 button? :)
>
> On Wed, Mar 14, 2012 at 4:11 PM, Peter Desmet <peter.desmet at umontreal.ca>
> wrote:
> > Hi Paul,
> >
> > Higher taxon: "Magnoliidae Novák ex Takhtajan" (a subclass).
> > - scientificName: Magnoliidae Novák ex Takhtajan
> > - taxonRank: subclass
> > But there are no terms to share the canonical name "Magnoliidae". The
> only
> > available options are kingdom, phylum, class, order, family, genus,
> > subgenus, specificEpithet, infraspecificEpithet, none of which are
> > appropriate.
> >
> > Solution:
> > - canonicalScientificName: Magnoliidae
> >
> > Infrageneric taxon: "Abies sect. Amabilis (Matzenko) Farjon & Rushforth"
> (a
> > section)
> > - scientificName: Abies sect. Amabilis (Matzenko) Farjon & Rushforth
> > - taxonRank: section
> > - genus: Abies
> > But there are no terms to share "Abies Amabilis", "Abies sect. Amabilis",
> > "Abies section Amabilis" or even "Amabilis". The only available options
> are
> > kingdom, phylum, class, order, family, genus, subgenus, specificEpithet,
> > infraspecificEpithet, none of which are appropriate. Why we have
> subgenus,
> > but not infragenericEpithet is another issue. I would at least be able to
> > share "Amabilis".
> >
> > Solution:
> > - canonicalScientificName: Abies Amabilis
> > - taxonRank: section
> >
> > Peter
> >
> > There is no place to share the canonical name "Magnoliidae" for this
> taxon.
> >
> > On Wed, Mar 14, 2012 at 14:37, Paul Kirk <p.kirk at cabi.org> wrote:
> >>
> >> 'For higher taxa or infrageneric taxa, these terms are not sufficient'
> ...
> >> why?
> >>
> >>
> >>
> >> Paul
> >>
> >>
> >>
> >> ________________________________
> >>
> >> From: tdwg-tag-bounces at lists.tdwg.org [tdwg-tag-bounces at lists.tdwg.org]
> on
> >> behalf of Peter Desmet [peter.desmet at umontreal.ca]
> >> Sent: 14 March 2012 18:26
> >> To: Richard Pyle
> >> Cc: TDWG content mailing list; Donald Hobern (GBIF); dev Developers;
> >> Christian Gendreau; TDWG TAG mailing list
> >>
> >> Subject: Re: [tdwg-tag] [tdwg-content] Canonical name parsing
> >>
> >> Rich,
> >>
> >> I wished those terms were sufficient, but as mentioned in the
> >> justification for
> http://code.google.com/p/darwincore/issues/detail?id=150:
> >>
> >> genus, specificEpithet, infraspecificEpithet: concatenated, this terms
> are
> >> identical to the canonicalScientificName for genera, species and
> >> infraspecific taxa. For higher taxa or infrageneric taxa, these terms
> are
> >> not sufficient. In addition, there is some ambiguity regarding the genus
> >> definition: for synonyms, is it the accepted genus or the genus that is
> part
> >> of the synonym name? See:
> >> http://lists.tdwg.org/pipermail/tdwg-content/2010-November/002052.html.
> In
> >> the former case, the genus cannot be used to concatenate a
> >> canonicalScientificName.
> >>
> >> To give an example for a higher taxon:
> >> scientificName: Magnoliidae Novák ex Takhtajan
> >> taxonRank: subclass
> >>
> >> There is no place to share the canonical name "Magnoliidae" for this
> >> taxon.
> >>
> >> Peter
> >>
> >> On Wed, Mar 14, 2012 at 14:13, Richard Pyle <deepreef at bishopmuseum.org>
> >> wrote:
> >>>
> >>>
> >>> I guess the parts that confuse me are:
> >>>
> >>> 1) What providers are able to produce a canonicalScientificName as per
> >>> Peter’s definition, but are unable to provide the pre-parsed elements
> of
> >>> genus | subgenus | specificEpithet | infraspecificEpithet?
> >>>
> >>> 2) What consumers could make use of a canonicalScientificName as per
> >>> Peter’s definition, but are unable to make (even better) use of the
> >>> pre-parsed elements of genus | subgenus | specificEpithet |
> >>> infraspecificEpithet?
> >>>
> >>> Aloha,
> >>> Rich
> >>>
> >>>
> >>>
> >>>
> >>> From: tdwg-content-bounces at lists.tdwg.org
> >>> [mailto:tdwg-content-bounces at lists.tdwg.org] On Behalf Of Peter Desmet
> >>> Sent: Wednesday, March 14, 2012 7:03 AM
> >>> To: Donald Hobern (GBIF)
> >>> Cc: TDWG content mailing list; Christian Gendreau; Tim Robertson
> [GBIF];
> >>> TDWG TAG mailing list; dev Developers
> >>> Subject: Re: [tdwg-content] Canonical name parsing
> >>>
> >>> Hi Donald,
> >>>
> >>> scientificName, with its current definition [1] is a great term and
> >>> should be continued to used as such. As with most Darwin Core terms, it
> >>> offers flexibility, so its not an impediment for publishing data. In
> the
> >>> GBIF context, this term is considered mandatory: records without it are
> >>> ignored during indexing (I believe). All of this can stay.
> >>>
> >>> canonicalScientificName would be an additional term with a clear rule
> >>> (see my proposed definition [2]). This is the case for other Darwin
> Core
> >>> terms as well, such as
> >>> decimalLatitude [3], minimalElevationInMeters [4] or countryCode [5].
> >>> They serve as an ready-to-use addition/alternative to verbatimLatitude
> [6],
> >>> verbatimElevation [7] and country [8] respectively. These terms don't
> stop
> >>> anyone from publishing data, but data publishers who can provide this
> kind
> >>> of information have the choice to do so. It would be the same for
> >>> canonicalScientificName.
> >>>
> >>> And yes, an aggregator like GBIF can play an important role in
> providing
> >>> consistent data to its users and figuring out what they really need,
> but not
> >>> all data is consumed that way. In addition, I hope a user would be
> able to
> >>> download cleaned data from the GBIF portal as Darwin Core. Wouldn't it
> be
> >>> nice that the parsed canonicalScientificName created by GBIF can be
> provided
> >>> in its proper term? There are users out there who want this!
> >>>
> >>> Regards,
> >>>
> >>> Peter
> >>>
> >>> [1] http://rs.tdwg.org/dwc/terms/index.htm#scientificName
> >>> [2] http://code.google.com/p/darwincore/issues/detail?id=150
> >>> [3] http://rs.tdwg.org/dwc/terms/index.htm#decimalLatitude
> >>> [4] http://rs.tdwg.org/dwc/terms/index.htm#minimumElevationInMeters
> >>> [5] http://rs.tdwg.org/dwc/terms/index.htm#countryCode
> >>> [6] http://rs.tdwg.org/dwc/terms/index.htm#verbatimLatitude
> >>> [7] http://rs.tdwg.org/dwc/terms/index.htm#verbatimElevation
> >>> [8] http://rs.tdwg.org/dwc/terms/index.htm#country
> >>>
> >>> On Wed, Mar 14, 2012 at 11:19, Donald Hobern (GBIF) <dhobern at gbif.org>
> >>> wrote:
> >>> >
> >>> > Hi Peter.
> >>> >
> >>> > I certainly agree that aggregators only represent one use case here
> >>> > but, having seen a lot of the mess of real-world data, I don't
> believe that
> >>> > simply adding a new term will fix this problem for the users you
> describe.
> >>> >  To get the results you want, we would need a sufficiently large
> majority of
> >>> > data sets to follow the rules perfectly that we could ignore those
> that were
> >>> > non-conformant.  This would mean we should mandate that every data
> set must
> >>> > use the new element (with or without the existing scientificName
> element)
> >>> > and that they must present scientific names in the expected way (or
> else
> >>> > have their data considered non-compliant). Until now, the philosophy
> on
> >>> > publishing Darwin Core data has been to make it as easy as possible
> for data
> >>> > providers to expose their data, even at the expense of greater
> complexity
> >>> > for consumers.  I suspect that we would have a lot less data
> available for
> >>> > use now if we had taken a more stringent approach.
> >>> >
> >>> > In some ways, this proposal reminds me of the structures in ABCD
> which
> >>> > seek to offer users verbatim and more normalised ways to represent
> several
> >>> > types of information.  This actually makes consuming all the
> possible forms
> >>> > of such data very complex, since a record may contain all variant
> forms or
> >>> > just any one of them.  If multiple forms are available, which one
> should be
> >>> > considered the primary version?
> >>> >
> >>> > I suspect that things may also get complicated as soon as you discuss
> >>> > botanical subspecies, varieties, subvarieties, forms and subforms.
>  There
> >>> > are recommended ways to abbreviate the rank markers in these cases
> but some
> >>> > variation can be expected.
> >>> >
> >>> > Of course aggregators should be providing more robust services for
> >>> > accessing exactly what you want in a consistent, predictable way and
> I would
> >>> > suggest that the best place to attack the problem is to define
> exactly what
> >>> > a typical user needs to see and then for GBIF and similar projects
> to work
> >>> > on delivering predictable data downloads and web services that clean
> out all
> >>> > of these nomenclatural inconsistencies - and perhaps also add value
> in other
> >>> > ways such as augmenting the data with associated environmental
> values (as
> >>> > the Atlas of Living Australia does).  This would allow us all to work
> >>> > together on developing a consistent and predictable algorithm for
> handling
> >>> > interpretation of name strings, including synonymy, misspellings,
> virus
> >>> > names and everything else that makes this such a difficult problem.
> >>> >
> >>> > Best wishes,
> >>> >
> >>> > Donald
> >>> >
> >>> >
> ----------------------------------------------------------------------
> >>> > Donald Hobern - GBIF Director - dhobern at gbif.org
> >>> > Global Biodiversity Information Facility http://www.gbif.org/
> >>> > GBIF Secretariat, Universitetsparken 15, DK-2100 Copenhagen Ø,
> Denmark
> >>> > Tel: +45 3532 1471  Mob: +45 2875 1471  Fax: +45 2875 1480
> >>> >
> ----------------------------------------------------------------------
> >>> >
> >>> >
> >>> > -----Original Message-----
> >>> > From: peter.desmet.cubc at gmail.com [mailto:
> peter.desmet.cubc at gmail.com]
> >>> > On Behalf Of Peter Desmet
> >>> > Sent: Wednesday, March 14, 2012 3:41 PM
> >>> > To: Tim Robertson [GBIF]
> >>> > Cc: Donald Hobern (GBIF); dev Developers; TDWG content mailing list;
> >>> > TDWG TAG mailing list; Christian Gendreau
> >>> > Subject: Re: Canonical name parsing
> >>> >
> >>> > Hi Tim,
> >>> >
> >>> > I agree, aggregators like GBIF and Canadensys will have to deal with
> >>> > clean and dirty data in each field anyway: they need code libraries
> to deal
> >>> > with this and it is good that these are being developed. But, that
> doesn't
> >>> > help someone who wants to use data from a Darwin Core Archive with
> his data
> >>> > in Excel or a Roderic Page who wants to get things done for a
> prototype.
> >>> > Having to use Java libraries or even the Name Parser [1] (though both
> >>> > great) is a barrier to data use. Darwin Core (Archives) is not only
> >>> > used for machine to machine interaction, humans use it too, and I
> think we
> >>> > should allow easy hacking (I mean this in the good sense),
> especially for
> >>> > something as important as the scientific name.
> >>> > In addition, as a data publisher (e.g. for our VASCAN checklist) I
> >>> > *do* have the information to provide a clean and simple to use
> >>> > canonicalScientificName, but I just can't share it via the otherwise
> >>> > excellent biodiversity sharing standard Darwin Core. I think that's
> a pity.
> >>> >
> >>> > Peter
> >>> >
> >>> > [1] http://tools.gbif.org/nameparser/
> >>> > [2] http://data.canadensys.net/vascan
> >>> >
> >>> > PS: Yes, Canadensys will use the GBIF interpretation libraries. Since
> >>> > we develop in Java as well, using those libraries is as easy as the
> >>> > proverbial "one line of code". We're looking forward in testing them
> and
> >>> > providing patches to enhance them. Open source FTW! :-)
> >>> >
> >>> >
> >>> > On Wed, Mar 14, 2012 at 07:32, Tim Robertson [GBIF]
> >>> > <trobertson at gbif.org> wrote:
> >>> > > Hi Peter,
> >>> > >
> >>> > > I'm replying off the TDWG list, since it is a bit of a tangent to
> >>> > > your discussion.  If you feel it is relevant, please CC the list
> again.
> >>> > >
> >>> > > At GBIF as you know, we have to interpret all kinds of quality of
> >>> > > content.  I tend to agree with Donald that this would not really
> help in
> >>> > > consumption, as in my experience we will have to deal with both
> clean and
> >>> > > dirty data in each field *anyway* when this is used at network
> scale.  I
> >>> > > would rather see us evolve the interpretation libraries to handle
> all the
> >>> > > corner cases, which we need to develop anyway.  We already do a
> pretty
> >>> > > decent job at extracting canonicals.  This is further enhanced
> when you
> >>> > > couple the extracted canonical with a fuzzy match against the
> "authoritative
> >>> > > names" we can now index thanks to the availability of checklists
> in DwC-A
> >>> > > format.
> >>> > >
> >>> > > I know you are a Java shop.  Are you using the GBIF interpretation
> >>> > > libraries [1] at the moment?  If not, is there a reason why you
> don't?
> >>> > > They are used in all GBIF projects (portal, checklistbank etc), and
> >>> > > the more we enhance them, the better it is for everyone.  We have a
> >>> > > significant test coverage [2,3] and there have been quite some man
> months
> >>> > > (years?) spent already in their development and with some real
> regular
> >>> > > expression experts (most notably Markus D. and Dave M.).  All our
> work is
> >>> > > Maven-ized, versioned and available in our Maven repository [4].
> >>> > >
> >>> > > I hope these are interesting to you.  We would welcome any patches
> to
> >>> > > enhance them, or assistance in identifying the corner cases and
> capturing
> >>> > > those as unit tests.
> >>> > >
> >>> > > Hope this helps,
> >>> > > Tim
> >>> > >
> >>> > > [1]
> >>> > >
> >>> > >
> http://code.google.com/p/gbif-ecat/source/browse/trunk/ecat-common/src
> >>> > > /main/java/org/gbif/ecat/parser/NameParser.java
> >>> > > [2]
> >>> > >
> >>> > >
> http://code.google.com/p/gbif-ecat/source/browse/trunk/ecat-common/src
> >>> > > /test/java/org/gbif/ecat/parser/NameParserTest.java
> >>> > > [3]
> >>> > >
> >>> > >
> http://code.google.com/p/gbif-ecat/source/browse/trunk/ecat-common/src
> >>> > > /#src%2Ftest%2Fresources [4]
> >>> > >
> http://repository.gbif.org/index.html#nexus-search;quick~ecat-common
> >>> > >
> >>> >
> >>> >
> >>> >
> >>> > --
> >>> > Peter Desmet
> >>> > Biodiversity Informatics Manager
> >>> > Canadensys - www.canadensys.net
> >>> >
> >>> > Université de Montréal Biodiversity Centre
> >>> > 4101 rue Sherbrooke est
> >>> > Montreal, QC, H1X2B2
> >>> > Canada
> >>> >
> >>> > Phone: 514-343-6111 #82354
> >>> > Fax: 514-343-2288
> >>> > Email: peter.desmet at umontreal.ca / peter.desmet.cubc at gmail.com
> >>> > Skype: anderhalv
> >>> > Public profile: http://www.linkedin.com/in/peterdesmet
> >>> >
> >>> >
> >>> > _______________________________________________
> >>> > tdwg-content mailing list
> >>> > tdwg-content at lists.tdwg.org
> >>> > http://lists.tdwg.org/mailman/listinfo/tdwg-content
> >>>
> >>>
> >>>
> >>>
> >>> --
> >>> Peter Desmet
> >>> Biodiversity Informatics Manager
> >>> Canadensys - www.canadensys.net
> >>>
> >>> Université de Montréal Biodiversity Centre
> >>> 4101 rue Sherbrooke est
> >>> Montreal, QC, H1X2B2
> >>> Canada
> >>>
> >>> Phone: 514-343-6111 #82354
> >>> Fax: 514-343-2288
> >>> Email: peter.desmet at umontreal.ca / peter.desmet.cubc at gmail.com
> >>> Skype: anderhalv
> >>> Public profile: http://www.linkedin.com/in/peterdesmet
> >>>
> >>>
> >>> This message is only intended for the addressee named above.  Its
> >>> contents may be privileged or otherwise protected.  Any unauthorized
> use,
> >>> disclosure or copying of this message or its contents is prohibited.
>  If you
> >>> have received this message by mistake, please notify us immediately by
> reply
> >>> mail or by collect telephone call.  Any personal opinions expressed in
> this
> >>> message do not necessarily represent the views of the Bishop Museum.
> >>> _______________________________________________
> >>> tdwg-content mailing list
> >>> tdwg-content at lists.tdwg.org
> >>> http://lists.tdwg.org/mailman/listinfo/tdwg-content
> >>
> >>
> >>
> >>
> >> --
> >> Peter Desmet
> >> Biodiversity Informatics Manager
> >> Canadensys - www.canadensys.net
> >>
> >> Université de Montréal Biodiversity Centre
> >> 4101 rue Sherbrooke est
> >> Montreal, QC, H1X2B2
> >> Canada
> >>
> >> Phone: 514-343-6111 #82354
> >> Fax: 514-343-2288
> >> Email: peter.desmet at umontreal.ca / peter.desmet.cubc at gmail.com
> >> Skype: anderhalv
> >> Public profile: http://www.linkedin.com/in/peterdesmet
> >>
> >> P Think Green - don't print this email unless you really need to
> >>
> >> ************************************************************************
> >> The information contained in this e-mail and any files transmitted with
> it
> >> is confidential and is for the exclusive use of the intended recipient.
> If
> >> you are not the intended recipient please note that any distribution,
> >> copying or use of this communication or the information in it is
> >> prohibited.
> >>
> >> Whilst CAB International trading as CABI takes steps to prevent the
> >> transmission of viruses via e-mail, we cannot guarantee that any e-mail
> or
> >> attachment is free from computer viruses and you are strongly advised to
> >> undertake your own anti-virus precautions.
> >>
> >> If you have received this communication in error, please notify us by
> >> e-mail at cabi at cabi.org or by telephone on +44 (0)1491 832111 and then
> >> delete the e-mail and any copies of it.
> >>
> >> CABI is an International Organization recognised by the UK Government
> >> under Statutory Instrument 1982 No. 1071...
> >>
> >>
> **************************************************************************
> >>
> >>
> >> _______________________________________________
> >> tdwg-tag mailing list
> >> tdwg-tag at lists.tdwg.org
> >> http://lists.tdwg.org/mailman/listinfo/tdwg-tag
> >>
> >
> >
> >
> > --
> > Peter Desmet
> > Biodiversity Informatics Manager
> > Canadensys - www.canadensys.net
> >
> > Université de Montréal Biodiversity Centre
> > 4101 rue Sherbrooke est
> > Montreal, QC, H1X2B2
> > Canada
> >
> > Phone: 514-343-6111 #82354
> > Fax: 514-343-2288
> > Email: peter.desmet at umontreal.ca / peter.desmet.cubc at gmail.com
> > Skype: anderhalv
> > Public profile: http://www.linkedin.com/in/peterdesmet
> >
> > _______________________________________________
> > tdwg-content mailing list
> > tdwg-content at lists.tdwg.org
> > http://lists.tdwg.org/mailman/listinfo/tdwg-content
> >
> _______________________________________________
> tdwg-content mailing list
> tdwg-content at lists.tdwg.org
> http://lists.tdwg.org/mailman/listinfo/tdwg-content
>



-- 
Peter Desmet
Biodiversity Informatics Manager
Canadensys - www.canadensys.net

Université de Montréal Biodiversity Centre
4101 rue Sherbrooke est
Montreal, QC, H1X2B2
Canada

Phone: 514-343-6111 #82354
Fax: 514-343-2288
Email: peter.desmet at umontreal.ca / peter.desmet.cubc at gmail.com
Skype: anderhalv
Public profile: http://www.linkedin.com/in/peterdesmet
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.tdwg.org/pipermail/tdwg-content/attachments/20120314/e91fa91a/attachment.html 


More information about the tdwg-content mailing list