[tdwg-content] Canonical name parsing

Peter Desmet peter.desmet at umontreal.ca
Wed Mar 14 18:02:54 CET 2012


Hi Donald,

scientificName, with its current definition [1] is a great term and should
be continued to used as such. As with most Darwin Core terms, it offers
flexibility, so its not an impediment for publishing data. In the GBIF
context, this term is considered mandatory: records without it are ignored
during indexing (I believe). All of this can stay.

canonicalScientificName would be an *additional* term with a *clear rule* (see
my proposed definition [2]). This is the case for other Darwin Core terms
as well, such as
decimalLatitude [3], minimalElevationInMeters [4] or countryCode [5]. They
serve as an ready-to-use addition/alternative to verbatimLatitude [6],
verbatimElevation [7] and country [8] respectively. These terms don't stop
anyone from publishing data, but data publishers who can provide this kind
of information have the choice to do so. It would be the same for
canonicalScientificName.

And yes, an aggregator like GBIF can play an important role in providing
consistent data to its users and figuring out what they really need, but
not all data is consumed that way. In addition, I hope a user would be able
to download cleaned data from the GBIF portal as Darwin Core. Wouldn't it
be nice that the parsed canonicalScientificName created by GBIF can be
provided in its proper term? There are users out there who want this!

Regards,

Peter

[1] http://rs.tdwg.org/dwc/terms/index.htm#scientificName
[2] http://code.google.com/p/darwincore/issues/detail?id=150
[3] http://rs.tdwg.org/dwc/terms/index.htm#decimalLatitude
[4] http://rs.tdwg.org/dwc/terms/index.htm#minimumElevationInMeters
[5] http://rs.tdwg.org/dwc/terms/index.htm#countryCode
[6] http://rs.tdwg.org/dwc/terms/index.htm#verbatimLatitude
[7] http://rs.tdwg.org/dwc/terms/index.htm#verbatimElevation
[8] http://rs.tdwg.org/dwc/terms/index.htm#country

On Wed, Mar 14, 2012 at 11:19, Donald Hobern (GBIF) <dhobern at gbif.org>
wrote:
>
> Hi Peter.
>
> I certainly agree that aggregators only represent one use case here but,
having seen a lot of the mess of real-world data, I don't believe that
simply adding a new term will fix this problem for the users you describe.
 To get the results you want, we would need a sufficiently large majority
of data sets to follow the rules perfectly that we could ignore those that
were non-conformant.  This would mean we should mandate that every data set
must use the new element (with or without the existing scientificName
element) and that they must present scientific names in the expected way
(or else have their data considered non-compliant). Until now, the
philosophy on publishing Darwin Core data has been to make it as easy as
possible for data providers to expose their data, even at the expense of
greater complexity for consumers.  I suspect that we would have a lot less
data available for use now if we had taken a more stringent approach.
>
> In some ways, this proposal reminds me of the structures in ABCD which
seek to offer users verbatim and more normalised ways to represent several
types of information.  This actually makes consuming all the possible forms
of such data very complex, since a record may contain all variant forms or
just any one of them.  If multiple forms are available, which one should be
considered the primary version?
>
> I suspect that things may also get complicated as soon as you discuss
botanical subspecies, varieties, subvarieties, forms and subforms.  There
are recommended ways to abbreviate the rank markers in these cases but some
variation can be expected.
>
> Of course aggregators should be providing more robust services for
accessing exactly what you want in a consistent, predictable way and I
would suggest that the best place to attack the problem is to define
exactly what a typical user needs to see and then for GBIF and similar
projects to work on delivering predictable data downloads and web services
that clean out all of these nomenclatural inconsistencies - and perhaps
also add value in other ways such as augmenting the data with associated
environmental values (as the Atlas of Living Australia does).  This would
allow us all to work together on developing a consistent and predictable
algorithm for handling interpretation of name strings, including synonymy,
misspellings, virus names and everything else that makes this such a
difficult problem.
>
> Best wishes,
>
> Donald
>
> ----------------------------------------------------------------------
> Donald Hobern - GBIF Director - dhobern at gbif.org
> Global Biodiversity Information Facility http://www.gbif.org/
> GBIF Secretariat, Universitetsparken 15, DK-2100 Copenhagen Ø, Denmark
> Tel: +45 3532 1471  Mob: +45 2875 1471  Fax: +45 2875 1480
> ----------------------------------------------------------------------
>
>
> -----Original Message-----
> From: peter.desmet.cubc at gmail.com [mailto:peter.desmet.cubc at gmail.com] On
Behalf Of Peter Desmet
> Sent: Wednesday, March 14, 2012 3:41 PM
> To: Tim Robertson [GBIF]
> Cc: Donald Hobern (GBIF); dev Developers; TDWG content mailing list; TDWG
TAG mailing list; Christian Gendreau
> Subject: Re: Canonical name parsing
>
> Hi Tim,
>
> I agree, aggregators like GBIF and Canadensys will have to deal with
clean and dirty data in each field anyway: they need code libraries to deal
with this and it is good that these are being developed. But, that doesn't
help someone who wants to use data from a Darwin Core Archive with his data
in Excel or a Roderic Page who wants to get things done for a prototype.
> Having to use Java libraries or even the Name Parser [1] (though both
> great) is a barrier to data use. Darwin Core (Archives) is not only used
for machine to machine interaction, humans use it too, and I think we
should allow easy hacking (I mean this in the good sense), especially for
something as important as the scientific name.
> In addition, as a data publisher (e.g. for our VASCAN checklist) I
> *do* have the information to provide a clean and simple to use
canonicalScientificName, but I just can't share it via the otherwise
excellent biodiversity sharing standard Darwin Core. I think that's a pity.
>
> Peter
>
> [1] http://tools.gbif.org/nameparser/
> [2] http://data.canadensys.net/vascan
>
> PS: Yes, Canadensys will use the GBIF interpretation libraries. Since we
develop in Java as well, using those libraries is as easy as the proverbial
"one line of code". We're looking forward in testing them and providing
patches to enhance them. Open source FTW! :-)
>
>
> On Wed, Mar 14, 2012 at 07:32, Tim Robertson [GBIF] <trobertson at gbif.org>
wrote:
> > Hi Peter,
> >
> > I'm replying off the TDWG list, since it is a bit of a tangent to your
discussion.  If you feel it is relevant, please CC the list again.
> >
> > At GBIF as you know, we have to interpret all kinds of quality of
content.  I tend to agree with Donald that this would not really help in
consumption, as in my experience we will have to deal with both clean and
dirty data in each field *anyway* when this is used at network scale.  I
would rather see us evolve the interpretation libraries to handle all the
corner cases, which we need to develop anyway.  We already do a pretty
decent job at extracting canonicals.  This is further enhanced when you
couple the extracted canonical with a fuzzy match against the
"authoritative names" we can now index thanks to the availability of
checklists in DwC-A format.
> >
> > I know you are a Java shop.  Are you using the GBIF interpretation
libraries [1] at the moment?  If not, is there a reason why you don't?
> > They are used in all GBIF projects (portal, checklistbank etc), and the
more we enhance them, the better it is for everyone.  We have a significant
test coverage [2,3] and there have been quite some man months (years?)
spent already in their development and with some real regular expression
experts (most notably Markus D. and Dave M.).  All our work is Maven-ized,
versioned and available in our Maven repository [4].
> >
> > I hope these are interesting to you.  We would welcome any patches to
enhance them, or assistance in identifying the corner cases and capturing
those as unit tests.
> >
> > Hope this helps,
> > Tim
> >
> > [1]
> > http://code.google.com/p/gbif-ecat/source/browse/trunk/ecat-common/src
> > /main/java/org/gbif/ecat/parser/NameParser.java
> > [2]
> > http://code.google.com/p/gbif-ecat/source/browse/trunk/ecat-common/src
> > /test/java/org/gbif/ecat/parser/NameParserTest.java
> > [3]
> > http://code.google.com/p/gbif-ecat/source/browse/trunk/ecat-common/src
> > /#src%2Ftest%2Fresources [4]
> > http://repository.gbif.org/index.html#nexus-search;quick~ecat-common
> >
>
>
>
> --
> Peter Desmet
> Biodiversity Informatics Manager
> Canadensys - www.canadensys.net
>
> Université de Montréal Biodiversity Centre
> 4101 rue Sherbrooke est
> Montreal, QC, H1X2B2
> Canada
>
> Phone: 514-343-6111 #82354
> Fax: 514-343-2288
> Email: peter.desmet at umontreal.ca / peter.desmet.cubc at gmail.com
> Skype: anderhalv
> Public profile: http://www.linkedin.com/in/peterdesmet
>
>
> _______________________________________________
> tdwg-content mailing list
> tdwg-content at lists.tdwg.org
> http://lists.tdwg.org/mailman/listinfo/tdwg-content




--
Peter Desmet
Biodiversity Informatics Manager
Canadensys - www.canadensys.net

Université de Montréal Biodiversity Centre
4101 rue Sherbrooke est
Montreal, QC, H1X2B2
Canada

Phone: 514-343-6111 #82354
Fax: 514-343-2288
Email: peter.desmet at umontreal.ca / peter.desmet.cubc at gmail.com
Skype: anderhalv
Public profile: http://www.linkedin.com/in/peterdesmet
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.tdwg.org/pipermail/tdwg-content/attachments/20120314/6a35941b/attachment.html 


More information about the tdwg-content mailing list