[tdwg-content] Canonical name parsing
Peter Desmet
peter.desmet at umontreal.ca
Wed Mar 14 15:40:37 CET 2012
Hi Tim,
I agree, aggregators like GBIF and Canadensys will have to deal with
clean and dirty data in each field anyway: they need code libraries to
deal with this and it is good that these are being developed. But,
that doesn't help someone who wants to use data from a Darwin Core
Archive with his data in Excel or a Roderic Page who wants to get
things done for a prototype.
Having to use Java libraries or even the Name Parser [1] (though both
great) is a barrier to data use. Darwin Core (Archives) is not only
used for machine to machine interaction, humans use it too, and I
think we should allow easy hacking (I mean this in the good sense),
especially for something as important as the scientific name.
In addition, as a data publisher (e.g. for our VASCAN checklist) I
*do* have the information to provide a clean and simple to use
canonicalScientificName, but I just can't share it via the otherwise
excellent biodiversity sharing standard Darwin Core. I think that's a
pity.
Peter
[1] http://tools.gbif.org/nameparser/
[2] http://data.canadensys.net/vascan
PS: Yes, Canadensys will use the GBIF interpretation libraries. Since
we develop in Java as well, using those libraries is as easy as the
proverbial "one line of code". We're looking forward in testing them
and providing patches to enhance them. Open source FTW! :-)
On Wed, Mar 14, 2012 at 07:32, Tim Robertson [GBIF] <trobertson at gbif.org> wrote:
> Hi Peter,
>
> I'm replying off the TDWG list, since it is a bit of a tangent to your discussion. If you feel it is relevant, please CC the list again.
>
> At GBIF as you know, we have to interpret all kinds of quality of content. I tend to agree with Donald that this would not really help in consumption, as in my experience we will have to deal with both clean and dirty data in each field *anyway* when this is used at network scale. I would rather see us evolve the interpretation libraries to handle all the corner cases, which we need to develop anyway. We already do a pretty decent job at extracting canonicals. This is further enhanced when you couple the extracted canonical with a fuzzy match against the "authoritative names" we can now index thanks to the availability of checklists in DwC-A format.
>
> I know you are a Java shop. Are you using the GBIF interpretation libraries [1] at the moment? If not, is there a reason why you don't?
> They are used in all GBIF projects (portal, checklistbank etc), and the more we enhance them, the better it is for everyone. We have a significant test coverage [2,3] and there have been quite some man months (years?) spent already in their development and with some real regular expression experts (most notably Markus D. and Dave M.). All our work is Maven-ized, versioned and available in our Maven repository [4].
>
> I hope these are interesting to you. We would welcome any patches to enhance them, or assistance in identifying the corner cases and capturing those as unit tests.
>
> Hope this helps,
> Tim
>
> [1] http://code.google.com/p/gbif-ecat/source/browse/trunk/ecat-common/src/main/java/org/gbif/ecat/parser/NameParser.java
> [2] http://code.google.com/p/gbif-ecat/source/browse/trunk/ecat-common/src/test/java/org/gbif/ecat/parser/NameParserTest.java
> [3] http://code.google.com/p/gbif-ecat/source/browse/trunk/ecat-common/src/#src%2Ftest%2Fresources
> [4] http://repository.gbif.org/index.html#nexus-search;quick~ecat-common
>
--
Peter Desmet
Biodiversity Informatics Manager
Canadensys - www.canadensys.net
Université de Montréal Biodiversity Centre
4101 rue Sherbrooke est
Montreal, QC, H1X2B2
Canada
Phone: 514-343-6111 #82354
Fax: 514-343-2288
Email: peter.desmet at umontreal.ca / peter.desmet.cubc at gmail.com
Skype: anderhalv
Public profile: http://www.linkedin.com/in/peterdesmet
More information about the tdwg-content
mailing list