[tdwg-content] Canonical name parsing
Dmitry Mozzherin
dmozzherin at eol.org
Wed Mar 14 15:59:34 CET 2012
I agree with Peter. Having clear disambiguation in DarwinCore between
terms for 'canonical name' and 'name with authorship' would make my
life much easier too. Right now I have to go through a quite a bit of
fuzzy logic every time I open a DarwinCore Archive file trying to
figure out what did authors of the file meant when they said
'scientificName'
Dima
On Wed, Mar 14, 2012 at 10:40 AM, Peter Desmet
<peter.desmet at umontreal.ca> wrote:
> Hi Tim,
>
> I agree, aggregators like GBIF and Canadensys will have to deal with
> clean and dirty data in each field anyway: they need code libraries to
> deal with this and it is good that these are being developed. But,
> that doesn't help someone who wants to use data from a Darwin Core
> Archive with his data in Excel or a Roderic Page who wants to get
> things done for a prototype.
> Having to use Java libraries or even the Name Parser [1] (though both
> great) is a barrier to data use. Darwin Core (Archives) is not only
> used for machine to machine interaction, humans use it too, and I
> think we should allow easy hacking (I mean this in the good sense),
> especially for something as important as the scientific name.
> In addition, as a data publisher (e.g. for our VASCAN checklist) I
> *do* have the information to provide a clean and simple to use
> canonicalScientificName, but I just can't share it via the otherwise
> excellent biodiversity sharing standard Darwin Core. I think that's a
> pity.
>
> Peter
>
> [1] http://tools.gbif.org/nameparser/
> [2] http://data.canadensys.net/vascan
>
> PS: Yes, Canadensys will use the GBIF interpretation libraries. Since
> we develop in Java as well, using those libraries is as easy as the
> proverbial "one line of code". We're looking forward in testing them
> and providing patches to enhance them. Open source FTW! :-)
>
>
> On Wed, Mar 14, 2012 at 07:32, Tim Robertson [GBIF] <trobertson at gbif.org> wrote:
>> Hi Peter,
>>
>> I'm replying off the TDWG list, since it is a bit of a tangent to your discussion. If you feel it is relevant, please CC the list again.
>>
>> At GBIF as you know, we have to interpret all kinds of quality of content. I tend to agree with Donald that this would not really help in consumption, as in my experience we will have to deal with both clean and dirty data in each field *anyway* when this is used at network scale. I would rather see us evolve the interpretation libraries to handle all the corner cases, which we need to develop anyway. We already do a pretty decent job at extracting canonicals. This is further enhanced when you couple the extracted canonical with a fuzzy match against the "authoritative names" we can now index thanks to the availability of checklists in DwC-A format.
>>
>> I know you are a Java shop. Are you using the GBIF interpretation libraries [1] at the moment? If not, is there a reason why you don't?
>> They are used in all GBIF projects (portal, checklistbank etc), and the more we enhance them, the better it is for everyone. We have a significant test coverage [2,3] and there have been quite some man months (years?) spent already in their development and with some real regular expression experts (most notably Markus D. and Dave M.). All our work is Maven-ized, versioned and available in our Maven repository [4].
>>
>> I hope these are interesting to you. We would welcome any patches to enhance them, or assistance in identifying the corner cases and capturing those as unit tests.
>>
>> Hope this helps,
>> Tim
>>
>> [1] http://code.google.com/p/gbif-ecat/source/browse/trunk/ecat-common/src/main/java/org/gbif/ecat/parser/NameParser.java
>> [2] http://code.google.com/p/gbif-ecat/source/browse/trunk/ecat-common/src/test/java/org/gbif/ecat/parser/NameParserTest.java
>> [3] http://code.google.com/p/gbif-ecat/source/browse/trunk/ecat-common/src/#src%2Ftest%2Fresources
>> [4] http://repository.gbif.org/index.html#nexus-search;quick~ecat-common
>>
>
>
>
> --
> Peter Desmet
> Biodiversity Informatics Manager
> Canadensys - www.canadensys.net
>
> Université de Montréal Biodiversity Centre
> 4101 rue Sherbrooke est
> Montreal, QC, H1X2B2
> Canada
>
> Phone: 514-343-6111 #82354
> Fax: 514-343-2288
> Email: peter.desmet at umontreal.ca / peter.desmet.cubc at gmail.com
> Skype: anderhalv
> Public profile: http://www.linkedin.com/in/peterdesmet
> _______________________________________________
> tdwg-content mailing list
> tdwg-content at lists.tdwg.org
> http://lists.tdwg.org/mailman/listinfo/tdwg-content
More information about the tdwg-content
mailing list