Hi Tim,
I agree, aggregators like GBIF and Canadensys will have to deal with clean and dirty data in each field anyway: they need code libraries to deal with this and it is good that these are being developed. But, that doesn't help someone who wants to use data from a Darwin Core Archive with his data in Excel or a Roderic Page who wants to get things done for a prototype. Having to use Java libraries or even the Name Parser [1] (though both great) is a barrier to data use. Darwin Core (Archives) is not only used for machine to machine interaction, humans use it too, and I think we should allow easy hacking (I mean this in the good sense), especially for something as important as the scientific name. In addition, as a data publisher (e.g. for our VASCAN checklist) I *do* have the information to provide a clean and simple to use canonicalScientificName, but I just can't share it via the otherwise excellent biodiversity sharing standard Darwin Core. I think that's a pity.
Peter
[1] http://tools.gbif.org/nameparser/ [2] http://data.canadensys.net/vascan
PS: Yes, Canadensys will use the GBIF interpretation libraries. Since we develop in Java as well, using those libraries is as easy as the proverbial "one line of code". We're looking forward in testing them and providing patches to enhance them. Open source FTW! :-)
On Wed, Mar 14, 2012 at 07:32, Tim Robertson [GBIF] trobertson@gbif.org wrote:
Hi Peter,
I'm replying off the TDWG list, since it is a bit of a tangent to your discussion. If you feel it is relevant, please CC the list again.
At GBIF as you know, we have to interpret all kinds of quality of content. I tend to agree with Donald that this would not really help in consumption, as in my experience we will have to deal with both clean and dirty data in each field *anyway* when this is used at network scale. I would rather see us evolve the interpretation libraries to handle all the corner cases, which we need to develop anyway. We already do a pretty decent job at extracting canonicals. This is further enhanced when you couple the extracted canonical with a fuzzy match against the "authoritative names" we can now index thanks to the availability of checklists in DwC-A format.
I know you are a Java shop. Are you using the GBIF interpretation libraries [1] at the moment? If not, is there a reason why you don't? They are used in all GBIF projects (portal, checklistbank etc), and the more we enhance them, the better it is for everyone. We have a significant test coverage [2,3] and there have been quite some man months (years?) spent already in their development and with some real regular expression experts (most notably Markus D. and Dave M.). All our work is Maven-ized, versioned and available in our Maven repository [4].
I hope these are interesting to you. We would welcome any patches to enhance them, or assistance in identifying the corner cases and capturing those as unit tests.
Hope this helps, Tim
[1] http://code.google.com/p/gbif-ecat/source/browse/trunk/ecat-common/src/main/... [2] http://code.google.com/p/gbif-ecat/source/browse/trunk/ecat-common/src/test/... [3] http://code.google.com/p/gbif-ecat/source/browse/trunk/ecat-common/src/#src%... [4] http://repository.gbif.org/index.html#nexus-search;quick~ecat-common