Dave & Tim,
Can you describe the linkage between the Drupal-based GBIF vocabulary server and the dictionaries in your parsers? Is the former used to seed the latter? How often does the latter get refreshed from data produced in the former? Does all that work take place in Refine? If you have published a white paper on this workflow already, could you point me to it so I can better understand the depth of the maintenance costs?
Cheers,
David
On Sat, May 18, 2013 at 8:50 PM, David Remsen dremsen@gbif.org wrote:
David,
You might like to use the GBIF vocabulary server. It has a multi-lingual country name thesaurus based on ISO 3166 and has over 23K terms for 226 ISO countries. You can download the data or use the service. It may have some lexical variants and misspellings. You can also get an account and add any you might know of. And all presented to you in your old friend Drupal. Perhaps you might like to serve as curator. Maybe? Diamond in the rough here, I'm sure of it.
http://vocabularies.gbif.org/vocabularies/country
Best, Dave
David Remsen Global Biodiversity Information Facility Secretariat Universitetsparken 15, DK-2100 Copenhagen, Denmark Tel: +1 508 289 7477 Fax: +1 508 289 7900 Mobile +1 508 274 4055 Skype: dremsen
On May 17, 2013, at 10:39 AM, Matt Jones wrote:
A good official list of countries is available from the Library of Congress: http://www.loc.gov/standards/codelists/countries.xml For background, see: http://www.loc.gov/marc/countries/
And of course there's ISO 3166, the list of country codes:
http://www.iso.org/iso/home/standards/country_codes/country_names_and_code_e... http://www.iso.org/iso/country_codes
Not sure about the alternate representations and misspellings, though.
Matt
On Fri, May 17, 2013 at 5:57 AM, Shorthouse, David davidpshorthouse@gmail.com wrote:
Folks,
The Canadensys development team, http://www.canadensys.net is looking for efficient, low-maintenance ways to validate and reconcile data in its National cache of occurrence data. We are working on a Java library to initially tackle single-field Darwin Core validations, https://github.com/Canadensys/narwhal-processor. We hope this library is sufficiently generalized for uses outside our project.
Our current challenge is to reconcile country names, which requires access to an up-to-date, well-maintained knowledge base of country names, their alternative representations (possibly multilingual), and mappings to known misspellings. For performance reasons, we'd like this thesaurus to be embedded in the library, but with the capacity to be periodically refreshed with data pulled from external resources such as dbpedia.org. This clearly has ties to semantic web thinking and, because we're new to the tools and services in this space, we'd like to solicit pointers and feedback such that we build this part of our library with maximal benefit to other projects. We started collecting thoughts here: https://github.com/Canadensys/narwhal-processor/issues/14.
Cheers,
David P. Shorthouse Christian Gendreau _______________________________________________ tdwg mailing list tdwg@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg
tdwg mailing list tdwg@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg