[tdwg] Country name reconciliation

Tim Robertson [GBIF] trobertson at gbif.org
Thu May 23 09:23:51 CEST 2013


Hi David S,

The short answer is that there is no linkage currently, and the dictionaries in the parser we periodically maintain using google refine. 

John W. and I are discussing merging our separate dictionaries to benefit all downstream dependencies (Canadensys for example) - we're only talking about the tab file dictionaries, so the GBIF parser API won't change.  I'll pull in the Drupal content as well.

Cheers,
Tim

On May 22, 2013, at 7:12 PM, Shorthouse, David wrote:

> Dave & Tim,
> 
> Can you describe the linkage between the Drupal-based GBIF vocabulary
> server and the dictionaries in your parsers? Is the former used to
> seed the latter? How often does the latter get refreshed from data
> produced in the former? Does all that work take place in Refine? If
> you have published a white paper on this workflow already, could you
> point me to it so I can better understand the depth of the maintenance
> costs?
> 
> Cheers,
> 
> David
> 
> On Sat, May 18, 2013 at 8:50 PM, David Remsen <dremsen at gbif.org> wrote:
>> David,
>> 
>> You might like to use the GBIF vocabulary server.  It has a multi-lingual
>> country name thesaurus based on ISO 3166 and has over 23K terms for 226 ISO
>> countries.  You can download the data or use the service.  It may have some
>> lexical variants and misspellings.  You can also get an account and add any
>> you might know of.   And all presented to you in your old friend Drupal.
>> Perhaps you might like to serve as curator.  Maybe?  Diamond in the rough
>> here, I'm sure of it.
>> 
>> http://vocabularies.gbif.org/vocabularies/country
>> 
>> Best,
>> Dave
>> 
>> ----------------------------------------------------------------------------
>> David Remsen
>> Global Biodiversity Information Facility Secretariat
>> Universitetsparken 15, DK-2100 Copenhagen, Denmark
>> Tel: +1 508 289 7477   Fax: +1 508 289 7900
>> Mobile +1 508 274 4055
>> Skype: dremsen
>> ----------------------------------------------------------------------------
>> 
>> 
>> 
>> 
>> 
>> On May 17, 2013, at 10:39 AM, Matt Jones wrote:
>> 
>> A good official list of countries is available from the Library of Congress:
>>  http://www.loc.gov/standards/codelists/countries.xml
>>  For background, see: http://www.loc.gov/marc/countries/
>> 
>> And of course there's ISO 3166, the list of country codes:
>> 
>> http://www.iso.org/iso/home/standards/country_codes/country_names_and_code_elements_xml.htm
>>  http://www.iso.org/iso/country_codes
>> 
>> Not sure about the alternate representations and misspellings, though.
>> 
>> Matt
>> 
>> 
>> On Fri, May 17, 2013 at 5:57 AM, Shorthouse, David
>> <davidpshorthouse at gmail.com> wrote:
>>> 
>>> Folks,
>>> 
>>> The Canadensys development team, http://www.canadensys.net is looking
>>> for efficient, low-maintenance ways to validate and reconcile data in
>>> its National cache of occurrence data. We are working on a Java
>>> library to initially tackle single-field Darwin Core validations,
>>> https://github.com/Canadensys/narwhal-processor. We hope this library
>>> is sufficiently generalized for uses outside our project.
>>> 
>>> Our current challenge is to reconcile country names, which requires
>>> access to an up-to-date, well-maintained knowledge base of country
>>> names, their alternative representations (possibly multilingual), and
>>> mappings to known misspellings. For performance reasons, we'd like
>>> this thesaurus to be embedded in the library, but with the capacity to
>>> be periodically refreshed with data pulled from external resources
>>> such as dbpedia.org. This clearly has ties to semantic web thinking
>>> and, because we're new to the tools and services in this space, we'd
>>> like to solicit pointers and feedback such that we build this part of
>>> our library with maximal benefit to other projects. We started
>>> collecting thoughts here:
>>> https://github.com/Canadensys/narwhal-processor/issues/14.
>>> 
>>> Cheers,
>>> 
>>> David P. Shorthouse
>>> Christian Gendreau
>>> _______________________________________________
>>> tdwg mailing list
>>> tdwg at lists.tdwg.org
>>> http://lists.tdwg.org/mailman/listinfo/tdwg
>> 
>> 
>> _______________________________________________
>> tdwg mailing list
>> tdwg at lists.tdwg.org
>> http://lists.tdwg.org/mailman/listinfo/tdwg
>> 
>> 
> 



More information about the tdwg mailing list