[tdwg] Country name reconciliation

Shorthouse, David david.shorthouse at umontreal.ca
Fri May 17 21:09:01 CEST 2013


Tim,

Indeed, the GBIF country parser is an extremely valuable contribution
and as you mention, is a dependency in the processing library we made
available on Github. We use it to help reconcile country names and it
works great. Nonetheless, it also appears that folks have been
collecting misspelled Country names with the ultimate goal of
standardizing data before they get incorporated into products or
repositories, http://bit.ly/Z1wQmC.

John Deck's recent reply to this thread is a much more coherent
phrasing of what I had intended to express. We have an an interested
party wishing to access our aggregated occurrence data as RDF. Doing
so effectively requires layering parsed & reconciled strings with
their URI representations. And, having done so, I hope that this
approach would open the door to interesting multi-field validations
(eg Nouveau-Brunswick is in fact a province in 加拿大), which is of
interest to FilteredPush workflows, the validation work that ALA has
done, and may others. So, we need a well documented and advertised
process by which all this work can be harmonized AND be baked with the
ingredients necessary for eventual semantic reasoning, independent of
what step throughout the data publication process such a
reconciliation/validation library might be used.

I see several parts to this:

1. What is the mechanism by which we should pool the dirt and their
(possibly ambiguous) mappings to authoritative lists (eg should we all
use Google refine to dump these?)
2. What is the source and structure of those authoritative lists, who
maintains them, and how do we make sure we don't break older mappings
created at #1
3. How do we provide a common interface & suite of APIs to the above,
agnostic to domain or data publication pipeline
4. Who should take the lead?

And, as a last practical recommendation, this is a more modern home
for GBIF code, https://github.com/gbif.

Dave

On Fri, May 17, 2013 at 10:48 AM, Tim Robertson [GBIF]
<trobertson at gbif.org> wrote:
> Hi David,
>
> You've built your other libraries using GBIF parsers.  Have you looked at
> how the GBIF country names interpretation works?  It would be helpful to
> know why it is not suitable for your use.
>
> The GBIF library concatenates known lists (such as ISO) along with about
> 2500 variations we've collected through period review of what we observe
> while indexing, and then using google refine we've mapped them to the ISO
> codes and we follow the ISO code changes as best we can.  Your
> narwhal-processor already has a software dependency on the GBIF code.
>
> Please remember that patches and additions are always welcome to the GBIF
> code, if you felt it could be improved.  I'm biased of course, but I'd
> rather see something that is broken fixed than watching a recreation of
> something that already exists.
>
> Cheers,
> Tim
>
>
> On May 17, 2013, at 4:39 PM, Matt Jones wrote:
>
> A good official list of countries is available from the Library of Congress:
>   http://www.loc.gov/standards/codelists/countries.xml
>   For background, see: http://www.loc.gov/marc/countries/
>
> And of course there's ISO 3166, the list of country codes:
>
> http://www.iso.org/iso/home/standards/country_codes/country_names_and_code_elements_xml.htm
>   http://www.iso.org/iso/country_codes
>
> Not sure about the alternate representations and misspellings, though.
>
> Matt
>
>
> On Fri, May 17, 2013 at 5:57 AM, Shorthouse, David
> <davidpshorthouse at gmail.com> wrote:
>>
>> Folks,
>>
>> The Canadensys development team, http://www.canadensys.net is looking
>> for efficient, low-maintenance ways to validate and reconcile data in
>> its National cache of occurrence data. We are working on a Java
>> library to initially tackle single-field Darwin Core validations,
>> https://github.com/Canadensys/narwhal-processor. We hope this library
>> is sufficiently generalized for uses outside our project.
>>
>> Our current challenge is to reconcile country names, which requires
>> access to an up-to-date, well-maintained knowledge base of country
>> names, their alternative representations (possibly multilingual), and
>> mappings to known misspellings. For performance reasons, we'd like
>> this thesaurus to be embedded in the library, but with the capacity to
>> be periodically refreshed with data pulled from external resources
>> such as dbpedia.org. This clearly has ties to semantic web thinking
>> and, because we're new to the tools and services in this space, we'd
>> like to solicit pointers and feedback such that we build this part of
>> our library with maximal benefit to other projects. We started
>> collecting thoughts here:
>> https://github.com/Canadensys/narwhal-processor/issues/14.
>>
>> Cheers,
>>
>> David P. Shorthouse
>> Christian Gendreau
>> _______________________________________________
>> tdwg mailing list
>> tdwg at lists.tdwg.org
>> http://lists.tdwg.org/mailman/listinfo/tdwg
>
>
> _______________________________________________
> tdwg mailing list
> tdwg at lists.tdwg.org
> http://lists.tdwg.org/mailman/listinfo/tdwg
>
>


More information about the tdwg mailing list