[tdwg] Ideas on having Harvesters like GBIF clean, flag inconsistencies, and add additional value to the data

Donald.Hobern at csiro.au Donald.Hobern at csiro.au
Mon May 11 04:31:07 CEST 2009


Thanks, Peter.

In fact GBIF is already doing much of what you describe, and the ALA (and I am sure several other projects) are planning to go further in the same direction.

The GBIF data portal already flags records with apparent coordinate issues and with apparent mismatches between coordinates and stated country, as well as flagging records with scientific names with impossible formats, records without a clear indication of the basis of record, impossible dates, etc.  I believe that this information is accessible as part of the responses from the web service.

The ALA has been building on the ALA portal software (in particular to index records against smaller administrative units within Australia).  We are also planning to identify records which are potentially sensitive (species endangered or considered a biosecurity threat nationally or in the region where it was recorded) and to report these to data providers.  My current thinking is that we should offer this as a service which can both be executed during harvesting and also as a stand-alone service for which users can submit a batch of Darwin Core-style records (probably tab-delimited) and get back a report for whichever set of tests or value-add operations they choose.  This could help providers with data cleaning even before they share their data (and also could help them to make sure there are no known sensitivity issues around their data).  Such a service could be extended more or less indefinitely to report more and more aspects of interest.  One of the major options could be to cross-reference records to accepted taxonomic authorities (via LSIDs or other identifiers).

Best wishes,

Donald


Donald Hobern, Director, Atlas of Living Australia
CSIRO Entomology, GPO Box 1700, Canberra, ACT 2601
Phone: (02) 62464352 Mobile: 0437990208 
Email: Donald.Hobern at csiro.au
Web: http://www.ala.org.au/ 
 

-----Original Message-----
Date: Fri, 8 May 2009 19:23:32 -0500
From: Peter DeVries <pete.devries at gmail.com>
Subject: [tdwg] Ideas on having Harvesters like GBIF clean,	flag
	inconsistencies, and 	add additional value to the data
To: tdwg at lists.tdwg.org
Message-ID:
	<3833bf630905081723l2f1d5369je8af6b0e4a26324d at mail.gmail.com>
Content-Type: text/plain; charset="iso-8859-1"

Arthur Chapman sent me some good comments regarding Datums etc.
The discussion made me realize that there may be a need for two types of
formats. One for the providers and a second one that is output by the
harvesting service.

This is because the needs and abilities of the data providers are different
than the needs and abilities of those who would like to consume the data.

Consumers, who analyze and map the data, would like something that is easy
to process, standardized and as as error free as as possible.

It could work in the following way.

Data harvesters, like GBIF, collect the records. Run them through
cleaning algorithms that check attributes including that the lat and long
actually match the location described.

These harvesters would then expose this cleaned data via XML and RDF with
tags that flag possible inconsistencies. The harvesters would also add a
field for the lat and long in WGS84 if the original record contains a valid
Datum. Those records without a Datum would still be exposed but the added
geo:latitude and geo:longitude fields would be empty.

I can imagine that that data uploaded to GBIF and other harvester services
will be replete with typo's and inconsistencies that will frustrate people
trying to analyze or simply map the data, the harvester services could add
value by minimizing these frustrations.

Originally, it seemed that a global service should standardize on a global
Datum like WGS84. After all, we have standardized on meters? However, after
discussing this with Arthur, I realize that this is not possible for a
number of reasons. That said, I think the data would be much more valuable
and less likely to be misinterpreted if if a version of it was available in
WGS84. This solution would eventually encourage data providers to understand
what a Datum is and include it in their data. It would also help solve a
number of other data integration problems.

Respectfully,

Pete


More information about the tdwg mailing list