Re: [tdwg] Ideas on having Harvesters like GBIF clean, flag inconsistencies, and add additional value to the data

12 May 2009

      Hi Peter,

We had a release date to meet with the IPT, so we used the "new"  
DarwinCore as it was submitted to the peer review for the release  
candidate 1.0 version you just downloaded.  When DwC is ratified, we  
will modify the IPT to support it.

In fact, John W. who is leading the DwC standard along with his  
developers, will also be contributing to the IPT codebase during 2009.

Cheers,

Tim

On 11 May 2009, at 22:15, Peter DeVries wrote:
...
Very Cool and Thanks,
I downloaded (http://code.google.com/p/gbif-providertoolkit/)
and got it working on one of my test machines.
Is there a plan to move or not move this to the new DarwinCore?
Thanks!
Pete
On Mon, May 11, 2009 at 2:48 AM, Tim Robertson <trobertson@gbif.org>  
wrote:
Hi Peter,
Just to expand on what Donald has written here:
...
My current thinking is that we should offer this as a service which
can both be executed during harvesting and also as a stand-alone
service for which users can submit a batch of Darwin Core-style
records (probably tab-delimited) and get back a report for whichever
set of tests or value-add operations they choose.  This could help
providers with data cleaning even before they share their data (and
also could help them to make sure there are no known sensitivity
issues around their data).  Such a service could be extended more or
less indefinitely to report more and more aspects of interest.  One
of the major options could be to cross-reference records to accepted
taxonomic authorities (via LSIDs or other identifiers).
GBIF recently launched an early release of a biodiversity data
publishing tool (http://code.google.com/p/gbif-providertoolkit/) which
allows for serving of occurrence and species oriented data, in a "star
schema" format with Darwin Core as the core of the star.  This tool
has an embedded database, which allows for serving of text files (csv,
tab delimited etc) and also the ability to sit in front of an existing
database to offer DwC through a complete archive, TAPIR and WFS,WMS
services.  As you publish data through this tool, it currently does
very basic type checking of input data, and creates "annotations" on
the records that have issue (e.g. http://ipt.gbif.org/annotations.html?resource_id=11)
.  As the tool matures in the coming months, we plan to open up an API
so that data provides can call external services and have them push
back annotations - e.g. check my coordinates, check my names with IPNI
etc.  By publishing the complete dataset as an "archive" (a zipped
dump with an xml file describing the columns, http://rs.tdwg.org/dwc/terms/guides/text/index.htm
 as Donald mentions) the technical threshold is reduced to a minimum
for the data transfer to implement such a quality service, while also
ensuring decent harvesting performance.  It is in the current GBIF
workplan to register such quality services in the GBIF registry which
is undergoing development now, so that they may be discovered and used
by all, including the GBIF publishing toolkit, and portals.  By doing
this, the roles of checking data, or implementing quality services are
not centralised in a GBIF portal, but can be used by the data owner
before sharing with GBIF or other networks.
Additionally, by allowing for remote annotations, we can aim to
ultimately push back all feedback from the GBIF portal (or others)
into the publishing tools as opposed to through email as is the
current feedback mechanism - this is related to other topics such as
uniquely identifying resources as they are shared through various
networks for example.  It would then be trivial to have (for example)
a google map with a clickable point which opens the details holding a
link "this record has bad coordinates", or a form to fill in.
Feedback could take the form of free text or perhaps even better, as
"structured annotations" where possible (this record would be correct
if the isoCountryCode was "DE") which could then be automatically
removed should the source be updated to meet the annotation criteria.
Best wishes,
Tim
...
Best wishes,
Donald
Donald Hobern, Director, Atlas of Living Australia
CSIRO Entomology, GPO Box 1700, Canberra, ACT 2601
Phone: (02) 62464352 Mobile: 0437990208
Email: Donald.Hobern@csiro.au
Web: http://www.ala.org.au/
-----Original Message-----
Date: Fri, 8 May 2009 19:23:32 -0500
From: Peter DeVries <pete.devries@gmail.com>
Subject: [tdwg] Ideas on having Harvesters like GBIF clean,   flag
      inconsistencies, and    add additional value to the data
To: tdwg@lists.tdwg.org
Message-ID:
      <3833bf630905081723l2f1d5369je8af6b0e4a26324d@mail.gmail.com>
Content-Type: text/plain; charset="iso-8859-1"
Arthur Chapman sent me some good comments regarding Datums etc.
The discussion made me realize that there may be a need for two
types of
formats. One for the providers and a second one that is output by
the
...
harvesting service.
This is because the needs and abilities of the data providers are
different
than the needs and abilities of those who would like to consume the
data.
Consumers, who analyze and map the data, would like something that
is easy
to process, standardized and as as error free as as possible.
It could work in the following way.
Data harvesters, like GBIF, collect the records. Run them through
cleaning algorithms that check attributes including that the lat and
long
actually match the location described.
These harvesters would then expose this cleaned data via XML and RDF
with
tags that flag possible inconsistencies. The harvesters would also
add a
field for the lat and long in WGS84 if the original record contains
a valid
Datum. Those records without a Datum would still be exposed but the
added
geo:latitude and geo:longitude fields would be empty.
I can imagine that that data uploaded to GBIF and other harvester
services
will be replete with typo's and inconsistencies that will frustrate
people
trying to analyze or simply map the data, the harvester services
could add
value by minimizing these frustrations.
Originally, it seemed that a global service should standardize on a
global
Datum like WGS84. After all, we have standardized on meters?
However, after
discussing this with Arthur, I realize that this is not possible  
for a
number of reasons. That said, I think the data would be much more
valuable
and less likely to be misinterpreted if if a version of it was
available in
WGS84. This solution would eventually encourage data providers to
understand
what a Datum is and include it in their data. It would also help
solve a
number of other data integration problems.
Respectfully,
Pete
_______________________________________________
tdwg mailing list
tdwg@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg
_______________________________________________
tdwg mailing list
tdwg@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg
-- 
---------------------------------------------------------------
Pete DeVries
Department of Entomology
University of Wisconsin - Madison
445 Russell Laboratories
1630 Linden Drive
Madison, WI 53706
------------------------------------------------------------