[tdwg] Ideas on having Harvesters like GBIF clean, flag inconsistencies, and add additional value to the data

Peter DeVries pete.devries at gmail.com
Mon May 11 22:15:34 CEST 2009


Very Cool and Thanks,
I downloaded (http://code.google.com/p/gbif-providertoolkit/)
and got it working on one of my test machines.

Is there a plan to move or not move this to the new DarwinCore?

Thanks!

Pete


On Mon, May 11, 2009 at 2:48 AM, Tim Robertson <trobertson at gbif.org> wrote:

> Hi Peter,
>
> Just to expand on what Donald has written here:
>
> > My current thinking is that we should offer this as a service which
> > can both be executed during harvesting and also as a stand-alone
> > service for which users can submit a batch of Darwin Core-style
> > records (probably tab-delimited) and get back a report for whichever
> > set of tests or value-add operations they choose.  This could help
> > providers with data cleaning even before they share their data (and
> > also could help them to make sure there are no known sensitivity
> > issues around their data).  Such a service could be extended more or
> > less indefinitely to report more and more aspects of interest.  One
> > of the major options could be to cross-reference records to accepted
> > taxonomic authorities (via LSIDs or other identifiers).
>
> GBIF recently launched an early release of a biodiversity data
> publishing tool (http://code.google.com/p/gbif-providertoolkit/) which
> allows for serving of occurrence and species oriented data, in a "star
> schema" format with Darwin Core as the core of the star.  This tool
> has an embedded database, which allows for serving of text files (csv,
> tab delimited etc) and also the ability to sit in front of an existing
> database to offer DwC through a complete archive, TAPIR and WFS,WMS
> services.  As you publish data through this tool, it currently does
> very basic type checking of input data, and creates "annotations" on
> the records that have issue (e.g.
> http://ipt.gbif.org/annotations.html?resource_id=11)
> .  As the tool matures in the coming months, we plan to open up an API
> so that data provides can call external services and have them push
> back annotations - e.g. check my coordinates, check my names with IPNI
> etc.  By publishing the complete dataset as an "archive" (a zipped
> dump with an xml file describing the columns,
> http://rs.tdwg.org/dwc/terms/guides/text/index.htm
>  as Donald mentions) the technical threshold is reduced to a minimum
> for the data transfer to implement such a quality service, while also
> ensuring decent harvesting performance.  It is in the current GBIF
> workplan to register such quality services in the GBIF registry which
> is undergoing development now, so that they may be discovered and used
> by all, including the GBIF publishing toolkit, and portals.  By doing
> this, the roles of checking data, or implementing quality services are
> not centralised in a GBIF portal, but can be used by the data owner
> before sharing with GBIF or other networks.
>
> Additionally, by allowing for remote annotations, we can aim to
> ultimately push back all feedback from the GBIF portal (or others)
> into the publishing tools as opposed to through email as is the
> current feedback mechanism - this is related to other topics such as
> uniquely identifying resources as they are shared through various
> networks for example.  It would then be trivial to have (for example)
> a google map with a clickable point which opens the details holding a
> link "this record has bad coordinates", or a form to fill in.
> Feedback could take the form of free text or perhaps even better, as
> "structured annotations" where possible (this record would be correct
> if the isoCountryCode was "DE") which could then be automatically
> removed should the source be updated to meet the annotation criteria.
>
> Best wishes,
>
> Tim
>
>
>
>
> >
> >
> > Best wishes,
> >
> > Donald
> >
> >
> > Donald Hobern, Director, Atlas of Living Australia
> > CSIRO Entomology, GPO Box 1700, Canberra, ACT 2601
> > Phone: (02) 62464352 Mobile: 0437990208
> > Email: Donald.Hobern at csiro.au
> > Web: http://www.ala.org.au/
> >
> >
> > -----Original Message-----
> > Date: Fri, 8 May 2009 19:23:32 -0500
> > From: Peter DeVries <pete.devries at gmail.com>
> > Subject: [tdwg] Ideas on having Harvesters like GBIF clean,   flag
> >       inconsistencies, and    add additional value to the data
> > To: tdwg at lists.tdwg.org
> > Message-ID:
> >       <3833bf630905081723l2f1d5369je8af6b0e4a26324d at mail.gmail.com>
> > Content-Type: text/plain; charset="iso-8859-1"
> >
> > Arthur Chapman sent me some good comments regarding Datums etc.
> > The discussion made me realize that there may be a need for two
> > types of
> > formats. One for the providers and a second one that is output by the
> > harvesting service.
> >
> > This is because the needs and abilities of the data providers are
> > different
> > than the needs and abilities of those who would like to consume the
> > data.
> >
> > Consumers, who analyze and map the data, would like something that
> > is easy
> > to process, standardized and as as error free as as possible.
> >
> > It could work in the following way.
> >
> > Data harvesters, like GBIF, collect the records. Run them through
> > cleaning algorithms that check attributes including that the lat and
> > long
> > actually match the location described.
> >
> > These harvesters would then expose this cleaned data via XML and RDF
> > with
> > tags that flag possible inconsistencies. The harvesters would also
> > add a
> > field for the lat and long in WGS84 if the original record contains
> > a valid
> > Datum. Those records without a Datum would still be exposed but the
> > added
> > geo:latitude and geo:longitude fields would be empty.
> >
> > I can imagine that that data uploaded to GBIF and other harvester
> > services
> > will be replete with typo's and inconsistencies that will frustrate
> > people
> > trying to analyze or simply map the data, the harvester services
> > could add
> > value by minimizing these frustrations.
> >
> > Originally, it seemed that a global service should standardize on a
> > global
> > Datum like WGS84. After all, we have standardized on meters?
> > However, after
> > discussing this with Arthur, I realize that this is not possible for a
> > number of reasons. That said, I think the data would be much more
> > valuable
> > and less likely to be misinterpreted if if a version of it was
> > available in
> > WGS84. This solution would eventually encourage data providers to
> > understand
> > what a Datum is and include it in their data. It would also help
> > solve a
> > number of other data integration problems.
> >
> > Respectfully,
> >
> > Pete
> > _______________________________________________
> > tdwg mailing list
> > tdwg at lists.tdwg.org
> > http://lists.tdwg.org/mailman/listinfo/tdwg
> >
>
> _______________________________________________
> tdwg mailing list
> tdwg at lists.tdwg.org
> http://lists.tdwg.org/mailman/listinfo/tdwg
>



-- 
---------------------------------------------------------------
Pete DeVries
Department of Entomology
University of Wisconsin - Madison
445 Russell Laboratories
1630 Linden Drive
Madison, WI 53706
------------------------------------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.tdwg.org/pipermail/tdwg/attachments/20090511/0ba08566/attachment.html 


More information about the tdwg mailing list