[tdwg] Ideas on having Harvesters like GBIF clean, flag inconsistencies, and add additional value to the data

Tim Robertson trobertson at gbif.org
Mon May 11 09:48:02 CEST 2009


Hi Peter,

Just to expand on what Donald has written here:

> My current thinking is that we should offer this as a service which  
> can both be executed during harvesting and also as a stand-alone  
> service for which users can submit a batch of Darwin Core-style  
> records (probably tab-delimited) and get back a report for whichever  
> set of tests or value-add operations they choose.  This could help  
> providers with data cleaning even before they share their data (and  
> also could help them to make sure there are no known sensitivity  
> issues around their data).  Such a service could be extended more or  
> less indefinitely to report more and more aspects of interest.  One  
> of the major options could be to cross-reference records to accepted  
> taxonomic authorities (via LSIDs or other identifiers).

GBIF recently launched an early release of a biodiversity data  
publishing tool (http://code.google.com/p/gbif-providertoolkit/) which  
allows for serving of occurrence and species oriented data, in a "star  
schema" format with Darwin Core as the core of the star.  This tool  
has an embedded database, which allows for serving of text files (csv,  
tab delimited etc) and also the ability to sit in front of an existing  
database to offer DwC through a complete archive, TAPIR and WFS,WMS  
services.  As you publish data through this tool, it currently does  
very basic type checking of input data, and creates "annotations" on  
the records that have issue (e.g. http://ipt.gbif.org/annotations.html?resource_id=11) 
.  As the tool matures in the coming months, we plan to open up an API  
so that data provides can call external services and have them push  
back annotations - e.g. check my coordinates, check my names with IPNI  
etc.  By publishing the complete dataset as an "archive" (a zipped  
dump with an xml file describing the columns, http://rs.tdwg.org/dwc/terms/guides/text/index.htm 
  as Donald mentions) the technical threshold is reduced to a minimum  
for the data transfer to implement such a quality service, while also  
ensuring decent harvesting performance.  It is in the current GBIF  
workplan to register such quality services in the GBIF registry which  
is undergoing development now, so that they may be discovered and used  
by all, including the GBIF publishing toolkit, and portals.  By doing  
this, the roles of checking data, or implementing quality services are  
not centralised in a GBIF portal, but can be used by the data owner  
before sharing with GBIF or other networks.

Additionally, by allowing for remote annotations, we can aim to  
ultimately push back all feedback from the GBIF portal (or others)  
into the publishing tools as opposed to through email as is the  
current feedback mechanism - this is related to other topics such as  
uniquely identifying resources as they are shared through various  
networks for example.  It would then be trivial to have (for example)  
a google map with a clickable point which opens the details holding a  
link "this record has bad coordinates", or a form to fill in.   
Feedback could take the form of free text or perhaps even better, as  
"structured annotations" where possible (this record would be correct  
if the isoCountryCode was "DE") which could then be automatically  
removed should the source be updated to meet the annotation criteria.

Best wishes,

Tim




>
>
> Best wishes,
>
> Donald
>
>
> Donald Hobern, Director, Atlas of Living Australia
> CSIRO Entomology, GPO Box 1700, Canberra, ACT 2601
> Phone: (02) 62464352 Mobile: 0437990208
> Email: Donald.Hobern at csiro.au
> Web: http://www.ala.org.au/
>
>
> -----Original Message-----
> Date: Fri, 8 May 2009 19:23:32 -0500
> From: Peter DeVries <pete.devries at gmail.com>
> Subject: [tdwg] Ideas on having Harvesters like GBIF clean,	flag
> 	inconsistencies, and 	add additional value to the data
> To: tdwg at lists.tdwg.org
> Message-ID:
> 	<3833bf630905081723l2f1d5369je8af6b0e4a26324d at mail.gmail.com>
> Content-Type: text/plain; charset="iso-8859-1"
>
> Arthur Chapman sent me some good comments regarding Datums etc.
> The discussion made me realize that there may be a need for two  
> types of
> formats. One for the providers and a second one that is output by the
> harvesting service.
>
> This is because the needs and abilities of the data providers are  
> different
> than the needs and abilities of those who would like to consume the  
> data.
>
> Consumers, who analyze and map the data, would like something that  
> is easy
> to process, standardized and as as error free as as possible.
>
> It could work in the following way.
>
> Data harvesters, like GBIF, collect the records. Run them through
> cleaning algorithms that check attributes including that the lat and  
> long
> actually match the location described.
>
> These harvesters would then expose this cleaned data via XML and RDF  
> with
> tags that flag possible inconsistencies. The harvesters would also  
> add a
> field for the lat and long in WGS84 if the original record contains  
> a valid
> Datum. Those records without a Datum would still be exposed but the  
> added
> geo:latitude and geo:longitude fields would be empty.
>
> I can imagine that that data uploaded to GBIF and other harvester  
> services
> will be replete with typo's and inconsistencies that will frustrate  
> people
> trying to analyze or simply map the data, the harvester services  
> could add
> value by minimizing these frustrations.
>
> Originally, it seemed that a global service should standardize on a  
> global
> Datum like WGS84. After all, we have standardized on meters?  
> However, after
> discussing this with Arthur, I realize that this is not possible for a
> number of reasons. That said, I think the data would be much more  
> valuable
> and less likely to be misinterpreted if if a version of it was  
> available in
> WGS84. This solution would eventually encourage data providers to  
> understand
> what a Datum is and include it in their data. It would also help  
> solve a
> number of other data integration problems.
>
> Respectfully,
>
> Pete
> _______________________________________________
> tdwg mailing list
> tdwg at lists.tdwg.org
> http://lists.tdwg.org/mailman/listinfo/tdwg
>



More information about the tdwg mailing list