[tdwg] Ideas on having Harvesters like GBIF clean, flag inconsistencies, and add additional value to the data

Javier de la Torre jatorre at gmail.com
Tue May 12 10:55:37 CEST 2009


Hi,

I am late on this discussion, but would like to add another issue.

What about errors that are not in the data itself but on the way the  
data is transfered to the network? I am talking about data  
trasnformations and things like that. Ideally, me as a data consumer,  
I would prefer that UTMs dont get transformed into latitude/longitud,  
but that they offer me the possibility to use them directly or apply  
myself the transformation. I know ABCD and Darwin Core allows to  
specify them in an informal way, but seems that most GBIF providers  
(at least in Europe) are applying those transformations without  
letting the world know. And the result is that for most Europe most  
points in GBIF look gridified. And thats because they actually are,  
but there is no way to distinguish them from real points.

I have discussed a bit the error problems this issue introduces and  
the complications on visualization of data in a blog post:

http://biodivertido.blogspot.com/2009/02/grid-data-shared-as-point-data-errors.html

  For me this is becoming even a bigger issue than Datums in some cases.

Javier.

On May 12, 2009, at 9:21 AM, Tim Robertson wrote:

> Hi Peter,
>
> We had a release date to meet with the IPT, so we used the "new"  
> DarwinCore as it was submitted to the peer review for the release  
> candidate 1.0 version you just downloaded.  When DwC is ratified, we  
> will modify the IPT to support it.
>
> In fact, John W. who is leading the DwC standard along with his  
> developers, will also be contributing to the IPT codebase during 2009.
>
> Cheers,
>
> Tim
>
> On 11 May 2009, at 22:15, Peter DeVries wrote:
>
>> Very Cool and Thanks,
>>
>> I downloaded (http://code.google.com/p/gbif-providertoolkit/)
>> and got it working on one of my test machines.
>>
>> Is there a plan to move or not move this to the new DarwinCore?
>>
>> Thanks!
>>
>> Pete
>>
>>
>> On Mon, May 11, 2009 at 2:48 AM, Tim Robertson  
>> <trobertson at gbif.org> wrote:
>> Hi Peter,
>>
>> Just to expand on what Donald has written here:
>>
>> > My current thinking is that we should offer this as a service which
>> > can both be executed during harvesting and also as a stand-alone
>> > service for which users can submit a batch of Darwin Core-style
>> > records (probably tab-delimited) and get back a report for  
>> whichever
>> > set of tests or value-add operations they choose.  This could help
>> > providers with data cleaning even before they share their data (and
>> > also could help them to make sure there are no known sensitivity
>> > issues around their data).  Such a service could be extended more  
>> or
>> > less indefinitely to report more and more aspects of interest.  One
>> > of the major options could be to cross-reference records to  
>> accepted
>> > taxonomic authorities (via LSIDs or other identifiers).
>>
>> GBIF recently launched an early release of a biodiversity data
>> publishing tool (http://code.google.com/p/gbif-providertoolkit/)  
>> which
>> allows for serving of occurrence and species oriented data, in a  
>> "star
>> schema" format with Darwin Core as the core of the star.  This tool
>> has an embedded database, which allows for serving of text files  
>> (csv,
>> tab delimited etc) and also the ability to sit in front of an  
>> existing
>> database to offer DwC through a complete archive, TAPIR and WFS,WMS
>> services.  As you publish data through this tool, it currently does
>> very basic type checking of input data, and creates "annotations" on
>> the records that have issue (e.g. http://ipt.gbif.org/annotations.html?resource_id=11)
>> .  As the tool matures in the coming months, we plan to open up an  
>> API
>> so that data provides can call external services and have them push
>> back annotations - e.g. check my coordinates, check my names with  
>> IPNI
>> etc.  By publishing the complete dataset as an "archive" (a zipped
>> dump with an xml file describing the columns, http://rs.tdwg.org/dwc/terms/guides/text/index.htm
>>  as Donald mentions) the technical threshold is reduced to a minimum
>> for the data transfer to implement such a quality service, while also
>> ensuring decent harvesting performance.  It is in the current GBIF
>> workplan to register such quality services in the GBIF registry which
>> is undergoing development now, so that they may be discovered and  
>> used
>> by all, including the GBIF publishing toolkit, and portals.  By doing
>> this, the roles of checking data, or implementing quality services  
>> are
>> not centralised in a GBIF portal, but can be used by the data owner
>> before sharing with GBIF or other networks.
>>
>> Additionally, by allowing for remote annotations, we can aim to
>> ultimately push back all feedback from the GBIF portal (or others)
>> into the publishing tools as opposed to through email as is the
>> current feedback mechanism - this is related to other topics such as
>> uniquely identifying resources as they are shared through various
>> networks for example.  It would then be trivial to have (for example)
>> a google map with a clickable point which opens the details holding a
>> link "this record has bad coordinates", or a form to fill in.
>> Feedback could take the form of free text or perhaps even better, as
>> "structured annotations" where possible (this record would be correct
>> if the isoCountryCode was "DE") which could then be automatically
>> removed should the source be updated to meet the annotation criteria.
>>
>> Best wishes,
>>
>> Tim
>>
>>
>>
>>
>> >
>> >
>> > Best wishes,
>> >
>> > Donald
>> >
>> >
>> > Donald Hobern, Director, Atlas of Living Australia
>> > CSIRO Entomology, GPO Box 1700, Canberra, ACT 2601
>> > Phone: (02) 62464352 Mobile: 0437990208
>> > Email: Donald.Hobern at csiro.au
>> > Web: http://www.ala.org.au/
>> >
>> >
>> > -----Original Message-----
>> > Date: Fri, 8 May 2009 19:23:32 -0500
>> > From: Peter DeVries <pete.devries at gmail.com>
>> > Subject: [tdwg] Ideas on having Harvesters like GBIF clean,   flag
>> >       inconsistencies, and    add additional value to the data
>> > To: tdwg at lists.tdwg.org
>> > Message-ID:
>> >       <3833bf630905081723l2f1d5369je8af6b0e4a26324d at mail.gmail.com>
>> > Content-Type: text/plain; charset="iso-8859-1"
>> >
>> > Arthur Chapman sent me some good comments regarding Datums etc.
>> > The discussion made me realize that there may be a need for two
>> > types of
>> > formats. One for the providers and a second one that is output by  
>> the
>> > harvesting service.
>> >
>> > This is because the needs and abilities of the data providers are
>> > different
>> > than the needs and abilities of those who would like to consume the
>> > data.
>> >
>> > Consumers, who analyze and map the data, would like something that
>> > is easy
>> > to process, standardized and as as error free as as possible.
>> >
>> > It could work in the following way.
>> >
>> > Data harvesters, like GBIF, collect the records. Run them through
>> > cleaning algorithms that check attributes including that the lat  
>> and
>> > long
>> > actually match the location described.
>> >
>> > These harvesters would then expose this cleaned data via XML and  
>> RDF
>> > with
>> > tags that flag possible inconsistencies. The harvesters would also
>> > add a
>> > field for the lat and long in WGS84 if the original record contains
>> > a valid
>> > Datum. Those records without a Datum would still be exposed but the
>> > added
>> > geo:latitude and geo:longitude fields would be empty.
>> >
>> > I can imagine that that data uploaded to GBIF and other harvester
>> > services
>> > will be replete with typo's and inconsistencies that will frustrate
>> > people
>> > trying to analyze or simply map the data, the harvester services
>> > could add
>> > value by minimizing these frustrations.
>> >
>> > Originally, it seemed that a global service should standardize on a
>> > global
>> > Datum like WGS84. After all, we have standardized on meters?
>> > However, after
>> > discussing this with Arthur, I realize that this is not possible  
>> for a
>> > number of reasons. That said, I think the data would be much more
>> > valuable
>> > and less likely to be misinterpreted if if a version of it was
>> > available in
>> > WGS84. This solution would eventually encourage data providers to
>> > understand
>> > what a Datum is and include it in their data. It would also help
>> > solve a
>> > number of other data integration problems.
>> >
>> > Respectfully,
>> >
>> > Pete
>> > _______________________________________________
>> > tdwg mailing list
>> > tdwg at lists.tdwg.org
>> > http://lists.tdwg.org/mailman/listinfo/tdwg
>> >
>>
>> _______________________________________________
>> tdwg mailing list
>> tdwg at lists.tdwg.org
>> http://lists.tdwg.org/mailman/listinfo/tdwg
>>
>>
>>
>> -- 
>> ---------------------------------------------------------------
>> Pete DeVries
>> Department of Entomology
>> University of Wisconsin - Madison
>> 445 Russell Laboratories
>> 1630 Linden Drive
>> Madison, WI 53706
>> ------------------------------------------------------------
>
> _______________________________________________
> tdwg mailing list
> tdwg at lists.tdwg.org
> http://lists.tdwg.org/mailman/listinfo/tdwg

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.tdwg.org/pipermail/tdwg/attachments/20090512/84e6cf1e/attachment.html 


More information about the tdwg mailing list