On Mon, May 11, 2009 at 2:48 AM, Tim Robertson <trobertson@gbif.org> wrote:

Hi Peter,

Just to expand on what Donald has written here:

> My current thinking is that we should offer this as a service which
> can both be executed during harvesting and also as a stand-alone
> service for which users can submit a batch of Darwin Core-style
> records (probably tab-delimited) and get back a report for whichever
> set of tests or value-add operations they choose. This could help
> providers with data cleaning even before they share their data (and
> also could help them to make sure there are no known sensitivity
> issues around their data). Such a service could be extended more or
> less indefinitely to report more and more aspects of interest. One
> of the major options could be to cross-reference records to accepted
> taxonomic authorities (via LSIDs or other identifiers).

GBIF recently launched an early release of a biodiversity data
publishing tool (http://code.google.com/p/gbif-providertoolkit/) which
allows for serving of occurrence and species oriented data, in a "star
schema" format with Darwin Core as the core of the star. This tool
has an embedded database, which allows for serving of text files (csv,
tab delimited etc) and also the ability to sit in front of an existing
database to offer DwC through a complete archive, TAPIR and WFS,WMS
services. As you publish data through this tool, it currently does
very basic type checking of input data, and creates "annotations" on
the records that have issue (e.g. http://ipt.gbif.org/annotations.html?resource_id=11)
. As the tool matures in the coming months, we plan to open up an API
so that data provides can call external services and have them push
back annotations - e.g. check my coordinates, check my names with IPNI
etc. By publishing the complete dataset as an "archive" (a zipped
dump with an xml file describing the columns, http://rs.tdwg.org/dwc/terms/guides/text/index.htm
as Donald mentions) the technical threshold is reduced to a minimum
for the data transfer to implement such a quality service, while also
ensuring decent harvesting performance. It is in the current GBIF
workplan to register such quality services in the GBIF registry which
is undergoing development now, so that they may be discovered and used
by all, including the GBIF publishing toolkit, and portals. By doing
this, the roles of checking data, or implementing quality services are
not centralised in a GBIF portal, but can be used by the data owner
before sharing with GBIF or other networks.

Additionally, by allowing for remote annotations, we can aim to
ultimately push back all feedback from the GBIF portal (or others)
into the publishing tools as opposed to through email as is the
current feedback mechanism - this is related to other topics such as
uniquely identifying resources as they are shared through various
networks for example. It would then be trivial to have (for example)
a google map with a clickable point which opens the details holding a
link "this record has bad coordinates", or a form to fill in.
Feedback could take the form of free text or perhaps even better, as
"structured annotations" where possible (this record would be correct
if the isoCountryCode was "DE") which could then be automatically
removed should the source be updated to meet the annotation criteria.

Best wishes,

Tim

>
>
> Best wishes,
>
> Donald
>
>
> Donald Hobern, Director, Atlas of Living Australia
> CSIRO Entomology, GPO Box 1700, Canberra, ACT 2601
> Phone: (02) 62464352 Mobile: 0437990208
> Email: Donald.Hobern@csiro.au
> Web: http://www.ala.org.au/
>
>
> -----Original Message-----
> Date: Fri, 8 May 2009 19:23:32 -0500
> From: Peter DeVries <pete.devries@gmail.com>
> Subject: [tdwg] Ideas on having Harvesters like GBIF clean, flag
> inconsistencies, and add additional value to the data
> To: tdwg@lists.tdwg.org
> Message-ID:
> <3833bf630905081723l2f1d5369je8af6b0e4a26324d@mail.gmail.com>
> Content-Type: text/plain; charset="iso-8859-1"
>
> Arthur Chapman sent me some good comments regarding Datums etc.
> The discussion made me realize that there may be a need for two
> types of
> formats. One for the providers and a second one that is output by the
> harvesting service.
>
> This is because the needs and abilities of the data providers are
> different
> than the needs and abilities of those who would like to consume the
> data.
>
> Consumers, who analyze and map the data, would like something that
> is easy
> to process, standardized and as as error free as as possible.
>
> It could work in the following way.
>
> Data harvesters, like GBIF, collect the records. Run them through
> cleaning algorithms that check attributes including that the lat and
> long
> actually match the location described.
>
> These harvesters would then expose this cleaned data via XML and RDF
> with
> tags that flag possible inconsistencies. The harvesters would also
> add a
> field for the lat and long in WGS84 if the original record contains
> a valid
> Datum. Those records without a Datum would still be exposed but the
> added
> geo:latitude and geo:longitude fields would be empty.
>
> I can imagine that that data uploaded to GBIF and other harvester
> services
> will be replete with typo's and inconsistencies that will frustrate
> people
> trying to analyze or simply map the data, the harvester services
> could add
> value by minimizing these frustrations.
>
> Originally, it seemed that a global service should standardize on a
> global
> Datum like WGS84. After all, we have standardized on meters?
> However, after
> discussing this with Arthur, I realize that this is not possible for a
> number of reasons. That said, I think the data would be much more
> valuable
> and less likely to be misinterpreted if if a version of it was
> available in
> WGS84. This solution would eventually encourage data providers to
> understand
> what a Datum is and include it in their data. It would also help
> solve a
> number of other data integration problems.
>
> Respectfully,
>
> Pete
> _______________________________________________
> tdwg mailing list
> tdwg@lists.tdwg.org
> http://lists.tdwg.org/mailman/listinfo/tdwg
>

_______________________________________________
tdwg mailing list
tdwg@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg

--
---------------------------------------------------------------
Pete DeVries
Department of Entomology
University of Wisconsin - Madison
445 Russell Laboratories
1630 Linden Drive
Madison, WI 53706
------------------------------------------------------------