 
            Very Cool and Thanks, I downloaded (http://code.google.com/p/gbif-providertoolkit/) and got it working on one of my test machines. Is there a plan to move or not move this to the new DarwinCore? Thanks! Pete On Mon, May 11, 2009 at 2:48 AM, Tim Robertson <trobertson@gbif.org> wrote:
Hi Peter,
Just to expand on what Donald has written here:
My current thinking is that we should offer this as a service which can both be executed during harvesting and also as a stand-alone service for which users can submit a batch of Darwin Core-style records (probably tab-delimited) and get back a report for whichever set of tests or value-add operations they choose. This could help providers with data cleaning even before they share their data (and also could help them to make sure there are no known sensitivity issues around their data). Such a service could be extended more or less indefinitely to report more and more aspects of interest. One of the major options could be to cross-reference records to accepted taxonomic authorities (via LSIDs or other identifiers).
GBIF recently launched an early release of a biodiversity data publishing tool (http://code.google.com/p/gbif-providertoolkit/) which allows for serving of occurrence and species oriented data, in a "star schema" format with Darwin Core as the core of the star. This tool has an embedded database, which allows for serving of text files (csv, tab delimited etc) and also the ability to sit in front of an existing database to offer DwC through a complete archive, TAPIR and WFS,WMS services. As you publish data through this tool, it currently does very basic type checking of input data, and creates "annotations" on the records that have issue (e.g. http://ipt.gbif.org/annotations.html?resource_id=11) . As the tool matures in the coming months, we plan to open up an API so that data provides can call external services and have them push back annotations - e.g. check my coordinates, check my names with IPNI etc. By publishing the complete dataset as an "archive" (a zipped dump with an xml file describing the columns, http://rs.tdwg.org/dwc/terms/guides/text/index.htm as Donald mentions) the technical threshold is reduced to a minimum for the data transfer to implement such a quality service, while also ensuring decent harvesting performance. It is in the current GBIF workplan to register such quality services in the GBIF registry which is undergoing development now, so that they may be discovered and used by all, including the GBIF publishing toolkit, and portals. By doing this, the roles of checking data, or implementing quality services are not centralised in a GBIF portal, but can be used by the data owner before sharing with GBIF or other networks.
Additionally, by allowing for remote annotations, we can aim to ultimately push back all feedback from the GBIF portal (or others) into the publishing tools as opposed to through email as is the current feedback mechanism - this is related to other topics such as uniquely identifying resources as they are shared through various networks for example. It would then be trivial to have (for example) a google map with a clickable point which opens the details holding a link "this record has bad coordinates", or a form to fill in. Feedback could take the form of free text or perhaps even better, as "structured annotations" where possible (this record would be correct if the isoCountryCode was "DE") which could then be automatically removed should the source be updated to meet the annotation criteria.
Best wishes,
Tim
Best wishes,
Donald
Donald Hobern, Director, Atlas of Living Australia CSIRO Entomology, GPO Box 1700, Canberra, ACT 2601 Phone: (02) 62464352 Mobile: 0437990208 Email: Donald.Hobern@csiro.au Web: http://www.ala.org.au/
-----Original Message----- Date: Fri, 8 May 2009 19:23:32 -0500 From: Peter DeVries <pete.devries@gmail.com> Subject: [tdwg] Ideas on having Harvesters like GBIF clean, flag inconsistencies, and add additional value to the data To: tdwg@lists.tdwg.org Message-ID: <3833bf630905081723l2f1d5369je8af6b0e4a26324d@mail.gmail.com> Content-Type: text/plain; charset="iso-8859-1"
Arthur Chapman sent me some good comments regarding Datums etc. The discussion made me realize that there may be a need for two types of formats. One for the providers and a second one that is output by the harvesting service.
This is because the needs and abilities of the data providers are different than the needs and abilities of those who would like to consume the data.
Consumers, who analyze and map the data, would like something that is easy to process, standardized and as as error free as as possible.
It could work in the following way.
Data harvesters, like GBIF, collect the records. Run them through cleaning algorithms that check attributes including that the lat and long actually match the location described.
These harvesters would then expose this cleaned data via XML and RDF with tags that flag possible inconsistencies. The harvesters would also add a field for the lat and long in WGS84 if the original record contains a valid Datum. Those records without a Datum would still be exposed but the added geo:latitude and geo:longitude fields would be empty.
I can imagine that that data uploaded to GBIF and other harvester services will be replete with typo's and inconsistencies that will frustrate people trying to analyze or simply map the data, the harvester services could add value by minimizing these frustrations.
Originally, it seemed that a global service should standardize on a global Datum like WGS84. After all, we have standardized on meters? However, after discussing this with Arthur, I realize that this is not possible for a number of reasons. That said, I think the data would be much more valuable and less likely to be misinterpreted if if a version of it was available in WGS84. This solution would eventually encourage data providers to understand what a Datum is and include it in their data. It would also help solve a number of other data integration problems.
Respectfully,
Pete _______________________________________________ tdwg mailing list tdwg@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg
_______________________________________________ tdwg mailing list tdwg@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg
-- --------------------------------------------------------------- Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 ------------------------------------------------------------