Ideas on having Harvesters like GBIF clean, flag inconsistencies, and add additional value to the data
 
            Arthur Chapman sent me some good comments regarding Datums etc. The discussion made me realize that there may be a need for two types of formats. One for the providers and a second one that is output by the harvesting service. This is because the needs and abilities of the data providers are different than the needs and abilities of those who would like to consume the data. Consumers, who analyze and map the data, would like something that is easy to process, standardized and as as error free as as possible. It could work in the following way. Data harvesters, like GBIF, collect the records. Run them through cleaning algorithms that check attributes including that the lat and long actually match the location described. These harvesters would then expose this cleaned data via XML and RDF with tags that flag possible inconsistencies. The harvesters would also add a field for the lat and long in WGS84 if the original record contains a valid Datum. Those records without a Datum would still be exposed but the added geo:latitude and geo:longitude fields would be empty. I can imagine that that data uploaded to GBIF and other harvester services will be replete with typo's and inconsistencies that will frustrate people trying to analyze or simply map the data, the harvester services could add value by minimizing these frustrations. Originally, it seemed that a global service should standardize on a global Datum like WGS84. After all, we have standardized on meters? However, after discussing this with Arthur, I realize that this is not possible for a number of reasons. That said, I think the data would be much more valuable and less likely to be misinterpreted if if a version of it was available in WGS84. This solution would eventually encourage data providers to understand what a Datum is and include it in their data. It would also help solve a number of other data integration problems. Respectfully, Pete --------------------------------------------------------------- Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 ------------------------------------------------------------
 
            Peter - there is one additional issue here You imply that if the data is not in WGS84 that the lat and long be removed ("Those records without a Datum would still be exposed but the added geo:latitude and geo:longitude fields would be empty.") However - the lat and long could still be included, but the Uncertainty would be increased as discussed in the Georferencing Best Practices document (http://www.gbif.org/prog/digit/data_quality/BioGeomancerGuide) and as also calculated in the MaNIS Georeferencing calculator (http://manisnet.org/gc.html) <http://www.gbif.org/prog/digit/data_quality/BioGeomancerGuide> and in the BioGeomancer toolkit (http://biogeomancer.org/) Cheers Arthur Peter DeVries wrote:
Arthur Chapman sent me some good comments regarding Datums etc.
The discussion made me realize that there may be a need for two types of formats. One for the providers and a second one that is output by the harvesting service.
This is because the needs and abilities of the data providers are different than the needs and abilities of those who would like to consume the data.
Consumers, who analyze and map the data, would like something that is easy to process, standardized and as as error free as as possible.
It could work in the following way.
Data harvesters, like GBIF, collect the records. Run them through cleaning algorithms that check attributes including that the lat and long actually match the location described.
These harvesters would then expose this cleaned data via XML and RDF with tags that flag possible inconsistencies. The harvesters would also add a field for the lat and long in WGS84 if the original record contains a valid Datum. Those records without a Datum would still be exposed but the added geo:latitude and geo:longitude fields would be empty.
I can imagine that that data uploaded to GBIF and other harvester services will be replete with typo's and inconsistencies that will frustrate people trying to analyze or simply map the data, the harvester services could add value by minimizing these frustrations.
Originally, it seemed that a global service should standardize on a global Datum like WGS84. After all, we have standardized on meters? However, after discussing this with Arthur, I realize that this is not possible for a number of reasons. That said, I think the data would be much more valuable and less likely to be misinterpreted if if a version of it was available in WGS84. This solution would eventually encourage data providers to understand what a Datum is and include it in their data. It would also help solve a number of other data integration problems.
Respectfully,
Pete
--------------------------------------------------------------- Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 ------------------------------------------------------------ ------------------------------------------------------------------------
_______________________________________________ tdwg mailing list tdwg@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg
participants (2)
- 
                 Arthur Chapman Arthur Chapman
- 
                 Peter DeVries Peter DeVries