Re: [tdwg-tapir] Fwd: Tapir protocol - Harvest methods?[SEC=UNCLASSIFIED]
Hi guys,
sorry for the late reaction, but I put off reading all the mails until today.
Using CSV and tab delimited files will cause problems when the dumps contains freetext data, e.g. locality description or notes. When I pushed our BioCASE cache (50 million occurrence records) between different DBMS using tab delimited files, I had to experience that people are very eager to use tabs and new lines in freetext fields. Any character you choose for delimiting contents you will find in freetext fields...
Cheers from Berlin, Jörg
-----Ursprüngliche Nachricht----- Von: tdwg-tapir-bounces@lists.tdwg.org [mailto:tdwg-tapir-bounces@lists.tdwg.org]Im Auftrag von Markus Döring Gesendet: Mittwoch, 14. Mai 2008 15:35 An: Aaron D. Steele Cc: TAPIR mailing list Betreff: Re: [tdwg-tapir] Fwd: Tapir protocol - Harvest methods?[SEC=UNCLASSIFIED]
it would keep the relations, but we dont really want any relational structure to be served up. And using sqlite binaries for the dwc star scheme would not be easier to work with than plain text files. they can even be loaded into excel straight away, can be versioned with svn and so on. If there is a geospatial extension file which has the GUID in the first column, applications might grab that directly and not even touch the central core file if they only want location data.
I'd prefer to stick with a csv or tab delimited file. The simpler the better. And it also cant get corrupted as easily.
Markus
On 14 May, 2008, at 15:25, Aaron D. Steele wrote:
for preserving relational data, we could also just dump tapirlink resources to an sqlite database file (http://www.sqlite.org), zip it up, and again make it available via the web service. we use sqlite internally for many projects, and it's both easy to use and well supported by jdbc, php, python, etc.
would something like this be a useful option?
thanks, aaron
On Wed, May 14, 2008 at 2:21 AM, Markus Döring mdoering@gbif.org wrote:
Interesting that we all come to the same conclusions... The trouble I had with just a simple flat csv file is repeating properties like multiple image urls. ABCD clients dont use ABCD just because its complex, but because they want to transport this relational data. We were considering 2 solutions to extending this csv approach. The first would be to have a single large denormalised csv file with many rows for the same record. It would require knowledge about the related entities though and could grow in size rapidly. The second idea which we think to adopt is allowing a single level of 1- many related entities. It is basically a "star" design with the core dwc table in the center and any number of extension tables around it. Each "table" aka csv file will have the record id as the first column, so the files can be related easily and it only needs a single identifier per record and not for the extension entities. This would give a lot of flexibility while keeping things pretty simple to deal with. It would even satisfy the ABCD needs as I havent yet seen anyone requiring 2 levels of related tables (other than lookup tables). Those extensions could even be a simple 1-1 relation, but would keep things semantically together just like a xml namespace. The darwin core extensions would be good for example.
So we could have a gzipped set of files, maybe with a simple metafile indicating the semantics of the columns for each file. An example could look like this:
# darwincore.csv 102 Aster alpinus subsp. parviceps ... 103 Polygala vulgaris ...
# curatorial.csv 102 Kew Herbarium 103 Reading Herbarium
# identification.csv 102 2003-05-04 Karl Marx Aster alpinus L. 102 2007-01-11 Mark Twain Aster korshinskyi Tamamsch. 102 2007-09-13 Roger Hyam Aster alpinus subsp. parviceps Novopokr. 103 2001-02-21 Steve Bekow Polygala vulgaris L.
I know this looks old fashioned, but it is just so simple and gives us so much flexibility. Markus
On 14 May, 2008, at 24:39, Greg Whitbread wrote:
We have used a very similar protocol to assemble the latest AVH cache. It should be noted that this is an as-well-as protocol that only works because we have an established semantic standard (hispid/abcd).
greg
trobertson@gbif.org wrote:
Hi All,
This is very interesting too me, as I came up with the same conclusion while harvesting for GBIF.
As a "harvester of all records" it is best described with an example:
- Complete Inventory of ScientificNames: 7 minutes @ the limited
200 records per page
- Complete Harvesting of records:
- 260,000 records
- 9 hours harvesting duration
- 500MB TAPIR+DwC XML returned (DwC 1.4 with geospatial and
curatorial extensions)
- Extraction of DwC records from harvested XML: <2 minutes
- Resulting file size 32MB, Gzipped to <3MB
I spun hard drives for 9 hours, and took up bandwidth that is paid for, to retrieve something that could have been generated provider side in minutes and transferred in seconds (3MB).
I sent a proposal to TDWG last year termed "datamaps" which was effectively what you are describing, and I based it on the Sitemaps protocol, but I got nowhere with it. With Markus, we are making more progress and I have spoken with several GBIF data providers about a proposed new standard for full dataset harvesting and it has been received well. So Markus and I have started a new proposal and have a working name of 'Localised DwC Index' file generation (it is an index if you have more than DwC data, and DwC is still standards compliant) which is really a GZipped Tab file dump of the data, which is slightly extensible. The document is not ready to circulate yet but the benefits section reads currently:
- Provider database load reduced, allowing it to serve real
distributed queries rather than "full datasource" harvesters
- Providers can choose to publish their index as it suits them,
giving control back to the provider
- Localised index generation can be built into tools not yet
capable of integrating with TDWG protocol networks such as GBIF
- Harvesters receive a full dataset view in one request, making it
very easy to determine what records are eligible for deletion
- It becomes very simple to write clients that consume entire
datasets. E.g. data cleansing tools that the provider can run:
- Give me ISO Country Codes for my dataset
- The application pulls down the providers index file,
generates ISO country code, returns a simple table using the providers own identifier
- Check my names for spelling mistakes
- Application skims over the records and provides a list that
are not known to the application
- Providers such as UK NBN cannot serve 20 million records to the
GBIF index using the existing protocols efficiently.
- They have the ability to generate a localised index however
- Harvesters can very quickly build up searchable indexes and it is
easy to create large indices.
- Node Portal can easily aggregate index data files
- true index to data, not an illusion of a cache. More like Google
sitemaps
It is the ease at which one can offer tools to data providers that really interests me. The technical threshold required to produce services that offer reporting tools on peoples data is really very low with this mechanism. This and the fact that large datasets will be harvestable - we have even considered the likes of bit-torrent for the large ones although I think this is overkill.
As a consumer therefore I fully support this move as a valuable addition to the wrapper tools.
Cheers
Tim (wrote the GBIF harvesting, and new to this list)
Begin forwarded message:
From: "Aaron D. Steele" eightysteele@gmail.com Date: 13 de mayo de 2008 22:40:09 GMT+02:00 To: tdwg-tapir@lists.tdwg.org Cc: Aaron Steele asteele@berkeley.edu Subject: Re: [tdwg-tapir] Tapir protocol - Harvest methods?
at berkeley we've recently prototyped a simple php program that uses an existing tapirlink installation to periodically dump tapir resources into a csv file. the solution is totally generic and can dump darwin core (and technically abcd schema, although it's currently untested). the resulting csv files are zip archived and made accessible using a web service. it's a simple approach that has proven to be, at least internally, quite reliable and useful.
for example, several of our caching applications use the web service to harvest csv data from tapirlink resources using the following process:
- download latest csv dump for a resource using the web service.
- flush all locally cached records for the resource.
- bulk load the latest csv data into the cache.
in this way, cached data are always synchronized with the resource and there's no need to track new, deleted, or changed records. as an aside, each time these cached data are queried by the caching application or selected in the user interface, log-only search requests are sent back to the resource.
after discussion with renato giovanni and john wieczorek, we've decided that merging this functionality into the tapirlink codebase would benefit the broader community. csv generation support would be declared through capabilities. although incremental harvesting wouldn't be immediately implemented, we could certainly extend the service to include it later.
i'd like to pause here to gauge the consensus, thoughts, concerns, and ideas of others. anyone?
thanks, aaron
2008/5/5 Kevin Richards RichardsK@landcareresearch.co.nz: > > I think I agree here. > > The harvesting "procedure" is really defined outside the Tapir > protocol, is > it not? So it is really an agreement between the harvester and > the > harvestees. > > So what is really needed here is the standard procedure for > maintaining a > "harvestable" dataset and the standard procedure for harvesting > that > dataset. > We have a general rule at Landcare, that we never delete records > in > our > datasets - they are either deprecated in favour of another > record, > and so > the resolution of that record would point to the new record, or > the > are set > to a state of "deleted", but are still kept in the dataset, and > can > be > resolved (which would indicate a state of deleted). > > Kevin > > >>>> "Renato De Giovanni" renato@cria.org.br 6/05/2008 7:33 a.m. >>>>>>> > > Hi Markus, > > I would suggest creating new concepts for incremental > harvesting, > either in the data standards themselves or in some new > extension. In > the case of TAPIR, GBIF could easily check the mapped concepts > before > deciding between incremental or full harvesting. > > Actually it could be just one new concept such as "recordStatus" > or > "deletionFlag". Or perhaps you could also want to create your > own > definition for dateLastModified indicating which set of concepts > should be considered to see if something has changed or not, > but I > guess this level of granularity would be difficult to be > supported. > > Regards, > -- > Renato > > On 5 May 2008 at 11:24, Markus Döring wrote: > >> Phil, >> incremental harvesting is not implemented on the GBIF side as >> far >> as I >> am aware. And I dont think that will be a simple thing to >> implement on >> the current system. Also, even if we can detect only the >> changed >> records since the last harevesting via dateLastModified we >> still >> have >> no information about deletions. We could have an arrangement >> saying >> that you keep deleted records as empty records with just the ID >> and >> nothing else (I vaguely remember LSIDs were supposed to work >> like >> this >> too). But that also needs to be supported on your side then, >> never >> entirely removing any record. I will have a discussion with the >> others >> at GBIF about that. >> >> Markus > _______________________________________________ > tdwg-tapir mailing list > tdwg-tapir@lists.tdwg.org > http://lists.tdwg.org/mailman/listinfo/tdwg-tapir > > > > > Please consider the environment before printing this email > > WARNING : This email and any attachments may be confidential > and/ > or > privileged. They are intended for the addressee only and are not > to > be read, > used, copied or disseminated by anyone receiving them in > error. If > you are > not the intended recipient, please notify the sender by return > email and > delete this message and any attachments. > > The views expressed in this email are those of the sender and do > not > necessarily reflect the > official views of Landcare Research. http:// > www.landcareresearch.co.nz > _______________________________________________ > tdwg-tapir mailing list > tdwg-tapir@lists.tdwg.org > http://lists.tdwg.org/mailman/listinfo/tdwg-tapir > > _______________________________________________ tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
--
Australian Centre for Plant BIodiversity Research<------------------+ National greg whitBread voice: +61 2 62509 482 Botanic Integrated Botanical Information System fax: +61 2 62509 599 Gardens S........ I.T. happens.. ghw@anbg.gov.au +----------------------------------------->GPO Box 1777 Canberra 2601
If you have received this transmission in error please notify us immediately by return e-mail and delete all copies. If this e-mail or any attachments have been sent to you in error, that error does not constitute waiver of any confidentiality, privilege or copyright in respect of information in the e-mail or attachments.
Please consider the environment before printing this email.
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
_______________________________________________ tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
that's right. So they need to be escaped if they really want to have control characters in their dumps.
But this is no different from escaping xml or any other document. It would just be nice if the number of escape characters is kept to a minimum. For this reason I personally prefer tab files, as escaping line returns and the delimiting tab space is rather little work.
Markus
On 15 May, 2008, at 13:40, Holetschek, Jörg wrote:
Hi guys,
sorry for the late reaction, but I put off reading all the mails until today.
Using CSV and tab delimited files will cause problems when the dumps contains freetext data, e.g. locality description or notes. When I pushed our BioCASE cache (50 million occurrence records) between different DBMS using tab delimited files, I had to experience that people are very eager to use tabs and new lines in freetext fields. Any character you choose for delimiting contents you will find in freetext fields...
Cheers from Berlin, Jörg
-----Ursprüngliche Nachricht----- Von: tdwg-tapir-bounces@lists.tdwg.org [mailto:tdwg-tapir-bounces@lists.tdwg.org]Im Auftrag von Markus Döring Gesendet: Mittwoch, 14. Mai 2008 15:35 An: Aaron D. Steele Cc: TAPIR mailing list Betreff: Re: [tdwg-tapir] Fwd: Tapir protocol - Harvest methods?[SEC=UNCLASSIFIED]
it would keep the relations, but we dont really want any relational structure to be served up. And using sqlite binaries for the dwc star scheme would not be easier to work with than plain text files. they can even be loaded into excel straight away, can be versioned with svn and so on. If there is a geospatial extension file which has the GUID in the first column, applications might grab that directly and not even touch the central core file if they only want location data.
I'd prefer to stick with a csv or tab delimited file. The simpler the better. And it also cant get corrupted as easily.
Markus
On 14 May, 2008, at 15:25, Aaron D. Steele wrote:
for preserving relational data, we could also just dump tapirlink resources to an sqlite database file (http://www.sqlite.org), zip it up, and again make it available via the web service. we use sqlite internally for many projects, and it's both easy to use and well supported by jdbc, php, python, etc.
would something like this be a useful option?
thanks, aaron
On Wed, May 14, 2008 at 2:21 AM, Markus Döring mdoering@gbif.org wrote:
Interesting that we all come to the same conclusions... The trouble I had with just a simple flat csv file is repeating properties like multiple image urls. ABCD clients dont use ABCD just because its complex, but because they want to transport this relational data. We were considering 2 solutions to extending this csv approach. The first would be to have a single large denormalised csv file with many rows for the same record. It would require knowledge about the related entities though and could grow in size rapidly. The second idea which we think to adopt is allowing a single level of 1- many related entities. It is basically a "star" design with the core dwc table in the center and any number of extension tables around it. Each "table" aka csv file will have the record id as the first column, so the files can be related easily and it only needs a single identifier per record and not for the extension entities. This would give a lot of flexibility while keeping things pretty simple to deal with. It would even satisfy the ABCD needs as I havent yet seen anyone requiring 2 levels of related tables (other than lookup tables). Those extensions could even be a simple 1-1 relation, but would keep things semantically together just like a xml namespace. The darwin core extensions would be good for example.
So we could have a gzipped set of files, maybe with a simple metafile indicating the semantics of the columns for each file. An example could look like this:
# darwincore.csv 102 Aster alpinus subsp. parviceps ... 103 Polygala vulgaris ...
# curatorial.csv 102 Kew Herbarium 103 Reading Herbarium
# identification.csv 102 2003-05-04 Karl Marx Aster alpinus L. 102 2007-01-11 Mark Twain Aster korshinskyi Tamamsch. 102 2007-09-13 Roger Hyam Aster alpinus subsp. parviceps Novopokr. 103 2001-02-21 Steve Bekow Polygala vulgaris L.
I know this looks old fashioned, but it is just so simple and gives us so much flexibility. Markus
On 14 May, 2008, at 24:39, Greg Whitbread wrote:
We have used a very similar protocol to assemble the latest AVH cache. It should be noted that this is an as-well-as protocol that only works because we have an established semantic standard (hispid/abcd).
greg
trobertson@gbif.org wrote:
Hi All,
This is very interesting too me, as I came up with the same conclusion while harvesting for GBIF.
As a "harvester of all records" it is best described with an example:
- Complete Inventory of ScientificNames: 7 minutes @ the limited
200 records per page
- Complete Harvesting of records:
- 260,000 records
- 9 hours harvesting duration
- 500MB TAPIR+DwC XML returned (DwC 1.4 with geospatial and
curatorial extensions)
- Extraction of DwC records from harvested XML: <2 minutes
- Resulting file size 32MB, Gzipped to <3MB
I spun hard drives for 9 hours, and took up bandwidth that is paid for, to retrieve something that could have been generated provider side in minutes and transferred in seconds (3MB).
I sent a proposal to TDWG last year termed "datamaps" which was effectively what you are describing, and I based it on the Sitemaps protocol, but I got nowhere with it. With Markus, we are making more progress and I have spoken with several GBIF data providers about a proposed new standard for full dataset harvesting and it has been received well. So Markus and I have started a new proposal and have a working name of 'Localised DwC Index' file generation (it is an index if you have more than DwC data, and DwC is still standards compliant) which is really a GZipped Tab file dump of the data, which is slightly extensible. The document is not ready to circulate yet but the benefits section reads currently:
- Provider database load reduced, allowing it to serve real
distributed queries rather than "full datasource" harvesters
- Providers can choose to publish their index as it suits them,
giving control back to the provider
- Localised index generation can be built into tools not yet
capable of integrating with TDWG protocol networks such as GBIF
- Harvesters receive a full dataset view in one request, making it
very easy to determine what records are eligible for deletion
- It becomes very simple to write clients that consume entire
datasets. E.g. data cleansing tools that the provider can run:
- Give me ISO Country Codes for my dataset
- The application pulls down the providers index file,
generates ISO country code, returns a simple table using the providers own identifier
- Check my names for spelling mistakes
- Application skims over the records and provides a list that
are not known to the application
- Providers such as UK NBN cannot serve 20 million records to the
GBIF index using the existing protocols efficiently.
- They have the ability to generate a localised index however
- Harvesters can very quickly build up searchable indexes and it
is easy to create large indices.
- Node Portal can easily aggregate index data files
- true index to data, not an illusion of a cache. More like Google
sitemaps
It is the ease at which one can offer tools to data providers that really interests me. The technical threshold required to produce services that offer reporting tools on peoples data is really very low with this mechanism. This and the fact that large datasets will be harvestable - we have even considered the likes of bit-torrent for the large ones although I think this is overkill.
As a consumer therefore I fully support this move as a valuable addition to the wrapper tools.
Cheers
Tim (wrote the GBIF harvesting, and new to this list)
Begin forwarded message:
> From: "Aaron D. Steele" eightysteele@gmail.com > Date: 13 de mayo de 2008 22:40:09 GMT+02:00 > To: tdwg-tapir@lists.tdwg.org > Cc: Aaron Steele asteele@berkeley.edu > Subject: Re: [tdwg-tapir] Tapir protocol - Harvest methods? > > at berkeley we've recently prototyped a simple php program that > uses > an existing tapirlink installation to periodically dump tapir > resources into a csv file. the solution is totally generic and > can > dump darwin core (and technically abcd schema, although it's > currently > untested). the resulting csv files are zip archived and made > accessible using a web service. it's a simple approach that has > proven > to be, at least internally, quite reliable and useful. > > for example, several of our caching applications use the web > service > to harvest csv data from tapirlink resources using the following > process: > 1) download latest csv dump for a resource using the web > service. > 2) flush all locally cached records for the resource. > 3) bulk load the latest csv data into the cache. > > in this way, cached data are always synchronized with the > resource and > there's no need to track new, deleted, or changed records. as an > aside, each time these cached data are queried by the caching > application or selected in the user interface, log-only search > requests are sent back to the resource. > > after discussion with renato giovanni and john wieczorek, we've > decided that merging this functionality into the tapirlink > codebase > would benefit the broader community. csv generation support > would > be > declared through capabilities. although incremental harvesting > wouldn't be immediately implemented, we could certainly extend > the > service to include it later. > > i'd like to pause here to gauge the consensus, thoughts, > concerns, and > ideas of others. anyone? > > thanks, > aaron > > 2008/5/5 Kevin Richards RichardsK@landcareresearch.co.nz: >> >> I think I agree here. >> >> The harvesting "procedure" is really defined outside the Tapir >> protocol, is >> it not? So it is really an agreement between the harvester and >> the >> harvestees. >> >> So what is really needed here is the standard procedure for >> maintaining a >> "harvestable" dataset and the standard procedure for harvesting >> that >> dataset. >> We have a general rule at Landcare, that we never delete >> records >> in >> our >> datasets - they are either deprecated in favour of another >> record, >> and so >> the resolution of that record would point to the new record, or >> the >> are set >> to a state of "deleted", but are still kept in the dataset, and >> can >> be >> resolved (which would indicate a state of deleted). >> >> Kevin >> >> >>>>> "Renato De Giovanni" renato@cria.org.br 6/05/2008 7:33 >>>>> a.m. >>>>>>>> >> >> Hi Markus, >> >> I would suggest creating new concepts for incremental >> harvesting, >> either in the data standards themselves or in some new >> extension. In >> the case of TAPIR, GBIF could easily check the mapped concepts >> before >> deciding between incremental or full harvesting. >> >> Actually it could be just one new concept such as >> "recordStatus" >> or >> "deletionFlag". Or perhaps you could also want to create your >> own >> definition for dateLastModified indicating which set of >> concepts >> should be considered to see if something has changed or not, >> but I >> guess this level of granularity would be difficult to be >> supported. >> >> Regards, >> -- >> Renato >> >> On 5 May 2008 at 11:24, Markus Döring wrote: >> >>> Phil, >>> incremental harvesting is not implemented on the GBIF side as >>> far >>> as I >>> am aware. And I dont think that will be a simple thing to >>> implement on >>> the current system. Also, even if we can detect only the >>> changed >>> records since the last harevesting via dateLastModified we >>> still >>> have >>> no information about deletions. We could have an arrangement >>> saying >>> that you keep deleted records as empty records with just the >>> ID >>> and >>> nothing else (I vaguely remember LSIDs were supposed to work >>> like >>> this >>> too). But that also needs to be supported on your side then, >>> never >>> entirely removing any record. I will have a discussion with >>> the >>> others >>> at GBIF about that. >>> >>> Markus >> _______________________________________________ >> tdwg-tapir mailing list >> tdwg-tapir@lists.tdwg.org >> http://lists.tdwg.org/mailman/listinfo/tdwg-tapir >> >> >> >> >> Please consider the environment before printing this email >> >> WARNING : This email and any attachments may be confidential >> and/ >> or >> privileged. They are intended for the addressee only and are >> not >> to >> be read, >> used, copied or disseminated by anyone receiving them in >> error. If >> you are >> not the intended recipient, please notify the sender by return >> email and >> delete this message and any attachments. >> >> The views expressed in this email are those of the sender and >> do >> not >> necessarily reflect the >> official views of Landcare Research. http:// >> www.landcareresearch.co.nz >> _______________________________________________ >> tdwg-tapir mailing list >> tdwg-tapir@lists.tdwg.org >> http://lists.tdwg.org/mailman/listinfo/tdwg-tapir >> >> > _______________________________________________ > tdwg-tapir mailing list > tdwg-tapir@lists.tdwg.org > http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
--
Australian Centre for Plant BIodiversity Research<------------------+ National greg whitBread voice: +61 2 62509 482 Botanic Integrated Botanical Information System fax: +61 2 62509 599 Gardens S........ I.T. happens.. ghw@anbg.gov.au +----------------------------------------->GPO Box 1777 Canberra 2601
If you have received this transmission in error please notify us immediately by return e-mail and delete all copies. If this e-mail or any attachments have been sent to you in error, that error does not constitute waiver of any confidentiality, privilege or copyright in respect of information in the e-mail or attachments.
Please consider the environment before printing this email.
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
I second that.
On Thu, May 15, 2008 at 5:11 AM, Markus Döring mdoering@gbif.org wrote:
that's right. So they need to be escaped if they really want to have control characters in their dumps.
But this is no different from escaping xml or any other document. It would just be nice if the number of escape characters is kept to a minimum. For this reason I personally prefer tab files, as escaping line returns and the delimiting tab space is rather little work.
Markus
On 15 May, 2008, at 13:40, Holetschek, Jörg wrote:
Hi guys,
sorry for the late reaction, but I put off reading all the mails until today.
Using CSV and tab delimited files will cause problems when the dumps contains freetext data, e.g. locality description or notes. When I pushed our BioCASE cache (50 million occurrence records) between different DBMS using tab delimited files, I had to experience that people are very eager to use tabs and new lines in freetext fields. Any character you choose for delimiting contents you will find in freetext fields...
Cheers from Berlin, Jörg
-----Ursprüngliche Nachricht----- Von: tdwg-tapir-bounces@lists.tdwg.org [mailto:tdwg-tapir-bounces@lists.tdwg.org]Im Auftrag von Markus Döring Gesendet: Mittwoch, 14. Mai 2008 15:35 An: Aaron D. Steele Cc: TAPIR mailing list Betreff: Re: [tdwg-tapir] Fwd: Tapir protocol - Harvest methods?[SEC=UNCLASSIFIED]
it would keep the relations, but we dont really want any relational structure to be served up. And using sqlite binaries for the dwc star scheme would not be easier to work with than plain text files. they can even be loaded into excel straight away, can be versioned with svn and so on. If there is a geospatial extension file which has the GUID in the first column, applications might grab that directly and not even touch the central core file if they only want location data.
I'd prefer to stick with a csv or tab delimited file. The simpler the better. And it also cant get corrupted as easily.
Markus
On 14 May, 2008, at 15:25, Aaron D. Steele wrote:
for preserving relational data, we could also just dump tapirlink resources to an sqlite database file (http://www.sqlite.org), zip it up, and again make it available via the web service. we use sqlite internally for many projects, and it's both easy to use and well supported by jdbc, php, python, etc.
would something like this be a useful option?
thanks, aaron
On Wed, May 14, 2008 at 2:21 AM, Markus Döring mdoering@gbif.org wrote:
Interesting that we all come to the same conclusions... The trouble I had with just a simple flat csv file is repeating properties like multiple image urls. ABCD clients dont use ABCD just because its complex, but because they want to transport this relational data. We were considering 2 solutions to extending this csv approach. The first would be to have a single large denormalised csv file with many rows for the same record. It would require knowledge about the related entities though and could grow in size rapidly. The second idea which we think to adopt is allowing a single level of 1- many related entities. It is basically a "star" design with the core dwc table in the center and any number of extension tables around it. Each "table" aka csv file will have the record id as the first column, so the files can be related easily and it only needs a single identifier per record and not for the extension entities. This would give a lot of flexibility while keeping things pretty simple to deal with. It would even satisfy the ABCD needs as I havent yet seen anyone requiring 2 levels of related tables (other than lookup tables). Those extensions could even be a simple 1-1 relation, but would keep things semantically together just like a xml namespace. The darwin core extensions would be good for example.
So we could have a gzipped set of files, maybe with a simple metafile indicating the semantics of the columns for each file. An example could look like this:
# darwincore.csv 102 Aster alpinus subsp. parviceps ... 103 Polygala vulgaris ...
# curatorial.csv 102 Kew Herbarium 103 Reading Herbarium
# identification.csv 102 2003-05-04 Karl Marx Aster alpinus L. 102 2007-01-11 Mark Twain Aster korshinskyi Tamamsch. 102 2007-09-13 Roger Hyam Aster alpinus subsp. parviceps Novopokr. 103 2001-02-21 Steve Bekow Polygala vulgaris L.
I know this looks old fashioned, but it is just so simple and gives us so much flexibility. Markus
On 14 May, 2008, at 24:39, Greg Whitbread wrote:
We have used a very similar protocol to assemble the latest AVH cache. It should be noted that this is an as-well-as protocol that only works because we have an established semantic standard (hispid/abcd).
greg
trobertson@gbif.org wrote:
Hi All,
This is very interesting too me, as I came up with the same conclusion while harvesting for GBIF.
As a "harvester of all records" it is best described with an example:
- Complete Inventory of ScientificNames: 7 minutes @ the limited
200 records per page
- Complete Harvesting of records:
- 260,000 records
- 9 hours harvesting duration
- 500MB TAPIR+DwC XML returned (DwC 1.4 with geospatial and
curatorial extensions)
- Extraction of DwC records from harvested XML: <2 minutes
- Resulting file size 32MB, Gzipped to <3MB
I spun hard drives for 9 hours, and took up bandwidth that is paid for, to retrieve something that could have been generated provider side in minutes and transferred in seconds (3MB).
I sent a proposal to TDWG last year termed "datamaps" which was effectively what you are describing, and I based it on the Sitemaps protocol, but I got nowhere with it. With Markus, we are making more progress and I have spoken with several GBIF data providers about a proposed new standard for full dataset harvesting and it has been received well. So Markus and I have started a new proposal and have a working name of 'Localised DwC Index' file generation (it is an index if you have more than DwC data, and DwC is still standards compliant) which is really a GZipped Tab file dump of the data, which is slightly extensible. The document is not ready to circulate yet but the benefits section reads currently:
- Provider database load reduced, allowing it to serve real
distributed queries rather than "full datasource" harvesters
- Providers can choose to publish their index as it suits them,
giving control back to the provider
- Localised index generation can be built into tools not yet
capable of integrating with TDWG protocol networks such as GBIF
- Harvesters receive a full dataset view in one request, making it
very easy to determine what records are eligible for deletion
- It becomes very simple to write clients that consume entire
datasets. E.g. data cleansing tools that the provider can run:
- Give me ISO Country Codes for my dataset
- The application pulls down the providers index file,
generates ISO country code, returns a simple table using the providers own identifier
- Check my names for spelling mistakes
- Application skims over the records and provides a list that
are not known to the application
- Providers such as UK NBN cannot serve 20 million records to the
GBIF index using the existing protocols efficiently.
- They have the ability to generate a localised index however
- Harvesters can very quickly build up searchable indexes and it
is easy to create large indices.
- Node Portal can easily aggregate index data files
- true index to data, not an illusion of a cache. More like Google
sitemaps
It is the ease at which one can offer tools to data providers that really interests me. The technical threshold required to produce services that offer reporting tools on peoples data is really very low with this mechanism. This and the fact that large datasets will be harvestable - we have even considered the likes of bit-torrent for the large ones although I think this is overkill.
As a consumer therefore I fully support this move as a valuable addition to the wrapper tools.
Cheers
Tim (wrote the GBIF harvesting, and new to this list)
> > Begin forwarded message: > >> From: "Aaron D. Steele" eightysteele@gmail.com >> Date: 13 de mayo de 2008 22:40:09 GMT+02:00 >> To: tdwg-tapir@lists.tdwg.org >> Cc: Aaron Steele asteele@berkeley.edu >> Subject: Re: [tdwg-tapir] Tapir protocol - Harvest methods? >> >> at berkeley we've recently prototyped a simple php program that >> uses >> an existing tapirlink installation to periodically dump tapir >> resources into a csv file. the solution is totally generic and >> can >> dump darwin core (and technically abcd schema, although it's >> currently >> untested). the resulting csv files are zip archived and made >> accessible using a web service. it's a simple approach that has >> proven >> to be, at least internally, quite reliable and useful. >> >> for example, several of our caching applications use the web >> service >> to harvest csv data from tapirlink resources using the following >> process: >> 1) download latest csv dump for a resource using the web >> service. >> 2) flush all locally cached records for the resource. >> 3) bulk load the latest csv data into the cache. >> >> in this way, cached data are always synchronized with the >> resource and >> there's no need to track new, deleted, or changed records. as an >> aside, each time these cached data are queried by the caching >> application or selected in the user interface, log-only search >> requests are sent back to the resource. >> >> after discussion with renato giovanni and john wieczorek, we've >> decided that merging this functionality into the tapirlink >> codebase >> would benefit the broader community. csv generation support >> would >> be >> declared through capabilities. although incremental harvesting >> wouldn't be immediately implemented, we could certainly extend >> the >> service to include it later. >> >> i'd like to pause here to gauge the consensus, thoughts, >> concerns, and >> ideas of others. anyone? >> >> thanks, >> aaron >> >> 2008/5/5 Kevin Richards RichardsK@landcareresearch.co.nz: >>> >>> I think I agree here. >>> >>> The harvesting "procedure" is really defined outside the Tapir >>> protocol, is >>> it not? So it is really an agreement between the harvester and >>> the >>> harvestees. >>> >>> So what is really needed here is the standard procedure for >>> maintaining a >>> "harvestable" dataset and the standard procedure for harvesting >>> that >>> dataset. >>> We have a general rule at Landcare, that we never delete >>> records >>> in >>> our >>> datasets - they are either deprecated in favour of another >>> record, >>> and so >>> the resolution of that record would point to the new record, or >>> the >>> are set >>> to a state of "deleted", but are still kept in the dataset, and >>> can >>> be >>> resolved (which would indicate a state of deleted). >>> >>> Kevin >>> >>> >>>>>> "Renato De Giovanni" renato@cria.org.br 6/05/2008 7:33 >>>>>> a.m. >>>>>>>>> >>> >>> Hi Markus, >>> >>> I would suggest creating new concepts for incremental >>> harvesting, >>> either in the data standards themselves or in some new >>> extension. In >>> the case of TAPIR, GBIF could easily check the mapped concepts >>> before >>> deciding between incremental or full harvesting. >>> >>> Actually it could be just one new concept such as >>> "recordStatus" >>> or >>> "deletionFlag". Or perhaps you could also want to create your >>> own >>> definition for dateLastModified indicating which set of >>> concepts >>> should be considered to see if something has changed or not, >>> but I >>> guess this level of granularity would be difficult to be >>> supported. >>> >>> Regards, >>> -- >>> Renato >>> >>> On 5 May 2008 at 11:24, Markus Döring wrote: >>> >>>> Phil, >>>> incremental harvesting is not implemented on the GBIF side as >>>> far >>>> as I >>>> am aware. And I dont think that will be a simple thing to >>>> implement on >>>> the current system. Also, even if we can detect only the >>>> changed >>>> records since the last harevesting via dateLastModified we >>>> still >>>> have >>>> no information about deletions. We could have an arrangement >>>> saying >>>> that you keep deleted records as empty records with just the >>>> ID >>>> and >>>> nothing else (I vaguely remember LSIDs were supposed to work >>>> like >>>> this >>>> too). But that also needs to be supported on your side then, >>>> never >>>> entirely removing any record. I will have a discussion with >>>> the >>>> others >>>> at GBIF about that. >>>> >>>> Markus >>> _______________________________________________ >>> tdwg-tapir mailing list >>> tdwg-tapir@lists.tdwg.org >>> http://lists.tdwg.org/mailman/listinfo/tdwg-tapir >>> >>> >>> >>> >>> Please consider the environment before printing this email >>> >>> WARNING : This email and any attachments may be confidential >>> and/ >>> or >>> privileged. They are intended for the addressee only and are >>> not >>> to >>> be read, >>> used, copied or disseminated by anyone receiving them in >>> error. If >>> you are >>> not the intended recipient, please notify the sender by return >>> email and >>> delete this message and any attachments. >>> >>> The views expressed in this email are those of the sender and >>> do >>> not >>> necessarily reflect the >>> official views of Landcare Research. http:// >>> www.landcareresearch.co.nz >>> _______________________________________________ >>> tdwg-tapir mailing list >>> tdwg-tapir@lists.tdwg.org >>> http://lists.tdwg.org/mailman/listinfo/tdwg-tapir >>> >>> >> _______________________________________________ >> tdwg-tapir mailing list >> tdwg-tapir@lists.tdwg.org >> http://lists.tdwg.org/mailman/listinfo/tdwg-tapir >
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
--
Australian Centre for Plant BIodiversity Research<------------------+ National greg whitBread voice: +61 2 62509 482 Botanic Integrated Botanical Information System fax: +61 2 62509 599 Gardens S........ I.T. happens.. ghw@anbg.gov.au +----------------------------------------->GPO Box 1777 Canberra 2601
If you have received this transmission in error please notify us immediately by return e-mail and delete all copies. If this e-mail or any attachments have been sent to you in error, that error does not constitute waiver of any confidentiality, privilege or copyright in respect of information in the e-mail or attachments.
Please consider the environment before printing this email.
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
participants (3)
-
Holetschek, Jörg
-
John R. WIECZOREK
-
Markus Döring