Re: [tdwg-tapir] Fwd: Tapir protocol - Harvest methods?
Hi All,
This is very interesting too me, as I came up with the same conclusion while harvesting for GBIF.
As a "harvester of all records" it is best described with an example:
- Complete Inventory of ScientificNames: 7 minutes @ the limited 200 records per page - Complete Harvesting of records: - 260,000 records - 9 hours harvesting duration - 500MB TAPIR+DwC XML returned (DwC 1.4 with geospatial and curatorial extensions) - Extraction of DwC records from harvested XML: <2 minutes - Resulting file size 32MB, Gzipped to <3MB
I spun hard drives for 9 hours, and took up bandwidth that is paid for, to retrieve something that could have been generated provider side in minutes and transferred in seconds (3MB).
I sent a proposal to TDWG last year termed "datamaps" which was effectively what you are describing, and I based it on the Sitemaps protocol, but I got nowhere with it. With Markus, we are making more progress and I have spoken with several GBIF data providers about a proposed new standard for full dataset harvesting and it has been received well. So Markus and I have started a new proposal and have a working name of 'Localised DwC Index' file generation (it is an index if you have more than DwC data, and DwC is still standards compliant) which is really a GZipped Tab file dump of the data, which is slightly extensible. The document is not ready to circulate yet but the benefits section reads currently:
- Provider database load reduced, allowing it to serve real distributed queries rather than "full datasource" harvesters - Providers can choose to publish their index as it suits them, giving control back to the provider - Localised index generation can be built into tools not yet capable of integrating with TDWG protocol networks such as GBIF - Harvesters receive a full dataset view in one request, making it very easy to determine what records are eligible for deletion - It becomes very simple to write clients that consume entire datasets. E.g. data cleansing tools that the provider can run: - Give me ISO Country Codes for my dataset - The application pulls down the providers index file, generates ISO country code, returns a simple table using the providers own identifier - Check my names for spelling mistakes - Application skims over the records and provides a list that are not known to the application - Providers such as UK NBN cannot serve 20 million records to the GBIF index using the existing protocols efficiently. - They have the ability to generate a localised index however - Harvesters can very quickly build up searchable indexes and it is easy to create large indices. - Node Portal can easily aggregate index data files - true index to data, not an illusion of a cache. More like Google sitemaps
It is the ease at which one can offer tools to data providers that really interests me. The technical threshold required to produce services that offer reporting tools on peoples data is really very low with this mechanism. This and the fact that large datasets will be harvestable - we have even considered the likes of bit-torrent for the large ones although I think this is overkill.
As a consumer therefore I fully support this move as a valuable addition to the wrapper tools.
Cheers
Tim (wrote the GBIF harvesting, and new to this list)
Begin forwarded message:
From: "Aaron D. Steele" eightysteele@gmail.com Date: 13 de mayo de 2008 22:40:09 GMT+02:00 To: tdwg-tapir@lists.tdwg.org Cc: Aaron Steele asteele@berkeley.edu Subject: Re: [tdwg-tapir] Tapir protocol - Harvest methods?
at berkeley we've recently prototyped a simple php program that uses an existing tapirlink installation to periodically dump tapir resources into a csv file. the solution is totally generic and can dump darwin core (and technically abcd schema, although it's currently untested). the resulting csv files are zip archived and made accessible using a web service. it's a simple approach that has proven to be, at least internally, quite reliable and useful.
for example, several of our caching applications use the web service to harvest csv data from tapirlink resources using the following process:
- download latest csv dump for a resource using the web service.
- flush all locally cached records for the resource.
- bulk load the latest csv data into the cache.
in this way, cached data are always synchronized with the resource and there's no need to track new, deleted, or changed records. as an aside, each time these cached data are queried by the caching application or selected in the user interface, log-only search requests are sent back to the resource.
after discussion with renato giovanni and john wieczorek, we've decided that merging this functionality into the tapirlink codebase would benefit the broader community. csv generation support would be declared through capabilities. although incremental harvesting wouldn't be immediately implemented, we could certainly extend the service to include it later.
i'd like to pause here to gauge the consensus, thoughts, concerns, and ideas of others. anyone?
thanks, aaron
2008/5/5 Kevin Richards RichardsK@landcareresearch.co.nz:
I think I agree here.
The harvesting "procedure" is really defined outside the Tapir protocol, is it not? So it is really an agreement between the harvester and the harvestees.
So what is really needed here is the standard procedure for maintaining a "harvestable" dataset and the standard procedure for harvesting that dataset. We have a general rule at Landcare, that we never delete records in our datasets - they are either deprecated in favour of another record, and so the resolution of that record would point to the new record, or the are set to a state of "deleted", but are still kept in the dataset, and can be resolved (which would indicate a state of deleted).
Kevin
"Renato De Giovanni" renato@cria.org.br 6/05/2008 7:33 a.m. >>>
Hi Markus,
I would suggest creating new concepts for incremental harvesting, either in the data standards themselves or in some new extension. In the case of TAPIR, GBIF could easily check the mapped concepts before deciding between incremental or full harvesting.
Actually it could be just one new concept such as "recordStatus" or "deletionFlag". Or perhaps you could also want to create your own definition for dateLastModified indicating which set of concepts should be considered to see if something has changed or not, but I guess this level of granularity would be difficult to be supported.
Regards,
Renato
On 5 May 2008 at 11:24, Markus Döring wrote:
Phil, incremental harvesting is not implemented on the GBIF side as far as I am aware. And I dont think that will be a simple thing to implement on the current system. Also, even if we can detect only the changed records since the last harevesting via dateLastModified we still have no information about deletions. We could have an arrangement saying that you keep deleted records as empty records with just the ID and nothing else (I vaguely remember LSIDs were supposed to work like this too). But that also needs to be supported on your side then, never entirely removing any record. I will have a discussion with the others at GBIF about that.
Markus
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
Please consider the environment before printing this email
WARNING : This email and any attachments may be confidential and/or privileged. They are intended for the addressee only and are not to be read, used, copied or disseminated by anyone receiving them in error. If you are not the intended recipient, please notify the sender by return email and delete this message and any attachments.
The views expressed in this email are those of the sender and do not necessarily reflect the official views of Landcare Research. http:// www.landcareresearch.co.nz _______________________________________________ tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
We have used a very similar protocol to assemble the latest AVH cache. It should be noted that this is an as-well-as protocol that only works because we have an established semantic standard (hispid/abcd).
greg
trobertson@gbif.org wrote:
Hi All,
This is very interesting too me, as I came up with the same conclusion while harvesting for GBIF.
As a "harvester of all records" it is best described with an example:
- Complete Inventory of ScientificNames: 7 minutes @ the limited 200
records per page
- Complete Harvesting of records:
- 260,000 records
- 9 hours harvesting duration
- 500MB TAPIR+DwC XML returned (DwC 1.4 with geospatial and curatorial
extensions)
- Extraction of DwC records from harvested XML: <2 minutes
- Resulting file size 32MB, Gzipped to <3MB
I spun hard drives for 9 hours, and took up bandwidth that is paid for, to retrieve something that could have been generated provider side in minutes and transferred in seconds (3MB).
I sent a proposal to TDWG last year termed "datamaps" which was effectively what you are describing, and I based it on the Sitemaps protocol, but I got nowhere with it. With Markus, we are making more progress and I have spoken with several GBIF data providers about a proposed new standard for full dataset harvesting and it has been received well. So Markus and I have started a new proposal and have a working name of 'Localised DwC Index' file generation (it is an index if you have more than DwC data, and DwC is still standards compliant) which is really a GZipped Tab file dump of the data, which is slightly extensible. The document is not ready to circulate yet but the benefits section reads currently:
- Provider database load reduced, allowing it to serve real distributed
queries rather than "full datasource" harvesters
- Providers can choose to publish their index as it suits them, giving
control back to the provider
- Localised index generation can be built into tools not yet capable of
integrating with TDWG protocol networks such as GBIF
- Harvesters receive a full dataset view in one request, making it very
easy to determine what records are eligible for deletion
- It becomes very simple to write clients that consume entire datasets.
E.g. data cleansing tools that the provider can run:
- Give me ISO Country Codes for my dataset
- The application pulls down the providers index file, generates ISO
country code, returns a simple table using the providers own identifier
- Check my names for spelling mistakes
- Application skims over the records and provides a list that are not
known to the application
- Providers such as UK NBN cannot serve 20 million records to the GBIF
index using the existing protocols efficiently.
- They have the ability to generate a localised index however
- Harvesters can very quickly build up searchable indexes and it is easy
to create large indices.
- Node Portal can easily aggregate index data files
- true index to data, not an illusion of a cache. More like Google sitemaps
It is the ease at which one can offer tools to data providers that really interests me. The technical threshold required to produce services that offer reporting tools on peoples data is really very low with this mechanism. This and the fact that large datasets will be harvestable - we have even considered the likes of bit-torrent for the large ones although I think this is overkill.
As a consumer therefore I fully support this move as a valuable addition to the wrapper tools.
Cheers
Tim (wrote the GBIF harvesting, and new to this list)
Begin forwarded message:
From: "Aaron D. Steele" eightysteele@gmail.com Date: 13 de mayo de 2008 22:40:09 GMT+02:00 To: tdwg-tapir@lists.tdwg.org Cc: Aaron Steele asteele@berkeley.edu Subject: Re: [tdwg-tapir] Tapir protocol - Harvest methods?
at berkeley we've recently prototyped a simple php program that uses an existing tapirlink installation to periodically dump tapir resources into a csv file. the solution is totally generic and can dump darwin core (and technically abcd schema, although it's currently untested). the resulting csv files are zip archived and made accessible using a web service. it's a simple approach that has proven to be, at least internally, quite reliable and useful.
for example, several of our caching applications use the web service to harvest csv data from tapirlink resources using the following process:
- download latest csv dump for a resource using the web service.
- flush all locally cached records for the resource.
- bulk load the latest csv data into the cache.
in this way, cached data are always synchronized with the resource and there's no need to track new, deleted, or changed records. as an aside, each time these cached data are queried by the caching application or selected in the user interface, log-only search requests are sent back to the resource.
after discussion with renato giovanni and john wieczorek, we've decided that merging this functionality into the tapirlink codebase would benefit the broader community. csv generation support would be declared through capabilities. although incremental harvesting wouldn't be immediately implemented, we could certainly extend the service to include it later.
i'd like to pause here to gauge the consensus, thoughts, concerns, and ideas of others. anyone?
thanks, aaron
2008/5/5 Kevin Richards RichardsK@landcareresearch.co.nz:
I think I agree here.
The harvesting "procedure" is really defined outside the Tapir protocol, is it not? So it is really an agreement between the harvester and the harvestees.
So what is really needed here is the standard procedure for maintaining a "harvestable" dataset and the standard procedure for harvesting that dataset. We have a general rule at Landcare, that we never delete records in our datasets - they are either deprecated in favour of another record, and so the resolution of that record would point to the new record, or the are set to a state of "deleted", but are still kept in the dataset, and can be resolved (which would indicate a state of deleted).
Kevin
> "Renato De Giovanni" renato@cria.org.br 6/05/2008 7:33 a.m. >>>
Hi Markus,
I would suggest creating new concepts for incremental harvesting, either in the data standards themselves or in some new extension. In the case of TAPIR, GBIF could easily check the mapped concepts before deciding between incremental or full harvesting.
Actually it could be just one new concept such as "recordStatus" or "deletionFlag". Or perhaps you could also want to create your own definition for dateLastModified indicating which set of concepts should be considered to see if something has changed or not, but I guess this level of granularity would be difficult to be supported.
Regards,
Renato
On 5 May 2008 at 11:24, Markus Döring wrote:
Phil, incremental harvesting is not implemented on the GBIF side as far as I am aware. And I dont think that will be a simple thing to implement on the current system. Also, even if we can detect only the changed records since the last harevesting via dateLastModified we still have no information about deletions. We could have an arrangement saying that you keep deleted records as empty records with just the ID and nothing else (I vaguely remember LSIDs were supposed to work like this too). But that also needs to be supported on your side then, never entirely removing any record. I will have a discussion with the others at GBIF about that.
Markus
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
Please consider the environment before printing this email
WARNING : This email and any attachments may be confidential and/or privileged. They are intended for the addressee only and are not to be read, used, copied or disseminated by anyone receiving them in error. If you are not the intended recipient, please notify the sender by return email and delete this message and any attachments.
The views expressed in this email are those of the sender and do not necessarily reflect the official views of Landcare Research. http:// www.landcareresearch.co.nz _______________________________________________ tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
Interesting that we all come to the same conclusions... The trouble I had with just a simple flat csv file is repeating properties like multiple image urls. ABCD clients dont use ABCD just because its complex, but because they want to transport this relational data. We were considering 2 solutions to extending this csv approach. The first would be to have a single large denormalised csv file with many rows for the same record. It would require knowledge about the related entities though and could grow in size rapidly. The second idea which we think to adopt is allowing a single level of 1- many related entities. It is basically a "star" design with the core dwc table in the center and any number of extension tables around it. Each "table" aka csv file will have the record id as the first column, so the files can be related easily and it only needs a single identifier per record and not for the extension entities. This would give a lot of flexibility while keeping things pretty simple to deal with. It would even satisfy the ABCD needs as I havent yet seen anyone requiring 2 levels of related tables (other than lookup tables). Those extensions could even be a simple 1-1 relation, but would keep things semantically together just like a xml namespace. The darwin core extensions would be good for example.
So we could have a gzipped set of files, maybe with a simple metafile indicating the semantics of the columns for each file. An example could look like this:
# darwincore.csv 102 Aster alpinus subsp. parviceps ... 103 Polygala vulgaris ...
# curatorial.csv 102 Kew Herbarium 103 Reading Herbarium
# identification.csv 102 2003-05-04 Karl Marx Aster alpinus L. 102 2007-01-11 Mark Twain Aster korshinskyi Tamamsch. 102 2007-09-13 Roger Hyam Aster alpinus subsp. parviceps Novopokr. 103 2001-02-21 Steve Bekow Polygala vulgaris L.
I know this looks old fashioned, but it is just so simple and gives us so much flexibility. Markus
On 14 May, 2008, at 24:39, Greg Whitbread wrote:
We have used a very similar protocol to assemble the latest AVH cache. It should be noted that this is an as-well-as protocol that only works because we have an established semantic standard (hispid/abcd).
greg
trobertson@gbif.org wrote:
Hi All,
This is very interesting too me, as I came up with the same conclusion while harvesting for GBIF.
As a "harvester of all records" it is best described with an example:
- Complete Inventory of ScientificNames: 7 minutes @ the limited 200
records per page
- Complete Harvesting of records:
- 260,000 records
- 9 hours harvesting duration
- 500MB TAPIR+DwC XML returned (DwC 1.4 with geospatial and
curatorial extensions)
- Extraction of DwC records from harvested XML: <2 minutes
- Resulting file size 32MB, Gzipped to <3MB
I spun hard drives for 9 hours, and took up bandwidth that is paid for, to retrieve something that could have been generated provider side in minutes and transferred in seconds (3MB).
I sent a proposal to TDWG last year termed "datamaps" which was effectively what you are describing, and I based it on the Sitemaps protocol, but I got nowhere with it. With Markus, we are making more progress and I have spoken with several GBIF data providers about a proposed new standard for full dataset harvesting and it has been received well. So Markus and I have started a new proposal and have a working name of 'Localised DwC Index' file generation (it is an index if you have more than DwC data, and DwC is still standards compliant) which is really a GZipped Tab file dump of the data, which is slightly extensible. The document is not ready to circulate yet but the benefits section reads currently:
- Provider database load reduced, allowing it to serve real
distributed queries rather than "full datasource" harvesters
- Providers can choose to publish their index as it suits them,
giving control back to the provider
- Localised index generation can be built into tools not yet
capable of integrating with TDWG protocol networks such as GBIF
- Harvesters receive a full dataset view in one request, making it
very easy to determine what records are eligible for deletion
- It becomes very simple to write clients that consume entire
datasets. E.g. data cleansing tools that the provider can run:
- Give me ISO Country Codes for my dataset
- The application pulls down the providers index file,
generates ISO country code, returns a simple table using the providers own identifier
- Check my names for spelling mistakes
- Application skims over the records and provides a list that
are not known to the application
- Providers such as UK NBN cannot serve 20 million records to the
GBIF index using the existing protocols efficiently.
- They have the ability to generate a localised index however
- Harvesters can very quickly build up searchable indexes and it is
easy to create large indices.
- Node Portal can easily aggregate index data files
- true index to data, not an illusion of a cache. More like Google
sitemaps
It is the ease at which one can offer tools to data providers that really interests me. The technical threshold required to produce services that offer reporting tools on peoples data is really very low with this mechanism. This and the fact that large datasets will be harvestable - we have even considered the likes of bit-torrent for the large ones although I think this is overkill.
As a consumer therefore I fully support this move as a valuable addition to the wrapper tools.
Cheers
Tim (wrote the GBIF harvesting, and new to this list)
Begin forwarded message:
From: "Aaron D. Steele" eightysteele@gmail.com Date: 13 de mayo de 2008 22:40:09 GMT+02:00 To: tdwg-tapir@lists.tdwg.org Cc: Aaron Steele asteele@berkeley.edu Subject: Re: [tdwg-tapir] Tapir protocol - Harvest methods?
at berkeley we've recently prototyped a simple php program that uses an existing tapirlink installation to periodically dump tapir resources into a csv file. the solution is totally generic and can dump darwin core (and technically abcd schema, although it's currently untested). the resulting csv files are zip archived and made accessible using a web service. it's a simple approach that has proven to be, at least internally, quite reliable and useful.
for example, several of our caching applications use the web service to harvest csv data from tapirlink resources using the following process:
- download latest csv dump for a resource using the web service.
- flush all locally cached records for the resource.
- bulk load the latest csv data into the cache.
in this way, cached data are always synchronized with the resource and there's no need to track new, deleted, or changed records. as an aside, each time these cached data are queried by the caching application or selected in the user interface, log-only search requests are sent back to the resource.
after discussion with renato giovanni and john wieczorek, we've decided that merging this functionality into the tapirlink codebase would benefit the broader community. csv generation support would be declared through capabilities. although incremental harvesting wouldn't be immediately implemented, we could certainly extend the service to include it later.
i'd like to pause here to gauge the consensus, thoughts, concerns, and ideas of others. anyone?
thanks, aaron
2008/5/5 Kevin Richards RichardsK@landcareresearch.co.nz:
I think I agree here.
The harvesting "procedure" is really defined outside the Tapir protocol, is it not? So it is really an agreement between the harvester and the harvestees.
So what is really needed here is the standard procedure for maintaining a "harvestable" dataset and the standard procedure for harvesting that dataset. We have a general rule at Landcare, that we never delete records in our datasets - they are either deprecated in favour of another record, and so the resolution of that record would point to the new record, or the are set to a state of "deleted", but are still kept in the dataset, and can be resolved (which would indicate a state of deleted).
Kevin
>> "Renato De Giovanni" renato@cria.org.br 6/05/2008 7:33 a.m. >> >>>
Hi Markus,
I would suggest creating new concepts for incremental harvesting, either in the data standards themselves or in some new extension. In the case of TAPIR, GBIF could easily check the mapped concepts before deciding between incremental or full harvesting.
Actually it could be just one new concept such as "recordStatus" or "deletionFlag". Or perhaps you could also want to create your own definition for dateLastModified indicating which set of concepts should be considered to see if something has changed or not, but I guess this level of granularity would be difficult to be supported.
Regards,
Renato
On 5 May 2008 at 11:24, Markus Döring wrote:
Phil, incremental harvesting is not implemented on the GBIF side as far as I am aware. And I dont think that will be a simple thing to implement on the current system. Also, even if we can detect only the changed records since the last harevesting via dateLastModified we still have no information about deletions. We could have an arrangement saying that you keep deleted records as empty records with just the ID and nothing else (I vaguely remember LSIDs were supposed to work like this too). But that also needs to be supported on your side then, never entirely removing any record. I will have a discussion with the others at GBIF about that.
Markus
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
Please consider the environment before printing this email
WARNING : This email and any attachments may be confidential and/ or privileged. They are intended for the addressee only and are not to be read, used, copied or disseminated by anyone receiving them in error. If you are not the intended recipient, please notify the sender by return email and delete this message and any attachments.
The views expressed in this email are those of the sender and do not necessarily reflect the official views of Landcare Research. http:// www.landcareresearch.co.nz _______________________________________________ tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
--
Australian Centre for Plant BIodiversity Research<------------------+ National greg whitBread voice: +61 2 62509 482 Botanic Integrated Botanical Information System fax: +61 2 62509 599 Gardens S........ I.T. happens.. ghw@anbg.gov.au +----------------------------------------->GPO Box 1777 Canberra 2601
If you have received this transmission in error please notify us immediately by return e-mail and delete all copies. If this e-mail or any attachments have been sent to you in error, that error does not constitute waiver of any confidentiality, privilege or copyright in respect of information in the e-mail or attachments.
Please consider the environment before printing this email.
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
Generally if we are going to have csv files for data transfer we don't need to have software implementations just some documentation on what the csv files should contain. Something along the lines of:
1) Make a report from your database as a csv file(s) with the following columns... 2) Zip it up. 3) Either put it on a webserver and send us the URL or upload it using this webform.
We don't need to bother with TAPIR etc. You could even only produce a CSV file of the records that have changed so big data sets needn't be a problem.
I worry that we are working out how to move data about quickly and forgetting that the real goal is to integrate data and that will only come if people have GUIDs on the stuff they own and use other peoples GUIDs in their data. Solutions based around CSV files do nothing to move people in that direction and I would suspect lead to making matters worse.
Finding ourselves in hole digging quicker may not be the best option.
Roger
------------------------------------------------------------- Roger Hyam Roger@BiodiversityCollectionsIndex.org http://www.BiodiversityCollectionsIndex.org ------------------------------------------------------------- Royal Botanic Garden Edinburgh 20A Inverleith Row, Edinburgh, EH3 5LR, UK Tel: +44 131 552 7171 ext 3015 Fax: +44 131 248 2901 http://www.rbge.org.uk/ -------------------------------------------------------------
On 14 May 2008, at 10:21, Markus Döring wrote:
Interesting that we all come to the same conclusions... The trouble I had with just a simple flat csv file is repeating properties like multiple image urls. ABCD clients dont use ABCD just because its complex, but because they want to transport this relational data. We were considering 2 solutions to extending this csv approach. The first would be to have a single large denormalised csv file with many rows for the same record. It would require knowledge about the related entities though and could grow in size rapidly. The second idea which we think to adopt is allowing a single level of 1- many related entities. It is basically a "star" design with the core dwc table in the center and any number of extension tables around it. Each "table" aka csv file will have the record id as the first column, so the files can be related easily and it only needs a single identifier per record and not for the extension entities. This would give a lot of flexibility while keeping things pretty simple to deal with. It would even satisfy the ABCD needs as I havent yet seen anyone requiring 2 levels of related tables (other than lookup tables). Those extensions could even be a simple 1-1 relation, but would keep things semantically together just like a xml namespace. The darwin core extensions would be good for example.
So we could have a gzipped set of files, maybe with a simple metafile indicating the semantics of the columns for each file. An example could look like this:
# darwincore.csv 102 Aster alpinus subsp. parviceps ... 103 Polygala vulgaris ...
# curatorial.csv 102 Kew Herbarium 103 Reading Herbarium
# identification.csv 102 2003-05-04 Karl Marx Aster alpinus L. 102 2007-01-11 Mark Twain Aster korshinskyi Tamamsch. 102 2007-09-13 Roger Hyam Aster alpinus subsp. parviceps Novopokr. 103 2001-02-21 Steve Bekow Polygala vulgaris L.
I know this looks old fashioned, but it is just so simple and gives us so much flexibility. Markus
On 14 May, 2008, at 24:39, Greg Whitbread wrote:
We have used a very similar protocol to assemble the latest AVH cache. It should be noted that this is an as-well-as protocol that only works because we have an established semantic standard (hispid/abcd).
greg
trobertson@gbif.org wrote:
Hi All,
This is very interesting too me, as I came up with the same conclusion while harvesting for GBIF.
As a "harvester of all records" it is best described with an example:
- Complete Inventory of ScientificNames: 7 minutes @ the limited 200
records per page
- Complete Harvesting of records:
- 260,000 records
- 9 hours harvesting duration
- 500MB TAPIR+DwC XML returned (DwC 1.4 with geospatial and
curatorial extensions)
- Extraction of DwC records from harvested XML: <2 minutes
- Resulting file size 32MB, Gzipped to <3MB
I spun hard drives for 9 hours, and took up bandwidth that is paid for, to retrieve something that could have been generated provider side in minutes and transferred in seconds (3MB).
I sent a proposal to TDWG last year termed "datamaps" which was effectively what you are describing, and I based it on the Sitemaps protocol, but I got nowhere with it. With Markus, we are making more progress and I have spoken with several GBIF data providers about a proposed new standard for full dataset harvesting and it has been received well. So Markus and I have started a new proposal and have a working name of 'Localised DwC Index' file generation (it is an index if you have more than DwC data, and DwC is still standards compliant) which is really a GZipped Tab file dump of the data, which is slightly extensible. The document is not ready to circulate yet but the benefits section reads currently:
- Provider database load reduced, allowing it to serve real
distributed queries rather than "full datasource" harvesters
- Providers can choose to publish their index as it suits them,
giving control back to the provider
- Localised index generation can be built into tools not yet
capable of integrating with TDWG protocol networks such as GBIF
- Harvesters receive a full dataset view in one request, making it
very easy to determine what records are eligible for deletion
- It becomes very simple to write clients that consume entire
datasets. E.g. data cleansing tools that the provider can run:
- Give me ISO Country Codes for my dataset
- The application pulls down the providers index file,
generates ISO country code, returns a simple table using the providers own identifier
- Check my names for spelling mistakes
- Application skims over the records and provides a list that
are not known to the application
- Providers such as UK NBN cannot serve 20 million records to the
GBIF index using the existing protocols efficiently.
- They have the ability to generate a localised index however
- Harvesters can very quickly build up searchable indexes and it is
easy to create large indices.
- Node Portal can easily aggregate index data files
- true index to data, not an illusion of a cache. More like Google
sitemaps
It is the ease at which one can offer tools to data providers that really interests me. The technical threshold required to produce services that offer reporting tools on peoples data is really very low with this mechanism. This and the fact that large datasets will be harvestable - we have even considered the likes of bit-torrent for the large ones although I think this is overkill.
As a consumer therefore I fully support this move as a valuable addition to the wrapper tools.
Cheers
Tim (wrote the GBIF harvesting, and new to this list)
Begin forwarded message:
From: "Aaron D. Steele" eightysteele@gmail.com Date: 13 de mayo de 2008 22:40:09 GMT+02:00 To: tdwg-tapir@lists.tdwg.org Cc: Aaron Steele asteele@berkeley.edu Subject: Re: [tdwg-tapir] Tapir protocol - Harvest methods?
at berkeley we've recently prototyped a simple php program that uses an existing tapirlink installation to periodically dump tapir resources into a csv file. the solution is totally generic and can dump darwin core (and technically abcd schema, although it's currently untested). the resulting csv files are zip archived and made accessible using a web service. it's a simple approach that has proven to be, at least internally, quite reliable and useful.
for example, several of our caching applications use the web service to harvest csv data from tapirlink resources using the following process:
- download latest csv dump for a resource using the web service.
- flush all locally cached records for the resource.
- bulk load the latest csv data into the cache.
in this way, cached data are always synchronized with the resource and there's no need to track new, deleted, or changed records. as an aside, each time these cached data are queried by the caching application or selected in the user interface, log-only search requests are sent back to the resource.
after discussion with renato giovanni and john wieczorek, we've decided that merging this functionality into the tapirlink codebase would benefit the broader community. csv generation support would be declared through capabilities. although incremental harvesting wouldn't be immediately implemented, we could certainly extend the service to include it later.
i'd like to pause here to gauge the consensus, thoughts, concerns, and ideas of others. anyone?
thanks, aaron
2008/5/5 Kevin Richards RichardsK@landcareresearch.co.nz:
I think I agree here.
The harvesting "procedure" is really defined outside the Tapir protocol, is it not? So it is really an agreement between the harvester and the harvestees.
So what is really needed here is the standard procedure for maintaining a "harvestable" dataset and the standard procedure for harvesting that dataset. We have a general rule at Landcare, that we never delete records in our datasets - they are either deprecated in favour of another record, and so the resolution of that record would point to the new record, or the are set to a state of "deleted", but are still kept in the dataset, and can be resolved (which would indicate a state of deleted).
Kevin
>>> "Renato De Giovanni" renato@cria.org.br 6/05/2008 7:33 a.m. >>>>>>
Hi Markus,
I would suggest creating new concepts for incremental harvesting, either in the data standards themselves or in some new extension. In the case of TAPIR, GBIF could easily check the mapped concepts before deciding between incremental or full harvesting.
Actually it could be just one new concept such as "recordStatus" or "deletionFlag". Or perhaps you could also want to create your own definition for dateLastModified indicating which set of concepts should be considered to see if something has changed or not, but I guess this level of granularity would be difficult to be supported.
Regards,
Renato
On 5 May 2008 at 11:24, Markus Döring wrote:
> Phil, > incremental harvesting is not implemented on the GBIF side as > far > as I > am aware. And I dont think that will be a simple thing to > implement on > the current system. Also, even if we can detect only the changed > records since the last harevesting via dateLastModified we still > have > no information about deletions. We could have an arrangement > saying > that you keep deleted records as empty records with just the ID > and > nothing else (I vaguely remember LSIDs were supposed to work > like > this > too). But that also needs to be supported on your side then, > never > entirely removing any record. I will have a discussion with the > others > at GBIF about that. > > Markus _______________________________________________ tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
Please consider the environment before printing this email
WARNING : This email and any attachments may be confidential and/ or privileged. They are intended for the addressee only and are not to be read, used, copied or disseminated by anyone receiving them in error. If you are not the intended recipient, please notify the sender by return email and delete this message and any attachments.
The views expressed in this email are those of the sender and do not necessarily reflect the official views of Landcare Research. http:// www.landcareresearch.co.nz _______________________________________________ tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
--
Australian Centre for Plant BIodiversity Research<------------------+ National greg whitBread voice: +61 2 62509 482 Botanic Integrated Botanical Information System fax: +61 2 62509 599 Gardens S........ I.T. happens.. ghw@anbg.gov.au +----------------------------------------->GPO Box 1777 Canberra 2601
If you have received this transmission in error please notify us immediately by return e-mail and delete all copies. If this e-mail or any attachments have been sent to you in error, that error does not constitute waiver of any confidentiality, privilege or copyright in respect of information in the e-mail or attachments.
Please consider the environment before printing this email.
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
Roegr writes "I worry that we are working out how to move data about quickly"
That is exactly what this is for, but why is it a worry (other than the likes of GBIF who really are worrying about moving data around quickly since everyone is shouting about latency problems)?
It is a 166 times (3meg versus 500meg) more efficient transfer of a data source for those wishing to transfer the whole thing. It is still standards compliant for the document passed across (DwC + flat extension schemas), and by incorporating it's generation into tools like a TAPIR wrapper, would ensure this. The reality is, many of the very large datasets have to come to GBIF like this - the transfer protocols existing just do not perform.
Furthermore, think how much easier it would be for someone like Catalogue of Life or ITIS to put up a service that says "hey, you give me the URL to your Locally generated DwC Index File and I'll give you back a report containing YOUR occurrence GUID, and MY LSID for your identification". Isn't that a good thing?
In my view these files are additional to any existing interfaces, only meet certain data type requirements and by no means detract from any of the important work (both technical and social aspects) on GUID assigning, document schemas etc. Therefore, like sitemaps became a requirement for large web sites, I think a more efficient standards based (than just dump your data and we'll handle it) approach is required for our community.
-----Original Message----- From: tdwg-tapir-bounces@lists.tdwg.org [mailto:tdwg-tapir-bounces@lists.tdwg.org] On Behalf Of Roger Hyam (TDWG) Sent: Wednesday, May 14, 2008 11:57 AM To: Markus Döring Cc: Hiscom-L Mailing List ((E-mail)); tdwg-tapir@lists.tdwg.org Subject: Re: [tdwg-tapir] Fwd: Tapir protocol - Harvest methods?[SEC=UNCLASSIFIED]
Generally if we are going to have csv files for data transfer we don't need to have software implementations just some documentation on what the csv files should contain. Something along the lines of:
1) Make a report from your database as a csv file(s) with the following columns... 2) Zip it up. 3) Either put it on a webserver and send us the URL or upload it using this webform.
We don't need to bother with TAPIR etc. You could even only produce a CSV file of the records that have changed so big data sets needn't be a problem.
I worry that we are working out how to move data about quickly and forgetting that the real goal is to integrate data and that will only come if people have GUIDs on the stuff they own and use other peoples GUIDs in their data. Solutions based around CSV files do nothing to move people in that direction and I would suspect lead to making matters worse.
Finding ourselves in hole digging quicker may not be the best option.
Roger
------------------------------------------------------------- Roger Hyam Roger@BiodiversityCollectionsIndex.org http://www.BiodiversityCollectionsIndex.org ------------------------------------------------------------- Royal Botanic Garden Edinburgh 20A Inverleith Row, Edinburgh, EH3 5LR, UK Tel: +44 131 552 7171 ext 3015 Fax: +44 131 248 2901 http://www.rbge.org.uk/ -------------------------------------------------------------
On 14 May 2008, at 10:21, Markus Döring wrote:
Interesting that we all come to the same conclusions... The trouble I had with just a simple flat csv file is repeating properties like multiple image urls. ABCD clients dont use ABCD just because its complex, but because they want to transport this relational data. We were considering 2 solutions to extending this csv approach. The first would be to have a single large denormalised csv file with many rows for the same record. It would require knowledge about the related entities though and could grow in size rapidly. The second idea which we think to adopt is allowing a single level of 1- many related entities. It is basically a "star" design with the core dwc table in the center and any number of extension tables around it. Each "table" aka csv file will have the record id as the first column, so the files can be related easily and it only needs a single identifier per record and not for the extension entities. This would give a lot of flexibility while keeping things pretty simple to deal with. It would even satisfy the ABCD needs as I havent yet seen anyone requiring 2 levels of related tables (other than lookup tables). Those extensions could even be a simple 1-1 relation, but would keep things semantically together just like a xml namespace. The darwin core extensions would be good for example.
So we could have a gzipped set of files, maybe with a simple metafile indicating the semantics of the columns for each file. An example could look like this:
# darwincore.csv 102 Aster alpinus subsp. parviceps ... 103 Polygala vulgaris ...
# curatorial.csv 102 Kew Herbarium 103 Reading Herbarium
# identification.csv 102 2003-05-04 Karl Marx Aster alpinus L. 102 2007-01-11 Mark Twain Aster korshinskyi Tamamsch. 102 2007-09-13 Roger Hyam Aster alpinus subsp. parviceps Novopokr. 103 2001-02-21 Steve Bekow Polygala vulgaris L.
I know this looks old fashioned, but it is just so simple and gives us so much flexibility. Markus
On 14 May, 2008, at 24:39, Greg Whitbread wrote:
We have used a very similar protocol to assemble the latest AVH cache. It should be noted that this is an as-well-as protocol that only works because we have an established semantic standard (hispid/abcd).
greg
trobertson@gbif.org wrote:
Hi All,
This is very interesting too me, as I came up with the same conclusion while harvesting for GBIF.
As a "harvester of all records" it is best described with an example:
- Complete Inventory of ScientificNames: 7 minutes @ the limited 200
records per page
- Complete Harvesting of records:
- 260,000 records
- 9 hours harvesting duration
- 500MB TAPIR+DwC XML returned (DwC 1.4 with geospatial and
curatorial extensions)
- Extraction of DwC records from harvested XML: <2 minutes
- Resulting file size 32MB, Gzipped to <3MB
I spun hard drives for 9 hours, and took up bandwidth that is paid for, to retrieve something that could have been generated provider side in minutes and transferred in seconds (3MB).
I sent a proposal to TDWG last year termed "datamaps" which was effectively what you are describing, and I based it on the Sitemaps protocol, but I got nowhere with it. With Markus, we are making more progress and I have spoken with several GBIF data providers about a proposed new standard for full dataset harvesting and it has been received well. So Markus and I have started a new proposal and have a working name of 'Localised DwC Index' file generation (it is an index if you have more than DwC data, and DwC is still standards compliant) which is really a GZipped Tab file dump of the data, which is slightly extensible. The document is not ready to circulate yet but the benefits section reads currently:
- Provider database load reduced, allowing it to serve real
distributed queries rather than "full datasource" harvesters
- Providers can choose to publish their index as it suits them,
giving control back to the provider
- Localised index generation can be built into tools not yet capable
of integrating with TDWG protocol networks such as GBIF
- Harvesters receive a full dataset view in one request, making it
very easy to determine what records are eligible for deletion
- It becomes very simple to write clients that consume entire
datasets. E.g. data cleansing tools that the provider can run:
- Give me ISO Country Codes for my dataset
- The application pulls down the providers index file, generates
ISO country code, returns a simple table using the providers own identifier
- Check my names for spelling mistakes
- Application skims over the records and provides a list that are
not known to the application
- Providers such as UK NBN cannot serve 20 million records to the
GBIF index using the existing protocols efficiently.
- They have the ability to generate a localised index however
- Harvesters can very quickly build up searchable indexes and it is
easy to create large indices.
- Node Portal can easily aggregate index data files
- true index to data, not an illusion of a cache. More like Google
sitemaps
It is the ease at which one can offer tools to data providers that really interests me. The technical threshold required to produce services that offer reporting tools on peoples data is really very low with this mechanism. This and the fact that large datasets will be harvestable - we have even considered the likes of bit-torrent for the large ones although I think this is overkill.
As a consumer therefore I fully support this move as a valuable addition to the wrapper tools.
Cheers
Tim (wrote the GBIF harvesting, and new to this list)
Begin forwarded message:
From: "Aaron D. Steele" eightysteele@gmail.com Date: 13 de mayo de 2008 22:40:09 GMT+02:00 To: tdwg-tapir@lists.tdwg.org Cc: Aaron Steele asteele@berkeley.edu Subject: Re: [tdwg-tapir] Tapir protocol - Harvest methods?
at berkeley we've recently prototyped a simple php program that uses an existing tapirlink installation to periodically dump tapir resources into a csv file. the solution is totally generic and can dump darwin core (and technically abcd schema, although it's currently untested). the resulting csv files are zip archived and made accessible using a web service. it's a simple approach that has proven to be, at least internally, quite reliable and useful.
for example, several of our caching applications use the web service to harvest csv data from tapirlink resources using the following process:
- download latest csv dump for a resource using the web service.
- flush all locally cached records for the resource.
- bulk load the latest csv data into the cache.
in this way, cached data are always synchronized with the resource and there's no need to track new, deleted, or changed records. as an aside, each time these cached data are queried by the caching application or selected in the user interface, log-only search requests are sent back to the resource.
after discussion with renato giovanni and john wieczorek, we've decided that merging this functionality into the tapirlink codebase would benefit the broader community. csv generation support would be declared through capabilities. although incremental harvesting wouldn't be immediately implemented, we could certainly extend the service to include it later.
i'd like to pause here to gauge the consensus, thoughts, concerns, and ideas of others. anyone?
thanks, aaron
2008/5/5 Kevin Richards RichardsK@landcareresearch.co.nz:
I think I agree here.
The harvesting "procedure" is really defined outside the Tapir protocol, is it not? So it is really an agreement between the harvester and the harvestees.
So what is really needed here is the standard procedure for maintaining a "harvestable" dataset and the standard procedure for harvesting that dataset. We have a general rule at Landcare, that we never delete records in our datasets - they are either deprecated in favour of another record, and so the resolution of that record would point to the new record, or the are set to a state of "deleted", but are still kept in the dataset, and can be resolved (which would indicate a state of deleted).
Kevin
>>> "Renato De Giovanni" renato@cria.org.br 6/05/2008 7:33 a.m. >>>>>>
Hi Markus,
I would suggest creating new concepts for incremental harvesting, either in the data standards themselves or in some new extension. In the case of TAPIR, GBIF could easily check the mapped concepts before deciding between incremental or full harvesting.
Actually it could be just one new concept such as "recordStatus" or "deletionFlag". Or perhaps you could also want to create your own definition for dateLastModified indicating which set of concepts should be considered to see if something has changed or not, but I guess this level of granularity would be difficult to be supported.
Regards,
Renato
On 5 May 2008 at 11:24, Markus Döring wrote:
> Phil, > incremental harvesting is not implemented on the GBIF side as > far as I am aware. And I dont think that will be a simple thing > to implement on the current system. Also, even if we can detect > only the changed records since the last harevesting via > dateLastModified we still have no information about deletions. > We could have an arrangement saying that you keep deleted > records as empty records with just the ID and nothing else (I > vaguely remember LSIDs were supposed to work like this too). But > that also needs to be supported on your side then, never > entirely removing any record. I will have a discussion with the > others at GBIF about that. > > Markus _______________________________________________ tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
Please consider the environment before printing this email
WARNING : This email and any attachments may be confidential and/ or privileged. They are intended for the addressee only and are not to be read, used, copied or disseminated by anyone receiving them in error. If you are not the intended recipient, please notify the sender by return email and delete this message and any attachments.
The views expressed in this email are those of the sender and do not necessarily reflect the official views of Landcare Research. http:// www.landcareresearch.co.nz _______________________________________________ tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
--
Australian Centre for Plant BIodiversity Research<------------------+ National greg whitBread voice: +61 2 62509 482 Botanic Integrated Botanical Information System fax: +61 2 62509 599 Gardens S........ I.T. happens.. ghw@anbg.gov.au +----------------------------------------->GPO Box 1777 Canberra 2601
If you have received this transmission in error please notify us immediately by return e-mail and delete all copies. If this e-mail or any attachments have been sent to you in error, that error does not constitute waiver of any confidentiality, privilege or copyright in respect of information in the e-mail or attachments.
Please consider the environment before printing this email.
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
_______________________________________________ tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
Hi Tim,
The thing about the sitemaps is that they describe resources with URIs they are not just a dump of an excel file.
I will buy you a beer in Oz if any proposal that is put forward mandates the use of GUIDs for primary keys in the CSV files (other than perhaps the additional files Markus was proposing of one to many relationships). I'd buy you several beers if you manage to get it accepted :)
All the best,
Roger
BTW: Another way to represent a graph of data (other than a series of linked csv files) would be to do it in RDF as Turtle then zipped. This does way with the need of a separate dictionary to describe what the columns mean, has to be UTF-8, can include data types etc ... A script to explode this back to tables probably wouldn't be too slow but this is probably just fantasy on my part.
On 14 May 2008, at 11:21, Tim Robertson wrote:
Roegr writes "I worry that we are working out how to move data about quickly"
That is exactly what this is for, but why is it a worry (other than the likes of GBIF who really are worrying about moving data around quickly since everyone is shouting about latency problems)?
It is a 166 times (3meg versus 500meg) more efficient transfer of a data source for those wishing to transfer the whole thing. It is still standards compliant for the document passed across (DwC + flat extension schemas), and by incorporating it's generation into tools like a TAPIR wrapper, would ensure this. The reality is, many of the very large datasets have to come to GBIF like this - the transfer protocols existing just do not perform.
Furthermore, think how much easier it would be for someone like Catalogue of Life or ITIS to put up a service that says "hey, you give me the URL to your Locally generated DwC Index File and I'll give you back a report containing YOUR occurrence GUID, and MY LSID for your identification". Isn't that a good thing?
In my view these files are additional to any existing interfaces, only meet certain data type requirements and by no means detract from any of the important work (both technical and social aspects) on GUID assigning, document schemas etc. Therefore, like sitemaps became a requirement for large web sites, I think a more efficient standards based (than just dump your data and we'll handle it) approach is required for our community.
-----Original Message----- From: tdwg-tapir-bounces@lists.tdwg.org [mailto:tdwg-tapir-bounces@lists.tdwg.org] On Behalf Of Roger Hyam (TDWG) Sent: Wednesday, May 14, 2008 11:57 AM To: Markus Döring Cc: Hiscom-L Mailing List ((E-mail)); tdwg-tapir@lists.tdwg.org Subject: Re: [tdwg-tapir] Fwd: Tapir protocol - Harvest methods?[SEC=UNCLASSIFIED]
Generally if we are going to have csv files for data transfer we don't need to have software implementations just some documentation on what the csv files should contain. Something along the lines of:
- Make a report from your database as a csv file(s) with the
following columns... 2) Zip it up. 3) Either put it on a webserver and send us the URL or upload it using this webform.
We don't need to bother with TAPIR etc. You could even only produce a CSV file of the records that have changed so big data sets needn't be a problem.
I worry that we are working out how to move data about quickly and forgetting that the real goal is to integrate data and that will only come if people have GUIDs on the stuff they own and use other peoples GUIDs in their data. Solutions based around CSV files do nothing to move people in that direction and I would suspect lead to making matters worse.
Finding ourselves in hole digging quicker may not be the best option.
Roger
Roger Hyam Roger@BiodiversityCollectionsIndex.org http://www.BiodiversityCollectionsIndex.org
Royal Botanic Garden Edinburgh 20A Inverleith Row, Edinburgh, EH3 5LR, UK Tel: +44 131 552 7171 ext 3015 Fax: +44 131 248 2901 http://www.rbge.org.uk/
On 14 May 2008, at 10:21, Markus Döring wrote:
Interesting that we all come to the same conclusions... The trouble I had with just a simple flat csv file is repeating properties like multiple image urls. ABCD clients dont use ABCD just because its complex, but because they want to transport this relational data. We were considering 2 solutions to extending this csv approach. The first would be to have a single large denormalised csv file with many rows for the same record. It would require knowledge about the related entities though and could grow in size rapidly. The second idea which we think to adopt is allowing a single level of 1- many related entities. It is basically a "star" design with the core dwc table in the center and any number of extension tables around it. Each "table" aka csv file will have the record id as the first column, so the files can be related easily and it only needs a single identifier per record and not for the extension entities. This would give a lot of flexibility while keeping things pretty simple to deal with. It would even satisfy the ABCD needs as I havent yet seen anyone requiring 2 levels of related tables (other than lookup tables). Those extensions could even be a simple 1-1 relation, but would keep things semantically together just like a xml namespace. The darwin core extensions would be good for example.
So we could have a gzipped set of files, maybe with a simple metafile indicating the semantics of the columns for each file. An example could look like this:
# darwincore.csv 102 Aster alpinus subsp. parviceps ... 103 Polygala vulgaris ...
# curatorial.csv 102 Kew Herbarium 103 Reading Herbarium
# identification.csv 102 2003-05-04 Karl Marx Aster alpinus L. 102 2007-01-11 Mark Twain Aster korshinskyi Tamamsch. 102 2007-09-13 Roger Hyam Aster alpinus subsp. parviceps Novopokr. 103 2001-02-21 Steve Bekow Polygala vulgaris L.
I know this looks old fashioned, but it is just so simple and gives us so much flexibility. Markus
On 14 May, 2008, at 24:39, Greg Whitbread wrote:
We have used a very similar protocol to assemble the latest AVH cache. It should be noted that this is an as-well-as protocol that only works because we have an established semantic standard (hispid/ abcd).
greg
trobertson@gbif.org wrote:
Hi All,
This is very interesting too me, as I came up with the same conclusion while harvesting for GBIF.
As a "harvester of all records" it is best described with an example:
- Complete Inventory of ScientificNames: 7 minutes @ the limited
200 records per page
- Complete Harvesting of records:
- 260,000 records
- 9 hours harvesting duration
- 500MB TAPIR+DwC XML returned (DwC 1.4 with geospatial and
curatorial extensions)
- Extraction of DwC records from harvested XML: <2 minutes
- Resulting file size 32MB, Gzipped to <3MB
I spun hard drives for 9 hours, and took up bandwidth that is paid for, to retrieve something that could have been generated provider side in minutes and transferred in seconds (3MB).
I sent a proposal to TDWG last year termed "datamaps" which was effectively what you are describing, and I based it on the Sitemaps protocol, but I got nowhere with it. With Markus, we are making more progress and I have spoken with several GBIF data providers about a proposed new standard for full dataset harvesting and it has been received well. So Markus and I have started a new proposal and have a working name of 'Localised DwC Index' file generation (it is an index if you have more than DwC data, and DwC is still standards compliant) which is really a GZipped Tab file dump of the data, which is slightly extensible. The document is not ready to circulate yet but the benefits section reads currently:
- Provider database load reduced, allowing it to serve real
distributed queries rather than "full datasource" harvesters
- Providers can choose to publish their index as it suits them,
giving control back to the provider
- Localised index generation can be built into tools not yet
capable of integrating with TDWG protocol networks such as GBIF
- Harvesters receive a full dataset view in one request, making it
very easy to determine what records are eligible for deletion
- It becomes very simple to write clients that consume entire
datasets. E.g. data cleansing tools that the provider can run:
- Give me ISO Country Codes for my dataset
- The application pulls down the providers index file, generates
ISO country code, returns a simple table using the providers own identifier
- Check my names for spelling mistakes
- Application skims over the records and provides a list that are
not known to the application
- Providers such as UK NBN cannot serve 20 million records to the
GBIF index using the existing protocols efficiently.
- They have the ability to generate a localised index however
- Harvesters can very quickly build up searchable indexes and it is
easy to create large indices.
- Node Portal can easily aggregate index data files
- true index to data, not an illusion of a cache. More like Google
sitemaps
It is the ease at which one can offer tools to data providers that really interests me. The technical threshold required to produce services that offer reporting tools on peoples data is really very low with this mechanism. This and the fact that large datasets will be harvestable - we have even considered the likes of bit-torrent for the large ones although I think this is overkill.
As a consumer therefore I fully support this move as a valuable addition to the wrapper tools.
Cheers
Tim (wrote the GBIF harvesting, and new to this list)
Begin forwarded message:
From: "Aaron D. Steele" eightysteele@gmail.com Date: 13 de mayo de 2008 22:40:09 GMT+02:00 To: tdwg-tapir@lists.tdwg.org Cc: Aaron Steele asteele@berkeley.edu Subject: Re: [tdwg-tapir] Tapir protocol - Harvest methods?
at berkeley we've recently prototyped a simple php program that uses an existing tapirlink installation to periodically dump tapir resources into a csv file. the solution is totally generic and can dump darwin core (and technically abcd schema, although it's currently untested). the resulting csv files are zip archived and made accessible using a web service. it's a simple approach that has proven to be, at least internally, quite reliable and useful.
for example, several of our caching applications use the web service to harvest csv data from tapirlink resources using the following process:
- download latest csv dump for a resource using the web service.
- flush all locally cached records for the resource.
- bulk load the latest csv data into the cache.
in this way, cached data are always synchronized with the resource and there's no need to track new, deleted, or changed records. as an aside, each time these cached data are queried by the caching application or selected in the user interface, log-only search requests are sent back to the resource.
after discussion with renato giovanni and john wieczorek, we've decided that merging this functionality into the tapirlink codebase would benefit the broader community. csv generation support would be declared through capabilities. although incremental harvesting wouldn't be immediately implemented, we could certainly extend the service to include it later.
i'd like to pause here to gauge the consensus, thoughts, concerns, and ideas of others. anyone?
thanks, aaron
2008/5/5 Kevin Richards RichardsK@landcareresearch.co.nz: > > I think I agree here. > > The harvesting "procedure" is really defined outside the Tapir > protocol, is it not? So it is really an agreement between the > harvester and the harvestees. > > So what is really needed here is the standard procedure for > maintaining a "harvestable" dataset and the standard procedure > for harvesting that dataset. > We have a general rule at Landcare, that we never delete records > in our datasets - they are either deprecated in favour of > another > record, and so the resolution of that record would point to the > new record, or the are set to a state of "deleted", but are > still > kept in the dataset, and can be resolved (which would indicate a > state of deleted). > > Kevin > > >>>> "Renato De Giovanni" renato@cria.org.br 6/05/2008 7:33 a.m. >>>>>>> > > Hi Markus, > > I would suggest creating new concepts for incremental > harvesting, > either in the data standards themselves or in some new > extension. > In the case of TAPIR, GBIF could easily check the mapped > concepts > before deciding between incremental or full harvesting. > > Actually it could be just one new concept such as "recordStatus" > or > "deletionFlag". Or perhaps you could also want to create your > own > definition for dateLastModified indicating which set of concepts > should be considered to see if something has changed or not, but > I guess this level of granularity would be difficult to be > supported. > > Regards, > -- > Renato > > On 5 May 2008 at 11:24, Markus Döring wrote: > >> Phil, >> incremental harvesting is not implemented on the GBIF side as >> far as I am aware. And I dont think that will be a simple thing >> to implement on the current system. Also, even if we can detect >> only the changed records since the last harevesting via >> dateLastModified we still have no information about deletions. >> We could have an arrangement saying that you keep deleted >> records as empty records with just the ID and nothing else (I >> vaguely remember LSIDs were supposed to work like this too). >> But >> that also needs to be supported on your side then, never >> entirely removing any record. I will have a discussion with the >> others at GBIF about that. >> >> Markus > _______________________________________________ > tdwg-tapir mailing list > tdwg-tapir@lists.tdwg.org > http://lists.tdwg.org/mailman/listinfo/tdwg-tapir > > > > > Please consider the environment before printing this email > > WARNING : This email and any attachments may be confidential > and/ > or privileged. They are intended for the addressee only and are > not to be read, used, copied or disseminated by anyone receiving > them in error. > If > you are > not the intended recipient, please notify the sender by return > email and delete this message and any attachments. > > The views expressed in this email are those of the sender and do > not necessarily reflect the official views of Landcare Research. > http:// www.landcareresearch.co.nz > _______________________________________________ > tdwg-tapir mailing list > tdwg-tapir@lists.tdwg.org > http://lists.tdwg.org/mailman/listinfo/tdwg-tapir > > _______________________________________________ tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
--
Australian Centre for Plant BIodiversity Research<------------------+ National greg whitBread voice: +61 2 62509 482 Botanic Integrated Botanical Information System fax: +61 2 62509 599 Gardens S........ I.T. happens.. ghw@anbg.gov.au +----------------------------------------->GPO Box 1777 Canberra 2601
If you have received this transmission in error please notify us immediately by return e-mail and delete all copies. If this e-mail or any attachments have been sent to you in error, that error does not constitute waiver of any confidentiality, privilege or copyright in respect of information in the e-mail or attachments.
Please consider the environment before printing this email.
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
Hi Roger,
<Homer style>Hmmm free beer. Hang on, if worrying about trying to transfer large data is not a good incentive for standardising a transfer mechanism, is free beer any better? ;o)
But seriously,
If the proposal was along the lines of a Tab file: - LSID kingdom phylum class order basis_of_record....
And then supporting files (star schema) with: - LSID latitude longitude ....
If the wrappers generated these kind of structures using the same configuration generated when a user installed it, would you feel happier? This is really what Markus and I are proposing, and we fully support all the GUID generation work and I for one am desperate for it, including the BCI "datasource" level GUIDs.
The analogy to sitemaps is quite simple - these index files do not provide the full detail - they provide the means to build an index based on DwC concepts, that would then facilitate the accession of the full detail record - e.g. through LSID. The LSID/GUID part is the same as the sitemap URI - no?
Tim
-----Original Message----- From: Roger Hyam (TDWG) [mailto:rogerhyam@mac.com] Sent: Wednesday, May 14, 2008 12:58 PM To: Tim Robertson Cc: 'Markus Döring'; 'Hiscom-L Mailing List ((E-mail))'; tdwg-tapir@lists.tdwg.org Subject: Re: [tdwg-tapir] Fwd: Tapir protocol - Harvest methods?[SEC=UNCLASSIFIED]
Hi Tim,
The thing about the sitemaps is that they describe resources with URIs they are not just a dump of an excel file.
I will buy you a beer in Oz if any proposal that is put forward mandates the use of GUIDs for primary keys in the CSV files (other than perhaps the additional files Markus was proposing of one to many relationships). I'd buy you several beers if you manage to get it accepted :)
All the best,
Roger
BTW: Another way to represent a graph of data (other than a series of linked csv files) would be to do it in RDF as Turtle then zipped. This does way with the need of a separate dictionary to describe what the columns mean, has to be UTF-8, can include data types etc ... A script to explode this back to tables probably wouldn't be too slow but this is probably just fantasy on my part.
On 14 May 2008, at 11:21, Tim Robertson wrote:
Roegr writes "I worry that we are working out how to move data about quickly"
That is exactly what this is for, but why is it a worry (other than the likes of GBIF who really are worrying about moving data around quickly since everyone is shouting about latency problems)?
It is a 166 times (3meg versus 500meg) more efficient transfer of a data source for those wishing to transfer the whole thing. It is still standards compliant for the document passed across (DwC + flat extension schemas), and by incorporating it's generation into tools like a TAPIR wrapper, would ensure this. The reality is, many of the very large datasets have to come to GBIF like this - the transfer protocols existing just do not perform.
Furthermore, think how much easier it would be for someone like Catalogue of Life or ITIS to put up a service that says "hey, you give me the URL to your Locally generated DwC Index File and I'll give you back a report containing YOUR occurrence GUID, and MY LSID for your identification". Isn't that a good thing?
In my view these files are additional to any existing interfaces, only meet certain data type requirements and by no means detract from any of the important work (both technical and social aspects) on GUID assigning, document schemas etc. Therefore, like sitemaps became a requirement for large web sites, I think a more efficient standards based (than just dump your data and we'll handle it) approach is required for our community.
-----Original Message----- From: tdwg-tapir-bounces@lists.tdwg.org [mailto:tdwg-tapir-bounces@lists.tdwg.org] On Behalf Of Roger Hyam (TDWG) Sent: Wednesday, May 14, 2008 11:57 AM To: Markus Döring Cc: Hiscom-L Mailing List ((E-mail)); tdwg-tapir@lists.tdwg.org Subject: Re: [tdwg-tapir] Fwd: Tapir protocol - Harvest methods?[SEC=UNCLASSIFIED]
Generally if we are going to have csv files for data transfer we don't need to have software implementations just some documentation on what the csv files should contain. Something along the lines of:
- Make a report from your database as a csv file(s) with the
following columns... 2) Zip it up. 3) Either put it on a webserver and send us the URL or upload it using this webform.
We don't need to bother with TAPIR etc. You could even only produce a CSV file of the records that have changed so big data sets needn't be a problem.
I worry that we are working out how to move data about quickly and forgetting that the real goal is to integrate data and that will only come if people have GUIDs on the stuff they own and use other peoples GUIDs in their data. Solutions based around CSV files do nothing to move people in that direction and I would suspect lead to making matters worse.
Finding ourselves in hole digging quicker may not be the best option.
Roger
Roger Hyam Roger@BiodiversityCollectionsIndex.org http://www.BiodiversityCollectionsIndex.org
Royal Botanic Garden Edinburgh 20A Inverleith Row, Edinburgh, EH3 5LR, UK Tel: +44 131 552 7171 ext 3015 Fax: +44 131 248 2901 http://www.rbge.org.uk/
On 14 May 2008, at 10:21, Markus Döring wrote:
Interesting that we all come to the same conclusions... The trouble I had with just a simple flat csv file is repeating properties like multiple image urls. ABCD clients dont use ABCD just because its complex, but because they want to transport this relational data. We were considering 2 solutions to extending this csv approach. The first would be to have a single large denormalised csv file with many rows for the same record. It would require knowledge about the related entities though and could grow in size rapidly. The second idea which we think to adopt is allowing a single level of 1- many related entities. It is basically a "star" design with the core dwc table in the center and any number of extension tables around it. Each "table" aka csv file will have the record id as the first column, so the files can be related easily and it only needs a single identifier per record and not for the extension entities. This would give a lot of flexibility while keeping things pretty simple to deal with. It would even satisfy the ABCD needs as I havent yet seen anyone requiring 2 levels of related tables (other than lookup tables). Those extensions could even be a simple 1-1 relation, but would keep things semantically together just like a xml namespace. The darwin core extensions would be good for example.
So we could have a gzipped set of files, maybe with a simple metafile indicating the semantics of the columns for each file. An example could look like this:
# darwincore.csv 102 Aster alpinus subsp. parviceps ... 103 Polygala vulgaris ...
# curatorial.csv 102 Kew Herbarium 103 Reading Herbarium
# identification.csv 102 2003-05-04 Karl Marx Aster alpinus L. 102 2007-01-11 Mark Twain Aster korshinskyi Tamamsch. 102 2007-09-13 Roger Hyam Aster alpinus subsp. parviceps Novopokr. 103 2001-02-21 Steve Bekow Polygala vulgaris L.
I know this looks old fashioned, but it is just so simple and gives us so much flexibility. Markus
On 14 May, 2008, at 24:39, Greg Whitbread wrote:
We have used a very similar protocol to assemble the latest AVH cache. It should be noted that this is an as-well-as protocol that only works because we have an established semantic standard (hispid/ abcd).
greg
trobertson@gbif.org wrote:
Hi All,
This is very interesting too me, as I came up with the same conclusion while harvesting for GBIF.
As a "harvester of all records" it is best described with an example:
- Complete Inventory of ScientificNames: 7 minutes @ the limited
200 records per page
- Complete Harvesting of records:
- 260,000 records
- 9 hours harvesting duration
- 500MB TAPIR+DwC XML returned (DwC 1.4 with geospatial and
curatorial extensions)
- Extraction of DwC records from harvested XML: <2 minutes
- Resulting file size 32MB, Gzipped to <3MB
I spun hard drives for 9 hours, and took up bandwidth that is paid for, to retrieve something that could have been generated provider side in minutes and transferred in seconds (3MB).
I sent a proposal to TDWG last year termed "datamaps" which was effectively what you are describing, and I based it on the Sitemaps protocol, but I got nowhere with it. With Markus, we are making more progress and I have spoken with several GBIF data providers about a proposed new standard for full dataset harvesting and it has been received well. So Markus and I have started a new proposal and have a working name of 'Localised DwC Index' file generation (it is an index if you have more than DwC data, and DwC is still standards compliant) which is really a GZipped Tab file dump of the data, which is slightly extensible. The document is not ready to circulate yet but the benefits section reads currently:
- Provider database load reduced, allowing it to serve real
distributed queries rather than "full datasource" harvesters
- Providers can choose to publish their index as it suits them,
giving control back to the provider
- Localised index generation can be built into tools not yet
capable of integrating with TDWG protocol networks such as GBIF
- Harvesters receive a full dataset view in one request, making it
very easy to determine what records are eligible for deletion
- It becomes very simple to write clients that consume entire
datasets. E.g. data cleansing tools that the provider can run:
- Give me ISO Country Codes for my dataset
- The application pulls down the providers index file, generates
ISO country code, returns a simple table using the providers own identifier
- Check my names for spelling mistakes
- Application skims over the records and provides a list that are
not known to the application
- Providers such as UK NBN cannot serve 20 million records to the
GBIF index using the existing protocols efficiently.
- They have the ability to generate a localised index however
- Harvesters can very quickly build up searchable indexes and it is
easy to create large indices.
- Node Portal can easily aggregate index data files
- true index to data, not an illusion of a cache. More like Google
sitemaps
It is the ease at which one can offer tools to data providers that really interests me. The technical threshold required to produce services that offer reporting tools on peoples data is really very low with this mechanism. This and the fact that large datasets will be harvestable - we have even considered the likes of bit-torrent for the large ones although I think this is overkill.
As a consumer therefore I fully support this move as a valuable addition to the wrapper tools.
Cheers
Tim (wrote the GBIF harvesting, and new to this list)
Begin forwarded message:
From: "Aaron D. Steele" eightysteele@gmail.com Date: 13 de mayo de 2008 22:40:09 GMT+02:00 To: tdwg-tapir@lists.tdwg.org Cc: Aaron Steele asteele@berkeley.edu Subject: Re: [tdwg-tapir] Tapir protocol - Harvest methods?
at berkeley we've recently prototyped a simple php program that uses an existing tapirlink installation to periodically dump tapir resources into a csv file. the solution is totally generic and can dump darwin core (and technically abcd schema, although it's currently untested). the resulting csv files are zip archived and made accessible using a web service. it's a simple approach that has proven to be, at least internally, quite reliable and useful.
for example, several of our caching applications use the web service to harvest csv data from tapirlink resources using the following process:
- download latest csv dump for a resource using the web service.
- flush all locally cached records for the resource.
- bulk load the latest csv data into the cache.
in this way, cached data are always synchronized with the resource and there's no need to track new, deleted, or changed records. as an aside, each time these cached data are queried by the caching application or selected in the user interface, log-only search requests are sent back to the resource.
after discussion with renato giovanni and john wieczorek, we've decided that merging this functionality into the tapirlink codebase would benefit the broader community. csv generation support would be declared through capabilities. although incremental harvesting wouldn't be immediately implemented, we could certainly extend the service to include it later.
i'd like to pause here to gauge the consensus, thoughts, concerns, and ideas of others. anyone?
thanks, aaron
2008/5/5 Kevin Richards RichardsK@landcareresearch.co.nz: > > I think I agree here. > > The harvesting "procedure" is really defined outside the Tapir > protocol, is it not? So it is really an agreement between the > harvester and the harvestees. > > So what is really needed here is the standard procedure for > maintaining a "harvestable" dataset and the standard procedure > for harvesting that dataset. > We have a general rule at Landcare, that we never delete records > in our datasets - they are either deprecated in favour of > another record, and so the resolution of that record would point > to the new record, or the are set to a state of "deleted", but > are still kept in the dataset, and can be resolved (which would > indicate a state of deleted). > > Kevin > > >>>> "Renato De Giovanni" renato@cria.org.br 6/05/2008 7:33 a.m. >>>>>>> > > Hi Markus, > > I would suggest creating new concepts for incremental > harvesting, either in the data standards themselves or in some > new extension. > In the case of TAPIR, GBIF could easily check the mapped > concepts before deciding between incremental or full harvesting. > > Actually it could be just one new concept such as "recordStatus" > or > "deletionFlag". Or perhaps you could also want to create your > own definition for dateLastModified indicating which set of > concepts should be considered to see if something has changed or > not, but I guess this level of granularity would be difficult to > be supported. > > Regards, > -- > Renato > > On 5 May 2008 at 11:24, Markus Döring wrote: > >> Phil, >> incremental harvesting is not implemented on the GBIF side as >> far as I am aware. And I dont think that will be a simple thing >> to implement on the current system. Also, even if we can detect >> only the changed records since the last harevesting via >> dateLastModified we still have no information about deletions. >> We could have an arrangement saying that you keep deleted >> records as empty records with just the ID and nothing else (I >> vaguely remember LSIDs were supposed to work like this too). >> But >> that also needs to be supported on your side then, never >> entirely removing any record. I will have a discussion with the >> others at GBIF about that. >> >> Markus > _______________________________________________ > tdwg-tapir mailing list > tdwg-tapir@lists.tdwg.org > http://lists.tdwg.org/mailman/listinfo/tdwg-tapir > > > > > Please consider the environment before printing this email > > WARNING : This email and any attachments may be confidential > and/ or privileged. They are intended for the addressee only and > are not to be read, used, copied or disseminated by anyone > receiving them in error. > If > you are > not the intended recipient, please notify the sender by return > email and delete this message and any attachments. > > The views expressed in this email are those of the sender and do > not necessarily reflect the official views of Landcare Research. > http:// www.landcareresearch.co.nz > _______________________________________________ > tdwg-tapir mailing list > tdwg-tapir@lists.tdwg.org > http://lists.tdwg.org/mailman/listinfo/tdwg-tapir > > _______________________________________________ tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
--
Australian Centre for Plant BIodiversity Research<------------------+ National greg whitBread voice: +61 2 62509 482 Botanic Integrated Botanical Information System fax: +61 2 62509 599 Gardens S........ I.T. happens.. ghw@anbg.gov.au +----------------------------------------->GPO Box 1777 Canberra 2601
If you have received this transmission in error please notify us immediately by return e-mail and delete all copies. If this e-mail or any attachments have been sent to you in error, that error does not constitute waiver of any confidentiality, privilege or copyright in respect of information in the e-mail or attachments.
Please consider the environment before printing this email.
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
Tim,
Ahh you hit the nail on the head. If these sitemaps contain just the indexing fields for records (and there is potentially more information available from another source) then there needs to be an unambiguous mechanism to link the things in the sitemaps to the things available via another means. i.e. GUIDs (could be URIs of various flavours including LSIDs).
LSID Authority plus sitemap would be good.
So we must mandate the use of GUIDs - your beer is practically safe.
Roger
On 14 May 2008, at 12:12, Tim Robertson wrote:
Hi Roger,
<Homer style>Hmmm free beer. Hang on, if worrying about trying to transfer large data is not a good incentive for standardising a transfer mechanism, is free beer any better? ;o)
But seriously,
If the proposal was along the lines of a Tab file:
- LSID kingdom phylum class order basis_of_record....
And then supporting files (star schema) with:
- LSID latitude longitude ....
If the wrappers generated these kind of structures using the same configuration generated when a user installed it, would you feel happier? This is really what Markus and I are proposing, and we fully support all the GUID generation work and I for one am desperate for it, including the BCI "datasource" level GUIDs.
The analogy to sitemaps is quite simple - these index files do not provide the full detail - they provide the means to build an index based on DwC concepts, that would then facilitate the accession of the full detail record
- e.g. through LSID. The LSID/GUID part is the same as the sitemap
URI - no?
Tim
-----Original Message----- From: Roger Hyam (TDWG) [mailto:rogerhyam@mac.com] Sent: Wednesday, May 14, 2008 12:58 PM To: Tim Robertson Cc: 'Markus Döring'; 'Hiscom-L Mailing List ((E-mail))'; tdwg-tapir@lists.tdwg.org Subject: Re: [tdwg-tapir] Fwd: Tapir protocol - Harvest methods?[SEC=UNCLASSIFIED]
Hi Tim,
The thing about the sitemaps is that they describe resources with URIs they are not just a dump of an excel file.
I will buy you a beer in Oz if any proposal that is put forward mandates the use of GUIDs for primary keys in the CSV files (other than perhaps the additional files Markus was proposing of one to many relationships). I'd buy you several beers if you manage to get it accepted :)
All the best,
Roger
BTW: Another way to represent a graph of data (other than a series of linked csv files) would be to do it in RDF as Turtle then zipped. This does way with the need of a separate dictionary to describe what the columns mean, has to be UTF-8, can include data types etc ... A script to explode this back to tables probably wouldn't be too slow but this is probably just fantasy on my part.
On 14 May 2008, at 11:21, Tim Robertson wrote:
Roegr writes "I worry that we are working out how to move data about quickly"
That is exactly what this is for, but why is it a worry (other than the likes of GBIF who really are worrying about moving data around quickly since everyone is shouting about latency problems)?
It is a 166 times (3meg versus 500meg) more efficient transfer of a data source for those wishing to transfer the whole thing. It is still standards compliant for the document passed across (DwC + flat extension schemas), and by incorporating it's generation into tools like a TAPIR wrapper, would ensure this. The reality is, many of the very large datasets have to come to GBIF like this - the transfer protocols existing just do not perform.
Furthermore, think how much easier it would be for someone like Catalogue of Life or ITIS to put up a service that says "hey, you give me the URL to your Locally generated DwC Index File and I'll give you back a report containing YOUR occurrence GUID, and MY LSID for your identification". Isn't that a good thing?
In my view these files are additional to any existing interfaces, only meet certain data type requirements and by no means detract from any of the important work (both technical and social aspects) on GUID assigning, document schemas etc. Therefore, like sitemaps became a requirement for large web sites, I think a more efficient standards based (than just dump your data and we'll handle it) approach is required for our community.
-----Original Message----- From: tdwg-tapir-bounces@lists.tdwg.org [mailto:tdwg-tapir-bounces@lists.tdwg.org] On Behalf Of Roger Hyam (TDWG) Sent: Wednesday, May 14, 2008 11:57 AM To: Markus Döring Cc: Hiscom-L Mailing List ((E-mail)); tdwg-tapir@lists.tdwg.org Subject: Re: [tdwg-tapir] Fwd: Tapir protocol - Harvest methods?[SEC=UNCLASSIFIED]
Generally if we are going to have csv files for data transfer we don't need to have software implementations just some documentation on what the csv files should contain. Something along the lines of:
- Make a report from your database as a csv file(s) with the
following columns... 2) Zip it up. 3) Either put it on a webserver and send us the URL or upload it using this webform.
We don't need to bother with TAPIR etc. You could even only produce a CSV file of the records that have changed so big data sets needn't be a problem.
I worry that we are working out how to move data about quickly and forgetting that the real goal is to integrate data and that will only come if people have GUIDs on the stuff they own and use other peoples GUIDs in their data. Solutions based around CSV files do nothing to move people in that direction and I would suspect lead to making matters worse.
Finding ourselves in hole digging quicker may not be the best option.
Roger
Roger Hyam Roger@BiodiversityCollectionsIndex.org http://www.BiodiversityCollectionsIndex.org
Royal Botanic Garden Edinburgh 20A Inverleith Row, Edinburgh, EH3 5LR, UK Tel: +44 131 552 7171 ext 3015 Fax: +44 131 248 2901 http://www.rbge.org.uk/
On 14 May 2008, at 10:21, Markus Döring wrote:
Interesting that we all come to the same conclusions... The trouble I had with just a simple flat csv file is repeating properties like multiple image urls. ABCD clients dont use ABCD just because its complex, but because they want to transport this relational data. We were considering 2 solutions to extending this csv approach. The first would be to have a single large denormalised csv file with many rows for the same record. It would require knowledge about the related entities though and could grow in size rapidly. The second idea which we think to adopt is allowing a single level of 1- many related entities. It is basically a "star" design with the core dwc table in the center and any number of extension tables around it. Each "table" aka csv file will have the record id as the first column, so the files can be related easily and it only needs a single identifier per record and not for the extension entities. This would give a lot of flexibility while keeping things pretty simple to deal with. It would even satisfy the ABCD needs as I havent yet seen anyone requiring 2 levels of related tables (other than lookup tables). Those extensions could even be a simple 1-1 relation, but would keep things semantically together just like a xml namespace. The darwin core extensions would be good for example.
So we could have a gzipped set of files, maybe with a simple metafile indicating the semantics of the columns for each file. An example could look like this:
# darwincore.csv 102 Aster alpinus subsp. parviceps ... 103 Polygala vulgaris ...
# curatorial.csv 102 Kew Herbarium 103 Reading Herbarium
# identification.csv 102 2003-05-04 Karl Marx Aster alpinus L. 102 2007-01-11 Mark Twain Aster korshinskyi Tamamsch. 102 2007-09-13 Roger Hyam Aster alpinus subsp. parviceps Novopokr. 103 2001-02-21 Steve Bekow Polygala vulgaris L.
I know this looks old fashioned, but it is just so simple and gives us so much flexibility. Markus
On 14 May, 2008, at 24:39, Greg Whitbread wrote:
We have used a very similar protocol to assemble the latest AVH cache. It should be noted that this is an as-well-as protocol that only works because we have an established semantic standard (hispid/ abcd).
greg
trobertson@gbif.org wrote:
Hi All,
This is very interesting too me, as I came up with the same conclusion while harvesting for GBIF.
As a "harvester of all records" it is best described with an example:
- Complete Inventory of ScientificNames: 7 minutes @ the limited
200 records per page
- Complete Harvesting of records:
- 260,000 records
- 9 hours harvesting duration
- 500MB TAPIR+DwC XML returned (DwC 1.4 with geospatial and
curatorial extensions)
- Extraction of DwC records from harvested XML: <2 minutes
- Resulting file size 32MB, Gzipped to <3MB
I spun hard drives for 9 hours, and took up bandwidth that is paid for, to retrieve something that could have been generated provider side in minutes and transferred in seconds (3MB).
I sent a proposal to TDWG last year termed "datamaps" which was effectively what you are describing, and I based it on the Sitemaps protocol, but I got nowhere with it. With Markus, we are making more progress and I have spoken with several GBIF data providers about a proposed new standard for full dataset harvesting and it has been received well. So Markus and I have started a new proposal and have a working name of 'Localised DwC Index' file generation (it is an index if you have more than DwC data, and DwC is still standards compliant) which is really a GZipped Tab file dump of the data, which is slightly extensible. The document is not ready to circulate yet but the benefits section reads currently:
- Provider database load reduced, allowing it to serve real
distributed queries rather than "full datasource" harvesters
- Providers can choose to publish their index as it suits them,
giving control back to the provider
- Localised index generation can be built into tools not yet
capable of integrating with TDWG protocol networks such as GBIF
- Harvesters receive a full dataset view in one request, making it
very easy to determine what records are eligible for deletion
- It becomes very simple to write clients that consume entire
datasets. E.g. data cleansing tools that the provider can run:
- Give me ISO Country Codes for my dataset
- The application pulls down the providers index file, generates
ISO country code, returns a simple table using the providers own identifier
- Check my names for spelling mistakes
- Application skims over the records and provides a list that are
not known to the application
- Providers such as UK NBN cannot serve 20 million records to the
GBIF index using the existing protocols efficiently.
- They have the ability to generate a localised index however
- Harvesters can very quickly build up searchable indexes and it
is easy to create large indices.
- Node Portal can easily aggregate index data files
- true index to data, not an illusion of a cache. More like Google
sitemaps
It is the ease at which one can offer tools to data providers that really interests me. The technical threshold required to produce services that offer reporting tools on peoples data is really very low with this mechanism. This and the fact that large datasets will be harvestable - we have even considered the likes of bit-torrent for the large ones although I think this is overkill.
As a consumer therefore I fully support this move as a valuable addition to the wrapper tools.
Cheers
Tim (wrote the GBIF harvesting, and new to this list)
Begin forwarded message:
> From: "Aaron D. Steele" eightysteele@gmail.com > Date: 13 de mayo de 2008 22:40:09 GMT+02:00 > To: tdwg-tapir@lists.tdwg.org > Cc: Aaron Steele asteele@berkeley.edu > Subject: Re: [tdwg-tapir] Tapir protocol - Harvest methods? > > at berkeley we've recently prototyped a simple php program that > uses an existing tapirlink installation to periodically dump > tapir resources into a csv file. the solution is totally generic > and can dump darwin core (and technically abcd schema, although > it's currently untested). the resulting csv files are zip > archived and made accessible using a web service. it's a simple > approach that has proven to be, at least internally, quite > reliable and useful. > > for example, several of our caching applications use the web > service to harvest csv data from tapirlink resources using the > following > process: > 1) download latest csv dump for a resource using the web > service. > 2) flush all locally cached records for the resource. > 3) bulk load the latest csv data into the cache. > > in this way, cached data are always synchronized with the > resource and there's no need to track new, deleted, or changed > records. as an aside, each time these cached data are queried by > the caching application or selected in the user interface, > log-only search requests are sent back to the resource. > > after discussion with renato giovanni and john wieczorek, we've > decided that merging this functionality into the tapirlink > codebase would benefit the broader community. csv generation > support would be declared through capabilities. although > incremental harvesting wouldn't be immediately implemented, we > could certainly extend the service to include it later. > > i'd like to pause here to gauge the consensus, thoughts, > concerns, and ideas of others. anyone? > > thanks, > aaron > > 2008/5/5 Kevin Richards RichardsK@landcareresearch.co.nz: >> >> I think I agree here. >> >> The harvesting "procedure" is really defined outside the Tapir >> protocol, is it not? So it is really an agreement between the >> harvester and the harvestees. >> >> So what is really needed here is the standard procedure for >> maintaining a "harvestable" dataset and the standard procedure >> for harvesting that dataset. >> We have a general rule at Landcare, that we never delete >> records >> in our datasets - they are either deprecated in favour of >> another record, and so the resolution of that record would >> point >> to the new record, or the are set to a state of "deleted", but >> are still kept in the dataset, and can be resolved (which would >> indicate a state of deleted). >> >> Kevin >> >> >>>>> "Renato De Giovanni" renato@cria.org.br 6/05/2008 7:33 >>>>> a.m. >>>>>>>> >> >> Hi Markus, >> >> I would suggest creating new concepts for incremental >> harvesting, either in the data standards themselves or in some >> new extension. >> In the case of TAPIR, GBIF could easily check the mapped >> concepts before deciding between incremental or full >> harvesting. >> >> Actually it could be just one new concept such as >> "recordStatus" >> or >> "deletionFlag". Or perhaps you could also want to create your >> own definition for dateLastModified indicating which set of >> concepts should be considered to see if something has changed >> or >> not, but I guess this level of granularity would be difficult >> to >> be supported. >> >> Regards, >> -- >> Renato >> >> On 5 May 2008 at 11:24, Markus Döring wrote: >> >>> Phil, >>> incremental harvesting is not implemented on the GBIF side as >>> far as I am aware. And I dont think that will be a simple >>> thing >>> to implement on the current system. Also, even if we can >>> detect >>> only the changed records since the last harevesting via >>> dateLastModified we still have no information about deletions. >>> We could have an arrangement saying that you keep deleted >>> records as empty records with just the ID and nothing else (I >>> vaguely remember LSIDs were supposed to work like this too). >>> But >>> that also needs to be supported on your side then, never >>> entirely removing any record. I will have a discussion with >>> the >>> others at GBIF about that. >>> >>> Markus >> _______________________________________________ >> tdwg-tapir mailing list >> tdwg-tapir@lists.tdwg.org >> http://lists.tdwg.org/mailman/listinfo/tdwg-tapir >> >> >> >> >> Please consider the environment before printing this email >> >> WARNING : This email and any attachments may be confidential >> and/ or privileged. They are intended for the addressee only >> and >> are not to be read, used, copied or disseminated by anyone >> receiving them in error. >> If >> you are >> not the intended recipient, please notify the sender by return >> email and delete this message and any attachments. >> >> The views expressed in this email are those of the sender and >> do >> not necessarily reflect the official views of Landcare >> Research. >> http:// www.landcareresearch.co.nz >> _______________________________________________ >> tdwg-tapir mailing list >> tdwg-tapir@lists.tdwg.org >> http://lists.tdwg.org/mailman/listinfo/tdwg-tapir >> >> > _______________________________________________ > tdwg-tapir mailing list > tdwg-tapir@lists.tdwg.org > http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
--
Australian Centre for Plant BIodiversity Research<------------------+ National greg whitBread voice: +61 2 62509 482 Botanic Integrated Botanical Information System fax: +61 2 62509 599 Gardens S........ I.T. happens.. ghw@anbg.gov.au +----------------------------------------->GPO Box 1777 Canberra 2601
If you have received this transmission in error please notify us immediately by return e-mail and delete all copies. If this e-mail or any attachments have been sent to you in error, that error does not constitute waiver of any confidentiality, privilege or copyright in respect of information in the e-mail or attachments.
Please consider the environment before printing this email.
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
Hi Roger,
Right, so I think we were talking over each other and both agree that the GUIDs (record and 'data source') and resolution mechanism is vital, along with the schemas etc for the full record response document.
This is slightly cleverer than a sitemap - a site map says "hey here are the URIs of interest" but then you must resolve each one and build your full text index (if you are called Google). What we are proposing is URI, plus a local index (the DwC fields) that are enough for some instances (GBIF portal in it's current state) to not have to resolve each record afterwards. It would also act a seed for OAI-PMH style crawlers.
Of course this does not help with aggregators who cannot maintain GUIDs - but that is a separate problem independent of any transfer mechanism.
Do you still have strong objections to this kind of approach?
Thanks
Tim
-----Original Message----- From: Roger Hyam (TDWG) [mailto:rogerhyam@mac.com] Sent: Wednesday, May 14, 2008 1:39 PM To: Tim Robertson Cc: 'Markus Döring'; tdwg-tapir@lists.tdwg.org Subject: Re: [tdwg-tapir] Fwd: Tapir protocol - Harvest methods?[SEC=UNCLASSIFIED]
Tim,
Ahh you hit the nail on the head. If these sitemaps contain just the indexing fields for records (and there is potentially more information available from another source) then there needs to be an unambiguous mechanism to link the things in the sitemaps to the things available via another means. i.e. GUIDs (could be URIs of various flavours including LSIDs).
LSID Authority plus sitemap would be good.
So we must mandate the use of GUIDs - your beer is practically safe.
Roger
On 14 May 2008, at 12:12, Tim Robertson wrote:
Hi Roger,
<Homer style>Hmmm free beer. Hang on, if worrying about trying to transfer large data is not a good incentive for standardising a transfer mechanism, is free beer any better? ;o)
But seriously,
If the proposal was along the lines of a Tab file:
- LSID kingdom phylum class order basis_of_record....
And then supporting files (star schema) with:
- LSID latitude longitude ....
If the wrappers generated these kind of structures using the same configuration generated when a user installed it, would you feel happier? This is really what Markus and I are proposing, and we fully support all the GUID generation work and I for one am desperate for it, including the BCI "datasource" level GUIDs.
The analogy to sitemaps is quite simple - these index files do not provide the full detail - they provide the means to build an index based on DwC concepts, that would then facilitate the accession of the full detail record
- e.g. through LSID. The LSID/GUID part is the same as the sitemap
URI - no?
Tim
-----Original Message----- From: Roger Hyam (TDWG) [mailto:rogerhyam@mac.com] Sent: Wednesday, May 14, 2008 12:58 PM To: Tim Robertson Cc: 'Markus Döring'; 'Hiscom-L Mailing List ((E-mail))'; tdwg-tapir@lists.tdwg.org Subject: Re: [tdwg-tapir] Fwd: Tapir protocol - Harvest methods?[SEC=UNCLASSIFIED]
Hi Tim,
The thing about the sitemaps is that they describe resources with URIs they are not just a dump of an excel file.
I will buy you a beer in Oz if any proposal that is put forward mandates the use of GUIDs for primary keys in the CSV files (other than perhaps the additional files Markus was proposing of one to many relationships). I'd buy you several beers if you manage to get it accepted :)
All the best,
Roger
BTW: Another way to represent a graph of data (other than a series of linked csv files) would be to do it in RDF as Turtle then zipped. This does way with the need of a separate dictionary to describe what the columns mean, has to be UTF-8, can include data types etc ... A script to explode this back to tables probably wouldn't be too slow but this is probably just fantasy on my part.
On 14 May 2008, at 11:21, Tim Robertson wrote:
Roegr writes "I worry that we are working out how to move data about quickly"
That is exactly what this is for, but why is it a worry (other than the likes of GBIF who really are worrying about moving data around quickly since everyone is shouting about latency problems)?
It is a 166 times (3meg versus 500meg) more efficient transfer of a data source for those wishing to transfer the whole thing. It is still standards compliant for the document passed across (DwC + flat extension schemas), and by incorporating it's generation into tools like a TAPIR wrapper, would ensure this. The reality is, many of the very large datasets have to come to GBIF like this - the transfer protocols existing just do not perform.
Furthermore, think how much easier it would be for someone like Catalogue of Life or ITIS to put up a service that says "hey, you give me the URL to your Locally generated DwC Index File and I'll give you back a report containing YOUR occurrence GUID, and MY LSID for your identification". Isn't that a good thing?
In my view these files are additional to any existing interfaces, only meet certain data type requirements and by no means detract from any of the important work (both technical and social aspects) on GUID assigning, document schemas etc. Therefore, like sitemaps became a requirement for large web sites, I think a more efficient standards based (than just dump your data and we'll handle it) approach is required for our community.
-----Original Message----- From: tdwg-tapir-bounces@lists.tdwg.org [mailto:tdwg-tapir-bounces@lists.tdwg.org] On Behalf Of Roger Hyam (TDWG) Sent: Wednesday, May 14, 2008 11:57 AM To: Markus Döring Cc: Hiscom-L Mailing List ((E-mail)); tdwg-tapir@lists.tdwg.org Subject: Re: [tdwg-tapir] Fwd: Tapir protocol - Harvest methods?[SEC=UNCLASSIFIED]
Generally if we are going to have csv files for data transfer we don't need to have software implementations just some documentation on what the csv files should contain. Something along the lines of:
- Make a report from your database as a csv file(s) with the
following columns... 2) Zip it up. 3) Either put it on a webserver and send us the URL or upload it using this webform.
We don't need to bother with TAPIR etc. You could even only produce a CSV file of the records that have changed so big data sets needn't be a problem.
I worry that we are working out how to move data about quickly and forgetting that the real goal is to integrate data and that will only come if people have GUIDs on the stuff they own and use other peoples GUIDs in their data. Solutions based around CSV files do nothing to move people in that direction and I would suspect lead to making matters worse.
Finding ourselves in hole digging quicker may not be the best option.
Roger
Roger Hyam Roger@BiodiversityCollectionsIndex.org http://www.BiodiversityCollectionsIndex.org
Royal Botanic Garden Edinburgh 20A Inverleith Row, Edinburgh, EH3 5LR, UK Tel: +44 131 552 7171 ext 3015 Fax: +44 131 248 2901 http://www.rbge.org.uk/
On 14 May 2008, at 10:21, Markus Döring wrote:
Interesting that we all come to the same conclusions... The trouble I had with just a simple flat csv file is repeating properties like multiple image urls. ABCD clients dont use ABCD just because its complex, but because they want to transport this relational data. We were considering 2 solutions to extending this csv approach. The first would be to have a single large denormalised csv file with many rows for the same record. It would require knowledge about the related entities though and could grow in size rapidly. The second idea which we think to adopt is allowing a single level of 1- many related entities. It is basically a "star" design with the core dwc table in the center and any number of extension tables around it. Each "table" aka csv file will have the record id as the first column, so the files can be related easily and it only needs a single identifier per record and not for the extension entities. This would give a lot of flexibility while keeping things pretty simple to deal with. It would even satisfy the ABCD needs as I havent yet seen anyone requiring 2 levels of related tables (other than lookup tables). Those extensions could even be a simple 1-1 relation, but would keep things semantically together just like a xml namespace. The darwin core extensions would be good for example.
So we could have a gzipped set of files, maybe with a simple metafile indicating the semantics of the columns for each file. An example could look like this:
# darwincore.csv 102 Aster alpinus subsp. parviceps ... 103 Polygala vulgaris ...
# curatorial.csv 102 Kew Herbarium 103 Reading Herbarium
# identification.csv 102 2003-05-04 Karl Marx Aster alpinus L. 102 2007-01-11 Mark Twain Aster korshinskyi Tamamsch. 102 2007-09-13 Roger Hyam Aster alpinus subsp. parviceps Novopokr. 103 2001-02-21 Steve Bekow Polygala vulgaris L.
I know this looks old fashioned, but it is just so simple and gives us so much flexibility. Markus
On 14 May, 2008, at 24:39, Greg Whitbread wrote:
We have used a very similar protocol to assemble the latest AVH cache. It should be noted that this is an as-well-as protocol that only works because we have an established semantic standard (hispid/ abcd).
greg
trobertson@gbif.org wrote:
Hi All,
This is very interesting too me, as I came up with the same conclusion while harvesting for GBIF.
As a "harvester of all records" it is best described with an example:
- Complete Inventory of ScientificNames: 7 minutes @ the limited
200 records per page
- Complete Harvesting of records:
- 260,000 records
- 9 hours harvesting duration
- 500MB TAPIR+DwC XML returned (DwC 1.4 with geospatial and
curatorial extensions)
- Extraction of DwC records from harvested XML: <2 minutes
- Resulting file size 32MB, Gzipped to <3MB
I spun hard drives for 9 hours, and took up bandwidth that is paid for, to retrieve something that could have been generated provider side in minutes and transferred in seconds (3MB).
I sent a proposal to TDWG last year termed "datamaps" which was effectively what you are describing, and I based it on the Sitemaps protocol, but I got nowhere with it. With Markus, we are making more progress and I have spoken with several GBIF data providers about a proposed new standard for full dataset harvesting and it has been received well. So Markus and I have started a new proposal and have a working name of 'Localised DwC Index' file generation (it is an index if you have more than DwC data, and DwC is still standards compliant) which is really a GZipped Tab file dump of the data, which is slightly extensible. The document is not ready to circulate yet but the benefits section reads currently:
- Provider database load reduced, allowing it to serve real
distributed queries rather than "full datasource" harvesters
- Providers can choose to publish their index as it suits them,
giving control back to the provider
- Localised index generation can be built into tools not yet
capable of integrating with TDWG protocol networks such as GBIF
- Harvesters receive a full dataset view in one request, making it
very easy to determine what records are eligible for deletion
- It becomes very simple to write clients that consume entire
datasets. E.g. data cleansing tools that the provider can run:
- Give me ISO Country Codes for my dataset
- The application pulls down the providers index file, generates
ISO country code, returns a simple table using the providers own identifier
- Check my names for spelling mistakes
- Application skims over the records and provides a list that are
not known to the application
- Providers such as UK NBN cannot serve 20 million records to the
GBIF index using the existing protocols efficiently.
- They have the ability to generate a localised index however
- Harvesters can very quickly build up searchable indexes and it
is easy to create large indices.
- Node Portal can easily aggregate index data files
- true index to data, not an illusion of a cache. More like Google
sitemaps
It is the ease at which one can offer tools to data providers that really interests me. The technical threshold required to produce services that offer reporting tools on peoples data is really very low with this mechanism. This and the fact that large datasets will be harvestable - we have even considered the likes of bit-torrent for the large ones although I think this is overkill.
As a consumer therefore I fully support this move as a valuable addition to the wrapper tools.
Cheers
Tim (wrote the GBIF harvesting, and new to this list)
Begin forwarded message:
> From: "Aaron D. Steele" eightysteele@gmail.com > Date: 13 de mayo de 2008 22:40:09 GMT+02:00 > To: tdwg-tapir@lists.tdwg.org > Cc: Aaron Steele asteele@berkeley.edu > Subject: Re: [tdwg-tapir] Tapir protocol - Harvest methods? > > at berkeley we've recently prototyped a simple php program that > uses an existing tapirlink installation to periodically dump > tapir resources into a csv file. the solution is totally generic > and can dump darwin core (and technically abcd schema, although > it's currently untested). the resulting csv files are zip > archived and made accessible using a web service. it's a simple > approach that has proven to be, at least internally, quite > reliable and useful. > > for example, several of our caching applications use the web > service to harvest csv data from tapirlink resources using the > following > process: > 1) download latest csv dump for a resource using the web > service. > 2) flush all locally cached records for the resource. > 3) bulk load the latest csv data into the cache. > > in this way, cached data are always synchronized with the > resource and there's no need to track new, deleted, or changed > records. as an aside, each time these cached data are queried by > the caching application or selected in the user interface, > log-only search requests are sent back to the resource. > > after discussion with renato giovanni and john wieczorek, we've > decided that merging this functionality into the tapirlink > codebase would benefit the broader community. csv generation > support would be declared through capabilities. although > incremental harvesting wouldn't be immediately implemented, we > could certainly extend the service to include it later. > > i'd like to pause here to gauge the consensus, thoughts, > concerns, and ideas of others. anyone? > > thanks, > aaron > > 2008/5/5 Kevin Richards RichardsK@landcareresearch.co.nz: >> >> I think I agree here. >> >> The harvesting "procedure" is really defined outside the Tapir >> protocol, is it not? So it is really an agreement between the >> harvester and the harvestees. >> >> So what is really needed here is the standard procedure for >> maintaining a "harvestable" dataset and the standard procedure >> for harvesting that dataset. >> We have a general rule at Landcare, that we never delete >> records in our datasets - they are either deprecated in favour >> of another record, and so the resolution of that record would >> point to the new record, or the are set to a state of >> "deleted", but are still kept in the dataset, and can be >> resolved (which would indicate a state of deleted). >> >> Kevin >> >> >>>>> "Renato De Giovanni" renato@cria.org.br 6/05/2008 7:33 >>>>> a.m. >>>>>>>> >> >> Hi Markus, >> >> I would suggest creating new concepts for incremental >> harvesting, either in the data standards themselves or in some >> new extension. >> In the case of TAPIR, GBIF could easily check the mapped >> concepts before deciding between incremental or full >> harvesting. >> >> Actually it could be just one new concept such as >> "recordStatus" >> or >> "deletionFlag". Or perhaps you could also want to create your >> own definition for dateLastModified indicating which set of >> concepts should be considered to see if something has changed >> or >> not, but I guess this level of granularity would be difficult >> to >> be supported. >> >> Regards, >> -- >> Renato >> >> On 5 May 2008 at 11:24, Markus Döring wrote: >> >>> Phil, >>> incremental harvesting is not implemented on the GBIF side as >>> far as I am aware. And I dont think that will be a simple >>> thing >>> to implement on the current system. Also, even if we can >>> detect >>> only the changed records since the last harevesting via >>> dateLastModified we still have no information about deletions. >>> We could have an arrangement saying that you keep deleted >>> records as empty records with just the ID and nothing else (I >>> vaguely remember LSIDs were supposed to work like this too). >>> But >>> that also needs to be supported on your side then, never >>> entirely removing any record. I will have a discussion with >>> the >>> others at GBIF about that. >>> >>> Markus >> _______________________________________________ >> tdwg-tapir mailing list >> tdwg-tapir@lists.tdwg.org >> http://lists.tdwg.org/mailman/listinfo/tdwg-tapir >> >> >> >> >> Please consider the environment before printing this email >> >> WARNING : This email and any attachments may be confidential >> and/ or privileged. They are intended for the addressee only >> and >> are not to be read, used, copied or disseminated by anyone >> receiving them in error. >> If >> you are >> not the intended recipient, please notify the sender by return >> email and delete this message and any attachments. >> >> The views expressed in this email are those of the sender and >> do >> not necessarily reflect the official views of Landcare >> Research. >> http:// www.landcareresearch.co.nz >> _______________________________________________ >> tdwg-tapir mailing list >> tdwg-tapir@lists.tdwg.org >> http://lists.tdwg.org/mailman/listinfo/tdwg-tapir >> >> > _______________________________________________ > tdwg-tapir mailing list > tdwg-tapir@lists.tdwg.org > http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
--
Australian Centre for Plant BIodiversity Research<------------------+ National greg whitBread voice: +61 2 62509 482 Botanic Integrated Botanical Information System fax: +61 2 62509 599 Gardens S........ I.T. happens.. ghw@anbg.gov.au +----------------------------------------->GPO Box 1777 Canberra 2601
If you have received this transmission in error please notify us immediately by return e-mail and delete all copies. If this e-mail or any attachments have been sent to you in error, that error does not constitute waiver of any confidentiality, privilege or copyright in respect of information in the e-mail or attachments.
Please consider the environment before printing this email.
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
Tim,
I have no problems at all provided GUIDs are in the sitemaps.
In my evil way I am just hoping that this is another mechanism to push all suppliers towards using GUIDs.
It would be easy for people to just say "All my data is in the sitemap" and not put any other service up or bother labeling the data with appropriate GUIDs and it would be easy for an indexer to say we will just dump and restore the data each time we do that file and not track individual records through time.
If the sitemaps don't require GUIDs then it would be bad news from a specimen collections point of view.
Do Australians do real beer?
Roger
On 14 May 2008, at 13:12, Tim Robertson wrote:
Hi Roger,
Right, so I think we were talking over each other and both agree that the GUIDs (record and 'data source') and resolution mechanism is vital, along with the schemas etc for the full record response document.
This is slightly cleverer than a sitemap - a site map says "hey here are the URIs of interest" but then you must resolve each one and build your full text index (if you are called Google). What we are proposing is URI, plus a local index (the DwC fields) that are enough for some instances (GBIF portal in it's current state) to not have to resolve each record afterwards. It would also act a seed for OAI- PMH style crawlers.
Of course this does not help with aggregators who cannot maintain GUIDs - but that is a separate problem independent of any transfer mechanism.
Do you still have strong objections to this kind of approach?
Thanks
Tim
-----Original Message----- From: Roger Hyam (TDWG) [mailto:rogerhyam@mac.com] Sent: Wednesday, May 14, 2008 1:39 PM To: Tim Robertson Cc: 'Markus Döring'; tdwg-tapir@lists.tdwg.org Subject: Re: [tdwg-tapir] Fwd: Tapir protocol - Harvest methods?[SEC=UNCLASSIFIED]
Tim,
Ahh you hit the nail on the head. If these sitemaps contain just the indexing fields for records (and there is potentially more information available from another source) then there needs to be an unambiguous mechanism to link the things in the sitemaps to the things available via another means. i.e. GUIDs (could be URIs of various flavours including LSIDs).
LSID Authority plus sitemap would be good.
So we must mandate the use of GUIDs - your beer is practically safe.
Roger
On 14 May 2008, at 12:12, Tim Robertson wrote:
Hi Roger,
<Homer style>Hmmm free beer. Hang on, if worrying about trying to transfer large data is not a good incentive for standardising a transfer mechanism, is free beer any better? ;o)
But seriously,
If the proposal was along the lines of a Tab file:
- LSID kingdom phylum class order basis_of_record....
And then supporting files (star schema) with:
- LSID latitude longitude ....
If the wrappers generated these kind of structures using the same configuration generated when a user installed it, would you feel happier? This is really what Markus and I are proposing, and we fully support all the GUID generation work and I for one am desperate for it, including the BCI "datasource" level GUIDs.
The analogy to sitemaps is quite simple - these index files do not provide the full detail - they provide the means to build an index based on DwC concepts, that would then facilitate the accession of the full detail record
- e.g. through LSID. The LSID/GUID part is the same as the sitemap
URI - no?
Tim
-----Original Message----- From: Roger Hyam (TDWG) [mailto:rogerhyam@mac.com] Sent: Wednesday, May 14, 2008 12:58 PM To: Tim Robertson Cc: 'Markus Döring'; 'Hiscom-L Mailing List ((E-mail))'; tdwg-tapir@lists.tdwg.org Subject: Re: [tdwg-tapir] Fwd: Tapir protocol - Harvest methods?[SEC=UNCLASSIFIED]
Hi Tim,
The thing about the sitemaps is that they describe resources with URIs they are not just a dump of an excel file.
I will buy you a beer in Oz if any proposal that is put forward mandates the use of GUIDs for primary keys in the CSV files (other than perhaps the additional files Markus was proposing of one to many relationships). I'd buy you several beers if you manage to get it accepted :)
All the best,
Roger
BTW: Another way to represent a graph of data (other than a series of linked csv files) would be to do it in RDF as Turtle then zipped. This does way with the need of a separate dictionary to describe what the columns mean, has to be UTF-8, can include data types etc ... A script to explode this back to tables probably wouldn't be too slow but this is probably just fantasy on my part.
On 14 May 2008, at 11:21, Tim Robertson wrote:
Roegr writes "I worry that we are working out how to move data about quickly"
That is exactly what this is for, but why is it a worry (other than the likes of GBIF who really are worrying about moving data around quickly since everyone is shouting about latency problems)?
It is a 166 times (3meg versus 500meg) more efficient transfer of a data source for those wishing to transfer the whole thing. It is still standards compliant for the document passed across (DwC + flat extension schemas), and by incorporating it's generation into tools like a TAPIR wrapper, would ensure this. The reality is, many of the very large datasets have to come to GBIF like this - the transfer protocols existing just do not perform.
Furthermore, think how much easier it would be for someone like Catalogue of Life or ITIS to put up a service that says "hey, you give me the URL to your Locally generated DwC Index File and I'll give you back a report containing YOUR occurrence GUID, and MY LSID for your identification". Isn't that a good thing?
In my view these files are additional to any existing interfaces, only meet certain data type requirements and by no means detract from any of the important work (both technical and social aspects) on GUID assigning, document schemas etc. Therefore, like sitemaps became a requirement for large web sites, I think a more efficient standards based (than just dump your data and we'll handle it) approach is required for our community.
-----Original Message----- From: tdwg-tapir-bounces@lists.tdwg.org [mailto:tdwg-tapir-bounces@lists.tdwg.org] On Behalf Of Roger Hyam (TDWG) Sent: Wednesday, May 14, 2008 11:57 AM To: Markus Döring Cc: Hiscom-L Mailing List ((E-mail)); tdwg-tapir@lists.tdwg.org Subject: Re: [tdwg-tapir] Fwd: Tapir protocol - Harvest methods?[SEC=UNCLASSIFIED]
Generally if we are going to have csv files for data transfer we don't need to have software implementations just some documentation on what the csv files should contain. Something along the lines of:
- Make a report from your database as a csv file(s) with the
following columns... 2) Zip it up. 3) Either put it on a webserver and send us the URL or upload it using this webform.
We don't need to bother with TAPIR etc. You could even only produce a CSV file of the records that have changed so big data sets needn't be a problem.
I worry that we are working out how to move data about quickly and forgetting that the real goal is to integrate data and that will only come if people have GUIDs on the stuff they own and use other peoples GUIDs in their data. Solutions based around CSV files do nothing to move people in that direction and I would suspect lead to making matters worse.
Finding ourselves in hole digging quicker may not be the best option.
Roger
Roger Hyam Roger@BiodiversityCollectionsIndex.org http://www.BiodiversityCollectionsIndex.org
Royal Botanic Garden Edinburgh 20A Inverleith Row, Edinburgh, EH3 5LR, UK Tel: +44 131 552 7171 ext 3015 Fax: +44 131 248 2901 http://www.rbge.org.uk/
On 14 May 2008, at 10:21, Markus Döring wrote:
Interesting that we all come to the same conclusions... The trouble I had with just a simple flat csv file is repeating properties like multiple image urls. ABCD clients dont use ABCD just because its complex, but because they want to transport this relational data. We were considering 2 solutions to extending this csv approach. The first would be to have a single large denormalised csv file with many rows for the same record. It would require knowledge about the related entities though and could grow in size rapidly. The second idea which we think to adopt is allowing a single level of 1- many related entities. It is basically a "star" design with the core dwc table in the center and any number of extension tables around it. Each "table" aka csv file will have the record id as the first column, so the files can be related easily and it only needs a single identifier per record and not for the extension entities. This would give a lot of flexibility while keeping things pretty simple to deal with. It would even satisfy the ABCD needs as I havent yet seen anyone requiring 2 levels of related tables (other than lookup tables). Those extensions could even be a simple 1-1 relation, but would keep things semantically together just like a xml namespace. The darwin core extensions would be good for example.
So we could have a gzipped set of files, maybe with a simple metafile indicating the semantics of the columns for each file. An example could look like this:
# darwincore.csv 102 Aster alpinus subsp. parviceps ... 103 Polygala vulgaris ...
# curatorial.csv 102 Kew Herbarium 103 Reading Herbarium
# identification.csv 102 2003-05-04 Karl Marx Aster alpinus L. 102 2007-01-11 Mark Twain Aster korshinskyi Tamamsch. 102 2007-09-13 Roger Hyam Aster alpinus subsp. parviceps Novopokr. 103 2001-02-21 Steve Bekow Polygala vulgaris L.
I know this looks old fashioned, but it is just so simple and gives us so much flexibility. Markus
On 14 May, 2008, at 24:39, Greg Whitbread wrote:
We have used a very similar protocol to assemble the latest AVH cache. It should be noted that this is an as-well-as protocol that only works because we have an established semantic standard (hispid/ abcd).
greg
trobertson@gbif.org wrote:
Hi All,
This is very interesting too me, as I came up with the same conclusion while harvesting for GBIF.
As a "harvester of all records" it is best described with an example:
- Complete Inventory of ScientificNames: 7 minutes @ the limited
200 records per page
- Complete Harvesting of records:
- 260,000 records
- 9 hours harvesting duration
- 500MB TAPIR+DwC XML returned (DwC 1.4 with geospatial and
curatorial extensions)
- Extraction of DwC records from harvested XML: <2 minutes
- Resulting file size 32MB, Gzipped to <3MB
I spun hard drives for 9 hours, and took up bandwidth that is paid for, to retrieve something that could have been generated provider side in minutes and transferred in seconds (3MB).
I sent a proposal to TDWG last year termed "datamaps" which was effectively what you are describing, and I based it on the Sitemaps protocol, but I got nowhere with it. With Markus, we are making more progress and I have spoken with several GBIF data providers about a proposed new standard for full dataset harvesting and it has been received well. So Markus and I have started a new proposal and have a working name of 'Localised DwC Index' file generation (it is an index if you have more than DwC data, and DwC is still standards compliant) which is really a GZipped Tab file dump of the data, which is slightly extensible. The document is not ready to circulate yet but the benefits section reads currently:
- Provider database load reduced, allowing it to serve real
distributed queries rather than "full datasource" harvesters
- Providers can choose to publish their index as it suits them,
giving control back to the provider
- Localised index generation can be built into tools not yet
capable of integrating with TDWG protocol networks such as GBIF
- Harvesters receive a full dataset view in one request, making
it very easy to determine what records are eligible for deletion
- It becomes very simple to write clients that consume entire
datasets. E.g. data cleansing tools that the provider can run:
- Give me ISO Country Codes for my dataset
- The application pulls down the providers index file, generates
ISO country code, returns a simple table using the providers own identifier
- Check my names for spelling mistakes
- Application skims over the records and provides a list that are
not known to the application
- Providers such as UK NBN cannot serve 20 million records to the
GBIF index using the existing protocols efficiently.
- They have the ability to generate a localised index however
- Harvesters can very quickly build up searchable indexes and it
is easy to create large indices.
- Node Portal can easily aggregate index data files
- true index to data, not an illusion of a cache. More like
Google sitemaps
It is the ease at which one can offer tools to data providers that really interests me. The technical threshold required to produce services that offer reporting tools on peoples data is really very low with this mechanism. This and the fact that large datasets will be harvestable - we have even considered the likes of bit-torrent for the large ones although I think this is overkill.
As a consumer therefore I fully support this move as a valuable addition to the wrapper tools.
Cheers
Tim (wrote the GBIF harvesting, and new to this list)
> > Begin forwarded message: > >> From: "Aaron D. Steele" eightysteele@gmail.com >> Date: 13 de mayo de 2008 22:40:09 GMT+02:00 >> To: tdwg-tapir@lists.tdwg.org >> Cc: Aaron Steele asteele@berkeley.edu >> Subject: Re: [tdwg-tapir] Tapir protocol - Harvest methods? >> >> at berkeley we've recently prototyped a simple php program that >> uses an existing tapirlink installation to periodically dump >> tapir resources into a csv file. the solution is totally >> generic >> and can dump darwin core (and technically abcd schema, although >> it's currently untested). the resulting csv files are zip >> archived and made accessible using a web service. it's a simple >> approach that has proven to be, at least internally, quite >> reliable and useful. >> >> for example, several of our caching applications use the web >> service to harvest csv data from tapirlink resources using the >> following >> process: >> 1) download latest csv dump for a resource using the web >> service. >> 2) flush all locally cached records for the resource. >> 3) bulk load the latest csv data into the cache. >> >> in this way, cached data are always synchronized with the >> resource and there's no need to track new, deleted, or changed >> records. as an aside, each time these cached data are queried >> by >> the caching application or selected in the user interface, >> log-only search requests are sent back to the resource. >> >> after discussion with renato giovanni and john wieczorek, we've >> decided that merging this functionality into the tapirlink >> codebase would benefit the broader community. csv generation >> support would be declared through capabilities. although >> incremental harvesting wouldn't be immediately implemented, we >> could certainly extend the service to include it later. >> >> i'd like to pause here to gauge the consensus, thoughts, >> concerns, and ideas of others. anyone? >> >> thanks, >> aaron >> >> 2008/5/5 Kevin Richards RichardsK@landcareresearch.co.nz: >>> >>> I think I agree here. >>> >>> The harvesting "procedure" is really defined outside the Tapir >>> protocol, is it not? So it is really an agreement between the >>> harvester and the harvestees. >>> >>> So what is really needed here is the standard procedure for >>> maintaining a "harvestable" dataset and the standard procedure >>> for harvesting that dataset. >>> We have a general rule at Landcare, that we never delete >>> records in our datasets - they are either deprecated in favour >>> of another record, and so the resolution of that record would >>> point to the new record, or the are set to a state of >>> "deleted", but are still kept in the dataset, and can be >>> resolved (which would indicate a state of deleted). >>> >>> Kevin >>> >>> >>>>>> "Renato De Giovanni" renato@cria.org.br 6/05/2008 7:33 >>>>>> a.m. >>>>>>>>> >>> >>> Hi Markus, >>> >>> I would suggest creating new concepts for incremental >>> harvesting, either in the data standards themselves or in some >>> new extension. >>> In the case of TAPIR, GBIF could easily check the mapped >>> concepts before deciding between incremental or full >>> harvesting. >>> >>> Actually it could be just one new concept such as >>> "recordStatus" >>> or >>> "deletionFlag". Or perhaps you could also want to create your >>> own definition for dateLastModified indicating which set of >>> concepts should be considered to see if something has changed >>> or >>> not, but I guess this level of granularity would be difficult >>> to >>> be supported. >>> >>> Regards, >>> -- >>> Renato >>> >>> On 5 May 2008 at 11:24, Markus Döring wrote: >>> >>>> Phil, >>>> incremental harvesting is not implemented on the GBIF side as >>>> far as I am aware. And I dont think that will be a simple >>>> thing >>>> to implement on the current system. Also, even if we can >>>> detect >>>> only the changed records since the last harevesting via >>>> dateLastModified we still have no information about >>>> deletions. >>>> We could have an arrangement saying that you keep deleted >>>> records as empty records with just the ID and nothing else (I >>>> vaguely remember LSIDs were supposed to work like this too). >>>> But >>>> that also needs to be supported on your side then, never >>>> entirely removing any record. I will have a discussion with >>>> the >>>> others at GBIF about that. >>>> >>>> Markus >>> _______________________________________________ >>> tdwg-tapir mailing list >>> tdwg-tapir@lists.tdwg.org >>> http://lists.tdwg.org/mailman/listinfo/tdwg-tapir >>> >>> >>> >>> >>> Please consider the environment before printing this email >>> >>> WARNING : This email and any attachments may be confidential >>> and/ or privileged. They are intended for the addressee only >>> and >>> are not to be read, used, copied or disseminated by anyone >>> receiving them in error. >>> If >>> you are >>> not the intended recipient, please notify the sender by return >>> email and delete this message and any attachments. >>> >>> The views expressed in this email are those of the sender and >>> do >>> not necessarily reflect the official views of Landcare >>> Research. >>> http:// www.landcareresearch.co.nz >>> _______________________________________________ >>> tdwg-tapir mailing list >>> tdwg-tapir@lists.tdwg.org >>> http://lists.tdwg.org/mailman/listinfo/tdwg-tapir >>> >>> >> _______________________________________________ >> tdwg-tapir mailing list >> tdwg-tapir@lists.tdwg.org >> http://lists.tdwg.org/mailman/listinfo/tdwg-tapir >
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
--
Australian Centre for Plant BIodiversity Research<------------------+ National greg whitBread voice: +61 2 62509 482 Botanic Integrated Botanical Information System fax: +61 2 62509 599 Gardens S........ I.T. happens.. ghw@anbg.gov.au +----------------------------------------->GPO Box 1777 Canberra 2601
If you have received this transmission in error please notify us immediately by return e-mail and delete all copies. If this e-mail or any attachments have been sent to you in error, that error does not constitute waiver of any confidentiality, privilege or copyright in respect of information in the e-mail or attachments.
Please consider the environment before printing this email.
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
Marcus,
We have used the star technique with abcd with some success. The core is pretty close to dwc if the current determination goes there too. It can then be delivered with headers appropriate to the harvester's preference. It gets harder when we try to do tcs.
On Wed, 2008-05-14 at 19:21, Markus Döring wrote:
Interesting that we all come to the same conclusions... The trouble I had with just a simple flat csv file is repeating properties like multiple image urls. ABCD clients dont use ABCD just because its complex, but because they want to transport this relational data. We were considering 2 solutions to extending this csv approach. The first would be to have a single large denormalised csv file with many rows for the same record. It would require knowledge about the related entities though and could grow in size rapidly. The second idea which we think to adopt is allowing a single level of 1- many related entities. It is basically a "star" design with the core dwc table in the center and any number of extension tables around it. Each "table" aka csv file will have the record id as the first column, so the files can be related easily and it only needs a single identifier per record and not for the extension entities. This would give a lot of flexibility while keeping things pretty simple to deal with. It would even satisfy the ABCD needs as I havent yet seen anyone requiring 2 levels of related tables (other than lookup tables). Those extensions could even be a simple 1-1 relation, but would keep things semantically together just like a xml namespace. The darwin core extensions would be good for example.
So we could have a gzipped set of files, maybe with a simple metafile indicating the semantics of the columns for each file. An example could look like this:
# darwincore.csv 102 Aster alpinus subsp. parviceps ... 103 Polygala vulgaris ...
# curatorial.csv 102 Kew Herbarium 103 Reading Herbarium
# identification.csv 102 2003-05-04 Karl Marx Aster alpinus L. 102 2007-01-11 Mark Twain Aster korshinskyi Tamamsch. 102 2007-09-13 Roger Hyam Aster alpinus subsp. parviceps Novopokr. 103 2001-02-21 Steve Bekow Polygala vulgaris L.
I know this looks old fashioned, but it is just so simple and gives us so much flexibility. Markus
On 14 May, 2008, at 24:39, Greg Whitbread wrote:
We have used a very similar protocol to assemble the latest AVH cache. It should be noted that this is an as-well-as protocol that only works because we have an established semantic standard (hispid/abcd).
greg
trobertson@gbif.org wrote:
Hi All,
This is very interesting too me, as I came up with the same conclusion while harvesting for GBIF.
As a "harvester of all records" it is best described with an example:
- Complete Inventory of ScientificNames: 7 minutes @ the limited 200
records per page
- Complete Harvesting of records:
- 260,000 records
- 9 hours harvesting duration
- 500MB TAPIR+DwC XML returned (DwC 1.4 with geospatial and
curatorial extensions)
- Extraction of DwC records from harvested XML: <2 minutes
- Resulting file size 32MB, Gzipped to <3MB
I spun hard drives for 9 hours, and took up bandwidth that is paid for, to retrieve something that could have been generated provider side in minutes and transferred in seconds (3MB).
I sent a proposal to TDWG last year termed "datamaps" which was effectively what you are describing, and I based it on the Sitemaps protocol, but I got nowhere with it. With Markus, we are making more progress and I have spoken with several GBIF data providers about a proposed new standard for full dataset harvesting and it has been received well. So Markus and I have started a new proposal and have a working name of 'Localised DwC Index' file generation (it is an index if you have more than DwC data, and DwC is still standards compliant) which is really a GZipped Tab file dump of the data, which is slightly extensible. The document is not ready to circulate yet but the benefits section reads currently:
- Provider database load reduced, allowing it to serve real
distributed queries rather than "full datasource" harvesters
- Providers can choose to publish their index as it suits them,
giving control back to the provider
- Localised index generation can be built into tools not yet
capable of integrating with TDWG protocol networks such as GBIF
- Harvesters receive a full dataset view in one request, making it
very easy to determine what records are eligible for deletion
- It becomes very simple to write clients that consume entire
datasets. E.g. data cleansing tools that the provider can run:
- Give me ISO Country Codes for my dataset
- The application pulls down the providers index file,
generates ISO country code, returns a simple table using the providers own identifier
- Check my names for spelling mistakes
- Application skims over the records and provides a list that
are not known to the application
- Providers such as UK NBN cannot serve 20 million records to the
GBIF index using the existing protocols efficiently.
- They have the ability to generate a localised index however
- Harvesters can very quickly build up searchable indexes and it is
easy to create large indices.
- Node Portal can easily aggregate index data files
- true index to data, not an illusion of a cache. More like Google
sitemaps
It is the ease at which one can offer tools to data providers that really interests me. The technical threshold required to produce services that offer reporting tools on peoples data is really very low with this mechanism. This and the fact that large datasets will be harvestable - we have even considered the likes of bit-torrent for the large ones although I think this is overkill.
As a consumer therefore I fully support this move as a valuable addition to the wrapper tools.
Cheers
Tim (wrote the GBIF harvesting, and new to this list)
Begin forwarded message:
From: "Aaron D. Steele" eightysteele@gmail.com Date: 13 de mayo de 2008 22:40:09 GMT+02:00 To: tdwg-tapir@lists.tdwg.org Cc: Aaron Steele asteele@berkeley.edu Subject: Re: [tdwg-tapir] Tapir protocol - Harvest methods?
at berkeley we've recently prototyped a simple php program that uses an existing tapirlink installation to periodically dump tapir resources into a csv file. the solution is totally generic and can dump darwin core (and technically abcd schema, although it's currently untested). the resulting csv files are zip archived and made accessible using a web service. it's a simple approach that has proven to be, at least internally, quite reliable and useful.
for example, several of our caching applications use the web service to harvest csv data from tapirlink resources using the following process:
- download latest csv dump for a resource using the web service.
- flush all locally cached records for the resource.
- bulk load the latest csv data into the cache.
in this way, cached data are always synchronized with the resource and there's no need to track new, deleted, or changed records. as an aside, each time these cached data are queried by the caching application or selected in the user interface, log-only search requests are sent back to the resource.
after discussion with renato giovanni and john wieczorek, we've decided that merging this functionality into the tapirlink codebase would benefit the broader community. csv generation support would be declared through capabilities. although incremental harvesting wouldn't be immediately implemented, we could certainly extend the service to include it later.
i'd like to pause here to gauge the consensus, thoughts, concerns, and ideas of others. anyone?
thanks, aaron
2008/5/5 Kevin Richards RichardsK@landcareresearch.co.nz:
I think I agree here.
The harvesting "procedure" is really defined outside the Tapir protocol, is it not? So it is really an agreement between the harvester and the harvestees.
So what is really needed here is the standard procedure for maintaining a "harvestable" dataset and the standard procedure for harvesting that dataset. We have a general rule at Landcare, that we never delete records in our datasets - they are either deprecated in favour of another record, and so the resolution of that record would point to the new record, or the are set to a state of "deleted", but are still kept in the dataset, and can be resolved (which would indicate a state of deleted).
Kevin
>>> "Renato De Giovanni" renato@cria.org.br 6/05/2008 7:33 a.m. >>> >>>
Hi Markus,
I would suggest creating new concepts for incremental harvesting, either in the data standards themselves or in some new extension. In the case of TAPIR, GBIF could easily check the mapped concepts before deciding between incremental or full harvesting.
Actually it could be just one new concept such as "recordStatus" or "deletionFlag". Or perhaps you could also want to create your own definition for dateLastModified indicating which set of concepts should be considered to see if something has changed or not, but I guess this level of granularity would be difficult to be supported.
Regards,
Renato
On 5 May 2008 at 11:24, Markus Döring wrote:
> Phil, > incremental harvesting is not implemented on the GBIF side as far > as I > am aware. And I dont think that will be a simple thing to > implement on > the current system. Also, even if we can detect only the changed > records since the last harevesting via dateLastModified we still > have > no information about deletions. We could have an arrangement > saying > that you keep deleted records as empty records with just the ID > and > nothing else (I vaguely remember LSIDs were supposed to work like > this > too). But that also needs to be supported on your side then, > never > entirely removing any record. I will have a discussion with the > others > at GBIF about that. > > Markus _______________________________________________ tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
Please consider the environment before printing this email
WARNING : This email and any attachments may be confidential and/ or privileged. They are intended for the addressee only and are not to be read, used, copied or disseminated by anyone receiving them in error. If you are not the intended recipient, please notify the sender by return email and delete this message and any attachments.
The views expressed in this email are those of the sender and do not necessarily reflect the official views of Landcare Research. http:// www.landcareresearch.co.nz _______________________________________________ tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
--
Australian Centre for Plant BIodiversity Research<------------------+ National greg whitBread voice: +61 2 62509 482 Botanic Integrated Botanical Information System fax: +61 2 62509 599 Gardens S........ I.T. happens.. ghw@anbg.gov.au +----------------------------------------->GPO Box 1777 Canberra 2601
If you have received this transmission in error please notify us immediately by return e-mail and delete all copies. If this e-mail or any attachments have been sent to you in error, that error does not constitute waiver of any confidentiality, privilege or copyright in respect of information in the e-mail or attachments.
Please consider the environment before printing this email.
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
Hiscom-l mailing list Hiscom-l@chah.org.au http://chah.org.au/mailman/listinfo/hiscom-l_chah.org.au
for preserving relational data, we could also just dump tapirlink resources to an sqlite database file (http://www.sqlite.org), zip it up, and again make it available via the web service. we use sqlite internally for many projects, and it's both easy to use and well supported by jdbc, php, python, etc.
would something like this be a useful option?
thanks, aaron
On Wed, May 14, 2008 at 2:21 AM, Markus Döring mdoering@gbif.org wrote:
Interesting that we all come to the same conclusions... The trouble I had with just a simple flat csv file is repeating properties like multiple image urls. ABCD clients dont use ABCD just because its complex, but because they want to transport this relational data. We were considering 2 solutions to extending this csv approach. The first would be to have a single large denormalised csv file with many rows for the same record. It would require knowledge about the related entities though and could grow in size rapidly. The second idea which we think to adopt is allowing a single level of 1- many related entities. It is basically a "star" design with the core dwc table in the center and any number of extension tables around it. Each "table" aka csv file will have the record id as the first column, so the files can be related easily and it only needs a single identifier per record and not for the extension entities. This would give a lot of flexibility while keeping things pretty simple to deal with. It would even satisfy the ABCD needs as I havent yet seen anyone requiring 2 levels of related tables (other than lookup tables). Those extensions could even be a simple 1-1 relation, but would keep things semantically together just like a xml namespace. The darwin core extensions would be good for example.
So we could have a gzipped set of files, maybe with a simple metafile indicating the semantics of the columns for each file. An example could look like this:
# darwincore.csv 102 Aster alpinus subsp. parviceps ... 103 Polygala vulgaris ...
# curatorial.csv 102 Kew Herbarium 103 Reading Herbarium
# identification.csv 102 2003-05-04 Karl Marx Aster alpinus L. 102 2007-01-11 Mark Twain Aster korshinskyi Tamamsch. 102 2007-09-13 Roger Hyam Aster alpinus subsp. parviceps Novopokr. 103 2001-02-21 Steve Bekow Polygala vulgaris L.
I know this looks old fashioned, but it is just so simple and gives us so much flexibility. Markus
On 14 May, 2008, at 24:39, Greg Whitbread wrote:
We have used a very similar protocol to assemble the latest AVH cache. It should be noted that this is an as-well-as protocol that only works because we have an established semantic standard (hispid/abcd).
greg
trobertson@gbif.org wrote:
Hi All,
This is very interesting too me, as I came up with the same conclusion while harvesting for GBIF.
As a "harvester of all records" it is best described with an example:
- Complete Inventory of ScientificNames: 7 minutes @ the limited 200
records per page
- Complete Harvesting of records:
- 260,000 records
- 9 hours harvesting duration
- 500MB TAPIR+DwC XML returned (DwC 1.4 with geospatial and
curatorial extensions)
- Extraction of DwC records from harvested XML: <2 minutes
- Resulting file size 32MB, Gzipped to <3MB
I spun hard drives for 9 hours, and took up bandwidth that is paid for, to retrieve something that could have been generated provider side in minutes and transferred in seconds (3MB).
I sent a proposal to TDWG last year termed "datamaps" which was effectively what you are describing, and I based it on the Sitemaps protocol, but I got nowhere with it. With Markus, we are making more progress and I have spoken with several GBIF data providers about a proposed new standard for full dataset harvesting and it has been received well. So Markus and I have started a new proposal and have a working name of 'Localised DwC Index' file generation (it is an index if you have more than DwC data, and DwC is still standards compliant) which is really a GZipped Tab file dump of the data, which is slightly extensible. The document is not ready to circulate yet but the benefits section reads currently:
- Provider database load reduced, allowing it to serve real
distributed queries rather than "full datasource" harvesters
- Providers can choose to publish their index as it suits them,
giving control back to the provider
- Localised index generation can be built into tools not yet
capable of integrating with TDWG protocol networks such as GBIF
- Harvesters receive a full dataset view in one request, making it
very easy to determine what records are eligible for deletion
- It becomes very simple to write clients that consume entire
datasets. E.g. data cleansing tools that the provider can run:
- Give me ISO Country Codes for my dataset
- The application pulls down the providers index file,
generates ISO country code, returns a simple table using the providers own identifier
- Check my names for spelling mistakes
- Application skims over the records and provides a list that
are not known to the application
- Providers such as UK NBN cannot serve 20 million records to the
GBIF index using the existing protocols efficiently.
- They have the ability to generate a localised index however
- Harvesters can very quickly build up searchable indexes and it is
easy to create large indices.
- Node Portal can easily aggregate index data files
- true index to data, not an illusion of a cache. More like Google
sitemaps
It is the ease at which one can offer tools to data providers that really interests me. The technical threshold required to produce services that offer reporting tools on peoples data is really very low with this mechanism. This and the fact that large datasets will be harvestable - we have even considered the likes of bit-torrent for the large ones although I think this is overkill.
As a consumer therefore I fully support this move as a valuable addition to the wrapper tools.
Cheers
Tim (wrote the GBIF harvesting, and new to this list)
Begin forwarded message:
From: "Aaron D. Steele" eightysteele@gmail.com Date: 13 de mayo de 2008 22:40:09 GMT+02:00 To: tdwg-tapir@lists.tdwg.org Cc: Aaron Steele asteele@berkeley.edu Subject: Re: [tdwg-tapir] Tapir protocol - Harvest methods?
at berkeley we've recently prototyped a simple php program that uses an existing tapirlink installation to periodically dump tapir resources into a csv file. the solution is totally generic and can dump darwin core (and technically abcd schema, although it's currently untested). the resulting csv files are zip archived and made accessible using a web service. it's a simple approach that has proven to be, at least internally, quite reliable and useful.
for example, several of our caching applications use the web service to harvest csv data from tapirlink resources using the following process:
- download latest csv dump for a resource using the web service.
- flush all locally cached records for the resource.
- bulk load the latest csv data into the cache.
in this way, cached data are always synchronized with the resource and there's no need to track new, deleted, or changed records. as an aside, each time these cached data are queried by the caching application or selected in the user interface, log-only search requests are sent back to the resource.
after discussion with renato giovanni and john wieczorek, we've decided that merging this functionality into the tapirlink codebase would benefit the broader community. csv generation support would be declared through capabilities. although incremental harvesting wouldn't be immediately implemented, we could certainly extend the service to include it later.
i'd like to pause here to gauge the consensus, thoughts, concerns, and ideas of others. anyone?
thanks, aaron
2008/5/5 Kevin Richards RichardsK@landcareresearch.co.nz:
I think I agree here.
The harvesting "procedure" is really defined outside the Tapir protocol, is it not? So it is really an agreement between the harvester and the harvestees.
So what is really needed here is the standard procedure for maintaining a "harvestable" dataset and the standard procedure for harvesting that dataset. We have a general rule at Landcare, that we never delete records in our datasets - they are either deprecated in favour of another record, and so the resolution of that record would point to the new record, or the are set to a state of "deleted", but are still kept in the dataset, and can be resolved (which would indicate a state of deleted).
Kevin
>>> "Renato De Giovanni" renato@cria.org.br 6/05/2008 7:33 a.m. >>> >>>
Hi Markus,
I would suggest creating new concepts for incremental harvesting, either in the data standards themselves or in some new extension. In the case of TAPIR, GBIF could easily check the mapped concepts before deciding between incremental or full harvesting.
Actually it could be just one new concept such as "recordStatus" or "deletionFlag". Or perhaps you could also want to create your own definition for dateLastModified indicating which set of concepts should be considered to see if something has changed or not, but I guess this level of granularity would be difficult to be supported.
Regards,
Renato
On 5 May 2008 at 11:24, Markus Döring wrote:
> Phil, > incremental harvesting is not implemented on the GBIF side as far > as I > am aware. And I dont think that will be a simple thing to > implement on > the current system. Also, even if we can detect only the changed > records since the last harevesting via dateLastModified we still > have > no information about deletions. We could have an arrangement > saying > that you keep deleted records as empty records with just the ID > and > nothing else (I vaguely remember LSIDs were supposed to work like > this > too). But that also needs to be supported on your side then, > never > entirely removing any record. I will have a discussion with the > others > at GBIF about that. > > Markus _______________________________________________ tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
Please consider the environment before printing this email
WARNING : This email and any attachments may be confidential and/ or privileged. They are intended for the addressee only and are not to be read, used, copied or disseminated by anyone receiving them in error. If you are not the intended recipient, please notify the sender by return email and delete this message and any attachments.
The views expressed in this email are those of the sender and do not necessarily reflect the official views of Landcare Research. http:// www.landcareresearch.co.nz _______________________________________________ tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
--
Australian Centre for Plant BIodiversity Research<------------------+ National greg whitBread voice: +61 2 62509 482 Botanic Integrated Botanical Information System fax: +61 2 62509 599 Gardens S........ I.T. happens.. ghw@anbg.gov.au +----------------------------------------->GPO Box 1777 Canberra 2601
If you have received this transmission in error please notify us immediately by return e-mail and delete all copies. If this e-mail or any attachments have been sent to you in error, that error does not constitute waiver of any confidentiality, privilege or copyright in respect of information in the e-mail or attachments.
Please consider the environment before printing this email.
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
it would keep the relations, but we dont really want any relational structure to be served up. And using sqlite binaries for the dwc star scheme would not be easier to work with than plain text files. they can even be loaded into excel straight away, can be versioned with svn and so on. If there is a geospatial extension file which has the GUID in the first column, applications might grab that directly and not even touch the central core file if they only want location data.
I'd prefer to stick with a csv or tab delimited file. The simpler the better. And it also cant get corrupted as easily.
Markus
On 14 May, 2008, at 15:25, Aaron D. Steele wrote:
for preserving relational data, we could also just dump tapirlink resources to an sqlite database file (http://www.sqlite.org), zip it up, and again make it available via the web service. we use sqlite internally for many projects, and it's both easy to use and well supported by jdbc, php, python, etc.
would something like this be a useful option?
thanks, aaron
On Wed, May 14, 2008 at 2:21 AM, Markus Döring mdoering@gbif.org wrote:
Interesting that we all come to the same conclusions... The trouble I had with just a simple flat csv file is repeating properties like multiple image urls. ABCD clients dont use ABCD just because its complex, but because they want to transport this relational data. We were considering 2 solutions to extending this csv approach. The first would be to have a single large denormalised csv file with many rows for the same record. It would require knowledge about the related entities though and could grow in size rapidly. The second idea which we think to adopt is allowing a single level of 1- many related entities. It is basically a "star" design with the core dwc table in the center and any number of extension tables around it. Each "table" aka csv file will have the record id as the first column, so the files can be related easily and it only needs a single identifier per record and not for the extension entities. This would give a lot of flexibility while keeping things pretty simple to deal with. It would even satisfy the ABCD needs as I havent yet seen anyone requiring 2 levels of related tables (other than lookup tables). Those extensions could even be a simple 1-1 relation, but would keep things semantically together just like a xml namespace. The darwin core extensions would be good for example.
So we could have a gzipped set of files, maybe with a simple metafile indicating the semantics of the columns for each file. An example could look like this:
# darwincore.csv 102 Aster alpinus subsp. parviceps ... 103 Polygala vulgaris ...
# curatorial.csv 102 Kew Herbarium 103 Reading Herbarium
# identification.csv 102 2003-05-04 Karl Marx Aster alpinus L. 102 2007-01-11 Mark Twain Aster korshinskyi Tamamsch. 102 2007-09-13 Roger Hyam Aster alpinus subsp. parviceps Novopokr. 103 2001-02-21 Steve Bekow Polygala vulgaris L.
I know this looks old fashioned, but it is just so simple and gives us so much flexibility. Markus
On 14 May, 2008, at 24:39, Greg Whitbread wrote:
We have used a very similar protocol to assemble the latest AVH cache. It should be noted that this is an as-well-as protocol that only works because we have an established semantic standard (hispid/abcd).
greg
trobertson@gbif.org wrote:
Hi All,
This is very interesting too me, as I came up with the same conclusion while harvesting for GBIF.
As a "harvester of all records" it is best described with an example:
- Complete Inventory of ScientificNames: 7 minutes @ the limited
200 records per page
- Complete Harvesting of records:
- 260,000 records
- 9 hours harvesting duration
- 500MB TAPIR+DwC XML returned (DwC 1.4 with geospatial and
curatorial extensions)
- Extraction of DwC records from harvested XML: <2 minutes
- Resulting file size 32MB, Gzipped to <3MB
I spun hard drives for 9 hours, and took up bandwidth that is paid for, to retrieve something that could have been generated provider side in minutes and transferred in seconds (3MB).
I sent a proposal to TDWG last year termed "datamaps" which was effectively what you are describing, and I based it on the Sitemaps protocol, but I got nowhere with it. With Markus, we are making more progress and I have spoken with several GBIF data providers about a proposed new standard for full dataset harvesting and it has been received well. So Markus and I have started a new proposal and have a working name of 'Localised DwC Index' file generation (it is an index if you have more than DwC data, and DwC is still standards compliant) which is really a GZipped Tab file dump of the data, which is slightly extensible. The document is not ready to circulate yet but the benefits section reads currently:
- Provider database load reduced, allowing it to serve real
distributed queries rather than "full datasource" harvesters
- Providers can choose to publish their index as it suits them,
giving control back to the provider
- Localised index generation can be built into tools not yet
capable of integrating with TDWG protocol networks such as GBIF
- Harvesters receive a full dataset view in one request, making it
very easy to determine what records are eligible for deletion
- It becomes very simple to write clients that consume entire
datasets. E.g. data cleansing tools that the provider can run:
- Give me ISO Country Codes for my dataset
- The application pulls down the providers index file,
generates ISO country code, returns a simple table using the providers own identifier
- Check my names for spelling mistakes
- Application skims over the records and provides a list that
are not known to the application
- Providers such as UK NBN cannot serve 20 million records to the
GBIF index using the existing protocols efficiently.
- They have the ability to generate a localised index however
- Harvesters can very quickly build up searchable indexes and it is
easy to create large indices.
- Node Portal can easily aggregate index data files
- true index to data, not an illusion of a cache. More like Google
sitemaps
It is the ease at which one can offer tools to data providers that really interests me. The technical threshold required to produce services that offer reporting tools on peoples data is really very low with this mechanism. This and the fact that large datasets will be harvestable - we have even considered the likes of bit-torrent for the large ones although I think this is overkill.
As a consumer therefore I fully support this move as a valuable addition to the wrapper tools.
Cheers
Tim (wrote the GBIF harvesting, and new to this list)
Begin forwarded message:
From: "Aaron D. Steele" eightysteele@gmail.com Date: 13 de mayo de 2008 22:40:09 GMT+02:00 To: tdwg-tapir@lists.tdwg.org Cc: Aaron Steele asteele@berkeley.edu Subject: Re: [tdwg-tapir] Tapir protocol - Harvest methods?
at berkeley we've recently prototyped a simple php program that uses an existing tapirlink installation to periodically dump tapir resources into a csv file. the solution is totally generic and can dump darwin core (and technically abcd schema, although it's currently untested). the resulting csv files are zip archived and made accessible using a web service. it's a simple approach that has proven to be, at least internally, quite reliable and useful.
for example, several of our caching applications use the web service to harvest csv data from tapirlink resources using the following process:
- download latest csv dump for a resource using the web service.
- flush all locally cached records for the resource.
- bulk load the latest csv data into the cache.
in this way, cached data are always synchronized with the resource and there's no need to track new, deleted, or changed records. as an aside, each time these cached data are queried by the caching application or selected in the user interface, log-only search requests are sent back to the resource.
after discussion with renato giovanni and john wieczorek, we've decided that merging this functionality into the tapirlink codebase would benefit the broader community. csv generation support would be declared through capabilities. although incremental harvesting wouldn't be immediately implemented, we could certainly extend the service to include it later.
i'd like to pause here to gauge the consensus, thoughts, concerns, and ideas of others. anyone?
thanks, aaron
2008/5/5 Kevin Richards RichardsK@landcareresearch.co.nz: > > I think I agree here. > > The harvesting "procedure" is really defined outside the Tapir > protocol, is > it not? So it is really an agreement between the harvester and > the > harvestees. > > So what is really needed here is the standard procedure for > maintaining a > "harvestable" dataset and the standard procedure for harvesting > that > dataset. > We have a general rule at Landcare, that we never delete records > in > our > datasets - they are either deprecated in favour of another > record, > and so > the resolution of that record would point to the new record, or > the > are set > to a state of "deleted", but are still kept in the dataset, and > can > be > resolved (which would indicate a state of deleted). > > Kevin > > >>>> "Renato De Giovanni" renato@cria.org.br 6/05/2008 7:33 a.m. >>>>>>> > > Hi Markus, > > I would suggest creating new concepts for incremental > harvesting, > either in the data standards themselves or in some new > extension. In > the case of TAPIR, GBIF could easily check the mapped concepts > before > deciding between incremental or full harvesting. > > Actually it could be just one new concept such as "recordStatus" > or > "deletionFlag". Or perhaps you could also want to create your > own > definition for dateLastModified indicating which set of concepts > should be considered to see if something has changed or not, > but I > guess this level of granularity would be difficult to be > supported. > > Regards, > -- > Renato > > On 5 May 2008 at 11:24, Markus Döring wrote: > >> Phil, >> incremental harvesting is not implemented on the GBIF side as >> far >> as I >> am aware. And I dont think that will be a simple thing to >> implement on >> the current system. Also, even if we can detect only the >> changed >> records since the last harevesting via dateLastModified we >> still >> have >> no information about deletions. We could have an arrangement >> saying >> that you keep deleted records as empty records with just the ID >> and >> nothing else (I vaguely remember LSIDs were supposed to work >> like >> this >> too). But that also needs to be supported on your side then, >> never >> entirely removing any record. I will have a discussion with the >> others >> at GBIF about that. >> >> Markus > _______________________________________________ > tdwg-tapir mailing list > tdwg-tapir@lists.tdwg.org > http://lists.tdwg.org/mailman/listinfo/tdwg-tapir > > > > > Please consider the environment before printing this email > > WARNING : This email and any attachments may be confidential > and/ > or > privileged. They are intended for the addressee only and are not > to > be read, > used, copied or disseminated by anyone receiving them in > error. If > you are > not the intended recipient, please notify the sender by return > email and > delete this message and any attachments. > > The views expressed in this email are those of the sender and do > not > necessarily reflect the > official views of Landcare Research. http:// > www.landcareresearch.co.nz > _______________________________________________ > tdwg-tapir mailing list > tdwg-tapir@lists.tdwg.org > http://lists.tdwg.org/mailman/listinfo/tdwg-tapir > > _______________________________________________ tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
--
Australian Centre for Plant BIodiversity Research<------------------+ National greg whitBread voice: +61 2 62509 482 Botanic Integrated Botanical Information System fax: +61 2 62509 599 Gardens S........ I.T. happens.. ghw@anbg.gov.au +----------------------------------------->GPO Box 1777 Canberra 2601
If you have received this transmission in error please notify us immediately by return e-mail and delete all copies. If this e-mail or any attachments have been sent to you in error, that error does not constitute waiver of any confidentiality, privilege or copyright in respect of information in the e-mail or attachments.
Please consider the environment before printing this email.
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
I agree with Markus about using a simple data format. Relational database dumps would require standard database structures or would expose specific things that are already encapsulated by abstraction layers (conceptual schemas).
I'm not sure about the best way to represent complex data structures like ABCD, but for simpler providers such as TapirLink/Dwc, the idea was to create a new script responsible for dumping all mapped concepts of a specific data source into a single file. Providers could periodically call this script from a cron job to regenerate the dump. The first line in the dump file would indicate the concept identifiers (GUIDs) associated with each column to make it a generic solution (and more compatible with existing applications). Content could be tab-delimited and in the end compressed.
Harvesters could use this "seed" file for the initial data import, and then potentially use incremental harvesting to update the cache. But in this case it would be necessary to know when the dump file was generated.
To use the existing TAPIR infrastructure, we would also need to know which providers support the dump files. Aaron's idea, when he first discussed with me, was to use a new custom operation. This makes sense to me, but would require a small change in the protocol to add a custom slot in the operations section of capabilities responses. Curiously, this approach would allow the existence of TAPIR "static providers" - the simplest possible category, even simpler than TapirLite. They would not support inventories, searches or query templates, but would make the dump file available through the new custom operation. Metadata, capabilities and ping could be just static files served by a very simple script.
If this approach makes sense, I think these are the points that still need to be addressed:
1) Decide about how to indicate the timestamp associated with the dump file. 2) Change the TAPIR schema (or figure out another solution to advertise the new capability, but always remembering that in the TAPIR context a single provider instance can host multiple data sources that are usually distinguished by a query parameter in the URL, so I'm not sure how a sitemaps approach could be used). 3) Decide about how to represent complex data such as ABCD (if using multiple files, I would suggest to compress them together and serve as a single file). 4) Write a short specification to describe the new custom operation and the data format.
I'm happy to change the schema if there's consensus about this.
Best Regards, -- Renato
it would keep the relations, but we dont really want any relational structure to be served up. And using sqlite binaries for the dwc star scheme would not be easier to work with than plain text files. they can even be loaded into excel straight away, can be versioned with svn and so on. If there is a geospatial extension file which has the GUID in the first column, applications might grab that directly and not even touch the central core file if they only want location data.
I'd prefer to stick with a csv or tab delimited file. The simpler the better. And it also cant get corrupted as easily.
Markus
On 14 May, 2008, at 15:25, Aaron D. Steele wrote:
for preserving relational data, we could also just dump tapirlink resources to an sqlite database file (http://www.sqlite.org), zip it up, and again make it available via the web service. we use sqlite internally for many projects, and it's both easy to use and well supported by jdbc, php, python, etc.
would something like this be a useful option?
thanks, aaron
Hi Renato,
Do you think this really go under TAPIR spec?
Sure we want the wrappers to produce it but it's just a document on a URL and can be described in such a simple way that loads of other people could incorporate it without getting into TAPIR specs, nor can they claim any TAPIR compliance just because they can do a 'select to outfile'.
I would also request that the headers aren't in the data file but the metafile. It is way easier to dump a big DB to this 'document standard' without needing to worry about how to get headers in a 20gig file.
Just some more thoughts
Cheers
Tim
I agree with Markus about using a simple data format. Relational database dumps would require standard database structures or would expose specific things that are already encapsulated by abstraction layers (conceptual schemas).
I'm not sure about the best way to represent complex data structures like ABCD, but for simpler providers such as TapirLink/Dwc, the idea was to create a new script responsible for dumping all mapped concepts of a specific data source into a single file. Providers could periodically call this script from a cron job to regenerate the dump. The first line in the dump file would indicate the concept identifiers (GUIDs) associated with each column to make it a generic solution (and more compatible with existing applications). Content could be tab-delimited and in the end compressed.
Harvesters could use this "seed" file for the initial data import, and then potentially use incremental harvesting to update the cache. But in this case it would be necessary to know when the dump file was generated.
To use the existing TAPIR infrastructure, we would also need to know which providers support the dump files. Aaron's idea, when he first discussed with me, was to use a new custom operation. This makes sense to me, but would require a small change in the protocol to add a custom slot in the operations section of capabilities responses. Curiously, this approach would allow the existence of TAPIR "static providers" - the simplest possible category, even simpler than TapirLite. They would not support inventories, searches or query templates, but would make the dump file available through the new custom operation. Metadata, capabilities and ping could be just static files served by a very simple script.
If this approach makes sense, I think these are the points that still need to be addressed:
- Decide about how to indicate the timestamp associated with the dump
file. 2) Change the TAPIR schema (or figure out another solution to advertise the new capability, but always remembering that in the TAPIR context a single provider instance can host multiple data sources that are usually distinguished by a query parameter in the URL, so I'm not sure how a sitemaps approach could be used). 3) Decide about how to represent complex data such as ABCD (if using multiple files, I would suggest to compress them together and serve as a single file). 4) Write a short specification to describe the new custom operation and the data format.
I'm happy to change the schema if there's consensus about this.
Best Regards,
Renato
it would keep the relations, but we dont really want any relational structure to be served up. And using sqlite binaries for the dwc star scheme would not be easier to work with than plain text files. they can even be loaded into excel straight away, can be versioned with svn and so on. If there is a geospatial extension file which has the GUID in the first column, applications might grab that directly and not even touch the central core file if they only want location data.
I'd prefer to stick with a csv or tab delimited file. The simpler the better. And it also cant get corrupted as easily.
Markus
On 14 May, 2008, at 15:25, Aaron D. Steele wrote:
for preserving relational data, we could also just dump tapirlink resources to an sqlite database file (http://www.sqlite.org), zip it up, and again make it available via the web service. we use sqlite internally for many projects, and it's both easy to use and well supported by jdbc, php, python, etc.
would something like this be a useful option?
thanks, aaron
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
I agree with Tim that it would be better to keep this proposal/ specification seperate from TAPIR. Saying that it could still be included in the TAPIR capabilities to indicate this feature. But an important reason to have these file is to get more providers on board. So they should also be able to implement this without the TAPIR overhead.
A seperate metafile would certainly hold also the timestamp of the last generation of the file, so keeping that seperate has additional advantages.
Markus
On 14 May, 2008, at 21:44, trobertson@gbif.org wrote:
Hi Renato,
Do you think this really go under TAPIR spec?
Sure we want the wrappers to produce it but it's just a document on a URL and can be described in such a simple way that loads of other people could incorporate it without getting into TAPIR specs, nor can they claim any TAPIR compliance just because they can do a 'select to outfile'.
I would also request that the headers aren't in the data file but the metafile. It is way easier to dump a big DB to this 'document standard' without needing to worry about how to get headers in a 20gig file.
Just some more thoughts
Cheers
Tim
I agree with Markus about using a simple data format. Relational database dumps would require standard database structures or would expose specific things that are already encapsulated by abstraction layers (conceptual schemas).
I'm not sure about the best way to represent complex data structures like ABCD, but for simpler providers such as TapirLink/Dwc, the idea was to create a new script responsible for dumping all mapped concepts of a specific data source into a single file. Providers could periodically call this script from a cron job to regenerate the dump. The first line in the dump file would indicate the concept identifiers (GUIDs) associated with each column to make it a generic solution (and more compatible with existing applications). Content could be tab-delimited and in the end compressed.
Harvesters could use this "seed" file for the initial data import, and then potentially use incremental harvesting to update the cache. But in this case it would be necessary to know when the dump file was generated.
To use the existing TAPIR infrastructure, we would also need to know which providers support the dump files. Aaron's idea, when he first discussed with me, was to use a new custom operation. This makes sense to me, but would require a small change in the protocol to add a custom slot in the operations section of capabilities responses. Curiously, this approach would allow the existence of TAPIR "static providers" - the simplest possible category, even simpler than TapirLite. They would not support inventories, searches or query templates, but would make the dump file available through the new custom operation. Metadata, capabilities and ping could be just static files served by a very simple script.
If this approach makes sense, I think these are the points that still need to be addressed:
- Decide about how to indicate the timestamp associated with the
dump file. 2) Change the TAPIR schema (or figure out another solution to advertise the new capability, but always remembering that in the TAPIR context a single provider instance can host multiple data sources that are usually distinguished by a query parameter in the URL, so I'm not sure how a sitemaps approach could be used). 3) Decide about how to represent complex data such as ABCD (if using multiple files, I would suggest to compress them together and serve as a single file). 4) Write a short specification to describe the new custom operation and the data format.
I'm happy to change the schema if there's consensus about this.
Best Regards,
Renato
it would keep the relations, but we dont really want any relational structure to be served up. And using sqlite binaries for the dwc star scheme would not be easier to work with than plain text files. they can even be loaded into excel straight away, can be versioned with svn and so on. If there is a geospatial extension file which has the GUID in the first column, applications might grab that directly and not even touch the central core file if they only want location data.
I'd prefer to stick with a csv or tab delimited file. The simpler the better. And it also cant get corrupted as easily.
Markus
On 14 May, 2008, at 15:25, Aaron D. Steele wrote:
for preserving relational data, we could also just dump tapirlink resources to an sqlite database file (http://www.sqlite.org), zip it up, and again make it available via the web service. we use sqlite internally for many projects, and it's both easy to use and well supported by jdbc, php, python, etc.
would something like this be a useful option?
thanks, aaron
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
I imagine a lot of these CSV files (we need a name for them) will be generated by an SQL query run on a scheduled task or a cron job. This is good and pretty easy to automate.
It increases the complexity of the dump process greatly if it also needs to update a metadata file with the new modified date every time. In fact it moves the set up of the process from just being a configuration job in most RDMS to needing actual scripts to run and change the metadata files. The structure of the CSV file is constant so the metadata file should really only be created once when the process is set up.
Could we use the modified / created dates in the HTTP headers for the files instead. The client just has to call a HEAD to see if the file has changed and get its size before deciding to download it. (It is amazing what you can do with good old HTTP).
The only thing that is lost doing it this way is we don't know the number of rows in the file but we do know it's size in bytes. What we gain is the ability for non-script-writing system admins to set up the system.
Just a thought,
Roger
On 15 May 2008, at 00:00, Markus Döring wrote:
I agree with Tim that it would be better to keep this proposal/ specification seperate from TAPIR. Saying that it could still be included in the TAPIR capabilities to indicate this feature. But an important reason to have these file is to get more providers on board. So they should also be able to implement this without the TAPIR overhead.
A seperate metafile would certainly hold also the timestamp of the last generation of the file, so keeping that seperate has additional advantages.
Markus
On 14 May, 2008, at 21:44, trobertson@gbif.org wrote:
Hi Renato,
Do you think this really go under TAPIR spec?
Sure we want the wrappers to produce it but it's just a document on a URL and can be described in such a simple way that loads of other people could incorporate it without getting into TAPIR specs, nor can they claim any TAPIR compliance just because they can do a 'select to outfile'.
I would also request that the headers aren't in the data file but the metafile. It is way easier to dump a big DB to this 'document standard' without needing to worry about how to get headers in a 20gig file.
Just some more thoughts
Cheers
Tim
I agree with Markus about using a simple data format. Relational database dumps would require standard database structures or would expose specific things that are already encapsulated by abstraction layers (conceptual schemas).
I'm not sure about the best way to represent complex data structures like ABCD, but for simpler providers such as TapirLink/Dwc, the idea was to create a new script responsible for dumping all mapped concepts of a specific data source into a single file. Providers could periodically call this script from a cron job to regenerate the dump. The first line in the dump file would indicate the concept identifiers (GUIDs) associated with each column to make it a generic solution (and more compatible with existing applications). Content could be tab-delimited and in the end compressed.
Harvesters could use this "seed" file for the initial data import, and then potentially use incremental harvesting to update the cache. But in this case it would be necessary to know when the dump file was generated.
To use the existing TAPIR infrastructure, we would also need to know which providers support the dump files. Aaron's idea, when he first discussed with me, was to use a new custom operation. This makes sense to me, but would require a small change in the protocol to add a custom slot in the operations section of capabilities responses. Curiously, this approach would allow the existence of TAPIR "static providers" - the simplest possible category, even simpler than TapirLite. They would not support inventories, searches or query templates, but would make the dump file available through the new custom operation. Metadata, capabilities and ping could be just static files served by a very simple script.
If this approach makes sense, I think these are the points that still need to be addressed:
- Decide about how to indicate the timestamp associated with the
dump file. 2) Change the TAPIR schema (or figure out another solution to advertise the new capability, but always remembering that in the TAPIR context a single provider instance can host multiple data sources that are usually distinguished by a query parameter in the URL, so I'm not sure how a sitemaps approach could be used). 3) Decide about how to represent complex data such as ABCD (if using multiple files, I would suggest to compress them together and serve as a single file). 4) Write a short specification to describe the new custom operation and the data format.
I'm happy to change the schema if there's consensus about this.
Best Regards,
Renato
it would keep the relations, but we dont really want any relational structure to be served up. And using sqlite binaries for the dwc star scheme would not be easier to work with than plain text files. they can even be loaded into excel straight away, can be versioned with svn and so on. If there is a geospatial extension file which has the GUID in the first column, applications might grab that directly and not even touch the central core file if they only want location data.
I'd prefer to stick with a csv or tab delimited file. The simpler the better. And it also cant get corrupted as easily.
Markus
On 14 May, 2008, at 15:25, Aaron D. Steele wrote:
for preserving relational data, we could also just dump tapirlink resources to an sqlite database file (http://www.sqlite.org), zip it up, and again make it available via the web service. we use sqlite internally for many projects, and it's both easy to use and well supported by jdbc, php, python, etc.
would something like this be a useful option?
thanks, aaron
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
Locally generated / localised DwC index files? (if you have rich data behind LSID, then this file is an index that allows searching of those rich data using DwC fields)
I would like to see the data file accompanied with a compulsory metafile that details rights, citation, contacts etc are all given. Whether this file needs the data generation timestamp I am not so sure either and the HTTP header approach does sound good. It means you can do a one time metafile crafting and then just CRON the dump generation... This would be for institutions with IT resources - e.g. UK NBN with 20M records.
For Joe Bloggs with a data set, if we included it in the wrapper tools, then it is easy to rewrite the metafile seemlessly anyway so they don't care.
Cheers,
Tim
-----Original Message----- From: Roger Hyam [mailto:rogerhyam@mac.com] Sent: Thursday, May 15, 2008 9:12 AM To: Markus Döring Cc: trobertson@gbif.org; tdwg-tapir@lists.tdwg.org Subject: Re: [tdwg-tapir] Fwd: Tapir protocol - Harvest methods?
I imagine a lot of these CSV files (we need a name for them) will be generated by an SQL query run on a scheduled task or a cron job. This is good and pretty easy to automate.
It increases the complexity of the dump process greatly if it also needs to update a metadata file with the new modified date every time. In fact it moves the set up of the process from just being a configuration job in most RDMS to needing actual scripts to run and change the metadata files. The structure of the CSV file is constant so the metadata file should really only be created once when the process is set up.
Could we use the modified / created dates in the HTTP headers for the files instead. The client just has to call a HEAD to see if the file has changed and get its size before deciding to download it. (It is amazing what you can do with good old HTTP).
The only thing that is lost doing it this way is we don't know the number of rows in the file but we do know it's size in bytes. What we gain is the ability for non-script-writing system admins to set up the system.
Just a thought,
Roger
On 15 May 2008, at 00:00, Markus Döring wrote:
I agree with Tim that it would be better to keep this proposal/ specification seperate from TAPIR. Saying that it could still be included in the TAPIR capabilities to indicate this feature. But an important reason to have these file is to get more providers on board. So they should also be able to implement this without the TAPIR overhead.
A seperate metafile would certainly hold also the timestamp of the last generation of the file, so keeping that seperate has additional advantages.
Markus
On 14 May, 2008, at 21:44, trobertson@gbif.org wrote:
Hi Renato,
Do you think this really go under TAPIR spec?
Sure we want the wrappers to produce it but it's just a document on a URL and can be described in such a simple way that loads of other people could incorporate it without getting into TAPIR specs, nor can they claim any TAPIR compliance just because they can do a 'select to outfile'.
I would also request that the headers aren't in the data file but the metafile. It is way easier to dump a big DB to this 'document standard' without needing to worry about how to get headers in a 20gig file.
Just some more thoughts
Cheers
Tim
I agree with Markus about using a simple data format. Relational database dumps would require standard database structures or would expose specific things that are already encapsulated by abstraction layers (conceptual schemas).
I'm not sure about the best way to represent complex data structures like ABCD, but for simpler providers such as TapirLink/Dwc, the idea was to create a new script responsible for dumping all mapped concepts of a specific data source into a single file. Providers could periodically call this script from a cron job to regenerate the dump. The first line in the dump file would indicate the concept identifiers (GUIDs) associated with each column to make it a generic solution (and more compatible with existing applications). Content could be tab-delimited and in the end compressed.
Harvesters could use this "seed" file for the initial data import, and then potentially use incremental harvesting to update the cache. But in this case it would be necessary to know when the dump file was generated.
To use the existing TAPIR infrastructure, we would also need to know which providers support the dump files. Aaron's idea, when he first discussed with me, was to use a new custom operation. This makes sense to me, but would require a small change in the protocol to add a custom slot in the operations section of capabilities responses. Curiously, this approach would allow the existence of TAPIR "static providers" - the simplest possible category, even simpler than TapirLite. They would not support inventories, searches or query templates, but would make the dump file available through the new custom operation. Metadata, capabilities and ping could be just static files served by a very simple script.
If this approach makes sense, I think these are the points that still need to be addressed:
- Decide about how to indicate the timestamp associated with the
dump file. 2) Change the TAPIR schema (or figure out another solution to advertise the new capability, but always remembering that in the TAPIR context a single provider instance can host multiple data sources that are usually distinguished by a query parameter in the URL, so I'm not sure how a sitemaps approach could be used). 3) Decide about how to represent complex data such as ABCD (if using multiple files, I would suggest to compress them together and serve as a single file). 4) Write a short specification to describe the new custom operation and the data format.
I'm happy to change the schema if there's consensus about this.
Best Regards,
Renato
it would keep the relations, but we dont really want any relational structure to be served up. And using sqlite binaries for the dwc star scheme would not be easier to work with than plain text files. they can even be loaded into excel straight away, can be versioned with svn and so on. If there is a geospatial extension file which has the GUID in the first column, applications might grab that directly and not even touch the central core file if they only want location data.
I'd prefer to stick with a csv or tab delimited file. The simpler the better. And it also cant get corrupted as easily.
Markus
On 14 May, 2008, at 15:25, Aaron D. Steele wrote:
for preserving relational data, we could also just dump tapirlink resources to an sqlite database file (http://www.sqlite.org), zip it up, and again make it available via the web service. we use sqlite internally for many projects, and it's both easy to use and well supported by jdbc, php, python, etc.
would something like this be a useful option?
thanks, aaron
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
I am worried about duplication and maintainance of multiple metadata files and also formats. TAPIR has one, NCD exists, FGDC, EML, DublinCore and many more. So maybe we should just add a URL to the metadata and not even specify the format, just recommend it should be compatible with dublin core? It could resolve into an RDF document, a TAPIR metadata response or an html page with embedded dublin core data? Then the dwc index metafile is a true static technical description and could be created once if we settle with the http approach.
Btw, with http you can even specify "If-Modified-Since" in a request header to get a "304 Not Modified" to be returned for files that havent changed since. The http 1.1 specs require webservers to support this. So the http response could always indicate the date-last- modified and the index file will only be returned if it was modified since the last request. Thats pretty much all we want, isnt it?
http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.25
Markus
On 15 May, 2008, at 10:04, Tim Robertson wrote:
Locally generated / localised DwC index files? (if you have rich data behind LSID, then this file is an index that allows searching of those rich data using DwC fields)
I would like to see the data file accompanied with a compulsory metafile that details rights, citation, contacts etc are all given. Whether this file needs the data generation timestamp I am not so sure either and the HTTP header approach does sound good. It means you can do a one time metafile crafting and then just CRON the dump generation... This would be for institutions with IT resources - e.g. UK NBN with 20M records.
For Joe Bloggs with a data set, if we included it in the wrapper tools, then it is easy to rewrite the metafile seemlessly anyway so they don't care.
Cheers,
Tim
-----Original Message----- From: Roger Hyam [mailto:rogerhyam@mac.com] Sent: Thursday, May 15, 2008 9:12 AM To: Markus Döring Cc: trobertson@gbif.org; tdwg-tapir@lists.tdwg.org Subject: Re: [tdwg-tapir] Fwd: Tapir protocol - Harvest methods?
I imagine a lot of these CSV files (we need a name for them) will be generated by an SQL query run on a scheduled task or a cron job. This is good and pretty easy to automate.
It increases the complexity of the dump process greatly if it also needs to update a metadata file with the new modified date every time. In fact it moves the set up of the process from just being a configuration job in most RDMS to needing actual scripts to run and change the metadata files. The structure of the CSV file is constant so the metadata file should really only be created once when the process is set up.
Could we use the modified / created dates in the HTTP headers for the files instead. The client just has to call a HEAD to see if the file has changed and get its size before deciding to download it. (It is amazing what you can do with good old HTTP).
The only thing that is lost doing it this way is we don't know the number of rows in the file but we do know it's size in bytes. What we gain is the ability for non-script-writing system admins to set up the system.
Just a thought,
Roger
On 15 May 2008, at 00:00, Markus Döring wrote:
I agree with Tim that it would be better to keep this proposal/ specification seperate from TAPIR. Saying that it could still be included in the TAPIR capabilities to indicate this feature. But an important reason to have these file is to get more providers on board. So they should also be able to implement this without the TAPIR overhead.
A seperate metafile would certainly hold also the timestamp of the last generation of the file, so keeping that seperate has additional advantages.
Markus
On 14 May, 2008, at 21:44, trobertson@gbif.org wrote:
Hi Renato,
Do you think this really go under TAPIR spec?
Sure we want the wrappers to produce it but it's just a document on a URL and can be described in such a simple way that loads of other people could incorporate it without getting into TAPIR specs, nor can they claim any TAPIR compliance just because they can do a 'select to outfile'.
I would also request that the headers aren't in the data file but the metafile. It is way easier to dump a big DB to this 'document standard' without needing to worry about how to get headers in a 20gig file.
Just some more thoughts
Cheers
Tim
I agree with Markus about using a simple data format. Relational database dumps would require standard database structures or would expose specific things that are already encapsulated by abstraction layers (conceptual schemas).
I'm not sure about the best way to represent complex data structures like ABCD, but for simpler providers such as TapirLink/Dwc, the idea was to create a new script responsible for dumping all mapped concepts of a specific data source into a single file. Providers could periodically call this script from a cron job to regenerate the dump. The first line in the dump file would indicate the concept identifiers (GUIDs) associated with each column to make it a generic solution (and more compatible with existing applications). Content could be tab-delimited and in the end compressed.
Harvesters could use this "seed" file for the initial data import, and then potentially use incremental harvesting to update the cache. But in this case it would be necessary to know when the dump file was generated.
To use the existing TAPIR infrastructure, we would also need to know which providers support the dump files. Aaron's idea, when he first discussed with me, was to use a new custom operation. This makes sense to me, but would require a small change in the protocol to add a custom slot in the operations section of capabilities responses. Curiously, this approach would allow the existence of TAPIR "static providers" - the simplest possible category, even simpler than TapirLite. They would not support inventories, searches or query templates, but would make the dump file available through the new custom operation. Metadata, capabilities and ping could be just static files served by a very simple script.
If this approach makes sense, I think these are the points that still need to be addressed:
- Decide about how to indicate the timestamp associated with the
dump file. 2) Change the TAPIR schema (or figure out another solution to advertise the new capability, but always remembering that in the TAPIR context a single provider instance can host multiple data sources that are usually distinguished by a query parameter in the URL, so I'm not sure how a sitemaps approach could be used). 3) Decide about how to represent complex data such as ABCD (if using multiple files, I would suggest to compress them together and serve as a single file). 4) Write a short specification to describe the new custom operation and the data format.
I'm happy to change the schema if there's consensus about this.
Best Regards,
Renato
it would keep the relations, but we dont really want any relational structure to be served up. And using sqlite binaries for the dwc star scheme would not be easier to work with than plain text files. they can even be loaded into excel straight away, can be versioned with svn and so on. If there is a geospatial extension file which has the GUID in the first column, applications might grab that directly and not even touch the central core file if they only want location data.
I'd prefer to stick with a csv or tab delimited file. The simpler the better. And it also cant get corrupted as easily.
Markus
On 14 May, 2008, at 15:25, Aaron D. Steele wrote:
for preserving relational data, we could also just dump tapirlink resources to an sqlite database file (http://www.sqlite.org), zip it up, and again make it available via the web service. we use sqlite internally for many projects, and it's both easy to use and well supported by jdbc, php, python, etc.
would something like this be a useful option?
thanks, aaron
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
Just a quick observation. I don't think we want or should expect the structure of the csv file to remain constant. The method Aaron proposed (and implemented) doesn't care. It looks for known-good DwC concept labels as column headers in whatever order, and with whatever other nonsense you want to put in the file, but only responds on the caching side to those concepts needed cache-side. I wouldn't want to lose that flexibility.
On Thu, May 15, 2008 at 2:00 AM, Markus Döring mdoering@gbif.org wrote:
I am worried about duplication and maintainance of multiple metadata files and also formats. TAPIR has one, NCD exists, FGDC, EML, DublinCore and many more. So maybe we should just add a URL to the metadata and not even specify the format, just recommend it should be compatible with dublin core? It could resolve into an RDF document, a TAPIR metadata response or an html page with embedded dublin core data? Then the dwc index metafile is a true static technical description and could be created once if we settle with the http approach.
Btw, with http you can even specify "If-Modified-Since" in a request header to get a "304 Not Modified" to be returned for files that havent changed since. The http 1.1 specs require webservers to support this. So the http response could always indicate the date-last- modified and the index file will only be returned if it was modified since the last request. Thats pretty much all we want, isnt it?
http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.25
Markus
On 15 May, 2008, at 10:04, Tim Robertson wrote:
Locally generated / localised DwC index files? (if you have rich data behind LSID, then this file is an index that allows searching of those rich data using DwC fields)
I would like to see the data file accompanied with a compulsory metafile that details rights, citation, contacts etc are all given. Whether this file needs the data generation timestamp I am not so sure either and the HTTP header approach does sound good. It means you can do a one time metafile crafting and then just CRON the dump generation... This would be for institutions with IT resources - e.g. UK NBN with 20M records.
For Joe Bloggs with a data set, if we included it in the wrapper tools, then it is easy to rewrite the metafile seemlessly anyway so they don't care.
Cheers,
Tim
-----Original Message----- From: Roger Hyam [mailto:rogerhyam@mac.com] Sent: Thursday, May 15, 2008 9:12 AM To: Markus Döring Cc: trobertson@gbif.org; tdwg-tapir@lists.tdwg.org Subject: Re: [tdwg-tapir] Fwd: Tapir protocol - Harvest methods?
I imagine a lot of these CSV files (we need a name for them) will be generated by an SQL query run on a scheduled task or a cron job. This is good and pretty easy to automate.
It increases the complexity of the dump process greatly if it also needs to update a metadata file with the new modified date every time. In fact it moves the set up of the process from just being a configuration job in most RDMS to needing actual scripts to run and change the metadata files. The structure of the CSV file is constant so the metadata file should really only be created once when the process is set up.
Could we use the modified / created dates in the HTTP headers for the files instead. The client just has to call a HEAD to see if the file has changed and get its size before deciding to download it. (It is amazing what you can do with good old HTTP).
The only thing that is lost doing it this way is we don't know the number of rows in the file but we do know it's size in bytes. What we gain is the ability for non-script-writing system admins to set up the system.
Just a thought,
Roger
On 15 May 2008, at 00:00, Markus Döring wrote:
I agree with Tim that it would be better to keep this proposal/ specification seperate from TAPIR. Saying that it could still be included in the TAPIR capabilities to indicate this feature. But an important reason to have these file is to get more providers on board. So they should also be able to implement this without the TAPIR overhead.
A seperate metafile would certainly hold also the timestamp of the last generation of the file, so keeping that seperate has additional advantages.
Markus
On 14 May, 2008, at 21:44, trobertson@gbif.org wrote:
Hi Renato,
Do you think this really go under TAPIR spec?
Sure we want the wrappers to produce it but it's just a document on a URL and can be described in such a simple way that loads of other people could incorporate it without getting into TAPIR specs, nor can they claim any TAPIR compliance just because they can do a 'select to outfile'.
I would also request that the headers aren't in the data file but the metafile. It is way easier to dump a big DB to this 'document standard' without needing to worry about how to get headers in a 20gig file.
Just some more thoughts
Cheers
Tim
I agree with Markus about using a simple data format. Relational database dumps would require standard database structures or would expose specific things that are already encapsulated by abstraction layers (conceptual schemas).
I'm not sure about the best way to represent complex data structures like ABCD, but for simpler providers such as TapirLink/Dwc, the idea was to create a new script responsible for dumping all mapped concepts of a specific data source into a single file. Providers could periodically call this script from a cron job to regenerate the dump. The first line in the dump file would indicate the concept identifiers (GUIDs) associated with each column to make it a generic solution (and more compatible with existing applications). Content could be tab-delimited and in the end compressed.
Harvesters could use this "seed" file for the initial data import, and then potentially use incremental harvesting to update the cache. But in this case it would be necessary to know when the dump file was generated.
To use the existing TAPIR infrastructure, we would also need to know which providers support the dump files. Aaron's idea, when he first discussed with me, was to use a new custom operation. This makes sense to me, but would require a small change in the protocol to add a custom slot in the operations section of capabilities responses. Curiously, this approach would allow the existence of TAPIR "static providers" - the simplest possible category, even simpler than TapirLite. They would not support inventories, searches or query templates, but would make the dump file available through the new custom operation. Metadata, capabilities and ping could be just static files served by a very simple script.
If this approach makes sense, I think these are the points that still need to be addressed:
- Decide about how to indicate the timestamp associated with the
dump file. 2) Change the TAPIR schema (or figure out another solution to advertise the new capability, but always remembering that in the TAPIR context a single provider instance can host multiple data sources that are usually distinguished by a query parameter in the URL, so I'm not sure how a sitemaps approach could be used). 3) Decide about how to represent complex data such as ABCD (if using multiple files, I would suggest to compress them together and serve as a single file). 4) Write a short specification to describe the new custom operation and the data format.
I'm happy to change the schema if there's consensus about this.
Best Regards,
Renato
it would keep the relations, but we dont really want any relational structure to be served up. And using sqlite binaries for the dwc star scheme would not be easier to work with than plain text files. they can even be loaded into excel straight away, can be versioned with svn and so on. If there is a geospatial extension file which has the GUID in the first column, applications might grab that directly and not even touch the central core file if they only want location data.
I'd prefer to stick with a csv or tab delimited file. The simpler the better. And it also cant get corrupted as easily.
Markus
On 14 May, 2008, at 15:25, Aaron D. Steele wrote:
> for preserving relational data, we could also just dump tapirlink > resources to an sqlite database file (http://www.sqlite.org), zip > it up, and again make it available via the web service. we use > sqlite internally for many projects, and it's both easy to use > and > well supported by jdbc, php, python, etc. > > would something like this be a useful option? > > thanks, > aaron
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
Hi Tim,
Just wondering: In the case of NBN, after the initial import from a dump file and if they just change a few records, are you planning to download everything again and then overwrite all 20 million records in GBIF's database? Or are you also interested in using some sort of incremental harvesting?
Best Regards, -- Renato
Locally generated / localised DwC index files? (if you have rich data behind LSID, then this file is an index that allows searching of those rich data using DwC fields)
I would like to see the data file accompanied with a compulsory metafile that details rights, citation, contacts etc are all given. Whether this file needs the data generation timestamp I am not so sure either and the HTTP header approach does sound good. It means you can do a one time metafile crafting and then just CRON the dump generation... This would be for institutions with IT resources - e.g. UK NBN with 20M records.
For Joe Bloggs with a data set, if we included it in the wrapper tools, then it is easy to rewrite the metafile seemlessly anyway so they don't care.
Cheers,
Tim
I am interested in it for anyone who could do date last modified - so the dump becomes the seed file really for OAI-PMH (or other).
Off the top of my head I *think* (but don't quote me!) that UK NBN allow people to basically delete their collection and upload a new version. So the collection code increments each time and OAI PMH is out the question as it is a reharvest each time anyway. If it is not NBN then it is some other large aggregator that does this.
Hi Tim,
Just wondering: In the case of NBN, after the initial import from a dump file and if they just change a few records, are you planning to download everything again and then overwrite all 20 million records in GBIF's database? Or are you also interested in using some sort of incremental harvesting?
Best Regards,
Renato
Locally generated / localised DwC index files? (if you have rich data behind LSID, then this file is an index that allows searching of those rich data using DwC fields)
I would like to see the data file accompanied with a compulsory metafile that details rights, citation, contacts etc are all given. Whether this file needs the data generation timestamp I am not so sure either and the HTTP header approach does sound good. It means you can do a one time metafile crafting and then just CRON the dump generation... This would be for institutions with IT resources - e.g. UK NBN with 20M records.
For Joe Bloggs with a data set, if we included it in the wrapper tools, then it is easy to rewrite the metafile seemlessly anyway so they don't care.
Cheers,
Tim
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
Right. I agree there's no particular reason to expose the dump file through a typical TAPIR URL. Headers could also be in a separate file. However, from a TAPIR service perspective, I think it's still important to somehow advertise the availability of a dump file in capabilities (even if GBIF doesn't use this). There's a slot in the end of a capabilities response that could be used for this purpose:
... <custom> <ext:dump baseurl="http://somehost/somepath/%22/%3E </custom> ...
Providers that only want to see their data being served through GBIF could simply make the dump files available somewhere, without the need to install and maintain a web service. TAPIR providers that have other reasons to exist could decide if they want to register the TAPIR endpoint or just the base URL of the dump file in GBIF's registry.
HTTP headers ("If-Modified-Since" and "Last-Modified") seem to solve the timestamp issue in an elegant way.
Regarding complex data, I would be inclined to propose some compact XML representation compatible with TAPIR so that existing wrapper functionalities could be used to generate the dump file. I suppose this could save considerable time. Another advantage is that it would be a generic solution, not restricted to one level relationships. Since TAPIR output models can map XML nodes to a concatenation of concepts and literals, it's also possible to have a single record element with some sort of csv content inside. I'm just not sure how to escape eventual separators that could be present in real content.
We could also provide more information about the format in the new dump element:
<ext:dump baseurl="http://somehost/somepath/" format="csv"/>
or
<ext:dump baseurl="http://somehost/somepath/" format="xml" outputModel="some_url"/>
Regards, -- Renato
Hi Renato,
Do you think this really go under TAPIR spec?
Sure we want the wrappers to produce it but it's just a document on a URL and can be described in such a simple way that loads of other people could incorporate it without getting into TAPIR specs, nor can they claim any TAPIR compliance just because they can do a 'select to outfile'.
I would also request that the headers aren't in the data file but the metafile. It is way easier to dump a big DB to this 'document standard' without needing to worry about how to get headers in a 20gig file.
Just some more thoughts
Cheers
Tim
Renato, I was thinking along those lines too. It would be nice for TAPIRs to announce the availablility of the index files. I wouldnt mind adding it even to the regular tapir schema once it has proven to work with the custom slot approach you have given.
Regarding star shaped data I would prefer to agree on one format instead of allowing different ones to save consumers from this pain. There is a straight forward xml serialisation for this scheme that we could use instead of tab files:
<record uri=""> <dwc:property1 /> <dwc:property2 /> extA:record <extA:property1 /> <extA:property2 /> </extA:record> extB:record <extB:property1 /> <extB:property2 /> extB:record <record>
Advantage is, it can be produced by TAPIR software and xml serialisation is required for many services, eg RSS anyway. But then again the whole point of the index files is that they are easy to generate and consume. On the other hand this xml structure is pretty simple to process and can be genereated from databases like sqlserver that have xml output straight away without the need of scripting.
That touches a different issue I am facing with the star scheme by the way. I have created an identification extension for darwin core that holds the historical list of identification events and their outcome. This is a YAML section of the metafile describing the columns for this extension through fully qualified concepts ala TAPIR:
identification: - http://rs.tdwg.org/dwc/dwcore/ScientificName - http://rs.tdwg.org/dwc/dwcore/AuthorYearOfScientificName - http://rs.tdwg.org/dwc/dwcore/Family - http://rs.tdwg.org/dwc/dwcore/IdentificationQualifier - http://rs.tdwg.org/dwc/curatorial/DateIdentified - http://rs.tdwg.org/dwc/curatorial/IdentifiedBy
When creating this I realised that pretty much all concepts I was interested in already existed in darwin core or the curatorial extension. Wouldnt it be wise to reuse those concepts? Or are they strictly tight to the idea of a current identification and therefore cant be used for historical ones? This is probably more of a darwin core question than TAPIR, but we are all on this list anyway ...
The xml in that case would look sth like this:
<record uri="http://mygarden.com/specimen/plants/54321-423-43-54-6-3-24-44 "> dwc:ScientificNameAster alpinus subsp. parvicepsdwc:ScientificName ... ident:record dwc:ScientificNameAster alpinusdwc:ScientificName dwc:AuthorYearOfScientificNameL.</dwc:AuthorYearOfScientificName> dwc:FamilyAsteraceaedwc:Family cur:DateIdentified1913-03-12</cur:DateIdentified> cur:IdentifiedByKarl Marx</cur:IdentifiedBy> </ident:record> ident:record dwc:ScientificNameAster alpinus subsp. parvicepsdwc:ScientificName dwc:AuthorYearOfScientificNameNovopokr.</ dwc:AuthorYearOfScientificName> dwc:FamilyAsteraceaedwc:Family cur:DateIdentified2003-09-07</cur:DateIdentified> cur:IdentifiedByKeith Richards</cur:IdentifiedBy> </ident:record> <record>
Markus
On 15 May, 2008, at 20:42, Renato De Giovanni wrote:
Right. I agree there's no particular reason to expose the dump file through a typical TAPIR URL. Headers could also be in a separate file. However, from a TAPIR service perspective, I think it's still important to somehow advertise the availability of a dump file in capabilities (even if GBIF doesn't use this). There's a slot in the end of a capabilities response that could be used for this purpose:
...
<custom> <ext:dump baseurl="http://somehost/somepath/"/> </custom> ...
Providers that only want to see their data being served through GBIF could simply make the dump files available somewhere, without the need to install and maintain a web service. TAPIR providers that have other reasons to exist could decide if they want to register the TAPIR endpoint or just the base URL of the dump file in GBIF's registry.
HTTP headers ("If-Modified-Since" and "Last-Modified") seem to solve the timestamp issue in an elegant way.
Regarding complex data, I would be inclined to propose some compact XML representation compatible with TAPIR so that existing wrapper functionalities could be used to generate the dump file. I suppose this could save considerable time. Another advantage is that it would be a generic solution, not restricted to one level relationships. Since TAPIR output models can map XML nodes to a concatenation of concepts and literals, it's also possible to have a single record element with some sort of csv content inside. I'm just not sure how to escape eventual separators that could be present in real content.
We could also provide more information about the format in the new dump element:
<ext:dump baseurl="http://somehost/somepath/" format="csv"/>
or
<ext:dump baseurl="http://somehost/somepath/" format="xml" outputModel="some_url"/>
Regards,
Renato
Hi Renato,
Do you think this really go under TAPIR spec?
Sure we want the wrappers to produce it but it's just a document on a URL and can be described in such a simple way that loads of other people could incorporate it without getting into TAPIR specs, nor can they claim any TAPIR compliance just because they can do a 'select to outfile'.
I would also request that the headers aren't in the data file but the metafile. It is way easier to dump a big DB to this 'document standard' without needing to worry about how to get headers in a 20gig file.
Just some more thoughts
Cheers
Tim
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
Hi, I welcome the idea of creating dumpfiles for more efficient harvesting. We've created such an option in the NCD toolkit and are working on dumpfiles also in the CPT (Checklist Provider Tool). The best format depends on the usage, probably an option to create dumpfiles in either cvs, json or xml would be the most flexible. And maybe an option to dump only the information that is useful for caching. Is it perhaps a good idea to discuss this also in a wider audience at the next TDWG annual meeting? I agree with Renato that it could be nice to show this option in tapir capabilities.
Wouter
Hi Markus,
Since DarwinCore is a generic list of elements that can be used by any application schema, I think it's OK to use them in the new schema that you're suggesting.
I agree that ideally we should try to define and use a common format for index files, although it seems that we will have at least two: csv for simple data and probably another one in XML for complex data, right?
Regarding the XML for complex data, if you manage to find a generic schema that can be used in different contexts (not only biodiversity data) then I agree we could avoid extra attributes in the respective capabilities element. Otherwise, I would prefer to see some extra attribute (such as "outputModel") giving more information about the XML. Since TAPIR was designed to be generic, this should not be a problem because clients and networks are already free to decide and to mandate specific TAPIR capabilities. This doesn't mean that there will be lots of formats for index files. It's a matter of agreeing on a common format but still keeping the protocol generic to allow different uses by other communities.
I also agree we could advertise the index file through some new TAPIR element instead of using the custom slot.
Best Regards, -- Renato
On 16 May 2008 at 10:29, Markus Döring wrote:
Renato, I was thinking along those lines too. It would be nice for TAPIRs to announce the availablility of the index files. I wouldnt mind adding it even to the regular tapir schema once it has proven to work with the custom slot approach you have given.
Regarding star shaped data I would prefer to agree on one format instead of allowing different ones to save consumers from this pain. There is a straight forward xml serialisation for this scheme that we could use instead of tab files:
<record uri=""> <dwc:property1 /> <dwc:property2 /> <extA:record> <extA:property1 /> <extA:property2 /> </extA:record> <extB:record> <extB:property1 /> <extB:property2 /> <extB:record> <record>
Advantage is, it can be produced by TAPIR software and xml serialisation is required for many services, eg RSS anyway. But then again the whole point of the index files is that they are easy to generate and consume. On the other hand this xml structure is pretty simple to process and can be genereated from databases like sqlserver that have xml output straight away without the need of scripting.
That touches a different issue I am facing with the star scheme by the way. I have created an identification extension for darwin core that holds the historical list of identification events and their outcome. This is a YAML section of the metafile describing the columns for this extension through fully qualified concepts ala TAPIR:
identification:
- http://rs.tdwg.org/dwc/dwcore/ScientificName
- http://rs.tdwg.org/dwc/dwcore/AuthorYearOfScientificName
- http://rs.tdwg.org/dwc/dwcore/Family
- http://rs.tdwg.org/dwc/dwcore/IdentificationQualifier
- http://rs.tdwg.org/dwc/curatorial/DateIdentified
- http://rs.tdwg.org/dwc/curatorial/IdentifiedBy
When creating this I realised that pretty much all concepts I was interested in already existed in darwin core or the curatorial extension. Wouldnt it be wise to reuse those concepts? Or are they strictly tight to the idea of a current identification and therefore cant be used for historical ones? This is probably more of a darwin core question than TAPIR, but we are all on this list anyway ...
The xml in that case would look sth like this:
<record uri="http://mygarden.com/specimen/plants/54321-423-43-54-6-3-24-44 "> dwc:ScientificNameAster alpinus subsp. parvicepsdwc:ScientificName ... ident:record dwc:ScientificNameAster alpinusdwc:ScientificName dwc:AuthorYearOfScientificNameL.</dwc:AuthorYearOfScientificName> dwc:FamilyAsteraceaedwc:Family cur:DateIdentified1913-03-12</cur:DateIdentified> cur:IdentifiedByKarl Marx</cur:IdentifiedBy> </ident:record> ident:record dwc:ScientificNameAster alpinus subsp. parvicepsdwc:ScientificName dwc:AuthorYearOfScientificNameNovopokr.</ dwc:AuthorYearOfScientificName> dwc:FamilyAsteraceaedwc:Family cur:DateIdentified2003-09-07</cur:DateIdentified> cur:IdentifiedByKeith Richards</cur:IdentifiedBy> </ident:record>
<record>
Markus
Renato, complex data can also be represented by tab files, with a file for each extension that has a pointer in the first column. That is what we originally had in mind with the star scheme.
Markus
On 20 May, 2008, at 17:16, Renato De Giovanni wrote:
Hi Markus,
Since DarwinCore is a generic list of elements that can be used by any application schema, I think it's OK to use them in the new schema that you're suggesting.
I agree that ideally we should try to define and use a common format for index files, although it seems that we will have at least two: csv for simple data and probably another one in XML for complex data, right?
Regarding the XML for complex data, if you manage to find a generic schema that can be used in different contexts (not only biodiversity data) then I agree we could avoid extra attributes in the respective capabilities element. Otherwise, I would prefer to see some extra attribute (such as "outputModel") giving more information about the XML. Since TAPIR was designed to be generic, this should not be a problem because clients and networks are already free to decide and to mandate specific TAPIR capabilities. This doesn't mean that there will be lots of formats for index files. It's a matter of agreeing on a common format but still keeping the protocol generic to allow different uses by other communities.
I also agree we could advertise the index file through some new TAPIR element instead of using the custom slot.
Best Regards,
Renato
On 16 May 2008 at 10:29, Markus Döring wrote:
Renato, I was thinking along those lines too. It would be nice for TAPIRs to announce the availablility of the index files. I wouldnt mind adding it even to the regular tapir schema once it has proven to work with the custom slot approach you have given.
Regarding star shaped data I would prefer to agree on one format instead of allowing different ones to save consumers from this pain. There is a straight forward xml serialisation for this scheme that we could use instead of tab files:
<record uri=""> <dwc:property1 /> <dwc:property2 /> <extA:record> <extA:property1 /> <extA:property2 /> </extA:record> <extB:record> <extB:property1 /> <extB:property2 /> <extB:record> <record>
Advantage is, it can be produced by TAPIR software and xml serialisation is required for many services, eg RSS anyway. But then again the whole point of the index files is that they are easy to generate and consume. On the other hand this xml structure is pretty simple to process and can be genereated from databases like sqlserver that have xml output straight away without the need of scripting.
That touches a different issue I am facing with the star scheme by the way. I have created an identification extension for darwin core that holds the historical list of identification events and their outcome. This is a YAML section of the metafile describing the columns for this extension through fully qualified concepts ala TAPIR:
identification:
- http://rs.tdwg.org/dwc/dwcore/ScientificName
- http://rs.tdwg.org/dwc/dwcore/AuthorYearOfScientificName
- http://rs.tdwg.org/dwc/dwcore/Family
- http://rs.tdwg.org/dwc/dwcore/IdentificationQualifier
- http://rs.tdwg.org/dwc/curatorial/DateIdentified
- http://rs.tdwg.org/dwc/curatorial/IdentifiedBy
When creating this I realised that pretty much all concepts I was interested in already existed in darwin core or the curatorial extension. Wouldnt it be wise to reuse those concepts? Or are they strictly tight to the idea of a current identification and therefore cant be used for historical ones? This is probably more of a darwin core question than TAPIR, but we are all on this list anyway ...
The xml in that case would look sth like this:
<record uri="http://mygarden.com/specimen/plants/54321-423-43-54-6-3-24-44 "> dwc:ScientificNameAster alpinus subsp. parvicepsdwc:ScientificName ... ident:record dwc:ScientificNameAster alpinusdwc:ScientificName dwc:AuthorYearOfScientificNameL.</ dwc:AuthorYearOfScientificName> dwc:FamilyAsteraceaedwc:Family cur:DateIdentified1913-03-12</cur:DateIdentified> cur:IdentifiedByKarl Marx</cur:IdentifiedBy> </ident:record> ident:record dwc:ScientificNameAster alpinus subsp. parvicepsdwc:ScientificName dwc:AuthorYearOfScientificNameNovopokr.</ dwc:AuthorYearOfScientificName> dwc:FamilyAsteraceaedwc:Family cur:DateIdentified2003-09-07</cur:DateIdentified> cur:IdentifiedByKeith Richards</cur:IdentifiedBy> </ident:record>
<record>
Markus
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
The notion of star schemas fits very nicely with what I had in mind for the RDF vocabularies. It would be good if any one of the CSV files in the star corresponds to a class in the vocabulary and the columns in the CSV files map to properties in vocabulary (or some other common vocabulary such as VCARD or DC etc. It would then be trivial to map the start to a semantic representation (such as the RDF returned from an LSID) of vice versa.
We can evolve the vocabularies to help this along.
This is probably all obvious but worth stating.
All the best,
Roger
On 20 May 2008, at 16:36, Markus Döring wrote:
Renato, complex data can also be represented by tab files, with a file for each extension that has a pointer in the first column. That is what we originally had in mind with the star scheme.
Markus
On 20 May, 2008, at 17:16, Renato De Giovanni wrote:
Hi Markus,
Since DarwinCore is a generic list of elements that can be used by any application schema, I think it's OK to use them in the new schema that you're suggesting.
I agree that ideally we should try to define and use a common format for index files, although it seems that we will have at least two: csv for simple data and probably another one in XML for complex data, right?
Regarding the XML for complex data, if you manage to find a generic schema that can be used in different contexts (not only biodiversity data) then I agree we could avoid extra attributes in the respective capabilities element. Otherwise, I would prefer to see some extra attribute (such as "outputModel") giving more information about the XML. Since TAPIR was designed to be generic, this should not be a problem because clients and networks are already free to decide and to mandate specific TAPIR capabilities. This doesn't mean that there will be lots of formats for index files. It's a matter of agreeing on a common format but still keeping the protocol generic to allow different uses by other communities.
I also agree we could advertise the index file through some new TAPIR element instead of using the custom slot.
Best Regards,
Renato
On 16 May 2008 at 10:29, Markus Döring wrote:
Renato, I was thinking along those lines too. It would be nice for TAPIRs to announce the availablility of the index files. I wouldnt mind adding it even to the regular tapir schema once it has proven to work with the custom slot approach you have given.
Regarding star shaped data I would prefer to agree on one format instead of allowing different ones to save consumers from this pain. There is a straight forward xml serialisation for this scheme that we could use instead of tab files:
<record uri=""> <dwc:property1 /> <dwc:property2 /> <extA:record> <extA:property1 /> <extA:property2 /> </extA:record> <extB:record> <extB:property1 /> <extB:property2 /> <extB:record> <record>
Advantage is, it can be produced by TAPIR software and xml serialisation is required for many services, eg RSS anyway. But then again the whole point of the index files is that they are easy to generate and consume. On the other hand this xml structure is pretty simple to process and can be genereated from databases like sqlserver that have xml output straight away without the need of scripting.
That touches a different issue I am facing with the star scheme by the way. I have created an identification extension for darwin core that holds the historical list of identification events and their outcome. This is a YAML section of the metafile describing the columns for this extension through fully qualified concepts ala TAPIR:
identification:
- http://rs.tdwg.org/dwc/dwcore/ScientificName
- http://rs.tdwg.org/dwc/dwcore/AuthorYearOfScientificName
- http://rs.tdwg.org/dwc/dwcore/Family
- http://rs.tdwg.org/dwc/dwcore/IdentificationQualifier
- http://rs.tdwg.org/dwc/curatorial/DateIdentified
- http://rs.tdwg.org/dwc/curatorial/IdentifiedBy
When creating this I realised that pretty much all concepts I was interested in already existed in darwin core or the curatorial extension. Wouldnt it be wise to reuse those concepts? Or are they strictly tight to the idea of a current identification and therefore cant be used for historical ones? This is probably more of a darwin core question than TAPIR, but we are all on this list anyway ...
The xml in that case would look sth like this:
<record uri="http://mygarden.com/specimen/plants/54321-423-43-54-6-3-24-44 "> dwc:ScientificNameAster alpinus subsp. parvicepsdwc:ScientificName ... ident:record dwc:ScientificNameAster alpinusdwc:ScientificName dwc:AuthorYearOfScientificNameL.</ dwc:AuthorYearOfScientificName> dwc:FamilyAsteraceaedwc:Family cur:DateIdentified1913-03-12</cur:DateIdentified> cur:IdentifiedByKarl Marx</cur:IdentifiedBy> </ident:record> ident:record dwc:ScientificNameAster alpinus subsp. parvicepsdwc:ScientificName dwc:AuthorYearOfScientificNameNovopokr.</ dwc:AuthorYearOfScientificName> dwc:FamilyAsteraceaedwc:Family cur:DateIdentified2003-09-07</cur:DateIdentified> cur:IdentifiedByKeith Richards</cur:IdentifiedBy> </ident:record>
<record>
Markus
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
Markus,
If we want to ensure the lowest possible barrier for providers, then I think zipped csv files need to be supported. If we really want to handle complex data using the same format, then we need something like the csv star scheme you mentioned (with well-defined rules about all files and how the records are related).
The limitation in this case is that we would only handle one-level relationships (not a generic solution) and providers with complex data would probably need to write some code to generate the dumps (not sure how many providers would do it) - unless wrappers that can handle complex data implement additional functionality to produce these dumps.
On the other hand, if we allow more than one format, complex data could be handled with compact XML representations (in a generic way) which could be automatically produced by existing wrappers.
So my understanding is that the biggest decision is: Use a single format (csv) with additional rules for complex data, or allow different formats (one for simple and another for complex data).
Although I know it's usually much better for clients to deal with a single format, my *feeling* in this case is that it would be more effective to allow different formats. I'm also not sure if it would be easier for clients to handle additional star scheme rules when importing complex data than it would be to parse a single XML file encoded in some compact structure.
Just some thoughts...
Best Regards, -- Renato
On 20 May 2008 at 17:36, Markus Döring wrote:
Renato, complex data can also be represented by tab files, with a file for each extension that has a pointer in the first column. That is what we originally had in mind with the star scheme.
Markus
it's my intuition that harvesting data in *different formats* is going to become a dominant use case handled by data providers worldwide. for example, some clients will want csv or star, while others will want xml or sqlite. i'd like to explore adding a simple plug-in architecture to tapirlink that, given a format plug-in (for example, csv_plugin.php), creates a resource data dump in that format which can be zip archived (along with any other metadata files required by the format) and downloaded by clients. in this way, as new formats are requested by the community, new format plug-ins can be added. it's a simple approach that's scalable, improves interoperability with clients, and avoids the need to agree on single format to support.
i'd also like to explore using a new 'harvest' tapir operation to facilitate harvest requests. for example:
tapir.php/myresource?op=harvest&format=csv&sbn=604800
the optional sbn parameter above stands for seconds before now. you can interpret the above request as:
"i want to download a csv dump of myresource only if it has been created within the last week (604,800 seconds)."
this approach might be somewhat controversial since it involves potential changes in the tapir protocol that not everyone agrees with. on the other hand, after consulting with renato and john, i don't see any harm with prototyping these new features, and giving the community the opportunity to experiment with concrete harvesting functionality before coming to a general consensus.
if you're keen on collaborating, i've created a new branch to prototype these ideas in: https://digir.svn.sourceforge.net/svnroot/digir/tapirlink/branches/harvest
thoughts? concerns?
thanks, aaron
On Wed, May 21, 2008 at 11:16 AM, Renato De Giovanni renato@cria.org.br wrote:
Markus,
If we want to ensure the lowest possible barrier for providers, then I think zipped csv files need to be supported. If we really want to handle complex data using the same format, then we need something like the csv star scheme you mentioned (with well-defined rules about all files and how the records are related).
The limitation in this case is that we would only handle one-level relationships (not a generic solution) and providers with complex data would probably need to write some code to generate the dumps (not sure how many providers would do it) - unless wrappers that can handle complex data implement additional functionality to produce these dumps.
On the other hand, if we allow more than one format, complex data could be handled with compact XML representations (in a generic way) which could be automatically produced by existing wrappers.
So my understanding is that the biggest decision is: Use a single format (csv) with additional rules for complex data, or allow different formats (one for simple and another for complex data).
Although I know it's usually much better for clients to deal with a single format, my *feeling* in this case is that it would be more effective to allow different formats. I'm also not sure if it would be easier for clients to handle additional star scheme rules when importing complex data than it would be to parse a single XML file encoded in some compact structure.
Just some thoughts...
Best Regards,
Renato
On 20 May 2008 at 17:36, Markus Döring wrote:
Renato, complex data can also be represented by tab files, with a file for each extension that has a pointer in the first column. That is what we originally had in mind with the star scheme.
Markus
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
I agree, and I'd suggest there are a couple of other useful formats. JSON and serialized PHP are commonly implemented in other web services, such as those offered by Yahoo and Google: JSON and serialized PHP.
Both of these are immediately useful in most programming languages, which would make it very easy to digest and display biodiversity information without the overhead of, say ABCD or our other XML structures. It would be interesting to see whether JSON and/or serialized PHP were easier or faster to consume than CSV. (I'm still looking for a reference for the term "star CSV", anyone want to explain?)
We need to make it as easy as possible to be involved at both ends of the data connection.
Cheers, Ben
-- Ben Richardson w=http://science.dec.wa.gov.au/people/?sid=98 e=ben.richardson@dec.wa.gov.au tz=ADST (UTC+9) t=+61 8 9334 0511 f=+61 8 9334 0515
-----Original Message----- From: tdwg-tapir-bounces@lists.tdwg.org [mailto:tdwg-tapir-bounces@lists.tdwg.org]On Behalf Of Aaron D. Steele Sent: Thursday, 22 May 2008 3:02 To: tdwg-tapir@lists.tdwg.org Subject: Re: [tdwg-tapir] Fwd: Tapir protocol - Harvest methods?
it's my intuition that harvesting data in *different formats* is going to become a dominant use case handled by data providers worldwide. for example, some clients will want csv or star, while others will want xml or sqlite. i'd like to explore adding a simple plug-in architecture to tapirlink that, given a format plug-in (for example, csv_plugin.php), creates a resource data dump in that format which can be zip archived (along with any other metadata files required by the format) and downloaded by clients. in this way, as new formats are requested by the community, new format plug-ins can be added. it's a simple approach that's scalable, improves interoperability with clients, and avoids the need to agree on single format to support.
i'd also like to explore using a new 'harvest' tapir operation to facilitate harvest requests. for example:
tapir.php/myresource?op=harvest&format=csv&sbn=604800
the optional sbn parameter above stands for seconds before now. you can interpret the above request as:
"i want to download a csv dump of myresource only if it has been created within the last week (604,800 seconds)."
this approach might be somewhat controversial since it involves potential changes in the tapir protocol that not everyone agrees with. on the other hand, after consulting with renato and john, i don't see any harm with prototyping these new features, and giving the community the opportunity to experiment with concrete harvesting functionality before coming to a general consensus.
if you're keen on collaborating, i've created a new branch to prototype these ideas in: https://digir.svn.sourceforge.net/svnroot/digir/tapirlink/bran ches/harvest
thoughts? concerns?
thanks, aaron
On Wed, May 21, 2008 at 11:16 AM, Renato De Giovanni renato@cria.org.br wrote:
Markus,
If we want to ensure the lowest possible barrier for providers, then I think zipped csv files need to be supported. If we really want to handle complex data using the same format, then we need something like the csv star scheme you mentioned (with well-defined
rules about
all files and how the records are related).
The limitation in this case is that we would only handle one-level relationships (not a generic solution) and providers with complex data would probably need to write some code to generate the dumps (not sure how many providers would do it) - unless wrappers that can handle complex data implement additional functionality to produce these dumps.
On the other hand, if we allow more than one format, complex data could be handled with compact XML representations (in a generic way) which could be automatically produced by existing wrappers.
So my understanding is that the biggest decision is: Use a single format (csv) with additional rules for complex data, or allow different formats (one for simple and another for complex data).
Although I know it's usually much better for clients to deal with a single format, my *feeling* in this case is that it would be more effective to allow different formats. I'm also not sure if it would be easier for clients to handle additional star scheme rules when importing complex data than it would be to parse a single XML file encoded in some compact structure.
Just some thoughts...
Best Regards,
Renato
On 20 May 2008 at 17:36, Markus Döring wrote:
Renato, complex data can also be represented by tab files, with a file for each extension that has a pointer in the first column. That is what we originally had in mind with the star scheme.
Markus
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
This email, together with any attachments, is intended for the addressee only. It may contain confidential or privileged information. If you are not the intended recipient of this email, please notify the sender, delete the email and attachments from your system and destroy any copies you may have taken of the email and its attachments. Duplication or further distribution by hardcopy, by electronic means or verbally is not permitted without permission.
I am being far to talkative today but can't resist.
Something like
mysqldump -u [username] -p [password] [databasename] | gzip > databasebackup.sql.gz
where databasebackup.sql.gz is in web accessible directory. Then on the other server
curl http://some/place/we/read/databasebackup.sql.gz | gunzip | mysql - u [username] -p [password] [databasename]
(probably got my pipes and redirects mixed up there but you can see what I mean)
Pop these two in cron jobs on either machine and the db on the second will be a read only mirror of the first. Curl can handle all the authentication etc if you need to hide the backup file behind protection. Not bad for two lines!
May not work with very big files though ...
Roger
------------------------------------------------------------- Roger Hyam Roger@BiodiversityCollectionsIndex.org http://www.BiodiversityCollectionsIndex.org ------------------------------------------------------------- Royal Botanic Garden Edinburgh 20A Inverleith Row, Edinburgh, EH3 5LR, UK Tel: +44 131 552 7171 ext 3015 Fax: +44 131 248 2901 http://www.rbge.org.uk/ -------------------------------------------------------------
On 14 May 2008, at 14:25, Aaron D. Steele wrote:
for preserving relational data, we could also just dump tapirlink resources to an sqlite database file (http://www.sqlite.org), zip it up, and again make it available via the web service. we use sqlite internally for many projects, and it's both easy to use and well supported by jdbc, php, python, etc.
would something like this be a useful option?
thanks, aaron
On Wed, May 14, 2008 at 2:21 AM, Markus Döring mdoering@gbif.org wrote:
Interesting that we all come to the same conclusions... The trouble I had with just a simple flat csv file is repeating properties like multiple image urls. ABCD clients dont use ABCD just because its complex, but because they want to transport this relational data. We were considering 2 solutions to extending this csv approach. The first would be to have a single large denormalised csv file with many rows for the same record. It would require knowledge about the related entities though and could grow in size rapidly. The second idea which we think to adopt is allowing a single level of 1- many related entities. It is basically a "star" design with the core dwc table in the center and any number of extension tables around it. Each "table" aka csv file will have the record id as the first column, so the files can be related easily and it only needs a single identifier per record and not for the extension entities. This would give a lot of flexibility while keeping things pretty simple to deal with. It would even satisfy the ABCD needs as I havent yet seen anyone requiring 2 levels of related tables (other than lookup tables). Those extensions could even be a simple 1-1 relation, but would keep things semantically together just like a xml namespace. The darwin core extensions would be good for example.
So we could have a gzipped set of files, maybe with a simple metafile indicating the semantics of the columns for each file. An example could look like this:
# darwincore.csv 102 Aster alpinus subsp. parviceps ... 103 Polygala vulgaris ...
# curatorial.csv 102 Kew Herbarium 103 Reading Herbarium
# identification.csv 102 2003-05-04 Karl Marx Aster alpinus L. 102 2007-01-11 Mark Twain Aster korshinskyi Tamamsch. 102 2007-09-13 Roger Hyam Aster alpinus subsp. parviceps Novopokr. 103 2001-02-21 Steve Bekow Polygala vulgaris L.
I know this looks old fashioned, but it is just so simple and gives us so much flexibility. Markus
On 14 May, 2008, at 24:39, Greg Whitbread wrote:
We have used a very similar protocol to assemble the latest AVH cache. It should be noted that this is an as-well-as protocol that only works because we have an established semantic standard (hispid/abcd).
greg
trobertson@gbif.org wrote:
Hi All,
This is very interesting too me, as I came up with the same conclusion while harvesting for GBIF.
As a "harvester of all records" it is best described with an example:
- Complete Inventory of ScientificNames: 7 minutes @ the limited
200 records per page
- Complete Harvesting of records:
- 260,000 records
- 9 hours harvesting duration
- 500MB TAPIR+DwC XML returned (DwC 1.4 with geospatial and
curatorial extensions)
- Extraction of DwC records from harvested XML: <2 minutes
- Resulting file size 32MB, Gzipped to <3MB
I spun hard drives for 9 hours, and took up bandwidth that is paid for, to retrieve something that could have been generated provider side in minutes and transferred in seconds (3MB).
I sent a proposal to TDWG last year termed "datamaps" which was effectively what you are describing, and I based it on the Sitemaps protocol, but I got nowhere with it. With Markus, we are making more progress and I have spoken with several GBIF data providers about a proposed new standard for full dataset harvesting and it has been received well. So Markus and I have started a new proposal and have a working name of 'Localised DwC Index' file generation (it is an index if you have more than DwC data, and DwC is still standards compliant) which is really a GZipped Tab file dump of the data, which is slightly extensible. The document is not ready to circulate yet but the benefits section reads currently:
- Provider database load reduced, allowing it to serve real
distributed queries rather than "full datasource" harvesters
- Providers can choose to publish their index as it suits them,
giving control back to the provider
- Localised index generation can be built into tools not yet
capable of integrating with TDWG protocol networks such as GBIF
- Harvesters receive a full dataset view in one request, making it
very easy to determine what records are eligible for deletion
- It becomes very simple to write clients that consume entire
datasets. E.g. data cleansing tools that the provider can run:
- Give me ISO Country Codes for my dataset
- The application pulls down the providers index file,
generates ISO country code, returns a simple table using the providers own identifier
- Check my names for spelling mistakes
- Application skims over the records and provides a list that
are not known to the application
- Providers such as UK NBN cannot serve 20 million records to the
GBIF index using the existing protocols efficiently.
- They have the ability to generate a localised index however
- Harvesters can very quickly build up searchable indexes and it is
easy to create large indices.
- Node Portal can easily aggregate index data files
- true index to data, not an illusion of a cache. More like Google
sitemaps
It is the ease at which one can offer tools to data providers that really interests me. The technical threshold required to produce services that offer reporting tools on peoples data is really very low with this mechanism. This and the fact that large datasets will be harvestable - we have even considered the likes of bit-torrent for the large ones although I think this is overkill.
As a consumer therefore I fully support this move as a valuable addition to the wrapper tools.
Cheers
Tim (wrote the GBIF harvesting, and new to this list)
Begin forwarded message:
From: "Aaron D. Steele" eightysteele@gmail.com Date: 13 de mayo de 2008 22:40:09 GMT+02:00 To: tdwg-tapir@lists.tdwg.org Cc: Aaron Steele asteele@berkeley.edu Subject: Re: [tdwg-tapir] Tapir protocol - Harvest methods?
at berkeley we've recently prototyped a simple php program that uses an existing tapirlink installation to periodically dump tapir resources into a csv file. the solution is totally generic and can dump darwin core (and technically abcd schema, although it's currently untested). the resulting csv files are zip archived and made accessible using a web service. it's a simple approach that has proven to be, at least internally, quite reliable and useful.
for example, several of our caching applications use the web service to harvest csv data from tapirlink resources using the following process:
- download latest csv dump for a resource using the web service.
- flush all locally cached records for the resource.
- bulk load the latest csv data into the cache.
in this way, cached data are always synchronized with the resource and there's no need to track new, deleted, or changed records. as an aside, each time these cached data are queried by the caching application or selected in the user interface, log-only search requests are sent back to the resource.
after discussion with renato giovanni and john wieczorek, we've decided that merging this functionality into the tapirlink codebase would benefit the broader community. csv generation support would be declared through capabilities. although incremental harvesting wouldn't be immediately implemented, we could certainly extend the service to include it later.
i'd like to pause here to gauge the consensus, thoughts, concerns, and ideas of others. anyone?
thanks, aaron
2008/5/5 Kevin Richards RichardsK@landcareresearch.co.nz: > > I think I agree here. > > The harvesting "procedure" is really defined outside the Tapir > protocol, is > it not? So it is really an agreement between the harvester and > the > harvestees. > > So what is really needed here is the standard procedure for > maintaining a > "harvestable" dataset and the standard procedure for harvesting > that > dataset. > We have a general rule at Landcare, that we never delete records > in > our > datasets - they are either deprecated in favour of another > record, > and so > the resolution of that record would point to the new record, or > the > are set > to a state of "deleted", but are still kept in the dataset, and > can > be > resolved (which would indicate a state of deleted). > > Kevin > > >>>> "Renato De Giovanni" renato@cria.org.br 6/05/2008 7:33 a.m. >>>>>>> > > Hi Markus, > > I would suggest creating new concepts for incremental > harvesting, > either in the data standards themselves or in some new > extension. In > the case of TAPIR, GBIF could easily check the mapped concepts > before > deciding between incremental or full harvesting. > > Actually it could be just one new concept such as "recordStatus" > or > "deletionFlag". Or perhaps you could also want to create your > own > definition for dateLastModified indicating which set of concepts > should be considered to see if something has changed or not, > but I > guess this level of granularity would be difficult to be > supported. > > Regards, > -- > Renato > > On 5 May 2008 at 11:24, Markus Döring wrote: > >> Phil, >> incremental harvesting is not implemented on the GBIF side as >> far >> as I >> am aware. And I dont think that will be a simple thing to >> implement on >> the current system. Also, even if we can detect only the >> changed >> records since the last harevesting via dateLastModified we >> still >> have >> no information about deletions. We could have an arrangement >> saying >> that you keep deleted records as empty records with just the ID >> and >> nothing else (I vaguely remember LSIDs were supposed to work >> like >> this >> too). But that also needs to be supported on your side then, >> never >> entirely removing any record. I will have a discussion with the >> others >> at GBIF about that. >> >> Markus > _______________________________________________ > tdwg-tapir mailing list > tdwg-tapir@lists.tdwg.org > http://lists.tdwg.org/mailman/listinfo/tdwg-tapir > > > > > Please consider the environment before printing this email > > WARNING : This email and any attachments may be confidential > and/ > or > privileged. They are intended for the addressee only and are not > to > be read, > used, copied or disseminated by anyone receiving them in > error. If > you are > not the intended recipient, please notify the sender by return > email and > delete this message and any attachments. > > The views expressed in this email are those of the sender and do > not > necessarily reflect the > official views of Landcare Research. http:// > www.landcareresearch.co.nz > _______________________________________________ > tdwg-tapir mailing list > tdwg-tapir@lists.tdwg.org > http://lists.tdwg.org/mailman/listinfo/tdwg-tapir > > _______________________________________________ tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
--
Australian Centre for Plant BIodiversity Research<------------------+ National greg whitBread voice: +61 2 62509 482 Botanic Integrated Botanical Information System fax: +61 2 62509 599 Gardens S........ I.T. happens.. ghw@anbg.gov.au +----------------------------------------->GPO Box 1777 Canberra 2601
If you have received this transmission in error please notify us immediately by return e-mail and delete all copies. If this e-mail or any attachments have been sent to you in error, that error does not constitute waiver of any confidentiality, privilege or copyright in respect of information in the e-mail or attachments.
Please consider the environment before printing this email.
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
Perhaps it could be put into some form of xml to preserve the relational model? Maybe a mechanism could be developed so that others could access the xml as well. How about even putting some sort of subsetting mechanism so that entire data sets need not be retrieved.
just a thought...
On Wed, May 14, 2008 at 9:25 AM, Aaron D. Steele eightysteele@gmail.com wrote:
for preserving relational data, we could also just dump tapirlink resources to an sqlite database file (http://www.sqlite.org), zip it up, and again make it available via the web service. we use sqlite internally for many projects, and it's both easy to use and well supported by jdbc, php, python, etc.
would something like this be a useful option?
thanks, aaron
On Wed, May 14, 2008 at 2:21 AM, Markus Döring mdoering@gbif.org wrote:
Interesting that we all come to the same conclusions... The trouble I had with just a simple flat csv file is repeating properties like multiple image urls. ABCD clients dont use ABCD just because its complex, but because they want to transport this relational data. We were considering 2 solutions to extending this csv approach. The first would be to have a single large denormalised csv file with many rows for the same record. It would require knowledge about the related entities though and could grow in size rapidly. The second idea which we think to adopt is allowing a single level of 1- many related entities. It is basically a "star" design with the core dwc table in the center and any number of extension tables around it. Each "table" aka csv file will have the record id as the first column, so the files can be related easily and it only needs a single identifier per record and not for the extension entities. This would give a lot of flexibility while keeping things pretty simple to deal with. It would even satisfy the ABCD needs as I havent yet seen anyone requiring 2 levels of related tables (other than lookup tables). Those extensions could even be a simple 1-1 relation, but would keep things semantically together just like a xml namespace. The darwin core extensions would be good for example.
So we could have a gzipped set of files, maybe with a simple metafile indicating the semantics of the columns for each file. An example could look like this:
# darwincore.csv 102 Aster alpinus subsp. parviceps ... 103 Polygala vulgaris ...
# curatorial.csv 102 Kew Herbarium 103 Reading Herbarium
# identification.csv 102 2003-05-04 Karl Marx Aster alpinus L. 102 2007-01-11 Mark Twain Aster korshinskyi Tamamsch. 102 2007-09-13 Roger Hyam Aster alpinus subsp. parviceps Novopokr. 103 2001-02-21 Steve Bekow Polygala vulgaris L.
I know this looks old fashioned, but it is just so simple and gives us so much flexibility. Markus
On 14 May, 2008, at 24:39, Greg Whitbread wrote:
We have used a very similar protocol to assemble the latest AVH cache. It should be noted that this is an as-well-as protocol that only works because we have an established semantic standard (hispid/abcd).
greg
trobertson@gbif.org wrote:
Hi All,
This is very interesting too me, as I came up with the same conclusion while harvesting for GBIF.
As a "harvester of all records" it is best described with an example:
- Complete Inventory of ScientificNames: 7 minutes @ the limited 200
records per page
- Complete Harvesting of records:
- 260,000 records
- 9 hours harvesting duration
- 500MB TAPIR+DwC XML returned (DwC 1.4 with geospatial and
curatorial extensions)
- Extraction of DwC records from harvested XML: <2 minutes
- Resulting file size 32MB, Gzipped to <3MB
I spun hard drives for 9 hours, and took up bandwidth that is paid for, to retrieve something that could have been generated provider side in minutes and transferred in seconds (3MB).
I sent a proposal to TDWG last year termed "datamaps" which was effectively what you are describing, and I based it on the Sitemaps protocol, but I got nowhere with it. With Markus, we are making more progress and I have spoken with several GBIF data providers about a proposed new standard for full dataset harvesting and it has been received well. So Markus and I have started a new proposal and have a working name of 'Localised DwC Index' file generation (it is an index if you have more than DwC data, and DwC is still standards compliant) which is really a GZipped Tab file dump of the data, which is slightly extensible. The document is not ready to circulate yet but the benefits section reads currently:
- Provider database load reduced, allowing it to serve real
distributed queries rather than "full datasource" harvesters
- Providers can choose to publish their index as it suits them,
giving control back to the provider
- Localised index generation can be built into tools not yet
capable of integrating with TDWG protocol networks such as GBIF
- Harvesters receive a full dataset view in one request, making it
very easy to determine what records are eligible for deletion
- It becomes very simple to write clients that consume entire
datasets. E.g. data cleansing tools that the provider can run:
- Give me ISO Country Codes for my dataset
- The application pulls down the providers index file,
generates ISO country code, returns a simple table using the providers own identifier
- Check my names for spelling mistakes
- Application skims over the records and provides a list that
are not known to the application
- Providers such as UK NBN cannot serve 20 million records to the
GBIF index using the existing protocols efficiently.
- They have the ability to generate a localised index however
- Harvesters can very quickly build up searchable indexes and it is
easy to create large indices.
- Node Portal can easily aggregate index data files
- true index to data, not an illusion of a cache. More like Google
sitemaps
It is the ease at which one can offer tools to data providers that really interests me. The technical threshold required to produce services that offer reporting tools on peoples data is really very low with this mechanism. This and the fact that large datasets will be harvestable - we have even considered the likes of bit-torrent for the large ones although I think this is overkill.
As a consumer therefore I fully support this move as a valuable addition to the wrapper tools.
Cheers
Tim (wrote the GBIF harvesting, and new to this list)
Begin forwarded message:
From: "Aaron D. Steele" eightysteele@gmail.com Date: 13 de mayo de 2008 22:40:09 GMT+02:00 To: tdwg-tapir@lists.tdwg.org Cc: Aaron Steele asteele@berkeley.edu Subject: Re: [tdwg-tapir] Tapir protocol - Harvest methods?
at berkeley we've recently prototyped a simple php program that uses an existing tapirlink installation to periodically dump tapir resources into a csv file. the solution is totally generic and can dump darwin core (and technically abcd schema, although it's currently untested). the resulting csv files are zip archived and made accessible using a web service. it's a simple approach that has proven to be, at least internally, quite reliable and useful.
for example, several of our caching applications use the web service to harvest csv data from tapirlink resources using the following process:
- download latest csv dump for a resource using the web service.
- flush all locally cached records for the resource.
- bulk load the latest csv data into the cache.
in this way, cached data are always synchronized with the resource and there's no need to track new, deleted, or changed records. as an aside, each time these cached data are queried by the caching application or selected in the user interface, log-only search requests are sent back to the resource.
after discussion with renato giovanni and john wieczorek, we've decided that merging this functionality into the tapirlink codebase would benefit the broader community. csv generation support would be declared through capabilities. although incremental harvesting wouldn't be immediately implemented, we could certainly extend the service to include it later.
i'd like to pause here to gauge the consensus, thoughts, concerns, and ideas of others. anyone?
thanks, aaron
2008/5/5 Kevin Richards RichardsK@landcareresearch.co.nz: > > I think I agree here. > > The harvesting "procedure" is really defined outside the Tapir > protocol, is > it not? So it is really an agreement between the harvester and > the > harvestees. > > So what is really needed here is the standard procedure for > maintaining a > "harvestable" dataset and the standard procedure for harvesting > that > dataset. > We have a general rule at Landcare, that we never delete records > in > our > datasets - they are either deprecated in favour of another record, > and so > the resolution of that record would point to the new record, or > the > are set > to a state of "deleted", but are still kept in the dataset, and > can > be > resolved (which would indicate a state of deleted). > > Kevin > > >>>> "Renato De Giovanni" renato@cria.org.br 6/05/2008 7:33 a.m. >>>> >>> > > Hi Markus, > > I would suggest creating new concepts for incremental harvesting, > either in the data standards themselves or in some new > extension. In > the case of TAPIR, GBIF could easily check the mapped concepts > before > deciding between incremental or full harvesting. > > Actually it could be just one new concept such as "recordStatus" > or > "deletionFlag". Or perhaps you could also want to create your own > definition for dateLastModified indicating which set of concepts > should be considered to see if something has changed or not, but I > guess this level of granularity would be difficult to be > supported. > > Regards, > -- > Renato > > On 5 May 2008 at 11:24, Markus Döring wrote: > >> Phil, >> incremental harvesting is not implemented on the GBIF side as far >> as I >> am aware. And I dont think that will be a simple thing to >> implement on >> the current system. Also, even if we can detect only the changed >> records since the last harevesting via dateLastModified we still >> have >> no information about deletions. We could have an arrangement >> saying >> that you keep deleted records as empty records with just the ID >> and >> nothing else (I vaguely remember LSIDs were supposed to work like >> this >> too). But that also needs to be supported on your side then, >> never >> entirely removing any record. I will have a discussion with the >> others >> at GBIF about that. >> >> Markus > _______________________________________________ > tdwg-tapir mailing list > tdwg-tapir@lists.tdwg.org > http://lists.tdwg.org/mailman/listinfo/tdwg-tapir > > > > > Please consider the environment before printing this email > > WARNING : This email and any attachments may be confidential and/ > or > privileged. They are intended for the addressee only and are not > to > be read, > used, copied or disseminated by anyone receiving them in error. If > you are > not the intended recipient, please notify the sender by return > email and > delete this message and any attachments. > > The views expressed in this email are those of the sender and do > not > necessarily reflect the > official views of Landcare Research. http:// > www.landcareresearch.co.nz > _______________________________________________ > tdwg-tapir mailing list > tdwg-tapir@lists.tdwg.org > http://lists.tdwg.org/mailman/listinfo/tdwg-tapir > > _______________________________________________ tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
--
Australian Centre for Plant BIodiversity Research<------------------+ National greg whitBread voice: +61 2 62509 482 Botanic Integrated Botanical Information System fax: +61 2 62509 599 Gardens S........ I.T. happens.. ghw@anbg.gov.au +----------------------------------------->GPO Box 1777 Canberra 2601
If you have received this transmission in error please notify us immediately by return e-mail and delete all copies. If this e-mail or any attachments have been sent to you in error, that error does not constitute waiver of any confidentiality, privilege or copyright in respect of information in the e-mail or attachments.
Please consider the environment before printing this email.
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
This discussion starts to remind me another one in Google Appengine discussio group. They talk about different ways to bulk upload data to their Big Table database.
http://groups.google.com/group/google-appengine/browse_thread/thread/18d246b...
I have read so far: -XML -CSV -RDF -JSON -AMF -SQL -OOXML -TSV
Uff so many ideas...
I would take whatever Google finally decide as it will probable become a defacto standard :D
The discussion is funny :D
Cheers.
On Wed, May 14, 2008 at 4:04 PM, Dave Vieglais vieglais@ku.edu wrote:
Perhaps it could be put into some form of xml to preserve the relational model? Maybe a mechanism could be developed so that others could access the xml as well. How about even putting some sort of subsetting mechanism so that entire data sets need not be retrieved.
just a thought...
On Wed, May 14, 2008 at 9:25 AM, Aaron D. Steele eightysteele@gmail.com wrote:
for preserving relational data, we could also just dump tapirlink resources to an sqlite database file (http://www.sqlite.org), zip it up, and again make it available via the web service. we use sqlite internally for many projects, and it's both easy to use and well supported by jdbc, php, python, etc.
would something like this be a useful option?
thanks, aaron
On Wed, May 14, 2008 at 2:21 AM, Markus Döring mdoering@gbif.org wrote:
Interesting that we all come to the same conclusions... The trouble I had with just a simple flat csv file is repeating properties like multiple image urls. ABCD clients dont use ABCD just because its complex, but because they want to transport this relational data. We were considering 2 solutions to extending this csv approach. The first would be to have a single large denormalised csv file with many rows for the same record. It would require knowledge about the related entities though and could grow in size rapidly. The second idea which we think to adopt is allowing a single level of 1- many related entities. It is basically a "star" design with the core dwc table in the center and any number of extension tables around it. Each "table" aka csv file will have the record id as the first column, so the files can be related easily and it only needs a single identifier per record and not for the extension entities. This would give a lot of flexibility while keeping things pretty simple to deal with. It would even satisfy the ABCD needs as I havent yet seen anyone requiring 2 levels of related tables (other than lookup tables). Those extensions could even be a simple 1-1 relation, but would keep things semantically together just like a xml namespace. The darwin core extensions would be good for example.
So we could have a gzipped set of files, maybe with a simple metafile indicating the semantics of the columns for each file. An example could look like this:
# darwincore.csv 102 Aster alpinus subsp. parviceps ... 103 Polygala vulgaris ...
# curatorial.csv 102 Kew Herbarium 103 Reading Herbarium
# identification.csv 102 2003-05-04 Karl Marx Aster alpinus L. 102 2007-01-11 Mark Twain Aster korshinskyi Tamamsch. 102 2007-09-13 Roger Hyam Aster alpinus subsp. parviceps Novopokr. 103 2001-02-21 Steve Bekow Polygala vulgaris L.
I know this looks old fashioned, but it is just so simple and gives us so much flexibility. Markus
On 14 May, 2008, at 24:39, Greg Whitbread wrote:
We have used a very similar protocol to assemble the latest AVH cache. It should be noted that this is an as-well-as protocol that only works because we have an established semantic standard (hispid/abcd).
greg
trobertson@gbif.org wrote:
Hi All,
This is very interesting too me, as I came up with the same conclusion while harvesting for GBIF.
As a "harvester of all records" it is best described with an example:
- Complete Inventory of ScientificNames: 7 minutes @ the limited 200
records per page
- Complete Harvesting of records:
- 260,000 records
- 9 hours harvesting duration
- 500MB TAPIR+DwC XML returned (DwC 1.4 with geospatial and
curatorial extensions)
- Extraction of DwC records from harvested XML: <2 minutes
- Resulting file size 32MB, Gzipped to <3MB
I spun hard drives for 9 hours, and took up bandwidth that is paid for, to retrieve something that could have been generated provider side in minutes and transferred in seconds (3MB).
I sent a proposal to TDWG last year termed "datamaps" which was effectively what you are describing, and I based it on the Sitemaps protocol, but I got nowhere with it. With Markus, we are making more progress and I have spoken with several GBIF data providers about a proposed new standard for full dataset harvesting and it has been received well. So Markus and I have started a new proposal and have a working name of 'Localised DwC Index' file generation (it is an index if you have more than DwC data, and DwC is still standards compliant) which is really a GZipped Tab file dump of the data, which is slightly extensible. The document is not ready to circulate yet but the benefits section reads currently:
- Provider database load reduced, allowing it to serve real
distributed queries rather than "full datasource" harvesters
- Providers can choose to publish their index as it suits them,
giving control back to the provider
- Localised index generation can be built into tools not yet
capable of integrating with TDWG protocol networks such as GBIF
- Harvesters receive a full dataset view in one request, making it
very easy to determine what records are eligible for deletion
- It becomes very simple to write clients that consume entire
datasets. E.g. data cleansing tools that the provider can run:
- Give me ISO Country Codes for my dataset
- The application pulls down the providers index file,
generates ISO country code, returns a simple table using the providers own identifier
- Check my names for spelling mistakes
- Application skims over the records and provides a list that
are not known to the application
- Providers such as UK NBN cannot serve 20 million records to the
GBIF index using the existing protocols efficiently.
- They have the ability to generate a localised index however
- Harvesters can very quickly build up searchable indexes and it is
easy to create large indices.
- Node Portal can easily aggregate index data files
- true index to data, not an illusion of a cache. More like Google
sitemaps
It is the ease at which one can offer tools to data providers that really interests me. The technical threshold required to produce services that offer reporting tools on peoples data is really very low with this mechanism. This and the fact that large datasets will be harvestable - we have even considered the likes of bit-torrent for the large ones although I think this is overkill.
As a consumer therefore I fully support this move as a valuable addition to the wrapper tools.
Cheers
Tim (wrote the GBIF harvesting, and new to this list)
Begin forwarded message:
> From: "Aaron D. Steele" eightysteele@gmail.com > Date: 13 de mayo de 2008 22:40:09 GMT+02:00 > To: tdwg-tapir@lists.tdwg.org > Cc: Aaron Steele asteele@berkeley.edu > Subject: Re: [tdwg-tapir] Tapir protocol - Harvest methods? > > at berkeley we've recently prototyped a simple php program that > uses > an existing tapirlink installation to periodically dump tapir > resources into a csv file. the solution is totally generic and can > dump darwin core (and technically abcd schema, although it's > currently > untested). the resulting csv files are zip archived and made > accessible using a web service. it's a simple approach that has > proven > to be, at least internally, quite reliable and useful. > > for example, several of our caching applications use the web > service > to harvest csv data from tapirlink resources using the following > process: > 1) download latest csv dump for a resource using the web service. > 2) flush all locally cached records for the resource. > 3) bulk load the latest csv data into the cache. > > in this way, cached data are always synchronized with the > resource and > there's no need to track new, deleted, or changed records. as an > aside, each time these cached data are queried by the caching > application or selected in the user interface, log-only search > requests are sent back to the resource. > > after discussion with renato giovanni and john wieczorek, we've > decided that merging this functionality into the tapirlink codebase > would benefit the broader community. csv generation support would > be > declared through capabilities. although incremental harvesting > wouldn't be immediately implemented, we could certainly extend the > service to include it later. > > i'd like to pause here to gauge the consensus, thoughts, > concerns, and > ideas of others. anyone? > > thanks, > aaron > > 2008/5/5 Kevin Richards RichardsK@landcareresearch.co.nz: >> >> I think I agree here. >> >> The harvesting "procedure" is really defined outside the Tapir >> protocol, is >> it not? So it is really an agreement between the harvester and >> the >> harvestees. >> >> So what is really needed here is the standard procedure for >> maintaining a >> "harvestable" dataset and the standard procedure for harvesting >> that >> dataset. >> We have a general rule at Landcare, that we never delete records >> in >> our >> datasets - they are either deprecated in favour of another record, >> and so >> the resolution of that record would point to the new record, or >> the >> are set >> to a state of "deleted", but are still kept in the dataset, and >> can >> be >> resolved (which would indicate a state of deleted). >> >> Kevin >> >> >>>>> "Renato De Giovanni" renato@cria.org.br 6/05/2008 7:33 a.m. >>>>> >>> >> >> Hi Markus, >> >> I would suggest creating new concepts for incremental harvesting, >> either in the data standards themselves or in some new >> extension. In >> the case of TAPIR, GBIF could easily check the mapped concepts >> before >> deciding between incremental or full harvesting. >> >> Actually it could be just one new concept such as "recordStatus" >> or >> "deletionFlag". Or perhaps you could also want to create your own >> definition for dateLastModified indicating which set of concepts >> should be considered to see if something has changed or not, but I >> guess this level of granularity would be difficult to be >> supported. >> >> Regards, >> -- >> Renato >> >> On 5 May 2008 at 11:24, Markus Döring wrote: >> >>> Phil, >>> incremental harvesting is not implemented on the GBIF side as far >>> as I >>> am aware. And I dont think that will be a simple thing to >>> implement on >>> the current system. Also, even if we can detect only the changed >>> records since the last harevesting via dateLastModified we still >>> have >>> no information about deletions. We could have an arrangement >>> saying >>> that you keep deleted records as empty records with just the ID >>> and >>> nothing else (I vaguely remember LSIDs were supposed to work like >>> this >>> too). But that also needs to be supported on your side then, >>> never >>> entirely removing any record. I will have a discussion with the >>> others >>> at GBIF about that. >>> >>> Markus >> _______________________________________________ >> tdwg-tapir mailing list >> tdwg-tapir@lists.tdwg.org >> http://lists.tdwg.org/mailman/listinfo/tdwg-tapir >> >> >> >> >> Please consider the environment before printing this email >> >> WARNING : This email and any attachments may be confidential and/ >> or >> privileged. They are intended for the addressee only and are not >> to >> be read, >> used, copied or disseminated by anyone receiving them in error. If >> you are >> not the intended recipient, please notify the sender by return >> email and >> delete this message and any attachments. >> >> The views expressed in this email are those of the sender and do >> not >> necessarily reflect the >> official views of Landcare Research. http:// >> www.landcareresearch.co.nz >> _______________________________________________ >> tdwg-tapir mailing list >> tdwg-tapir@lists.tdwg.org >> http://lists.tdwg.org/mailman/listinfo/tdwg-tapir >> >> > _______________________________________________ > tdwg-tapir mailing list > tdwg-tapir@lists.tdwg.org > http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
--
Australian Centre for Plant BIodiversity Research<------------------+ National greg whitBread voice: +61 2 62509 482 Botanic Integrated Botanical Information System fax: +61 2 62509 599 Gardens S........ I.T. happens.. ghw@anbg.gov.au +----------------------------------------->GPO Box 1777 Canberra 2601
If you have received this transmission in error please notify us immediately by return e-mail and delete all copies. If this e-mail or any attachments have been sent to you in error, that error does not constitute waiver of any confidentiality, privilege or copyright in respect of information in the e-mail or attachments.
Please consider the environment before printing this email.
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
... and because of appengine we were considering to use YAML for a very simple metafile for the conceptual binding instead of having column header rows. http://code.google.com/appengine/docs/configuringanapp.html
On 14 May, 2008, at 16:19, Javier de la Torre wrote:
This discussion starts to remind me another one in Google Appengine discussio group. They talk about different ways to bulk upload data to their Big Table database.
http://groups.google.com/group/google-appengine/browse_thread/thread/18d246b...
I have read so far: -XML -CSV -RDF -JSON -AMF -SQL -OOXML -TSV
Uff so many ideas...
I would take whatever Google finally decide as it will probable become a defacto standard :D
The discussion is funny :D
Cheers.
On Wed, May 14, 2008 at 4:04 PM, Dave Vieglais vieglais@ku.edu wrote:
Perhaps it could be put into some form of xml to preserve the relational model? Maybe a mechanism could be developed so that others could access the xml as well. How about even putting some sort of subsetting mechanism so that entire data sets need not be retrieved.
just a thought...
On Wed, May 14, 2008 at 9:25 AM, Aaron D. Steele <eightysteele@gmail.com
wrote: for preserving relational data, we could also just dump tapirlink resources to an sqlite database file (http://www.sqlite.org), zip it up, and again make it available via the web service. we use sqlite internally for many projects, and it's both easy to use and well supported by jdbc, php, python, etc.
would something like this be a useful option?
thanks, aaron
On Wed, May 14, 2008 at 2:21 AM, Markus Döring mdoering@gbif.org wrote:
Interesting that we all come to the same conclusions... The trouble I had with just a simple flat csv file is repeating properties like multiple image urls. ABCD clients dont use ABCD just because its complex, but because they want to transport this relational data. We were considering 2 solutions to extending this csv approach. The first would be to have a single large denormalised csv file with many rows for the same record. It would require knowledge about the related entities though and could grow in size rapidly. The second idea which we think to adopt is allowing a single level of 1- many related entities. It is basically a "star" design with the core dwc table in the center and any number of extension tables around it. Each "table" aka csv file will have the record id as the first column, so the files can be related easily and it only needs a single identifier per record and not for the extension entities. This would give a lot of flexibility while keeping things pretty simple to deal with. It would even satisfy the ABCD needs as I havent yet seen anyone requiring 2 levels of related tables (other than lookup tables). Those extensions could even be a simple 1-1 relation, but would keep things semantically together just like a xml namespace. The darwin core extensions would be good for example.
So we could have a gzipped set of files, maybe with a simple metafile indicating the semantics of the columns for each file. An example could look like this:
# darwincore.csv 102 Aster alpinus subsp. parviceps ... 103 Polygala vulgaris ...
# curatorial.csv 102 Kew Herbarium 103 Reading Herbarium
# identification.csv 102 2003-05-04 Karl Marx Aster alpinus L. 102 2007-01-11 Mark Twain Aster korshinskyi Tamamsch. 102 2007-09-13 Roger Hyam Aster alpinus subsp. parviceps Novopokr. 103 2001-02-21 Steve Bekow Polygala vulgaris L.
I know this looks old fashioned, but it is just so simple and gives us so much flexibility. Markus
On 14 May, 2008, at 24:39, Greg Whitbread wrote:
We have used a very similar protocol to assemble the latest AVH cache. It should be noted that this is an as-well-as protocol that only works because we have an established semantic standard (hispid/abcd).
greg
trobertson@gbif.org wrote:
Hi All,
This is very interesting too me, as I came up with the same conclusion while harvesting for GBIF.
As a "harvester of all records" it is best described with an example:
- Complete Inventory of ScientificNames: 7 minutes @ the
limited 200 records per page
- Complete Harvesting of records:
- 260,000 records
- 9 hours harvesting duration
- 500MB TAPIR+DwC XML returned (DwC 1.4 with geospatial and
curatorial extensions)
- Extraction of DwC records from harvested XML: <2 minutes
- Resulting file size 32MB, Gzipped to <3MB
I spun hard drives for 9 hours, and took up bandwidth that is paid for, to retrieve something that could have been generated provider side in minutes and transferred in seconds (3MB).
I sent a proposal to TDWG last year termed "datamaps" which was effectively what you are describing, and I based it on the Sitemaps protocol, but I got nowhere with it. With Markus, we are making more progress and I have spoken with several GBIF data providers about a proposed new standard for full dataset harvesting and it has been received well. So Markus and I have started a new proposal and have a working name of 'Localised DwC Index' file generation (it is an index if you have more than DwC data, and DwC is still standards compliant) which is really a GZipped Tab file dump of the data, which is slightly extensible. The document is not ready to circulate yet but the benefits section reads currently:
- Provider database load reduced, allowing it to serve real
distributed queries rather than "full datasource" harvesters
- Providers can choose to publish their index as it suits them,
giving control back to the provider
- Localised index generation can be built into tools not yet
capable of integrating with TDWG protocol networks such as GBIF
- Harvesters receive a full dataset view in one request, making
it very easy to determine what records are eligible for deletion
- It becomes very simple to write clients that consume entire
datasets. E.g. data cleansing tools that the provider can run:
- Give me ISO Country Codes for my dataset
- The application pulls down the providers index file,
generates ISO country code, returns a simple table using the providers own identifier
- Check my names for spelling mistakes
- Application skims over the records and provides a list that
are not known to the application
- Providers such as UK NBN cannot serve 20 million records to the
GBIF index using the existing protocols efficiently.
- They have the ability to generate a localised index however
- Harvesters can very quickly build up searchable indexes and
it is easy to create large indices.
- Node Portal can easily aggregate index data files
- true index to data, not an illusion of a cache. More like
Google sitemaps
It is the ease at which one can offer tools to data providers that really interests me. The technical threshold required to produce services that offer reporting tools on peoples data is really very low with this mechanism. This and the fact that large datasets will be harvestable - we have even considered the likes of bit-torrent for the large ones although I think this is overkill.
As a consumer therefore I fully support this move as a valuable addition to the wrapper tools.
Cheers
Tim (wrote the GBIF harvesting, and new to this list)
> > Begin forwarded message: > >> From: "Aaron D. Steele" eightysteele@gmail.com >> Date: 13 de mayo de 2008 22:40:09 GMT+02:00 >> To: tdwg-tapir@lists.tdwg.org >> Cc: Aaron Steele asteele@berkeley.edu >> Subject: Re: [tdwg-tapir] Tapir protocol - Harvest methods? >> >> at berkeley we've recently prototyped a simple php program that >> uses >> an existing tapirlink installation to periodically dump tapir >> resources into a csv file. the solution is totally generic >> and can >> dump darwin core (and technically abcd schema, although it's >> currently >> untested). the resulting csv files are zip archived and made >> accessible using a web service. it's a simple approach that has >> proven >> to be, at least internally, quite reliable and useful. >> >> for example, several of our caching applications use the web >> service >> to harvest csv data from tapirlink resources using the >> following >> process: >> 1) download latest csv dump for a resource using the web >> service. >> 2) flush all locally cached records for the resource. >> 3) bulk load the latest csv data into the cache. >> >> in this way, cached data are always synchronized with the >> resource and >> there's no need to track new, deleted, or changed records. as >> an >> aside, each time these cached data are queried by the caching >> application or selected in the user interface, log-only search >> requests are sent back to the resource. >> >> after discussion with renato giovanni and john wieczorek, we've >> decided that merging this functionality into the tapirlink >> codebase >> would benefit the broader community. csv generation support >> would >> be >> declared through capabilities. although incremental harvesting >> wouldn't be immediately implemented, we could certainly >> extend the >> service to include it later. >> >> i'd like to pause here to gauge the consensus, thoughts, >> concerns, and >> ideas of others. anyone? >> >> thanks, >> aaron >> >> 2008/5/5 Kevin Richards RichardsK@landcareresearch.co.nz: >>> >>> I think I agree here. >>> >>> The harvesting "procedure" is really defined outside the Tapir >>> protocol, is >>> it not? So it is really an agreement between the harvester >>> and >>> the >>> harvestees. >>> >>> So what is really needed here is the standard procedure for >>> maintaining a >>> "harvestable" dataset and the standard procedure for >>> harvesting >>> that >>> dataset. >>> We have a general rule at Landcare, that we never delete >>> records >>> in >>> our >>> datasets - they are either deprecated in favour of another >>> record, >>> and so >>> the resolution of that record would point to the new record, >>> or >>> the >>> are set >>> to a state of "deleted", but are still kept in the dataset, >>> and >>> can >>> be >>> resolved (which would indicate a state of deleted). >>> >>> Kevin >>> >>> >>>>>> "Renato De Giovanni" renato@cria.org.br 6/05/2008 7:33 >>>>>> a.m. >>>>>>>>> >>> >>> Hi Markus, >>> >>> I would suggest creating new concepts for incremental >>> harvesting, >>> either in the data standards themselves or in some new >>> extension. In >>> the case of TAPIR, GBIF could easily check the mapped concepts >>> before >>> deciding between incremental or full harvesting. >>> >>> Actually it could be just one new concept such as >>> "recordStatus" >>> or >>> "deletionFlag". Or perhaps you could also want to create >>> your own >>> definition for dateLastModified indicating which set of >>> concepts >>> should be considered to see if something has changed or not, >>> but I >>> guess this level of granularity would be difficult to be >>> supported. >>> >>> Regards, >>> -- >>> Renato >>> >>> On 5 May 2008 at 11:24, Markus Döring wrote: >>> >>>> Phil, >>>> incremental harvesting is not implemented on the GBIF side >>>> as far >>>> as I >>>> am aware. And I dont think that will be a simple thing to >>>> implement on >>>> the current system. Also, even if we can detect only the >>>> changed >>>> records since the last harevesting via dateLastModified we >>>> still >>>> have >>>> no information about deletions. We could have an arrangement >>>> saying >>>> that you keep deleted records as empty records with just >>>> the ID >>>> and >>>> nothing else (I vaguely remember LSIDs were supposed to >>>> work like >>>> this >>>> too). But that also needs to be supported on your side then, >>>> never >>>> entirely removing any record. I will have a discussion with >>>> the >>>> others >>>> at GBIF about that. >>>> >>>> Markus >>> _______________________________________________ >>> tdwg-tapir mailing list >>> tdwg-tapir@lists.tdwg.org >>> http://lists.tdwg.org/mailman/listinfo/tdwg-tapir >>> >>> >>> >>> >>> Please consider the environment before printing this email >>> >>> WARNING : This email and any attachments may be confidential >>> and/ >>> or >>> privileged. They are intended for the addressee only and are >>> not >>> to >>> be read, >>> used, copied or disseminated by anyone receiving them in >>> error. If >>> you are >>> not the intended recipient, please notify the sender by return >>> email and >>> delete this message and any attachments. >>> >>> The views expressed in this email are those of the sender >>> and do >>> not >>> necessarily reflect the >>> official views of Landcare Research. http:// >>> www.landcareresearch.co.nz >>> _______________________________________________ >>> tdwg-tapir mailing list >>> tdwg-tapir@lists.tdwg.org >>> http://lists.tdwg.org/mailman/listinfo/tdwg-tapir >>> >>> >> _______________________________________________ >> tdwg-tapir mailing list >> tdwg-tapir@lists.tdwg.org >> http://lists.tdwg.org/mailman/listinfo/tdwg-tapir >
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
--
Australian Centre for Plant BIodiversity Research<------------------+ National greg whitBread voice: +61 2 62509 482 Botanic Integrated Botanical Information System fax: +61 2 62509 599 Gardens S........ I.T. happens.. ghw@anbg.gov.au +----------------------------------------->GPO Box 1777 Canberra 2601
If you have received this transmission in error please notify us immediately by return e-mail and delete all copies. If this e-mail or any attachments have been sent to you in error, that error does not constitute waiver of any confidentiality, privilege or copyright in respect of information in the e-mail or attachments.
Please consider the environment before printing this email.
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
Another interesting problem you touch on...
Take the GBIF Index. People want a country "slice" of the data. The SQL to slice up the data on occurrences is fine, but then what about the taxonomy stuff - do you throq out the stuff that is not relevant to the sliced region? What about sub selecting only the regional common names etc etc.
I think you will be unlikely to generically come up with subsets of DB dumps without specific model knowledge, but I'd be interested to hear if you do!!! I think you'd have to basically do an interceptor that does a pre-select - probably also a chained up sequence of post-SQL's - no?
-----Original Message----- From: tdwg-tapir-bounces@lists.tdwg.org [mailto:tdwg-tapir-bounces@lists.tdwg.org] On Behalf Of Dave Vieglais Sent: Wednesday, May 14, 2008 4:05 PM To: Aaron D. Steele Cc: Aaron Steele; tdwg-tapir@lists.tdwg.org Subject: Re: [tdwg-tapir] Fwd: Tapir protocol - Harvest methods?[SEC=UNCLASSIFIED]
Perhaps it could be put into some form of xml to preserve the relational model? Maybe a mechanism could be developed so that others could access the xml as well. How about even putting some sort of subsetting mechanism so that entire data sets need not be retrieved.
just a thought...
On Wed, May 14, 2008 at 9:25 AM, Aaron D. Steele eightysteele@gmail.com wrote:
for preserving relational data, we could also just dump tapirlink resources to an sqlite database file (http://www.sqlite.org), zip it up, and again make it available via the web service. we use sqlite internally for many projects, and it's both easy to use and well supported by jdbc, php, python, etc.
would something like this be a useful option?
thanks, aaron
On Wed, May 14, 2008 at 2:21 AM, Markus Döring mdoering@gbif.org wrote:
Interesting that we all come to the same conclusions... The trouble I had with just a simple flat csv file is repeating >
properties like multiple image urls. ABCD clients dont use ABCD just
because its complex, but because they want to transport this >
relational data. We were considering 2 solutions to extending this csv
approach. The first would be to have a single large denormalised
csv > file with many rows for the same record. It would require knowledge > about the related entities though and could grow in size rapidly. The > second idea which we think to adopt is allowing a single level of 1- > many related entities. It is basically a "star" design with the core > dwc table in the center and any number of
extension tables around it.
Each "table" aka csv file will have the record id as the first
column, > so the files can be related easily and it only needs a single > identifier per record and not for the extension entities. This would > give a lot of flexibility while keeping things pretty simple to deal > with. It would even satisfy the ABCD needs as I havent yet seen anyone > requiring 2 levels of related tables (other than lookup tables). Those > extensions could even be a simple 1-1 relation, but would keep things > semantically together just like a xml namespace. The darwin core > extensions would be good for example.
So we could have a gzipped set of files, maybe with a simple
metafile > indicating the semantics of the columns for each file.
An example could look like this:
# darwincore.csv 102 Aster alpinus subsp. parviceps ... 103 Polygala vulgaris ...
# curatorial.csv 102 Kew Herbarium 103 Reading Herbarium
# identification.csv 102 2003-05-04 Karl Marx Aster alpinus L. 102 2007-01-11 Mark Twain Aster korshinskyi Tamamsch. 102 2007-09-13 Roger Hyam Aster alpinus subsp. parviceps Novopokr. 103 2001-02-21 Steve Bekow Polygala vulgaris L.
I know this looks old fashioned, but it is just so simple and
gives us > so much flexibility.
Markus
On 14 May, 2008, at 24:39, Greg Whitbread wrote:
We have used a very similar protocol to assemble the latest AVH
cache.
It should be noted that this is an as-well-as protocol that only
works > > because we have an established semantic standard
(hispid/abcd).
greg
trobertson@gbif.org wrote:
Hi All,
This is very interesting too me, as I came up with the same >
conclusion > >> while harvesting for GBIF.
As a "harvester of all records" it is best described with an
example:
- Complete Inventory of ScientificNames: 7 minutes @ the
limited 200 > >> records per page > >> - Complete Harvesting of records:
- 260,000 records
- 9 hours harvesting duration
- 500MB TAPIR+DwC XML returned (DwC 1.4 with geospatial and >
curatorial > >> extensions) > >> - Extraction of DwC records
from harvested XML: <2 minutes > >> - Resulting file size 32MB, Gzipped to <3MB > >> > >> I spun hard drives for 9 hours, and took up bandwidth that is paid > >> for, to > >> retrieve something that could have been generated provider side in > >> minutes > >> and transferred in seconds (3MB).
I sent a proposal to TDWG last year termed "datamaps" which was effectively what you are describing, and I based it on the
Sitemaps > >> protocol, but I got nowhere with it. With Markus, we are making more > >> progress and I have spoken with several GBIF data providers about a > >> proposed new standard for full dataset harvesting and it has been > >> received > >> well. So Markus and I have started a new proposal and have a > >> working name > >> of 'Localised DwC Index' file generation (it is an index if you > >> have more > >> than DwC data, and DwC is still standards compliant) which is > >> really a > >> GZipped Tab file dump of the data, which is slightly extensible. The > >> document is not ready to circulate yet but the benefits section reads > >> currently:
- Provider database load reduced, allowing it to serve real >
distributed > >> queries rather than "full datasource" harvesters
- Providers can choose to publish their index as it suits them,
giving > >> control back to the provider > >> - Localised
index generation can be built into tools not yet > >> capable of >
integrating with TDWG protocol networks such as GBIF > >> -
Harvesters receive a full dataset view in one request, making it >
very > >> easy to determine what records are eligible for
deletion > >> - It becomes very simple to write clients that consume entire > >> datasets.
E.g. data cleansing tools that the provider can run:
- Give me ISO Country Codes for my dataset
- The application pulls down the providers index file,
generates ISO country code, returns a simple table using the providers own >
identifier > >> - Check my names for spelling mistakes
- Application skims over the records and provides a list that
are not known to the application
- Providers such as UK NBN cannot serve 20 million records to
the > >> GBIF > >> index using the existing protocols efficiently.
- They have the ability to generate a localised index however
- Harvesters can very quickly build up searchable indexes and it
is > >> easy > >> to create large indices.
- Node Portal can easily aggregate index data files > >> -
true index to data, not an illusion of a cache. More like Google >
sitemaps > >> > >> It is the ease at which one can offer tools
to data providers that > >> really > >> interests me. The technical threshold required to produce services > >> that > >> offer reporting tools on peoples data is really very low with this >
mechanism. This and the fact that large datasets will be > >>
harvestable - we > >> have even considered the likes of bit-torrent for the large ones > >> although > >> I think this is overkill.
As a consumer therefore I fully support this move as a valuable addition > >> to the wrapper tools.
Cheers
Tim (wrote the GBIF harvesting, and new to this list) > >> > >>
>>> Begin forwarded message:
From: "Aaron D. Steele" eightysteele@gmail.com > >>>>
Date: 13 de mayo de 2008 22:40:09 GMT+02:00 > >>>> To: tdwg-tapir@lists.tdwg.org > >>>> Cc: Aaron Steele asteele@berkeley.edu > >>>> Subject: Re: [tdwg-tapir] Tapir protocol - Harvest methods?
at berkeley we've recently prototyped a simple php program
that > >>>> uses > >>>> an existing tapirlink installation to periodically dump tapir > >>>> resources into a csv file. the solution is totally generic and can > >>>> dump darwin core (and technically abcd schema, although it's > >>>> currently > >>>> untested). the resulting csv files are zip archived and made > >>>> accessible using a web service. it's a simple approach that has >
proven > >>>> to be, at least internally, quite reliable and
useful.
for example, several of our caching applications use the web service > >>>> to harvest csv data from tapirlink resources
using the following > >>>> process:
- download latest csv dump for a resource using the web service.
- flush all locally cached records for the resource.
- bulk load the latest csv data into the cache.
in this way, cached data are always synchronized with the >
resource and > >>>> there's no need to track new, deleted, or
changed records. as an > >>>> aside, each time these cached data are queried by the caching > >>>> application or selected in the user interface, log-only search > >>>> requests are sent back to the resource.
after discussion with renato giovanni and john wieczorek,
we've > >>>> decided that merging this functionality into the tapirlink codebase > >>>> would benefit the broader community. csv generation support would > >>>> be > >>>> declared through capabilities. although incremental harvesting > >>>> wouldn't be immediately implemented, we could certainly extend the > >>>> service to include it later.
i'd like to pause here to gauge the consensus, thoughts, >
concerns, and > >>>> ideas of others. anyone?
thanks, aaron
2008/5/5 Kevin Richards RichardsK@landcareresearch.co.nz: > > I think I agree here. > > The harvesting "procedure" is really defined outside the
Tapir > >>>>> protocol, is > >>>>> it not? So it is really an agreement between the harvester and > >>>>> the > >>>>> harvestees.
> > So what is really needed here is the standard procedure for > maintaining a > >>>>> "harvestable" dataset and the
standard procedure for harvesting > >>>>> that > >>>>> dataset.
> We have a general rule at Landcare, that we never delete
records > >>>>> in > >>>>> our > >>>>> datasets - they are either deprecated in favour of another record, > >>>>> and so >
the resolution of that record would point to the new record, or > the > >>>>> are set > >>>>> to a state of "deleted", but
are still kept in the dataset, and > >>>>> can > >>>>> be >
resolved (which would indicate a state of deleted). > > Kevin > > >>>> "Renato De Giovanni" renato@cria.org.br 6/05/2008 7:33 a.m. >>>> >>> > > Hi Markus, > > I would suggest creating new concepts for incremental
harvesting, > >>>>> either in the data standards themselves or in some new > >>>>> extension. In > >>>>> the case of TAPIR, GBIF could easily check the mapped concepts > >>>>> before > >>>>> deciding between incremental or full harvesting.
> > Actually it could be just one new concept such as "recordStatus" > or > "deletionFlag". Or perhaps you could also want to create
your own > >>>>> definition for dateLastModified indicating which set of concepts > >>>>> should be considered to see if something has changed or not, but I > >>>>> guess this level of granularity would be difficult to be > >>>>> supported.
> > Regards, > -- > Renato > > On 5 May 2008 at 11:24, Markus Döring wrote: > >> Phil, >> incremental harvesting is not implemented on the GBIF side
as far > >>>>>> as I > >>>>>> am aware. And I dont think that will be a simple thing to > >>>>>> implement on > >>>>>> the current system. Also, even if we can detect only the changed > >>>>>> records since the last harevesting via dateLastModified we still >
> have > >>>>>> no information about deletions. We could have
an arrangement > >>>>>> saying > >>>>>> that you keep deleted records as empty records with just the ID > >>>>>> and > >>>>>> nothing else (I vaguely remember LSIDs were supposed to work like >
> this > >>>>>> too). But that also needs to be supported on
your side then, > >>>>>> never > >>>>>> entirely removing any record. I will have a discussion with the > >>>>>> others > >>>>>> at GBIF about that.
>> >> Markus > _______________________________________________ > tdwg-tapir mailing list > tdwg-tapir@lists.tdwg.org > http://lists.tdwg.org/mailman/listinfo/tdwg-tapir > > > > > Please consider the environment before printing this email > > >>>>> WARNING : This email and any attachments may be
confidential and/ > >>>>> or > >>>>> privileged. They are intended for the addressee only and are not > >>>>> to > >>>>> be read, >
used, copied or disseminated by anyone receiving them in error.
If > >>>>> you are > >>>>> not the intended recipient, please notify the sender by return > >>>>> email and > >>>>> delete this message and any attachments.
> > The views expressed in this email are those of the sender
and do > >>>>> not > >>>>> necessarily reflect the > >>>>> official views of Landcare Research. http:// > >>>>> www.landcareresearch.co.nz > >>>>> _______________________________________________
> tdwg-tapir mailing list > tdwg-tapir@lists.tdwg.org > http://lists.tdwg.org/mailman/listinfo/tdwg-tapir > > _______________________________________________ tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
--
Australian Centre for Plant BIodiversity
Research<------------------+
National greg whitBread voice: +61 2 62509
482
Botanic Integrated Botanical Information System fax: +61 2 62509
599
Gardens S........ I.T. happens..
ghw@anbg.gov.au
+----------------------------------------->GPO Box 1777 Canberra
2601 > > > > > > > > ------ > > If you have received this transmission in error please notify us > > immediately by return e-mail and delete all copies. If this e-mail > > or any attachments have been sent to you in error, that error does > > not constitute waiver of any confidentiality, privilege or copyright > > in respect of information in the e-mail or attachments.
Please consider the environment before printing this email.
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
_______________________________________________ tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
While getting everyone to use the same standard is ideal, we can at least shoot to provide a real, sustainable standard in the way sitemaps are now standard for websites. Go to http://somesite.com/sitemap.xml and/or http://somesite.com/sitemap.xml.gz and if it's a realtively current site where the producers have considered search engine optimization (SEO) - the chances are it'll be there. During my first few months here at Mobot I've brought up DiGIR (the last implementation was dying hourly) so that it's now available full time, and am working with Renato and Kevin to get Tapir running the same way. In speaking with Tim and others it sounds as if some will still be using the older protocol, while others will use Tapir - so there is value in continuing to provide both. In this way I would see something like a sitemap being "yet another standard", if it was one that was accepted as a best practice, newer/more current sites could adopt it, and advertise it as something dead simple to implement for others. When a harvester would visit a site it would check for that file first, before spending the time and bandwidth (per Tim's example) to rebuild what the server should have already available - much like a spider checking for sitemap.xml before it starts randomly following href links on web pages to spider it the 'old way'. In this way I could also see that idea of providers lowing their bandwidth as an added incentive to get on the bus.
I've too thought of things like bittorrent (with updates announced and tripped by RSS feeds), simple rsync deltas and even sending XML over XMMP (jabber) to keep things in sync - but at the end of the day we want something that is just there, made available via the simplest method. Expecting others to install something special to do something extra is going to be difficult, but if we say, if you create this file off the root we can use it, and it will benefit your site as well - sounds much easier to achieve.
New to the list (and bioinformatics in general)
Phil
also - Nomina was an incredible time for me, so when I'm in Australia I plan on buying beers for anyone within earshot
On Wed, 2008-05-14 at 10:04 -0400, Dave Vieglais wrote:
Perhaps it could be put into some form of xml to preserve the relational model? Maybe a mechanism could be developed so that others could access the xml as well. How about even putting some sort of subsetting mechanism so that entire data sets need not be retrieved.
just a thought...
On Wed, May 14, 2008 at 9:25 AM, Aaron D. Steele eightysteele@gmail.com wrote:
for preserving relational data, we could also just dump tapirlink resources to an sqlite database file (http://www.sqlite.org), zip it up, and again make it available via the web service. we use sqlite internally for many projects, and it's both easy to use and well supported by jdbc, php, python, etc.
would something like this be a useful option?
thanks, aaron
On Wed, May 14, 2008 at 2:21 AM, Markus Döring mdoering@gbif.org wrote:
Interesting that we all come to the same conclusions... The trouble I had with just a simple flat csv file is repeating properties like multiple image urls. ABCD clients dont use ABCD just because its complex, but because they want to transport this relational data. We were considering 2 solutions to extending this csv approach. The first would be to have a single large denormalised csv file with many rows for the same record. It would require knowledge about the related entities though and could grow in size rapidly. The second idea which we think to adopt is allowing a single level of 1- many related entities. It is basically a "star" design with the core dwc table in the center and any number of extension tables around it. Each "table" aka csv file will have the record id as the first column, so the files can be related easily and it only needs a single identifier per record and not for the extension entities. This would give a lot of flexibility while keeping things pretty simple to deal with. It would even satisfy the ABCD needs as I havent yet seen anyone requiring 2 levels of related tables (other than lookup tables). Those extensions could even be a simple 1-1 relation, but would keep things semantically together just like a xml namespace. The darwin core extensions would be good for example.
So we could have a gzipped set of files, maybe with a simple metafile indicating the semantics of the columns for each file. An example could look like this:
# darwincore.csv 102 Aster alpinus subsp. parviceps ... 103 Polygala vulgaris ...
# curatorial.csv 102 Kew Herbarium 103 Reading Herbarium
# identification.csv 102 2003-05-04 Karl Marx Aster alpinus L. 102 2007-01-11 Mark Twain Aster korshinskyi Tamamsch. 102 2007-09-13 Roger Hyam Aster alpinus subsp. parviceps Novopokr. 103 2001-02-21 Steve Bekow Polygala vulgaris L.
I know this looks old fashioned, but it is just so simple and gives us so much flexibility. Markus
On 14 May, 2008, at 24:39, Greg Whitbread wrote:
We have used a very similar protocol to assemble the latest AVH cache. It should be noted that this is an as-well-as protocol that only works because we have an established semantic standard (hispid/abcd).
greg
trobertson@gbif.org wrote:
Hi All,
This is very interesting too me, as I came up with the same conclusion while harvesting for GBIF.
As a "harvester of all records" it is best described with an example:
- Complete Inventory of ScientificNames: 7 minutes @ the limited 200
records per page
- Complete Harvesting of records:
- 260,000 records
- 9 hours harvesting duration
- 500MB TAPIR+DwC XML returned (DwC 1.4 with geospatial and
curatorial extensions)
- Extraction of DwC records from harvested XML: <2 minutes
- Resulting file size 32MB, Gzipped to <3MB
I spun hard drives for 9 hours, and took up bandwidth that is paid for, to retrieve something that could have been generated provider side in minutes and transferred in seconds (3MB).
I sent a proposal to TDWG last year termed "datamaps" which was effectively what you are describing, and I based it on the Sitemaps protocol, but I got nowhere with it. With Markus, we are making more progress and I have spoken with several GBIF data providers about a proposed new standard for full dataset harvesting and it has been received well. So Markus and I have started a new proposal and have a working name of 'Localised DwC Index' file generation (it is an index if you have more than DwC data, and DwC is still standards compliant) which is really a GZipped Tab file dump of the data, which is slightly extensible. The document is not ready to circulate yet but the benefits section reads currently:
- Provider database load reduced, allowing it to serve real
distributed queries rather than "full datasource" harvesters
- Providers can choose to publish their index as it suits them,
giving control back to the provider
- Localised index generation can be built into tools not yet
capable of integrating with TDWG protocol networks such as GBIF
- Harvesters receive a full dataset view in one request, making it
very easy to determine what records are eligible for deletion
- It becomes very simple to write clients that consume entire
datasets. E.g. data cleansing tools that the provider can run:
- Give me ISO Country Codes for my dataset
- The application pulls down the providers index file,
generates ISO country code, returns a simple table using the providers own identifier
- Check my names for spelling mistakes
- Application skims over the records and provides a list that
are not known to the application
- Providers such as UK NBN cannot serve 20 million records to the
GBIF index using the existing protocols efficiently.
- They have the ability to generate a localised index however
- Harvesters can very quickly build up searchable indexes and it is
easy to create large indices.
- Node Portal can easily aggregate index data files
- true index to data, not an illusion of a cache. More like Google
sitemaps
It is the ease at which one can offer tools to data providers that really interests me. The technical threshold required to produce services that offer reporting tools on peoples data is really very low with this mechanism. This and the fact that large datasets will be harvestable - we have even considered the likes of bit-torrent for the large ones although I think this is overkill.
As a consumer therefore I fully support this move as a valuable addition to the wrapper tools.
Cheers
Tim (wrote the GBIF harvesting, and new to this list)
Begin forwarded message:
> From: "Aaron D. Steele" eightysteele@gmail.com > Date: 13 de mayo de 2008 22:40:09 GMT+02:00 > To: tdwg-tapir@lists.tdwg.org > Cc: Aaron Steele asteele@berkeley.edu > Subject: Re: [tdwg-tapir] Tapir protocol - Harvest methods? > > at berkeley we've recently prototyped a simple php program that > uses > an existing tapirlink installation to periodically dump tapir > resources into a csv file. the solution is totally generic and can > dump darwin core (and technically abcd schema, although it's > currently > untested). the resulting csv files are zip archived and made > accessible using a web service. it's a simple approach that has > proven > to be, at least internally, quite reliable and useful. > > for example, several of our caching applications use the web > service > to harvest csv data from tapirlink resources using the following > process: > 1) download latest csv dump for a resource using the web service. > 2) flush all locally cached records for the resource. > 3) bulk load the latest csv data into the cache. > > in this way, cached data are always synchronized with the > resource and > there's no need to track new, deleted, or changed records. as an > aside, each time these cached data are queried by the caching > application or selected in the user interface, log-only search > requests are sent back to the resource. > > after discussion with renato giovanni and john wieczorek, we've > decided that merging this functionality into the tapirlink codebase > would benefit the broader community. csv generation support would > be > declared through capabilities. although incremental harvesting > wouldn't be immediately implemented, we could certainly extend the > service to include it later. > > i'd like to pause here to gauge the consensus, thoughts, > concerns, and > ideas of others. anyone? > > thanks, > aaron > > 2008/5/5 Kevin Richards RichardsK@landcareresearch.co.nz: >> >> I think I agree here. >> >> The harvesting "procedure" is really defined outside the Tapir >> protocol, is >> it not? So it is really an agreement between the harvester and >> the >> harvestees. >> >> So what is really needed here is the standard procedure for >> maintaining a >> "harvestable" dataset and the standard procedure for harvesting >> that >> dataset. >> We have a general rule at Landcare, that we never delete records >> in >> our >> datasets - they are either deprecated in favour of another record, >> and so >> the resolution of that record would point to the new record, or >> the >> are set >> to a state of "deleted", but are still kept in the dataset, and >> can >> be >> resolved (which would indicate a state of deleted). >> >> Kevin >> >> >>>>> "Renato De Giovanni" renato@cria.org.br 6/05/2008 7:33 a.m. >>>>> >>> >> >> Hi Markus, >> >> I would suggest creating new concepts for incremental harvesting, >> either in the data standards themselves or in some new >> extension. In >> the case of TAPIR, GBIF could easily check the mapped concepts >> before >> deciding between incremental or full harvesting. >> >> Actually it could be just one new concept such as "recordStatus" >> or >> "deletionFlag". Or perhaps you could also want to create your own >> definition for dateLastModified indicating which set of concepts >> should be considered to see if something has changed or not, but I >> guess this level of granularity would be difficult to be >> supported. >> >> Regards, >> -- >> Renato >> >> On 5 May 2008 at 11:24, Markus Döring wrote: >> >>> Phil, >>> incremental harvesting is not implemented on the GBIF side as far >>> as I >>> am aware. And I dont think that will be a simple thing to >>> implement on >>> the current system. Also, even if we can detect only the changed >>> records since the last harevesting via dateLastModified we still >>> have >>> no information about deletions. We could have an arrangement >>> saying >>> that you keep deleted records as empty records with just the ID >>> and >>> nothing else (I vaguely remember LSIDs were supposed to work like >>> this >>> too). But that also needs to be supported on your side then, >>> never >>> entirely removing any record. I will have a discussion with the >>> others >>> at GBIF about that. >>> >>> Markus >> _______________________________________________ >> tdwg-tapir mailing list >> tdwg-tapir@lists.tdwg.org >> http://lists.tdwg.org/mailman/listinfo/tdwg-tapir >> >> >> >> >> Please consider the environment before printing this email >> >> WARNING : This email and any attachments may be confidential and/ >> or >> privileged. They are intended for the addressee only and are not >> to >> be read, >> used, copied or disseminated by anyone receiving them in error. If >> you are >> not the intended recipient, please notify the sender by return >> email and >> delete this message and any attachments. >> >> The views expressed in this email are those of the sender and do >> not >> necessarily reflect the >> official views of Landcare Research. http:// >> www.landcareresearch.co.nz >> _______________________________________________ >> tdwg-tapir mailing list >> tdwg-tapir@lists.tdwg.org >> http://lists.tdwg.org/mailman/listinfo/tdwg-tapir >> >> > _______________________________________________ > tdwg-tapir mailing list > tdwg-tapir@lists.tdwg.org > http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
--
Australian Centre for Plant BIodiversity Research<------------------+ National greg whitBread voice: +61 2 62509 482 Botanic Integrated Botanical Information System fax: +61 2 62509 599 Gardens S........ I.T. happens.. ghw@anbg.gov.au +----------------------------------------->GPO Box 1777 Canberra 2601
If you have received this transmission in error please notify us immediately by return e-mail and delete all copies. If this e-mail or any attachments have been sent to you in error, that error does not constitute waiver of any confidentiality, privilege or copyright in respect of information in the e-mail or attachments.
Please consider the environment before printing this email.
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
participants (15)
-
Aaron D. Steele
-
Aaron D. Steele
-
Dave Vieglais
-
Greg Whitbread
-
Javier de la Torre
-
John R. WIECZOREK
-
Markus Döring
-
Phil Cryer
-
Renato De Giovanni
-
Richardson, Ben
-
Roger Hyam
-
Roger Hyam (TDWG)
-
Tim Robertson
-
trobertson@gbif.org
-
Wouter Addink