New subject: [tdwg-tapir] Fwd: Tapir protocol - Harvest methods? [SEC=UNCLASSIFIED]

13 May 2008

      Hi All,

This is very interesting too me, as I came up with the same conclusion
while harvesting for GBIF.

As a "harvester of all records" it is best described with an example:

- Complete Inventory of ScientificNames: 7 minutes @ the limited 200
records per page
- Complete Harvesting of records:
  - 260,000 records
  - 9 hours harvesting duration
  - 500MB TAPIR+DwC XML returned (DwC 1.4 with geospatial and curatorial
extensions)
- Extraction of DwC records from harvested XML: <2 minutes
  - Resulting file size 32MB, Gzipped to <3MB

I spun hard drives for 9 hours, and took up bandwidth that is paid for, to
retrieve something that could have been generated provider side in minutes
and transferred in seconds (3MB).

I sent a proposal to TDWG last year termed "datamaps" which was
effectively what you are describing, and I based it on the Sitemaps
protocol, but I got nowhere with it.  With Markus, we are making more
progress and I have spoken with several GBIF data providers about a
proposed new standard for full dataset harvesting and it has been received
well.  So Markus and I have started a new proposal and have a working name
of 'Localised DwC Index' file generation (it is an index if you have more
than DwC data, and DwC is still standards compliant) which is really a
GZipped Tab file dump of the data, which is slightly extensible.  The
document is not ready to circulate yet but the benefits section reads
currently:

- Provider database load reduced, allowing it to serve real distributed
queries rather than "full datasource" harvesters
- Providers can choose to publish their index as it suits them, giving
control back to the provider
- Localised index generation can be built into tools not yet capable of
integrating with TDWG protocol networks such as GBIF
- Harvesters receive a full dataset view in one request, making it very
easy to determine what records are eligible for deletion
- It becomes very simple to write clients that consume entire datasets.
E.g. data cleansing tools that the provider can run:
  -  Give me ISO Country Codes for my dataset
     -  The application pulls down the providers index file, generates ISO
country code, returns a simple table using the providers own
identifier
  - Check my names for spelling mistakes
    - Application skims over the records and provides a list that are not
known to the application
 - Providers such as UK NBN cannot serve 20 million records to the GBIF
index using the existing protocols efficiently.
  - They have the ability to generate a localised index however
- Harvesters can very quickly build up searchable indexes and it is easy
to create large indices.
  - Node Portal can easily aggregate index data files
- true index to data, not an illusion of a cache. More like Google sitemaps

It is the ease at which one can offer tools to data providers that really
interests me.  The technical threshold required to produce services that
offer reporting tools on peoples data is really very low with this
mechanism.  This and the fact that large datasets will be harvestable - we
have even considered the likes of bit-torrent for the large ones although
I think this is overkill.

As a consumer therefore I fully support this move as a valuable addition
to the wrapper tools.

Cheers

Tim
(wrote the GBIF harvesting, and new to this list)
...
Begin forwarded message:
...
From: "Aaron D. Steele" <eightysteele@gmail.com>
Date: 13 de mayo de 2008 22:40:09 GMT+02:00
To: tdwg-tapir@lists.tdwg.org
Cc: Aaron Steele <asteele@berkeley.edu>
Subject: Re: [tdwg-tapir] Tapir protocol - Harvest methods?
at berkeley we've recently prototyped a simple php program that uses
an existing tapirlink installation to periodically dump tapir
resources into a csv file. the solution is totally generic and can
dump darwin core (and technically abcd schema, although it's currently
untested). the resulting csv files are zip archived and made
accessible using a web service. it's a simple approach that has proven
to be, at least internally, quite reliable and useful.
for example, several of our caching applications use the web service
to harvest csv data from tapirlink resources using the following
process:
1) download latest csv dump for a resource using the web service.
2) flush all locally cached records for the resource.
3) bulk load the latest csv data into the cache.
in this way, cached data are always synchronized with the resource and
there's no need to track new, deleted, or changed records. as an
aside, each time these cached data are queried by the caching
application or selected in the user interface, log-only search
requests are sent back to the resource.
after discussion with renato giovanni and john wieczorek, we've
decided that merging this functionality into the tapirlink codebase
would benefit the broader community. csv generation support would be
declared through capabilities. although incremental harvesting
wouldn't be immediately implemented, we could certainly extend the
service to include it later.
i'd like to pause here to gauge the consensus, thoughts, concerns, and
ideas of others. anyone?
thanks,
aaron
2008/5/5 Kevin Richards <RichardsK@landcareresearch.co.nz>:
...
I think I agree here.
The harvesting "procedure" is really defined outside the Tapir
protocol, is
it not?  So it is really an agreement between the harvester and the
harvestees.
So what is really needed here is the standard procedure for
maintaining a
"harvestable" dataset and the standard procedure for harvesting that
dataset.
We have a general rule at Landcare, that we never delete records in
our
datasets - they are either deprecated in favour of another record,
and so
the resolution of that record would point to the new record, or the
are set
to a state of "deleted", but are still kept in the dataset, and can
be
resolved (which would indicate a state of deleted).
Kevin
...
...
...
"Renato De Giovanni" <renato@cria.org.br> 6/05/2008 7:33 a.m. >>>
Hi Markus,
I would suggest creating new concepts for incremental harvesting,
either in the data standards themselves or in some new extension. In
the case of TAPIR, GBIF could easily check the mapped concepts before
deciding between incremental or full harvesting.
Actually it could be just one new concept such as "recordStatus" or
"deletionFlag". Or perhaps you could also want to create your own
definition for dateLastModified indicating which set of concepts
should be considered to see if something has changed or not, but I
guess this level of granularity would be difficult to be supported.
Regards,
--
Renato
On 5 May 2008 at 11:24, Markus Döring wrote:
...
Phil,
incremental harvesting is not implemented on the GBIF side as far
as I
am aware. And I dont think that will be a simple thing to
implement on
the current system. Also, even if we can detect only the changed
records since the last harevesting via dateLastModified we still
have
no information about deletions. We could have an arrangement saying
that you keep deleted records as empty records with just the ID and
nothing else (I vaguely remember LSIDs were supposed to work like
this
too). But that also needs to be supported on your side then, never
entirely removing any record. I will have a discussion with the
others
at GBIF about that.
Markus
_______________________________________________
tdwg-tapir mailing list
tdwg-tapir@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
Please consider the environment before printing this email
WARNING : This email and any attachments may be confidential and/or
privileged. They are intended for the addressee only and are not to
be read,
used, copied or disseminated by anyone receiving them in error. If
you are
not the intended recipient, please notify the sender by return
email and
delete this message and any attachments.
The views expressed in this email are those of the sender and do not
necessarily reflect the
official views of Landcare Research. http://
www.landcareresearch.co.nz
_______________________________________________
tdwg-tapir mailing list
tdwg-tapir@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
_______________________________________________
tdwg-tapir mailing list
tdwg-tapir@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-tapir

Re: [tdwg-tapir] Fwd: Tapir protocol - Harvest methods?

trobertson＠gbif.org

Greg Whitbread

Markus Döring

Roger Hyam (TDWG)

Tim Robertson

Roger Hyam (TDWG)

Tim Robertson

Roger Hyam (TDWG)

Tim Robertson

Roger Hyam (TDWG)

Greg Whitbread

Aaron D. Steele

Markus Döring

Renato De Giovanni

trobertson＠gbif.org

Markus Döring

Roger Hyam

Tim Robertson

Markus Döring

John R. WIECZOREK

Renato De Giovanni

trobertson＠gbif.org

Renato De Giovanni

Markus Döring

Wouter Addink

Renato De Giovanni

Markus Döring

Roger Hyam

Renato De Giovanni

Aaron D. Steele

Richardson, Ben

Roger Hyam (TDWG)

Dave Vieglais

Javier de la Torre

Markus Döring

Tim Robertson

Phil Cryer

tags

participants (15)