[tdwg-tapir] Fwd: Tapir protocol - Harvest methods? [SEC=UNCLASSIFIED]

Wed May 14 00:39:05 CEST 2008

We have used a very similar protocol to assemble the latest AVH cache.
It should be noted that this is an as-well-as protocol that only works
because we have an established semantic standard (hispid/abcd).

greg

trobertson at gbif.org wrote:
> Hi All,
> 
> This is very interesting too me, as I came up with the same conclusion
> while harvesting for GBIF.
> 
> As a "harvester of all records" it is best described with an example:
> 
> - Complete Inventory of ScientificNames: 7 minutes @ the limited 200
> records per page
> - Complete Harvesting of records:
>   - 260,000 records
>   - 9 hours harvesting duration
>   - 500MB TAPIR+DwC XML returned (DwC 1.4 with geospatial and curatorial
> extensions)
> - Extraction of DwC records from harvested XML: <2 minutes
>   - Resulting file size 32MB, Gzipped to <3MB
> 
> I spun hard drives for 9 hours, and took up bandwidth that is paid for, to
> retrieve something that could have been generated provider side in minutes
> and transferred in seconds (3MB).
> 
> I sent a proposal to TDWG last year termed "datamaps" which was
> effectively what you are describing, and I based it on the Sitemaps
> protocol, but I got nowhere with it.  With Markus, we are making more
> progress and I have spoken with several GBIF data providers about a
> proposed new standard for full dataset harvesting and it has been received
> well.  So Markus and I have started a new proposal and have a working name
> of 'Localised DwC Index' file generation (it is an index if you have more
> than DwC data, and DwC is still standards compliant) which is really a
> GZipped Tab file dump of the data, which is slightly extensible.  The
> document is not ready to circulate yet but the benefits section reads
> currently:
> 
> - Provider database load reduced, allowing it to serve real distributed
> queries rather than "full datasource" harvesters
> - Providers can choose to publish their index as it suits them, giving
> control back to the provider
> - Localised index generation can be built into tools not yet capable of
> integrating with TDWG protocol networks such as GBIF
> - Harvesters receive a full dataset view in one request, making it very
> easy to determine what records are eligible for deletion
> - It becomes very simple to write clients that consume entire datasets.
> E.g. data cleansing tools that the provider can run:
>   -  Give me ISO Country Codes for my dataset
>      -  The application pulls down the providers index file, generates ISO
> country code, returns a simple table using the providers own
> identifier
>   - Check my names for spelling mistakes
>     - Application skims over the records and provides a list that are not
> known to the application
>  - Providers such as UK NBN cannot serve 20 million records to the GBIF
> index using the existing protocols efficiently.
>   - They have the ability to generate a localised index however
> - Harvesters can very quickly build up searchable indexes and it is easy
> to create large indices.
>   - Node Portal can easily aggregate index data files
> - true index to data, not an illusion of a cache. More like Google sitemaps
> 
> It is the ease at which one can offer tools to data providers that really
> interests me.  The technical threshold required to produce services that
> offer reporting tools on peoples data is really very low with this
> mechanism.  This and the fact that large datasets will be harvestable - we
> have even considered the likes of bit-torrent for the large ones although
> I think this is overkill.
> 
> As a consumer therefore I fully support this move as a valuable addition
> to the wrapper tools.
> 
> Cheers
> 
> Tim
> (wrote the GBIF harvesting, and new to this list)
> 
> 
>>
>> Begin forwarded message:
>>
>>> From: "Aaron D. Steele" <eightysteele at gmail.com>
>>> Date: 13 de mayo de 2008 22:40:09 GMT+02:00
>>> To: tdwg-tapir at lists.tdwg.org
>>> Cc: Aaron Steele <asteele at berkeley.edu>
>>> Subject: Re: [tdwg-tapir] Tapir protocol - Harvest methods?
>>>
>>> at berkeley we've recently prototyped a simple php program that uses
>>> an existing tapirlink installation to periodically dump tapir
>>> resources into a csv file. the solution is totally generic and can
>>> dump darwin core (and technically abcd schema, although it's currently
>>> untested). the resulting csv files are zip archived and made
>>> accessible using a web service. it's a simple approach that has proven
>>> to be, at least internally, quite reliable and useful.
>>>
>>> for example, several of our caching applications use the web service
>>> to harvest csv data from tapirlink resources using the following
>>> process:
>>> 1) download latest csv dump for a resource using the web service.
>>> 2) flush all locally cached records for the resource.
>>> 3) bulk load the latest csv data into the cache.
>>>
>>> in this way, cached data are always synchronized with the resource and
>>> there's no need to track new, deleted, or changed records. as an
>>> aside, each time these cached data are queried by the caching
>>> application or selected in the user interface, log-only search
>>> requests are sent back to the resource.
>>>
>>> after discussion with renato giovanni and john wieczorek, we've
>>> decided that merging this functionality into the tapirlink codebase
>>> would benefit the broader community. csv generation support would be
>>> declared through capabilities. although incremental harvesting
>>> wouldn't be immediately implemented, we could certainly extend the
>>> service to include it later.
>>>
>>> i'd like to pause here to gauge the consensus, thoughts, concerns, and
>>> ideas of others. anyone?
>>>
>>> thanks,
>>> aaron
>>>
>>> 2008/5/5 Kevin Richards <RichardsK at landcareresearch.co.nz>:
>>>>
>>>> I think I agree here.
>>>>
>>>> The harvesting "procedure" is really defined outside the Tapir
>>>> protocol, is
>>>> it not?  So it is really an agreement between the harvester and the
>>>> harvestees.
>>>>
>>>> So what is really needed here is the standard procedure for
>>>> maintaining a
>>>> "harvestable" dataset and the standard procedure for harvesting that
>>>> dataset.
>>>> We have a general rule at Landcare, that we never delete records in
>>>> our
>>>> datasets - they are either deprecated in favour of another record,
>>>> and so
>>>> the resolution of that record would point to the new record, or the
>>>> are set
>>>> to a state of "deleted", but are still kept in the dataset, and can
>>>> be
>>>> resolved (which would indicate a state of deleted).
>>>>
>>>> Kevin
>>>>
>>>>
>>>>>>> "Renato De Giovanni" <renato at cria.org.br> 6/05/2008 7:33 a.m. >>>
>>>>
>>>> Hi Markus,
>>>>
>>>> I would suggest creating new concepts for incremental harvesting,
>>>> either in the data standards themselves or in some new extension. In
>>>> the case of TAPIR, GBIF could easily check the mapped concepts before
>>>> deciding between incremental or full harvesting.
>>>>
>>>> Actually it could be just one new concept such as "recordStatus" or
>>>> "deletionFlag". Or perhaps you could also want to create your own
>>>> definition for dateLastModified indicating which set of concepts
>>>> should be considered to see if something has changed or not, but I
>>>> guess this level of granularity would be difficult to be supported.
>>>>
>>>> Regards,
>>>> --
>>>> Renato
>>>>
>>>> On 5 May 2008 at 11:24, Markus Döring wrote:
>>>>
>>>>> Phil,
>>>>> incremental harvesting is not implemented on the GBIF side as far
>>>>> as I
>>>>> am aware. And I dont think that will be a simple thing to
>>>>> implement on
>>>>> the current system. Also, even if we can detect only the changed
>>>>> records since the last harevesting via dateLastModified we still
>>>>> have
>>>>> no information about deletions. We could have an arrangement saying
>>>>> that you keep deleted records as empty records with just the ID and
>>>>> nothing else (I vaguely remember LSIDs were supposed to work like
>>>>> this
>>>>> too). But that also needs to be supported on your side then, never
>>>>> entirely removing any record. I will have a discussion with the
>>>>> others
>>>>> at GBIF about that.
>>>>>
>>>>> Markus
>>>> _______________________________________________
>>>> tdwg-tapir mailing list
>>>> tdwg-tapir at lists.tdwg.org
>>>> http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
>>>>
>>>>
>>>>
>>>>
>>>> Please consider the environment before printing this email
>>>>
>>>> WARNING : This email and any attachments may be confidential and/or
>>>> privileged. They are intended for the addressee only and are not to
>>>> be read,
>>>> used, copied or disseminated by anyone receiving them in error. If
>>>> you are
>>>> not the intended recipient, please notify the sender by return
>>>> email and
>>>> delete this message and any attachments.
>>>>
>>>> The views expressed in this email are those of the sender and do not
>>>> necessarily reflect the
>>>> official views of Landcare Research. http://
>>>> www.landcareresearch.co.nz
>>>> _______________________________________________
>>>> tdwg-tapir mailing list
>>>> tdwg-tapir at lists.tdwg.org
>>>> http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
>>>>
>>>>
>>> _______________________________________________
>>> tdwg-tapir mailing list
>>> tdwg-tapir at lists.tdwg.org
>>> http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
>>
> 
> 
> _______________________________________________
> tdwg-tapir mailing list
> tdwg-tapir at lists.tdwg.org
> http://lists.tdwg.org/mailman/listinfo/tdwg-tapir

-- 

Australian Centre for Plant BIodiversity Research<------------------+
National            greg whitBread             voice: +61 2 62509 482
Botanic Integrated Botanical Information System  fax: +61 2 62509 599
Gardens                      S........ I.T. happens.. ghw at anbg.gov.au
+----------------------------------------->GPO Box 1777 Canberra 2601

------
If you have received this transmission in error please notify us immediately by return e-mail and delete all copies. If this e-mail or any attachments have been sent to you in error, that error does not constitute waiver of any confidentiality, privilege or copyright in respect of information in the e-mail or attachments. 

Please consider the environment before printing this email.

------