[tdwg-tapir] Fwd: Tapir protocol - Harvest methods?[SEC=UNCLASSIFIED]

Tim Robertson trobertson at gbif.org
Wed May 14 14:12:56 CEST 2008

Hi Roger,

Right, so I think we were talking over each other and both agree that the
GUIDs (record and 'data source') and resolution mechanism is vital, along
with the schemas etc for the full record response document.

This is slightly cleverer than a sitemap - a site map says "hey here are the
URIs of interest" but then you must resolve each one and build your full
text index (if you are called Google).
What we are proposing is URI, plus a local index (the DwC fields) that are
enough for some instances (GBIF portal in it's current state) to not have to
resolve each record afterwards.  It would also act a seed for OAI-PMH style

Of course this does not help with aggregators who cannot maintain GUIDs -
but that is a separate problem independent of any transfer mechanism.

Do you still have strong objections to this kind of approach?



-----Original Message-----
From: Roger Hyam (TDWG) [mailto:rogerhyam at mac.com] 
Sent: Wednesday, May 14, 2008 1:39 PM
To: Tim Robertson
Cc: 'Markus Döring'; tdwg-tapir at lists.tdwg.org
Subject: Re: [tdwg-tapir] Fwd: Tapir protocol - Harvest


Ahh you hit the nail on the head. If these sitemaps contain just the
indexing fields for records (and there is potentially more information
available from another source) then there needs to be an unambiguous
mechanism to link the things in the sitemaps to the things available via
another means. i.e. GUIDs (could be URIs of various flavours including

LSID Authority plus sitemap would be good.

So we must mandate the use of GUIDs - your beer is practically safe.


On 14 May 2008, at 12:12, Tim Robertson wrote:

> Hi Roger,
> <Homer style>Hmmm free beer.
> Hang on, if worrying about trying to transfer large data is not a good 
> incentive for standardising a transfer mechanism, is free beer any 
> better?
> ;o)
> But seriously,
> If the proposal was along the lines of a Tab file:
> - LSID kingdom phylum class order basis_of_record....
> And then supporting files (star schema) with:
> - LSID latitude longitude ....
> If the wrappers generated these kind of structures using the same 
> configuration generated when a user installed it, would you feel 
> happier?
> This is really what Markus and I are proposing, and we fully support 
> all the GUID generation work and I for one am desperate for it, 
> including the BCI "datasource" level GUIDs.
> The analogy to sitemaps is quite simple - these index files do not 
> provide the full detail - they provide the means to build an index 
> based on DwC concepts, that would then facilitate the accession of the 
> full detail record
> - e.g. through LSID.  The LSID/GUID part is the same as the sitemap 
> URI - no?
> Tim
> -----Original Message-----
> From: Roger Hyam (TDWG) [mailto:rogerhyam at mac.com]
> Sent: Wednesday, May 14, 2008 12:58 PM
> To: Tim Robertson
> Cc: 'Markus Döring'; 'Hiscom-L Mailing List ((E-mail))'; 
> tdwg-tapir at lists.tdwg.org
> Subject: Re: [tdwg-tapir] Fwd: Tapir protocol - Harvest 
> Hi Tim,
> The thing about the sitemaps is that they describe resources with URIs 
> they are not just a dump of an excel file.
> I will buy you a beer in Oz if any proposal that is put forward 
> mandates the use of GUIDs for primary keys in the CSV files (other 
> than perhaps the additional files Markus was proposing of one to many 
> relationships).
> I'd buy
> you several beers if you manage to get it accepted :)
> All the best,
> Roger
> BTW: Another way to represent a graph of data (other than a series of 
> linked csv files) would be to do it in RDF as Turtle then zipped. This 
> does way with the need of a separate dictionary to describe what the 
> columns mean, has to be UTF-8, can include data types etc ... A script 
> to explode this back to tables probably wouldn't be too slow but this 
> is probably just fantasy on my part.
> On 14 May 2008, at 11:21, Tim Robertson wrote:
>> Roegr writes "I worry that we are working out how to move data about 
>> quickly"
>> That is exactly what this is for, but why is it a worry (other than 
>> the likes of GBIF who really are worrying about moving data around 
>> quickly since everyone is shouting about latency problems)?
>> It is a 166 times (3meg versus 500meg) more efficient transfer of a 
>> data source for those wishing to transfer the whole thing.  It is 
>> still standards compliant for the document passed across (DwC + flat 
>> extension schemas), and by incorporating it's generation into tools 
>> like a TAPIR wrapper, would ensure this.  The reality is, many of the 
>> very large datasets have to come to GBIF like this - the transfer 
>> protocols existing just do not perform.
>> Furthermore, think how much easier it would be for someone like 
>> Catalogue of Life or ITIS to put up a service that says "hey, you 
>> give me the URL to your Locally generated DwC Index File and I'll 
>> give you back a report containing YOUR occurrence GUID, and MY LSID 
>> for your identification".  Isn't that a good thing?
>> In my view these files are additional to any existing interfaces, 
>> only meet certain data type requirements and by no means detract from 
>> any of the important work (both technical and social aspects) on GUID 
>> assigning, document schemas etc.  Therefore, like sitemaps became a 
>> requirement for large web sites, I think a more efficient standards 
>> based (than just dump your data and we'll handle it) approach is 
>> required for our community.
>> -----Original Message-----
>> From: tdwg-tapir-bounces at lists.tdwg.org 
>> [mailto:tdwg-tapir-bounces at lists.tdwg.org] On Behalf Of Roger Hyam
>> (TDWG)
>> Sent: Wednesday, May 14, 2008 11:57 AM
>> To: Markus Döring
>> Cc: Hiscom-L Mailing List ((E-mail)); tdwg-tapir at lists.tdwg.org
>> Subject: Re: [tdwg-tapir] Fwd: Tapir protocol - Harvest 
>> Generally if we are going to have csv files for data transfer we 
>> don't need to have software implementations just some documentation 
>> on what the csv files should contain. Something along the lines of:
>> 1) Make a report from your database as a csv file(s) with the 
>> following columns...
>> 2) Zip it up.
>> 3) Either put it on a webserver and send us  the URL or upload it 
>> using this webform.
>> We don't need to bother with TAPIR etc. You could even only produce a 
>> CSV file of the records that have changed so big data sets needn't be 
>> a problem.
>> I worry that we are working out how to move data about quickly and 
>> forgetting that the real goal is to integrate data and that will only 
>> come if people have GUIDs on the stuff they own and use other peoples 
>> GUIDs in their data.  Solutions based around CSV files do nothing to 
>> move people in that direction and I would suspect lead to making 
>> matters worse.
>> Finding ourselves in  hole digging quicker may not be the best 
>> option.
>> Roger
>> -------------------------------------------------------------
>> Roger Hyam
>> Roger at BiodiversityCollectionsIndex.org
>> http://www.BiodiversityCollectionsIndex.org
>> -------------------------------------------------------------
>> Royal Botanic Garden Edinburgh
>> 20A Inverleith Row, Edinburgh, EH3 5LR, UK
>> Tel: +44 131 552 7171 ext 3015
>> Fax: +44 131 248 2901
>> http://www.rbge.org.uk/
>> -------------------------------------------------------------
>> On 14 May 2008, at 10:21, Markus Döring wrote:
>>> Interesting that we all come to the same conclusions...
>>> The trouble I had with just a simple flat csv file is repeating 
>>> properties like multiple image urls. ABCD clients dont use ABCD just 
>>> because its complex, but because they want to transport this 
>>> relational data. We were considering 2 solutions to extending this 
>>> csv approach. The first would be to have a single large denormalised 
>>> csv file with many rows for the same record. It would require 
>>> knowledge about the related entities though and could grow in size 
>>> rapidly. The second idea which we think to adopt is allowing a 
>>> single level of 1- many related entities. It is basically a "star" 
>>> design with the core dwc table in the center and any number of 
>>> extension tables around it.
>>> Each "table" aka csv file will have the record id as the first 
>>> column, so the files can be related easily and it only needs a 
>>> single identifier per record and not for the extension entities. 
>>> This would give a lot of flexibility while keeping things pretty 
>>> simple to deal with. It would even satisfy the ABCD needs as I 
>>> havent yet seen anyone requiring 2 levels of related tables (other 
>>> than lookup tables).
>>> Those
>>> extensions could even be a simple 1-1 relation, but would keep 
>>> things semantically together just like a xml namespace. The darwin 
>>> core extensions would be good for example.
>>> So we could have a gzipped set of files, maybe with a simple 
>>> metafile indicating the semantics of the columns for each file.
>>> An example could look like this:
>>> # darwincore.csv
>>> 102    Aster alpinus subsp. parviceps    ...
>>> 103    Polygala vulgaris    ...
>>> # curatorial.csv
>>> 102    Kew Herbarium
>>> 103    Reading Herbarium
>>> # identification.csv
>>> 102    2003-05-04    Karl Marx    Aster alpinus L.
>>> 102    2007-01-11    Mark Twain    Aster korshinskyi Tamamsch.
>>> 102    2007-09-13    Roger Hyam    Aster alpinus subsp. parviceps
>>> Novopokr.
>>> 103    2001-02-21    Steve Bekow    Polygala vulgaris L.
>>> I know this looks old fashioned, but it is just so simple and gives 
>>> us so much flexibility.
>>> Markus
>>> On 14 May, 2008, at 24:39, Greg Whitbread wrote:
>>>> We have used a very similar protocol to assemble the latest AVH 
>>>> cache.
>>>> It should be noted that this is an as-well-as protocol that only 
>>>> works because we have an established semantic standard (hispid/ 
>>>> abcd).
>>>> greg
>>>> trobertson at gbif.org wrote:
>>>>> Hi All,
>>>>> This is very interesting too me, as I came up with the same 
>>>>> conclusion while harvesting for GBIF.
>>>>> As a "harvester of all records" it is best described with an
>>>>> example:
>>>>> - Complete Inventory of ScientificNames: 7 minutes @ the limited 
>>>>> 200 records per page
>>>>> - Complete Harvesting of records:
>>>>> - 260,000 records
>>>>> - 9 hours harvesting duration
>>>>> - 500MB TAPIR+DwC XML returned (DwC 1.4 with geospatial and 
>>>>> curatorial
>>>>> extensions)
>>>>> - Extraction of DwC records from harvested XML: <2 minutes
>>>>> - Resulting file size 32MB, Gzipped to <3MB
>>>>> I spun hard drives for 9 hours, and took up bandwidth that is paid 
>>>>> for, to retrieve something that could have been generated provider 
>>>>> side in minutes and transferred in seconds (3MB).
>>>>> I sent a proposal to TDWG last year termed "datamaps" which was 
>>>>> effectively what you are describing, and I based it on the 
>>>>> Sitemaps protocol, but I got nowhere with it.  With Markus, we are 
>>>>> making more progress and I have spoken with several GBIF data 
>>>>> providers about a proposed new standard for full dataset 
>>>>> harvesting and it has been received well.  So Markus and I have 
>>>>> started a new proposal and have a working name of 'Localised DwC 
>>>>> Index' file generation (it is an index if you have more than DwC 
>>>>> data, and DwC is still standards
>>>>> compliant) which is really a
>>>>> GZipped Tab file dump of the data, which is slightly extensible.
>>>>> The
>>>>> document is not ready to circulate yet but the benefits section 
>>>>> reads
>>>>> currently:
>>>>> - Provider database load reduced, allowing it to serve real 
>>>>> distributed queries rather than "full datasource" harvesters
>>>>> - Providers can choose to publish their index as it suits them, 
>>>>> giving control back to the provider
>>>>> - Localised index generation can be built into tools not yet 
>>>>> capable of integrating with TDWG protocol networks such as GBIF
>>>>> - Harvesters receive a full dataset view in one request, making it 
>>>>> very easy to determine what records are eligible for deletion
>>>>> - It becomes very simple to write clients that consume entire 
>>>>> datasets.
>>>>> E.g. data cleansing tools that the provider can run:
>>>>> -  Give me ISO Country Codes for my dataset
>>>>>  -  The application pulls down the providers index file, generates 
>>>>> ISO country code, returns a simple table using the providers own 
>>>>> identifier
>>>>> - Check my names for spelling mistakes
>>>>> - Application skims over the records and provides a list that are 
>>>>> not known to the application
>>>>> - Providers such as UK NBN cannot serve 20 million records to the 
>>>>> GBIF index using the existing protocols efficiently.
>>>>> - They have the ability to generate a localised index however
>>>>> - Harvesters can very quickly build up searchable indexes and it 
>>>>> is easy to create large indices.
>>>>> - Node Portal can easily aggregate index data files
>>>>> - true index to data, not an illusion of a cache. More like Google 
>>>>> sitemaps
>>>>> It is the ease at which one can offer tools to data providers that 
>>>>> really interests me.  The technical threshold required to produce 
>>>>> services that offer reporting tools on peoples data is really very 
>>>>> low with this mechanism.  This and the fact that large datasets 
>>>>> will be harvestable - we have even considered the likes of 
>>>>> bit-torrent for the large ones although I think this is overkill.
>>>>> As a consumer therefore I fully support this move as a valuable 
>>>>> addition to the wrapper tools.
>>>>> Cheers
>>>>> Tim
>>>>> (wrote the GBIF harvesting, and new to this list)
>>>>>> Begin forwarded message:
>>>>>>> From: "Aaron D. Steele" <eightysteele at gmail.com>
>>>>>>> Date: 13 de mayo de 2008 22:40:09 GMT+02:00
>>>>>>> To: tdwg-tapir at lists.tdwg.org
>>>>>>> Cc: Aaron Steele <asteele at berkeley.edu>
>>>>>>> Subject: Re: [tdwg-tapir] Tapir protocol - Harvest methods?
>>>>>>> at berkeley we've recently prototyped a simple php program that 
>>>>>>> uses an existing tapirlink installation to periodically dump 
>>>>>>> tapir resources into a csv file. the solution is totally generic 
>>>>>>> and can dump darwin core (and technically abcd schema, although 
>>>>>>> it's currently untested). the resulting csv files are zip 
>>>>>>> archived and made accessible using a web service. it's a simple 
>>>>>>> approach that has proven to be, at least internally, quite 
>>>>>>> reliable and useful.
>>>>>>> for example, several of our caching applications use the web 
>>>>>>> service to harvest csv data from tapirlink resources using the 
>>>>>>> following
>>>>>>> process:
>>>>>>> 1) download latest csv dump for a resource using the web 
>>>>>>> service.
>>>>>>> 2) flush all locally cached records for the resource.
>>>>>>> 3) bulk load the latest csv data into the cache.
>>>>>>> in this way, cached data are always synchronized with the 
>>>>>>> resource and there's no need to track new, deleted, or changed 
>>>>>>> records. as an aside, each time these cached data are queried by 
>>>>>>> the caching application or selected in the user interface, 
>>>>>>> log-only search requests are sent back to the resource.
>>>>>>> after discussion with renato giovanni and john wieczorek, we've 
>>>>>>> decided that merging this functionality into the tapirlink 
>>>>>>> codebase would benefit the broader community. csv generation 
>>>>>>> support would be declared through capabilities. although 
>>>>>>> incremental harvesting wouldn't be immediately implemented, we 
>>>>>>> could certainly extend the service to include it later.
>>>>>>> i'd like to pause here to gauge the consensus, thoughts, 
>>>>>>> concerns, and ideas of others. anyone?
>>>>>>> thanks,
>>>>>>> aaron
>>>>>>> 2008/5/5 Kevin Richards <RichardsK at landcareresearch.co.nz>:
>>>>>>>> I think I agree here.
>>>>>>>> The harvesting "procedure" is really defined outside the Tapir 
>>>>>>>> protocol, is it not?  So it is really an agreement between the 
>>>>>>>> harvester and the harvestees.
>>>>>>>> So what is really needed here is the standard procedure for 
>>>>>>>> maintaining a "harvestable" dataset and the standard procedure 
>>>>>>>> for harvesting that dataset.
>>>>>>>> We have a general rule at Landcare, that we never delete 
>>>>>>>> records in our datasets - they are either deprecated in favour 
>>>>>>>> of another record, and so the resolution of that record would 
>>>>>>>> point to the new record, or the are set to a state of 
>>>>>>>> "deleted", but are still kept in the dataset, and can be 
>>>>>>>> resolved (which would indicate a state of deleted).
>>>>>>>> Kevin
>>>>>>>>>>> "Renato De Giovanni" <renato at cria.org.br> 6/05/2008 7:33  
>>>>>>>>>>> a.m.
>>>>>>>> Hi Markus,
>>>>>>>> I would suggest creating new concepts for incremental
>>>>>>>> harvesting, either in the data standards themselves or in some
>>>>>>>> new extension.
>>>>>>>> In the case of TAPIR, GBIF could easily check the mapped
>>>>>>>> concepts before deciding between incremental or full  
>>>>>>>> harvesting.
>>>>>>>> Actually it could be just one new concept such as  
>>>>>>>> "recordStatus"
>>>>>>>> or
>>>>>>>> "deletionFlag". Or perhaps you could also want to create your
>>>>>>>> own definition for dateLastModified indicating which set of
>>>>>>>> concepts should be considered to see if something has changed  
>>>>>>>> or
>>>>>>>> not, but I guess this level of granularity would be difficult  
>>>>>>>> to
>>>>>>>> be supported.
>>>>>>>> Regards,
>>>>>>>> --
>>>>>>>> Renato
>>>>>>>> On 5 May 2008 at 11:24, Markus Döring wrote:
>>>>>>>>> Phil,
>>>>>>>>> incremental harvesting is not implemented on the GBIF side as
>>>>>>>>> far as I am aware. And I dont think that will be a simple  
>>>>>>>>> thing
>>>>>>>>> to implement on the current system. Also, even if we can  
>>>>>>>>> detect
>>>>>>>>> only the changed records since the last harevesting via
>>>>>>>>> dateLastModified we still have no information about deletions.
>>>>>>>>> We could have an arrangement saying that you keep deleted
>>>>>>>>> records as empty records with just the ID and nothing else (I
>>>>>>>>> vaguely remember LSIDs were supposed to work like this too).
>>>>>>>>> But
>>>>>>>>> that also needs to be supported on your side then, never
>>>>>>>>> entirely removing any record. I will have a discussion with  
>>>>>>>>> the
>>>>>>>>> others at GBIF about that.
>>>>>>>>> Markus
>>>>>>>> _______________________________________________
>>>>>>>> tdwg-tapir mailing list
>>>>>>>> tdwg-tapir at lists.tdwg.org
>>>>>>>> http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
>>>>>>>> Please consider the environment before printing this email
>>>>>>>> WARNING : This email and any attachments may be confidential
>>>>>>>> and/ or privileged. They are intended for the addressee only  
>>>>>>>> and
>>>>>>>> are not to be read, used, copied or disseminated by anyone
>>>>>>>> receiving them in error.
>>>>>>>> If
>>>>>>>> you are
>>>>>>>> not the intended recipient, please notify the sender by return
>>>>>>>> email and delete this message and any attachments.
>>>>>>>> The views expressed in this email are those of the sender and  
>>>>>>>> do
>>>>>>>> not necessarily reflect the official views of Landcare  
>>>>>>>> Research.
>>>>>>>> http:// www.landcareresearch.co.nz
>>>>>>>> _______________________________________________
>>>>>>>> tdwg-tapir mailing list
>>>>>>>> tdwg-tapir at lists.tdwg.org
>>>>>>>> http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
>>>>>>> _______________________________________________
>>>>>>> tdwg-tapir mailing list
>>>>>>> tdwg-tapir at lists.tdwg.org
>>>>>>> http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
>>>>> _______________________________________________
>>>>> tdwg-tapir mailing list
>>>>> tdwg-tapir at lists.tdwg.org
>>>>> http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
>>>> --
>>>> Australian Centre for Plant BIodiversity
>>>> Research<------------------+
>>>> National            greg whitBread             voice: +61 2 62509
>>>> 482
>>>> Botanic Integrated Botanical Information System  fax: +61 2 62509
>>>> 599
>>>> Gardens                      S........ I.T. happens..
>>>> ghw at anbg.gov.au
>>>> +----------------------------------------->GPO Box 1777 Canberra
>>>> 2601
>>>> ------
>>>> If you have received this transmission in error please notify us
>>>> immediately by return e-mail and delete all copies. If this e-mail
>>>> or any attachments have been sent to you in error, that error does
>>>> not constitute waiver of any confidentiality, privilege or  
>>>> copyright
>>>> in respect of information in the e-mail or attachments.
>>>> Please consider the environment before printing this email.
>>>> ------
>>>> _______________________________________________
>>>> tdwg-tapir mailing list
>>>> tdwg-tapir at lists.tdwg.org
>>>> http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
>>> _______________________________________________
>>> tdwg-tapir mailing list
>>> tdwg-tapir at lists.tdwg.org
>>> http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
>> _______________________________________________
>> tdwg-tapir mailing list
>> tdwg-tapir at lists.tdwg.org
>> http://lists.tdwg.org/mailman/listinfo/tdwg-tapir

More information about the tdwg-tag mailing list