[tdwg-tapir] Fwd: Tapir protocol - Harvest methods?

Markus Döring mdoering at gbif.org
Thu May 15 01:00:12 CEST 2008


I agree with Tim that it would be better to keep this proposal/ 
specification seperate from TAPIR. Saying that it could still be  
included in the TAPIR capabilities to indicate this feature. But an  
important reason to have these file is to get more providers on board.  
So they should also be able to implement this without the TAPIR  
overhead.

A seperate metafile would certainly hold also the timestamp of the  
last generation of the file, so keeping that seperate has additional  
advantages.

Markus



On 14 May, 2008, at 21:44, trobertson at gbif.org wrote:

> Hi Renato,
>
> Do you think this really go under TAPIR spec?
>
> Sure we want the wrappers to produce it but it's just a document on  
> a URL
> and can be described in such a simple way that loads of other people  
> could
> incorporate it without getting into TAPIR specs, nor can they claim  
> any
> TAPIR compliance just because they can do a 'select to outfile'.
>
> I would also request that the headers aren't in the data file but the
> metafile.  It is way easier to dump a big DB to this 'document  
> standard'
> without needing to worry about how to get headers in a 20gig file.
>
> Just some more thoughts
>
> Cheers
>
> Tim
>
>
>
>> I agree with Markus about using a simple data format. Relational  
>> database
>> dumps would require standard database structures or would expose  
>> specific
>> things that are already encapsulated by abstraction layers  
>> (conceptual
>> schemas).
>>
>> I'm not sure about the best way to represent complex data  
>> structures like
>> ABCD, but for simpler providers such as TapirLink/Dwc, the idea was  
>> to
>> create a new script responsible for dumping all mapped concepts of a
>> specific data source into a single file. Providers could  
>> periodically call
>> this script from a cron job to regenerate the dump. The first line  
>> in the
>> dump file would indicate the concept identifiers (GUIDs) associated  
>> with
>> each column to make it a generic solution (and more compatible with
>> existing applications). Content could be tab-delimited and in the end
>> compressed.
>>
>> Harvesters could use this "seed" file for the initial data import,  
>> and
>> then potentially use incremental harvesting to update the cache.  
>> But in
>> this case it would be necessary to know when the dump file was  
>> generated.
>>
>> To use the existing TAPIR infrastructure, we would also need to  
>> know which
>> providers support the dump files. Aaron's idea, when he first  
>> discussed
>> with me, was to use a new custom operation. This makes sense to me,  
>> but
>> would require a small change in the protocol to add a custom slot  
>> in the
>> operations section of capabilities responses. Curiously, this  
>> approach
>> would allow the existence of TAPIR "static providers" - the simplest
>> possible category, even simpler than TapirLite. They would not  
>> support
>> inventories, searches or query templates, but would make the dump  
>> file
>> available through the new custom operation. Metadata, capabilities  
>> and
>> ping could be just static files served by a very simple script.
>>
>> If this approach makes sense, I think these are the points that  
>> still need
>> to be addressed:
>>
>> 1) Decide about how to indicate the timestamp associated with the  
>> dump
>> file.
>> 2) Change the TAPIR schema (or figure out another solution to  
>> advertise
>> the new capability, but always remembering that in the TAPIR  
>> context a
>> single provider instance can host multiple data sources that are  
>> usually
>> distinguished by a query parameter in the URL, so I'm not sure how a
>> sitemaps approach could be used).
>> 3) Decide about how to represent complex data such as ABCD (if using
>> multiple files, I would suggest to compress them together and serve  
>> as a
>> single file).
>> 4) Write a short specification to describe the new custom operation  
>> and
>> the data format.
>>
>> I'm happy to change the schema if there's consensus about this.
>>
>> Best Regards,
>> --
>> Renato
>>
>>
>>> it would keep the relations, but we dont really want any relational
>>> structure to be served up.
>>> And using sqlite binaries for the dwc star scheme would not be  
>>> easier
>>> to work with than plain text files. they can even be loaded into  
>>> excel
>>> straight away, can be versioned with svn and so on. If there is a
>>> geospatial extension file which has the GUID in the first column,
>>> applications might grab that directly and not even touch the central
>>> core file if they only want location data.
>>>
>>> I'd prefer to stick with a csv or tab delimited file.
>>> The simpler the better. And it also cant get corrupted as easily.
>>>
>>> Markus
>>>
>>>
>>>
>>> On 14 May, 2008, at 15:25, Aaron D. Steele wrote:
>>>
>>>> for preserving relational data, we could also just dump tapirlink
>>>> resources to an sqlite database file (http://www.sqlite.org), zip  
>>>> it
>>>> up, and again make it available via the web service. we use sqlite
>>>> internally for many projects, and it's both easy to use and well
>>>> supported by jdbc, php, python, etc.
>>>>
>>>> would something like this be a useful option?
>>>>
>>>> thanks,
>>>> aaron
>>
>>
>> _______________________________________________
>> tdwg-tapir mailing list
>> tdwg-tapir at lists.tdwg.org
>> http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
>>
>>
>
>
> _______________________________________________
> tdwg-tapir mailing list
> tdwg-tapir at lists.tdwg.org
> http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
>




More information about the tdwg-tag mailing list