[tdwg-tapir] Fwd: Tapir protocol - Harvest methods?

trobertson at gbif.org trobertson at gbif.org
Wed May 14 21:44:57 CEST 2008


Hi Renato,

Do you think this really go under TAPIR spec?

Sure we want the wrappers to produce it but it's just a document on a URL
and can be described in such a simple way that loads of other people could
incorporate it without getting into TAPIR specs, nor can they claim any
TAPIR compliance just because they can do a 'select to outfile'.

I would also request that the headers aren't in the data file but the
metafile.  It is way easier to dump a big DB to this 'document standard'
without needing to worry about how to get headers in a 20gig file.

Just some more thoughts

Cheers

Tim



> I agree with Markus about using a simple data format. Relational database
> dumps would require standard database structures or would expose specific
> things that are already encapsulated by abstraction layers (conceptual
> schemas).
>
> I'm not sure about the best way to represent complex data structures like
> ABCD, but for simpler providers such as TapirLink/Dwc, the idea was to
> create a new script responsible for dumping all mapped concepts of a
> specific data source into a single file. Providers could periodically call
> this script from a cron job to regenerate the dump. The first line in the
> dump file would indicate the concept identifiers (GUIDs) associated with
> each column to make it a generic solution (and more compatible with
> existing applications). Content could be tab-delimited and in the end
> compressed.
>
> Harvesters could use this "seed" file for the initial data import, and
> then potentially use incremental harvesting to update the cache. But in
> this case it would be necessary to know when the dump file was generated.
>
> To use the existing TAPIR infrastructure, we would also need to know which
> providers support the dump files. Aaron's idea, when he first discussed
> with me, was to use a new custom operation. This makes sense to me, but
> would require a small change in the protocol to add a custom slot in the
> operations section of capabilities responses. Curiously, this approach
> would allow the existence of TAPIR "static providers" - the simplest
> possible category, even simpler than TapirLite. They would not support
> inventories, searches or query templates, but would make the dump file
> available through the new custom operation. Metadata, capabilities and
> ping could be just static files served by a very simple script.
>
> If this approach makes sense, I think these are the points that still need
> to be addressed:
>
> 1) Decide about how to indicate the timestamp associated with the dump
> file.
> 2) Change the TAPIR schema (or figure out another solution to advertise
> the new capability, but always remembering that in the TAPIR context a
> single provider instance can host multiple data sources that are usually
> distinguished by a query parameter in the URL, so I'm not sure how a
> sitemaps approach could be used).
> 3) Decide about how to represent complex data such as ABCD (if using
> multiple files, I would suggest to compress them together and serve as a
> single file).
> 4) Write a short specification to describe the new custom operation and
> the data format.
>
> I'm happy to change the schema if there's consensus about this.
>
> Best Regards,
> --
> Renato
>
>
>> it would keep the relations, but we dont really want any relational
>> structure to be served up.
>> And using sqlite binaries for the dwc star scheme would not be easier
>> to work with than plain text files. they can even be loaded into excel
>> straight away, can be versioned with svn and so on. If there is a
>> geospatial extension file which has the GUID in the first column,
>> applications might grab that directly and not even touch the central
>> core file if they only want location data.
>>
>> I'd prefer to stick with a csv or tab delimited file.
>> The simpler the better. And it also cant get corrupted as easily.
>>
>> Markus
>>
>>
>>
>> On 14 May, 2008, at 15:25, Aaron D. Steele wrote:
>>
>>> for preserving relational data, we could also just dump tapirlink
>>> resources to an sqlite database file (http://www.sqlite.org), zip it
>>> up, and again make it available via the web service. we use sqlite
>>> internally for many projects, and it's both easy to use and well
>>> supported by jdbc, php, python, etc.
>>>
>>> would something like this be a useful option?
>>>
>>> thanks,
>>> aaron
>
>
> _______________________________________________
> tdwg-tapir mailing list
> tdwg-tapir at lists.tdwg.org
> http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
>
>





More information about the tdwg-tag mailing list