Re: [tdwg-tapir] Fwd: Tapir protocol - Harvest methods?

14 May 2008

      Hi Renato,

Do you think this really go under TAPIR spec?

Sure we want the wrappers to produce it but it's just a document on a URL
and can be described in such a simple way that loads of other people could
incorporate it without getting into TAPIR specs, nor can they claim any
TAPIR compliance just because they can do a 'select to outfile'.

I would also request that the headers aren't in the data file but the
metafile.  It is way easier to dump a big DB to this 'document standard'
without needing to worry about how to get headers in a 20gig file.

Just some more thoughts

Cheers

Tim
...
I agree with Markus about using a simple data format. Relational database
dumps would require standard database structures or would expose specific
things that are already encapsulated by abstraction layers (conceptual
schemas).
I'm not sure about the best way to represent complex data structures like
ABCD, but for simpler providers such as TapirLink/Dwc, the idea was to
create a new script responsible for dumping all mapped concepts of a
specific data source into a single file. Providers could periodically call
this script from a cron job to regenerate the dump. The first line in the
dump file would indicate the concept identifiers (GUIDs) associated with
each column to make it a generic solution (and more compatible with
existing applications). Content could be tab-delimited and in the end
compressed.
Harvesters could use this "seed" file for the initial data import, and
then potentially use incremental harvesting to update the cache. But in
this case it would be necessary to know when the dump file was generated.
To use the existing TAPIR infrastructure, we would also need to know which
providers support the dump files. Aaron's idea, when he first discussed
with me, was to use a new custom operation. This makes sense to me, but
would require a small change in the protocol to add a custom slot in the
operations section of capabilities responses. Curiously, this approach
would allow the existence of TAPIR "static providers" - the simplest
possible category, even simpler than TapirLite. They would not support
inventories, searches or query templates, but would make the dump file
available through the new custom operation. Metadata, capabilities and
ping could be just static files served by a very simple script.
If this approach makes sense, I think these are the points that still need
to be addressed:
1) Decide about how to indicate the timestamp associated with the dump
file.
2) Change the TAPIR schema (or figure out another solution to advertise
the new capability, but always remembering that in the TAPIR context a
single provider instance can host multiple data sources that are usually
distinguished by a query parameter in the URL, so I'm not sure how a
sitemaps approach could be used).
3) Decide about how to represent complex data such as ABCD (if using
multiple files, I would suggest to compress them together and serve as a
single file).
4) Write a short specification to describe the new custom operation and
the data format.
I'm happy to change the schema if there's consensus about this.
Best Regards,
--
Renato
...
it would keep the relations, but we dont really want any relational
structure to be served up.
And using sqlite binaries for the dwc star scheme would not be easier
to work with than plain text files. they can even be loaded into excel
straight away, can be versioned with svn and so on. If there is a
geospatial extension file which has the GUID in the first column,
applications might grab that directly and not even touch the central
core file if they only want location data.
I'd prefer to stick with a csv or tab delimited file.
The simpler the better. And it also cant get corrupted as easily.
Markus
On 14 May, 2008, at 15:25, Aaron D. Steele wrote:
...
for preserving relational data, we could also just dump tapirlink
resources to an sqlite database file (http://www.sqlite.org), zip it
up, and again make it available via the web service. we use sqlite
internally for many projects, and it's both easy to use and well
supported by jdbc, php, python, etc.
would something like this be a useful option?
thanks,
aaron
_______________________________________________
tdwg-tapir mailing list
tdwg-tapir@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-tapir

Re: [tdwg-tapir] Fwd: Tapir protocol - Harvest methods?

trobertson＠gbif.org