Re: [tdwg-tapir] Fwd: Tapir protocol - Harvest methods?

21 May 2008

      it's my intuition that harvesting data in *different formats* is going
to become a dominant use case handled by data providers worldwide. for
example, some clients will want csv or star, while others will want
xml or sqlite. i'd like to explore adding a simple plug-in
architecture to tapirlink that, given a format plug-in (for example,
csv_plugin.php), creates a resource data dump in that format which can
be zip archived (along with any other metadata files required by the
format) and downloaded by clients. in this way, as new formats are
requested by the community, new format plug-ins can be added. it's a
simple approach that's scalable, improves interoperability with
clients, and avoids the need to agree on single format to support.

i'd also like to explore using a new 'harvest' tapir operation to
facilitate harvest requests. for example:

tapir.php/myresource?op=harvest&format=csv&sbn=604800

the optional sbn parameter above stands for seconds before now. you
can interpret the above request as:

"i want to download a csv dump of myresource only if it has been
created within the last week (604,800 seconds)."

this approach might be somewhat controversial since it involves
potential changes in the tapir protocol that not everyone agrees with.
on the other hand, after consulting with renato and john, i don't see
any harm with prototyping these new features, and giving the community
the opportunity to experiment with concrete harvesting functionality
before coming to a general consensus.

if you're keen on collaborating, i've created a new branch to
prototype these ideas in:
https://digir.svn.sourceforge.net/svnroot/digir/tapirlink/branches/harvest

thoughts? concerns?

thanks,
aaron

On Wed, May 21, 2008 at 11:16 AM, Renato De Giovanni <renato@cria.org.br> wrote:
...
Markus,
If we want to ensure the lowest possible barrier for providers, then
I think zipped csv files need to be supported. If we really want to
handle complex data using the same format, then we need something
like the csv star scheme you mentioned (with well-defined rules about
all files and how the records are related).
The limitation in this case is that we would only handle one-level
relationships (not a generic solution) and providers with complex
data would probably need to write some code to generate the dumps
(not sure how many providers would do it) - unless wrappers that can
handle complex data implement additional functionality to produce
these dumps.
On the other hand, if we allow more than one format, complex data
could be handled with compact XML representations (in a generic way)
which could be automatically produced by existing wrappers.
So my understanding is that the biggest decision is: Use a single
format (csv) with additional rules for complex data, or allow
different formats (one for simple and another for complex data).
Although I know it's usually much better for clients to deal with a
single format, my *feeling* in this case is that it would be more
effective to allow different formats. I'm also not sure if it would
be easier for clients to handle additional star scheme rules when
importing complex data than it would be to parse a single XML file
encoded in some compact structure.
Just some thoughts...
Best Regards,
--
Renato
On 20 May 2008 at 17:36, Markus Döring wrote:
...
Renato,
complex data can also be represented by tab files, with a file for
each extension that has a pointer in the first column.
That is what we originally had in mind with the star scheme.
Markus
_______________________________________________
tdwg-tapir mailing list
tdwg-tapir@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-tapir

Re: [tdwg-tapir] Fwd: Tapir protocol - Harvest methods?

Aaron D. Steele