Hi Guido,
Thanks again for your thoughts.
(as you already know, I'm copying this message to the mailing list)
Incremental harvesting does require some sort of "modified since" parameter used against the last harvesting date. Ideally you should also be able to get all deleted records in another step using a "deleted since" parameter. The way to achieve this in TAPIR is to define two concepts and then use them in filters if the providers have mapped those concepts. I think we shouldn't force providers to have the corresponding content, and networks should remain completely free to define their own data abstraction layers.
In your case, you don't need to embed such concepts in all query templates as filter conditions, unless each query template returns completely different things. You can try to define a single query template just for harvesting.
Regarding dump files, even if a provider is developed and configured to return a dump file behind certain service calls, there may be lots of other queries that can return almost all records. For this reason providers may still want to limit the number of records that can be returned and advertise this limitation through capabilities, so the situation can get a bit confusing for clients. I still think it makes more sense to simply allow providers to declare dump files separately. Please note that this would be an optional feature, so you're totally free to decide whether you want to implement it or not.
Regarding the dump format, I'm certainly happy to leave it open to any option that makes sense.
Thanks again, -- Renato
Hi Renato,
that whole thing sounds good to me, just two comments:
In order to facilitate incremental updates (which I am in absolute favor of), each query requires some sort of a timestamp parameter for specifying the maximum age of the data to return as an incremental update, some sort of "modified_since" parameter. This should become an inherently permitted part of every query then, without having to define it in each template.
Using a dump file should be up to the individual TAPIR providers, since it easily hides behind the web front-end. That file just needs to contain what querying the database would return anyway, as sort of a file based cache. Returning the dump instead of querying the database can be done either completely inside the TAPIR provider (invisible to the client), or using a redirect (possible one generated dynamically). So it should not be part of the specification, imho. If you decide, howerver, to include dumps in the capabilities, the format should definitely be customizable, not strictly bound to XML or CSV, because XML is overkill for some data, while CSV is too flat for other data. How about JSON, in addition? It's a nice combination of CSVs simplicity and XMLs power to express hierarchical content. In the future, some might want RDF as a further format ... to be continued. In order not to unnecessarily hamper TAPIRs acceptance, it should really be up to the individual providers which format to use.
So far my two cents, Guido