[tdwg-tapir] Tapir protocol - Harvest methods?

Wed Apr 30 14:58:00 CEST 2008

Hi Stan,

Just a few comments about TAPIR and OAI-PMH.

I'm not sure if there's any core functionality offered by OAI-PMH that
cannot be easily replicated with TAPIR. The main ingredients would be:

* A short list of concepts, basically record identifier, record timestamp,
set membership and deletion flag. These would be the main concepts
associated with request parameters and filters.
* An extra list of concepts (or perhaps only one wrapper concept for XML
content) that would be used to return the complete record representation
in responses.

On the other hand, there are many functionalities in TAPIR that cannot be
replicated in OAI-PMH since TAPIR is a generic search protocol. In some
situations, and depending on how data providers are implemented, this can
make TAPIR more efficient even in data harvesting scenarios. In OAI-PMH it
may be necessary to send multiple requests to retrieve all data from a
single record (in case there there are multiple metadata prefixes
associated with the record). Also note that GBIF is using a name range
query template for harvesting TAPIR providers - this approach has been
created after years of experience and seems to give the best performance
for them. I'm not sure if GBIF could use a similar strategy for an OAI-PMH
provider, i.e., retrieving approximately the same number of records in
sequential requests using a custom filter that potentially forces the
local database to use an index. In TAPIR this can be done with an
inventory request (with "count" activated) and subsequent searches using a
parameterized range filter guaranteed to return a certain number of
records.

I realize there may be other reasons to expose data using OAI-PMH (more
available tools or compatibility with other networks). In this case, I
should point to this interesting work where in the end Kevin Richards
implemented an OAI-PMH service on top of TAPIR using less than 50 lines of
code:

http://wiki.tdwg.org/twiki/bin/view/TAPIR/TapirOAIPMH

Best Regards,
--
Renato

> Phil,
>
> TAPIR was intended to be a unification of DiGIR and BioCASE. There are a
> few
> implementations of providers but fewer instances of portals built on
> TAPIR.
> Networks built on DiGIR may eventually switch to TAPIR, but that remains
> to
> be seen.  DiGIR and BioCASE were designed for distributed queries, not
> really
> harvesting.  I understand harvesting can be done more simply and
> efficiently
> by other approaches, such as OAI-PMH.  If the sensibilities of data
> providers
> evolves to accept and allow harvesting (which seems likely), we may see
> "networks" built on that architecture, instead of distributed queries.
>
> If your only goal is to provide data to GBIF, I would suggest installing
> TAPIR (unless Tim Robertson tells you something else).  If you are
> concerned
> about providing data to other networks, like www.SERNEC.org, you'll need a
> DiGIR provider, too.  (Such is the nature of technical transition.)
>
> -Stan
>
> Stanley D. Blum, Ph.D.
> Research Information Manager
> California Academy of Sciences
> 875 Howard St.
> San Francisco,  CA
> +1 (415) 321-8183