Re: [tdwg-tapir] Tapir protocol - Harvest methods?

30 Apr 2008


      Hi Markus,
As you indicate, the provider side data cache, one that reflects
actual changes in content according to an agreed data model (and so
providing true date last modified), is crucial to efficient
propagation of content through the various networks.  Such a cache, if
implemented properly can also provide an effective basis for both push
and pull models of data transfer.  Indeed some data providers may
implement a mechanism where they allow other providers to push content
to their cache, thus enabling those with limited connectivity or
expertise for running a server to contribute to a network.  In such a
model, the only really important pieces of information (for
synchronization) are a unique identifier for each record and
timestamps indicating when the object was created and last modified.
Provenance metadata should also be captured unless the intended
outcome is an entirely anonymous network.   Such a push+pull approach
is being implemented for the fishnet network, and results thus far
have been satisfying.

regards,
  Dave V.

On Wed, Apr 30, 2008 at 7:38 AM, Markus Döring <mdoering@gbif.org> wrote:
...
Interesting.
indeed a stable identifier is vital for many things. So is date last
modified for incremental harvesting (using whatever protocol as roger
explained).
And that is why I want to continue some of WASABIs ideas of having a data
cache on the *provider* side. The provider software fills this cache from
the live db anytime the provider wants to publish his data and the date last
modified gets calculated per record. Also GUIDs can be assigned in this
process based on stable local IDs. And from this cache different protocols
incl TAPIRlite & OAI-PMH can easily be served. At GBIF we would even like to
go further and create "local index files" (our current working title for
this) for very efficient harvesting which can be downloaded as a static
compressed single file - much like Google uses sitemaps for indexing. I am
currently preparing a document on this with Tim Robertson and we are happy
to hear your thoughts on this in a few weeks.
Markus
On 30 Apr, 2008, at 15:55, Roger Hyam (TDWG) wrote:
Hi Renato and all,
The issue of harvesting isn't really a protocol one. In order to be able to
have an efficient harvesting strategy (i.e do incremental harvests) data
suppliers need:
uniquely identify objects (records or items or whatever)
keep track of when they change these items
My understanding is that GBIF (and I guess other indexers) have to
completely re-index the majority of data sources because these two things
are not implemented consistently or at all by many of the suppliers. GBIF
are now running out of resources and can't keep re-indexing every record
every time. This is especially ironic as most of the records are from
archives where the data rarely changes. It also means that data from the
GBIF cache isn't comparable over time. If a data set is dropped and replaced
by a new version with subtly different data points the consumer can't know
if the different data points are additions or corrections to the old data
points.
The TAPIR protocol does not require records to have ids and modifications
dates. There is no reason for it to do so. The protocol may even be useful
in applications where one positively does not want to enforce this.
If data providers who do implement TAPIR do supply ids and modification
dates in a uniform way then it would be possible to incrementally harvest
from them. It might even be possible to layer the OAI-PMH protocol over the
top of TAPIR to make it more generic - as Kevin's work shows.
If TAPIR data sources don't supply ids and modification dates or they don't
supply them in a "standard" way then efficient incremental harvesting is
near enough impossible. One would have to do an inventory call where all the
records began with "A"  then with "B" etc.
OAI-PMH mandates the notions of ids (indeed GUIDs) and modification dates
but obviously doesn't have a notion of search/query at all.
My belief/opinion is that the primary purpose of many people exposing data
is to get it indexed (harvested) by GBIF.  "Just" supplying data through
TAPIR for this purpose does not make GBIFs job easy or scalable. Providers
should also supply GUIDs and modification dates. If they supply the GUIDs
and modification dates the protocol is not so important - RSS or Atom
anyone?
I would go so far as saying that if data providers can't supply these two
pieces of information they shouldn't expose their data as they are just
polluting the global data pool - but that would probably be me saying way
too much just to be provocative!
Hope my ranting is informative,
All the best,
Roger
-------------------------------------------------------------
Roger Hyam
Roger@BiodiversityCollectionsIndex.org
http://www.BiodiversityCollectionsIndex.org
-------------------------------------------------------------
Royal Botanic Garden Edinburgh
20A Inverleith Row, Edinburgh, EH3 5LR, UK
Tel: +44 131 552 7171 ext 3015
Fax: +44 131 248 2901
http://www.rbge.org.uk/
-------------------------------------------------------------
On 30 Apr 2008, at 13:58, Renato De Giovanni wrote:
Hi Stan,
Just a few comments about TAPIR and OAI-PMH.
I'm not sure if there's any core functionality offered by OAI-PMH that
cannot be easily replicated with TAPIR. The main ingredients would be:
* A short list of concepts, basically record identifier, record timestamp,
set membership and deletion flag. These would be the main concepts
associated with request parameters and filters.
* An extra list of concepts (or perhaps only one wrapper concept for XML
content) that would be used to return the complete record representation
in responses.
On the other hand, there are many functionalities in TAPIR that cannot be
replicated in OAI-PMH since TAPIR is a generic search protocol. In some
situations, and depending on how data providers are implemented, this can
make TAPIR more efficient even in data harvesting scenarios. In OAI-PMH it
may be necessary to send multiple requests to retrieve all data from a
single record (in case there there are multiple metadata prefixes
associated with the record). Also note that GBIF is using a name range
query template for harvesting TAPIR providers - this approach has been
created after years of experience and seems to give the best performance
for them. I'm not sure if GBIF could use a similar strategy for an OAI-PMH
provider, i.e., retrieving approximately the same number of records in
sequential requests using a custom filter that potentially forces the
local database to use an index. In TAPIR this can be done with an
inventory request (with "count" activated) and subsequent searches using a
parameterized range filter guaranteed to return a certain number of
records.
I realize there may be other reasons to expose data using OAI-PMH (more
available tools or compatibility with other networks). In this case, I
should point to this interesting work where in the end Kevin Richards
implemented an OAI-PMH service on top of TAPIR using less than 50 lines of
code:
http://wiki.tdwg.org/twiki/bin/view/TAPIR/TapirOAIPMH
Best Regards,
--
Renato
Phil,
TAPIR was intended to be a unification of DiGIR and BioCASE. There are a
few
implementations of providers but fewer instances of portals built on
TAPIR.
Networks built on DiGIR may eventually switch to TAPIR, but that remains
to
be seen.  DiGIR and BioCASE were designed for distributed queries, not
really
harvesting.  I understand harvesting can be done more simply and
efficiently
by other approaches, such as OAI-PMH.  If the sensibilities of data
providers
evolves to accept and allow harvesting (which seems likely), we may see
"networks" built on that architecture, instead of distributed queries.
If your only goal is to provide data to GBIF, I would suggest installing
TAPIR (unless Tim Robertson tells you something else).  If you are
concerned
about providing data to other networks, like www.SERNEC.org, you'll need a
DiGIR provider, too.  (Such is the nature of technical transition.)
-Stan
Stanley D. Blum, Ph.D.
Research Information Manager
California Academy of Sciences
875 Howard St.
San Francisco,  CA
+1 (415) 321-8183
_______________________________________________
tdwg-tapir mailing list
tdwg-tapir@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
_______________________________________________
tdwg-tapir mailing list
tdwg-tapir@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
_______________________________________________
 tdwg-tapir mailing list
 tdwg-tapir@lists.tdwg.org
 http://lists.tdwg.org/mailman/listinfo/tdwg-tapir

Re: [tdwg-tapir] Tapir protocol - Harvest methods?

Dave Vieglais