[tdwg-tapir] Tapir protocol - Harvest methods?

Wed Apr 30 15:55:56 CEST 2008

Hi Renato and all,

The issue of harvesting isn't really a protocol one. In order to be  
able to have an efficient harvesting strategy (i.e do incremental  
harvests) data suppliers need:

uniquely identify objects (records or items or whatever)
keep track of when they change these items

My understanding is that GBIF (and I guess other indexers) have to  
completely re-index the majority of data sources because these two  
things are not implemented consistently or at all by many of the  
suppliers. GBIF are now running out of resources and can't keep re- 
indexing every record every time. This is especially ironic as most of  
the records are from archives where the data rarely changes. It also  
means that data from the GBIF cache isn't comparable over time. If a  
data set is dropped and replaced by a new version with subtly  
different data points the consumer can't know if the different data  
points are additions or corrections to the old data points.

The TAPIR protocol does not require records to have ids and  
modifications dates. There is no reason for it to do so. The protocol  
may even be useful in applications where one positively does not want  
to enforce this.

If data providers who do implement TAPIR do supply ids and  
modification dates in a uniform way then it would be possible to  
incrementally harvest from them. It might even be possible to layer  
the OAI-PMH protocol over the top of TAPIR to make it more generic -  
as Kevin's work shows.

If TAPIR data sources don't supply ids and modification dates or they  
don't supply them in a "standard" way then efficient incremental  
harvesting is near enough impossible. One would have to do an  
inventory call where all the records began with "A"  then with "B" etc.

OAI-PMH mandates the notions of ids (indeed GUIDs) and modification  
dates but obviously doesn't have a notion of search/query at all.

My belief/opinion is that the primary purpose of many people exposing  
data is to get it indexed (harvested) by GBIF.  "Just" supplying data  
through TAPIR for this purpose does not make GBIFs job easy or  
scalable. Providers should also supply GUIDs and modification dates.  
If they supply the GUIDs and modification dates the protocol is not so  
important - RSS or Atom anyone?

I would go so far as saying that if data providers can't supply these  
two pieces of information they shouldn't expose their data as they are  
just polluting the global data pool - but that would probably be me  
saying way too much just to be provocative!

Hope my ranting is informative,

All the best,

Roger

-------------------------------------------------------------
Roger Hyam
Roger at BiodiversityCollectionsIndex.org
http://www.BiodiversityCollectionsIndex.org
-------------------------------------------------------------
Royal Botanic Garden Edinburgh
20A Inverleith Row, Edinburgh, EH3 5LR, UK
Tel: +44 131 552 7171 ext 3015
Fax: +44 131 248 2901
http://www.rbge.org.uk/
-------------------------------------------------------------

On 30 Apr 2008, at 13:58, Renato De Giovanni wrote:

> Hi Stan,
>
> Just a few comments about TAPIR and OAI-PMH.
>
> I'm not sure if there's any core functionality offered by OAI-PMH that
> cannot be easily replicated with TAPIR. The main ingredients would be:
>
> * A short list of concepts, basically record identifier, record  
> timestamp,
> set membership and deletion flag. These would be the main concepts
> associated with request parameters and filters.
> * An extra list of concepts (or perhaps only one wrapper concept for  
> XML
> content) that would be used to return the complete record  
> representation
> in responses.
>
> On the other hand, there are many functionalities in TAPIR that  
> cannot be
> replicated in OAI-PMH since TAPIR is a generic search protocol. In  
> some
> situations, and depending on how data providers are implemented,  
> this can
> make TAPIR more efficient even in data harvesting scenarios. In OAI- 
> PMH it
> may be necessary to send multiple requests to retrieve all data from a
> single record (in case there there are multiple metadata prefixes
> associated with the record). Also note that GBIF is using a name range
> query template for harvesting TAPIR providers - this approach has been
> created after years of experience and seems to give the best  
> performance
> for them. I'm not sure if GBIF could use a similar strategy for an  
> OAI-PMH
> provider, i.e., retrieving approximately the same number of records in
> sequential requests using a custom filter that potentially forces the
> local database to use an index. In TAPIR this can be done with an
> inventory request (with "count" activated) and subsequent searches  
> using a
> parameterized range filter guaranteed to return a certain number of
> records.
>
> I realize there may be other reasons to expose data using OAI-PMH  
> (more
> available tools or compatibility with other networks). In this case, I
> should point to this interesting work where in the end Kevin Richards
> implemented an OAI-PMH service on top of TAPIR using less than 50  
> lines of
> code:
>
> http://wiki.tdwg.org/twiki/bin/view/TAPIR/TapirOAIPMH
>
> Best Regards,
> --
> Renato
>
>
>> Phil,
>>
>> TAPIR was intended to be a unification of DiGIR and BioCASE. There  
>> are a
>> few
>> implementations of providers but fewer instances of portals built on
>> TAPIR.
>> Networks built on DiGIR may eventually switch to TAPIR, but that  
>> remains
>> to
>> be seen.  DiGIR and BioCASE were designed for distributed queries,  
>> not
>> really
>> harvesting.  I understand harvesting can be done more simply and
>> efficiently
>> by other approaches, such as OAI-PMH.  If the sensibilities of data
>> providers
>> evolves to accept and allow harvesting (which seems likely), we may  
>> see
>> "networks" built on that architecture, instead of distributed  
>> queries.
>>
>> If your only goal is to provide data to GBIF, I would suggest  
>> installing
>> TAPIR (unless Tim Robertson tells you something else).  If you are
>> concerned
>> about providing data to other networks, like www.SERNEC.org, you'll  
>> need a
>> DiGIR provider, too.  (Such is the nature of technical transition.)
>>
>> -Stan
>>
>> Stanley D. Blum, Ph.D.
>> Research Information Manager
>> California Academy of Sciences
>> 875 Howard St.
>> San Francisco,  CA
>> +1 (415) 321-8183
>
>
> _______________________________________________
> tdwg-tapir mailing list
> tdwg-tapir at lists.tdwg.org
> http://lists.tdwg.org/mailman/listinfo/tdwg-tapir

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.tdwg.org/pipermail/tdwg-tag/attachments/20080430/bdd1a2d7/attachment.html