[tdwg-tapir] Tapir protocol - Harvest methods?

Wed Apr 30 17:43:23 CEST 2008

Hi Roger,
Your summation is correct - some source database implementations fail
to record date last modified for individual records of the data model
being exposed by the data provider (e.g. darwin core model over DiGIR,
TAPIR, WASABI, etc), and so the only effective mechanism for change
detection is to record a hash of a normalized form (ensuring field
order and formatting consistency) of content from each record.  The
actual data from the record can be captured as well in the cache, or
the cache interface can simply reconstruct the records from the source
database.  The important part of the process is rendering content
according to the model being exposed to the networks, and recording
the hash and time stamps, to enable content change tracking.

I've found this process to be very efficient, processing a change
detection across an entire dataset of about 200k records in only a
couple of minutes without any particular optimization implemented.
Too slow for realtime access, but by caching the change information, a
data provider could easily be modified to enable proper change
detection to support the requirements of protocols such as OAI-PMH.

regards,
  Dave V.

On Wed, Apr 30, 2008 at 8:11 AM, Roger Hyam (TDWG) <rogerhyam at mac.com> wrote:
>
>
> Markus,
>
> Martin and were just having a conversation along these lines.
>
> The trouble is with the notion of whether a "record" has changed when the
> internal database may not have the same notion of a record as the wrapper
> software. i.e. the record visible to the outside is the result of a query
> and the only way to know whether the data in a particular query result row
> has changed is to run the query again.
>
> Our thoughts were along the lines of doing a hash of the query results so
> that you can just run a periodic exhaustive crawl of the data locally and
> update your local cache with changes but I guess you could just use a
> serialization of the object as the hash.
>
> Your local cache would only need to contain three fields: object_id,
> last_mod, object_serialization and could be very generic. An adaptor to the
> client database would just have to run a  query to generate the serialized
> objects. I wrote an implementation of OAI-PMH on top of a table like this
> and it worked really easily. The problem is still the adaptor to the client
> database. Some one has to do the mapping and they could use TAPIR to do
> that...
>
> At this point the coffee was drunk and we had to stop...
>
> All the best with it,
>
> Roger
>
>
>
>
>
> -------------------------------------------------------------
> Roger Hyam
> Roger at BiodiversityCollectionsIndex.org
> http://www.BiodiversityCollectionsIndex.org
> -------------------------------------------------------------
> Royal Botanic Garden Edinburgh
> 20A Inverleith Row, Edinburgh, EH3 5LR, UK
> Tel: +44 131 552 7171 ext 3015
> Fax: +44 131 248 2901
> http://www.rbge.org.uk/
> -------------------------------------------------------------
>
>
>
>
>
>
> On 30 Apr 2008, at 15:38, Markus Döring wrote:
>
> Interesting.
> indeed a stable identifier is vital for many things. So is date last
> modified for incremental harvesting (using whatever protocol as roger
> explained).
>
> And that is why I want to continue some of WASABIs ideas of having a data
> cache on the *provider* side. The provider software fills this cache from
> the live db anytime the provider wants to publish his data and the date last
> modified gets calculated per record. Also GUIDs can be assigned in this
> process based on stable local IDs. And from this cache different protocols
> incl TAPIRlite & OAI-PMH can easily be served. At GBIF we would even like to
> go further and create "local index files" (our current working title for
> this) for very efficient harvesting which can be downloaded as a static
> compressed single file - much like Google uses sitemaps for indexing. I am
> currently preparing a document on this with Tim Robertson and we are happy
> to hear your thoughts on this in a few weeks.
>
> Markus
>
>
>
>
>
> On 30 Apr, 2008, at 15:55, Roger Hyam (TDWG) wrote:
>
>
> Hi Renato and all,
>
> The issue of harvesting isn't really a protocol one. In order to be able to
> have an efficient harvesting strategy (i.e do incremental harvests) data
> suppliers need:
>
>
> uniquely identify objects (records or items or whatever)
> keep track of when they change these items
>
> My understanding is that GBIF (and I guess other indexers) have to
> completely re-index the majority of data sources because these two things
> are not implemented consistently or at all by many of the suppliers. GBIF
> are now running out of resources and can't keep re-indexing every record
> every time. This is especially ironic as most of the records are from
> archives where the data rarely changes. It also means that data from the
> GBIF cache isn't comparable over time. If a data set is dropped and replaced
> by a new version with subtly different data points the consumer can't know
> if the different data points are additions or corrections to the old data
> points.
>
> The TAPIR protocol does not require records to have ids and modifications
> dates. There is no reason for it to do so. The protocol may even be useful
> in applications where one positively does not want to enforce this.
>
> If data providers who do implement TAPIR do supply ids and modification
> dates in a uniform way then it would be possible to incrementally harvest
> from them. It might even be possible to layer the OAI-PMH protocol over the
> top of TAPIR to make it more generic - as Kevin's work shows.
>
> If TAPIR data sources don't supply ids and modification dates or they don't
> supply them in a "standard" way then efficient incremental harvesting is
> near enough impossible. One would have to do an inventory call where all the
> records began with "A"  then with "B" etc.
>
> OAI-PMH mandates the notions of ids (indeed GUIDs) and modification dates
> but obviously doesn't have a notion of search/query at all.
>
> My belief/opinion is that the primary purpose of many people exposing data
> is to get it indexed (harvested) by GBIF.  "Just" supplying data through
> TAPIR for this purpose does not make GBIFs job easy or scalable. Providers
> should also supply GUIDs and modification dates. If they supply the GUIDs
> and modification dates the protocol is not so important - RSS or Atom
> anyone?
>
> I would go so far as saying that if data providers can't supply these two
> pieces of information they shouldn't expose their data as they are just
> polluting the global data pool - but that would probably be me saying way
> too much just to be provocative!
>
> Hope my ranting is informative,
>
> All the best,
>
> Roger
>
>
>
>
> -------------------------------------------------------------
> Roger Hyam
> Roger at BiodiversityCollectionsIndex.org
> http://www.BiodiversityCollectionsIndex.org
> -------------------------------------------------------------
> Royal Botanic Garden Edinburgh
> 20A Inverleith Row, Edinburgh, EH3 5LR, UK
> Tel: +44 131 552 7171 ext 3015
> Fax: +44 131 248 2901
> http://www.rbge.org.uk/
> -------------------------------------------------------------
>
>
>
>
>
> On 30 Apr 2008, at 13:58, Renato De Giovanni wrote:
> Hi Stan,
>
> Just a few comments about TAPIR and OAI-PMH.
>
> I'm not sure if there's any core functionality offered by OAI-PMH that
> cannot be easily replicated with TAPIR. The main ingredients would be:
>
> * A short list of concepts, basically record identifier, record timestamp,
> set membership and deletion flag. These would be the main concepts
> associated with request parameters and filters.
> * An extra list of concepts (or perhaps only one wrapper concept for XML
> content) that would be used to return the complete record representation
> in responses.
>
> On the other hand, there are many functionalities in TAPIR that cannot be
> replicated in OAI-PMH since TAPIR is a generic search protocol. In some
> situations, and depending on how data providers are implemented, this can
> make TAPIR more efficient even in data harvesting scenarios. In OAI-PMH it
> may be necessary to send multiple requests to retrieve all data from a
> single record (in case there there are multiple metadata prefixes
> associated with the record). Also note that GBIF is using a name range
> query template for harvesting TAPIR providers - this approach has been
> created after years of experience and seems to give the best performance
> for them. I'm not sure if GBIF could use a similar strategy for an OAI-PMH
> provider, i.e., retrieving approximately the same number of records in
> sequential requests using a custom filter that potentially forces the
> local database to use an index. In TAPIR this can be done with an
> inventory request (with "count" activated) and subsequent searches using a
> parameterized range filter guaranteed to return a certain number of
> records.
>
> I realize there may be other reasons to expose data using OAI-PMH (more
> available tools or compatibility with other networks). In this case, I
> should point to this interesting work where in the end Kevin Richards
> implemented an OAI-PMH service on top of TAPIR using less than 50 lines of
> code:
>
> http://wiki.tdwg.org/twiki/bin/view/TAPIR/TapirOAIPMH
>
> Best Regards,
> --
> Renato
>
>
> Phil,
>
> TAPIR was intended to be a unification of DiGIR and BioCASE. There are a
> few
> implementations of providers but fewer instances of portals built on
> TAPIR.
> Networks built on DiGIR may eventually switch to TAPIR, but that remains
> to
> be seen.  DiGIR and BioCASE were designed for distributed queries, not
> really
> harvesting.  I understand harvesting can be done more simply and
> efficiently
> by other approaches, such as OAI-PMH.  If the sensibilities of data
> providers
> evolves to accept and allow harvesting (which seems likely), we may see
> "networks" built on that architecture, instead of distributed queries.
>
> If your only goal is to provide data to GBIF, I would suggest installing
> TAPIR (unless Tim Robertson tells you something else).  If you are
> concerned
> about providing data to other networks, like www.SERNEC.org, you'll need a
> DiGIR provider, too.  (Such is the nature of technical transition.)
>
> -Stan
>
> Stanley D. Blum, Ph.D.
> Research Information Manager
> California Academy of Sciences
> 875 Howard St.
> San Francisco,  CA
> +1 (415) 321-8183
>
>
> _______________________________________________
> tdwg-tapir mailing list
> tdwg-tapir at lists.tdwg.org
> http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
>
> _______________________________________________
> tdwg-tapir mailing list
> tdwg-tapir at lists.tdwg.org
> http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
>
>
>
> _______________________________________________
>  tdwg-tapir mailing list
>  tdwg-tapir at lists.tdwg.org
>  http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
>
>