[tdwg-tapir] Tapir protocol - Harvest methods?
Just starting with Tapir/DiGIR - I have 2 questions:
* I would like to know if the Tapir protocol is the preferred method over DiGIR. We have a DiGIR implementation that we want to move away from, and bring up a Tapir one in its place. Is this normal, or do organizations run both to facilitate their older clients to do harvesting?
* What is a method to harvest data from Tapir, and/or DiGIR - we want to do this internally to test our implementation before we open up to the world, how can I do this (we run Windows and Linux as clients)
Thank you
Phil
--
Phil Cryer
Open Source Development
Missouri Botanical Garden
Phil,
Is the "DiGIR implementation that you want to move away from" just a DiGIR service? Or is it something else?
I would only keep a parallel DiGIR service if there are older clients that can only talk to it and for some reason (time/resources) can't be updated. I'm not sure if this is your case.
Also, when you said that you want to "test your implementation", did you mean that you want to test a TAPIR service, or is it some other application based on TAPIR? If you just want to test a TAPIR service, you could simply run TapirTester on it instead of developing your own harvester:
Note: If necessary, the existing tests can be improved. New ones can also be created (TapirTester is open source).
Hope this helps, -- Renato
On 28 Apr 2008 at 10:39, Phil Cryer wrote:
Just starting with Tapir/DiGIR - I have 2 questions:
- I would like to know if the Tapir protocol is the preferred method
over DiGIR. We have a DiGIR implementation that we want to move away from, and bring up a Tapir one in its place. Is this normal, or do organizations run both to facilitate their older clients to do harvesting?
- What is a method to harvest data from Tapir, and/or DiGIR -we want
to do this internally to test our implementation before we open up to the world, how can I do this (we run Windows and Linux as clients)
Thank you
Phil
Phil Cryer Open Source Development Missouri Botanical Garden
So we have DiGIR running at Mobot for Tropicos data, and clients hit it to harvest data. I was just wondering if people are still deploying DiGIR at all, or are they just using Tapir by default? It seems to have taken over for DiGIR, and I want to know if that's a 'standard' that we should follow.
For testing, yes, we're talking more of performance; make sure our network and server will handle X load. So I guess I want to know more of, how do clients attach to a Tapir server, how do they pull the data from us?
Sorry if this is such a newbie question, but I can't understand this aspect from the docs I've read.
Thanks for the reply!
Phil
-----Original Message----- From: tdwg-tapir-bounces@lists.tdwg.org [mailto:tdwg-tapir-bounces@lists.tdwg.org] On Behalf Of Renato De Giovanni Sent: Monday, April 28, 2008 1:54 PM To: tdwg-tapir@lists.tdwg.org Subject: Re: [tdwg-tapir] Tapir protocol - Harvest methods?
Phil,
Is the "DiGIR implementation that you want to move away from" just a DiGIR service? Or is it something else?
I would only keep a parallel DiGIR service if there are older clients that can only talk to it and for some reason (time/resources) can't be updated. I'm not sure if this is your case.
Also, when you said that you want to "test your implementation", did you mean that you want to test a TAPIR service, or is it some other application based on TAPIR? If you just want to test a TAPIR service, you could simply run TapirTester on it instead of developing your own harvester:
Note: If necessary, the existing tests can be improved. New ones can also be created (TapirTester is open source).
Hope this helps, -- Renato
On 28 Apr 2008 at 10:39, Phil Cryer wrote:
Just starting with Tapir/DiGIR - I have 2 questions:
- I would like to know if the Tapir protocol is the preferred method
over DiGIR. We have a DiGIR implementation that we want to move away from, and bring up a Tapir one in its place. Is this normal, or do organizations run both to facilitate their older clients to do harvesting?
- What is a method to harvest data from Tapir, and/or DiGIR -we want
to do this internally to test our implementation before we open up to the world, how can I do this (we run Windows and Linux as clients)
Thank you
Phil
Phil Cryer Open Source Development Missouri Botanical Garden
_______________________________________________ tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
Phil,
TAPIR was intended to be a unification of DiGIR and BioCASE. There are a few implementations of providers but fewer instances of portals built on TAPIR. Networks built on DiGIR may eventually switch to TAPIR, but that remains to be seen. DiGIR and BioCASE were designed for distributed queries, not really harvesting. I understand harvesting can be done more simply and efficiently by other approaches, such as OAI-PMH. If the sensibilities of data providers evolves to accept and allow harvesting (which seems likely), we may see "networks" built on that architecture, instead of distributed queries.
If your only goal is to provide data to GBIF, I would suggest installing TAPIR (unless Tim Robertson tells you something else). If you are concerned about providing data to other networks, like www.SERNEC.org, you'll need a DiGIR provider, too. (Such is the nature of technical transition.)
-Stan
Stanley D. Blum, Ph.D. Research Information Manager California Academy of Sciences 875 Howard St. San Francisco, CA +1 (415) 321-8183
-----Original Message----- From: tdwg-tapir-bounces@lists.tdwg.org [mailto:tdwg-tapir-bounces@lists.tdwg.org] On Behalf Of Phil Cryer Sent: Monday, April 28, 2008 1:22 PM To: Renato De Giovanni; tdwg-tapir@lists.tdwg.org Subject: RE: [tdwg-tapir] Tapir protocol - Harvest methods?
So we have DiGIR running at Mobot for Tropicos data, and clients hit it to harvest data. I was just wondering if people are still deploying DiGIR at all, or are they just using Tapir by default? It seems to have taken over for DiGIR, and I want to know if that's a 'standard' that we should follow.
For testing, yes, we're talking more of performance; make sure our network and server will handle X load. So I guess I want to know more of, how do clients attach to a Tapir server, how do they pull the data from us?
Sorry if this is such a newbie question, but I can't understand this aspect from the docs I've read.
Thanks for the reply!
Phil
-----Original Message----- From: tdwg-tapir-bounces@lists.tdwg.org [mailto:tdwg-tapir-bounces@lists.tdwg.org] On Behalf Of Renato De Giovanni Sent: Monday, April 28, 2008 1:54 PM To: tdwg-tapir@lists.tdwg.org Subject: Re: [tdwg-tapir] Tapir protocol - Harvest methods?
Phil,
Is the "DiGIR implementation that you want to move away from" just a DiGIR service? Or is it something else?
I would only keep a parallel DiGIR service if there are older clients that can only talk to it and for some reason (time/resources) can't be updated. I'm not sure if this is your case.
Also, when you said that you want to "test your implementation", did you mean that you want to test a TAPIR service, or is it some other application based on TAPIR? If you just want to test a TAPIR service, you could simply run TapirTester on it instead of developing your own harvester:
Note: If necessary, the existing tests can be improved. New ones can also be created (TapirTester is open source).
Hope this helps, -- Renato
On 28 Apr 2008 at 10:39, Phil Cryer wrote:
Just starting with Tapir/DiGIR - I have 2 questions:
- I would like to know if the Tapir protocol is the preferred method
over DiGIR. We have a DiGIR implementation that we want to move away from, and bring up a Tapir one in its place. Is this normal, or do organizations run both to facilitate their older clients to do harvesting?
- What is a method to harvest data from Tapir, and/or DiGIR -we want
to do this internally to test our implementation before we open up to the world, how can I do this (we run Windows and Linux as clients)
Thank you
Phil
Phil Cryer Open Source Development Missouri Botanical Garden
_______________________________________________ tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir _______________________________________________ tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
Phil, from the GBIF side it doesnt matter whether you use DiGIR or TAPIR. Both protocols are currently supported by the GBIF indexer. If you use TapirLink simply mapping to DarwinCore is enough. For other TAPIRlite providers please make sure your service works with the 2 following DarwinCore TAPIR templates found at TDWG:
http://rs.tdwg.org/tapir/cs/dwc/1.4/template/dwc_sci_name_range.xml http://rs.tdwg.org/tapir/cs/dwc/1.4/template/dwc_unfiltered_search.xml
At GBIF we are currently also thinking about a much simpler provider software tailored for harvesting. That will reduce load on providers enormously while still supporting basic TAPIR capabilities for true distributed queries. We will keep this list informed once we have thought this through.
Markus
-- Markus Döring, Berlin Senior Software Developer GBIF Secretariat mdoering@gbif.org
On 28 Apr, 2008, at 23:02, Blum, Stan wrote:
Phil,
TAPIR was intended to be a unification of DiGIR and BioCASE. There are a few implementations of providers but fewer instances of portals built on TAPIR. Networks built on DiGIR may eventually switch to TAPIR, but that remains to be seen. DiGIR and BioCASE were designed for distributed queries, not really harvesting. I understand harvesting can be done more simply and efficiently by other approaches, such as OAI-PMH. If the sensibilities of data providers evolves to accept and allow harvesting (which seems likely), we may see "networks" built on that architecture, instead of distributed queries.
If your only goal is to provide data to GBIF, I would suggest installing TAPIR (unless Tim Robertson tells you something else). If you are concerned about providing data to other networks, like www.SERNEC.org, you'll need a DiGIR provider, too. (Such is the nature of technical transition.)
-Stan
Stanley D. Blum, Ph.D. Research Information Manager California Academy of Sciences 875 Howard St. San Francisco, CA +1 (415) 321-8183
-----Original Message----- From: tdwg-tapir-bounces@lists.tdwg.org [mailto:tdwg-tapir-bounces@lists.tdwg.org] On Behalf Of Phil Cryer Sent: Monday, April 28, 2008 1:22 PM To: Renato De Giovanni; tdwg-tapir@lists.tdwg.org Subject: RE: [tdwg-tapir] Tapir protocol - Harvest methods?
So we have DiGIR running at Mobot for Tropicos data, and clients hit it to harvest data. I was just wondering if people are still deploying DiGIR at all, or are they just using Tapir by default? It seems to have taken over for DiGIR, and I want to know if that's a 'standard' that we should follow.
For testing, yes, we're talking more of performance; make sure our network and server will handle X load. So I guess I want to know more of, how do clients attach to a Tapir server, how do they pull the data from us?
Sorry if this is such a newbie question, but I can't understand this aspect from the docs I've read.
Thanks for the reply!
Phil
-----Original Message----- From: tdwg-tapir-bounces@lists.tdwg.org [mailto:tdwg-tapir-bounces@lists.tdwg.org] On Behalf Of Renato De Giovanni Sent: Monday, April 28, 2008 1:54 PM To: tdwg-tapir@lists.tdwg.org Subject: Re: [tdwg-tapir] Tapir protocol - Harvest methods?
Phil,
Is the "DiGIR implementation that you want to move away from" just a DiGIR service? Or is it something else?
I would only keep a parallel DiGIR service if there are older clients that can only talk to it and for some reason (time/resources) can't be updated. I'm not sure if this is your case.
Also, when you said that you want to "test your implementation", did you mean that you want to test a TAPIR service, or is it some other application based on TAPIR? If you just want to test a TAPIR service, you could simply run TapirTester on it instead of developing your own harvester:
Note: If necessary, the existing tests can be improved. New ones can also be created (TapirTester is open source).
Hope this helps,
Renato
On 28 Apr 2008 at 10:39, Phil Cryer wrote:
Just starting with Tapir/DiGIR - I have 2 questions:
- I would like to know if the Tapir protocol is the preferred method
over DiGIR. We have a DiGIR implementation that we want to move away from, and bring up a Tapir one in its place. Is this normal, or do organizations run both to facilitate their older clients to do harvesting?
- What is a method to harvest data from Tapir, and/or DiGIR -we want
to do this internally to test our implementation before we open up to the world, how can I do this (we run Windows and Linux as clients)
Thank you
Phil
Phil Cryer Open Source Development Missouri Botanical Garden
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir _______________________________________________ tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir _______________________________________________ tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
Phil, having said that GBIF is happy both with DiGIR and TAPIR I still wanted to raise one important issue: The DiGIR PHP code is rather old now and is not being maintained anymore by anyone. I had problems myself getting it running with PHP5, whereas TapirLink installed in 2 minutes without any problem. So for new provider installations GBIF definitely recommends to use TAPIR over DiGIR.
Markus
On 29 Apr, 2008, at 24:34, Markus Döring wrote:
Phil, from the GBIF side it doesnt matter whether you use DiGIR or TAPIR. Both protocols are currently supported by the GBIF indexer. If you use TapirLink simply mapping to DarwinCore is enough. For other TAPIRlite providers please make sure your service works with the 2 following DarwinCore TAPIR templates found at TDWG:
http://rs.tdwg.org/tapir/cs/dwc/1.4/template/dwc_sci_name_range.xml http://rs.tdwg.org/tapir/cs/dwc/1.4/template/dwc_unfiltered_search.xml
At GBIF we are currently also thinking about a much simpler provider software tailored for harvesting. That will reduce load on providers enormously while still supporting basic TAPIR capabilities for true distributed queries. We will keep this list informed once we have thought this through.
Markus
-- Markus Döring, Berlin Senior Software Developer GBIF Secretariat mdoering@gbif.org
On 28 Apr, 2008, at 23:02, Blum, Stan wrote:
Phil,
TAPIR was intended to be a unification of DiGIR and BioCASE. There are a few implementations of providers but fewer instances of portals built on TAPIR. Networks built on DiGIR may eventually switch to TAPIR, but that remains to be seen. DiGIR and BioCASE were designed for distributed queries, not really harvesting. I understand harvesting can be done more simply and efficiently by other approaches, such as OAI-PMH. If the sensibilities of data providers evolves to accept and allow harvesting (which seems likely), we may see "networks" built on that architecture, instead of distributed queries.
If your only goal is to provide data to GBIF, I would suggest installing TAPIR (unless Tim Robertson tells you something else). If you are concerned about providing data to other networks, like www.SERNEC.org, you'll need a DiGIR provider, too. (Such is the nature of technical transition.)
-Stan
Stanley D. Blum, Ph.D. Research Information Manager California Academy of Sciences 875 Howard St. San Francisco, CA +1 (415) 321-8183
-----Original Message----- From: tdwg-tapir-bounces@lists.tdwg.org [mailto:tdwg-tapir-bounces@lists.tdwg.org] On Behalf Of Phil Cryer Sent: Monday, April 28, 2008 1:22 PM To: Renato De Giovanni; tdwg-tapir@lists.tdwg.org Subject: RE: [tdwg-tapir] Tapir protocol - Harvest methods?
So we have DiGIR running at Mobot for Tropicos data, and clients hit it to harvest data. I was just wondering if people are still deploying DiGIR at all, or are they just using Tapir by default? It seems to have taken over for DiGIR, and I want to know if that's a 'standard' that we should follow.
For testing, yes, we're talking more of performance; make sure our network and server will handle X load. So I guess I want to know more of, how do clients attach to a Tapir server, how do they pull the data from us?
Sorry if this is such a newbie question, but I can't understand this aspect from the docs I've read.
Thanks for the reply!
Phil
-----Original Message----- From: tdwg-tapir-bounces@lists.tdwg.org [mailto:tdwg-tapir-bounces@lists.tdwg.org] On Behalf Of Renato De Giovanni Sent: Monday, April 28, 2008 1:54 PM To: tdwg-tapir@lists.tdwg.org Subject: Re: [tdwg-tapir] Tapir protocol - Harvest methods?
Phil,
Is the "DiGIR implementation that you want to move away from" just a DiGIR service? Or is it something else?
I would only keep a parallel DiGIR service if there are older clients that can only talk to it and for some reason (time/resources) can't be updated. I'm not sure if this is your case.
Also, when you said that you want to "test your implementation", did you mean that you want to test a TAPIR service, or is it some other application based on TAPIR? If you just want to test a TAPIR service, you could simply run TapirTester on it instead of developing your own harvester:
Note: If necessary, the existing tests can be improved. New ones can also be created (TapirTester is open source).
Hope this helps,
Renato
On 28 Apr 2008 at 10:39, Phil Cryer wrote:
Just starting with Tapir/DiGIR - I have 2 questions:
- I would like to know if the Tapir protocol is the preferred method
over DiGIR. We have a DiGIR implementation that we want to move away from, and bring up a Tapir one in its place. Is this normal, or do organizations run both to facilitate their older clients to do harvesting?
- What is a method to harvest data from Tapir, and/or DiGIR -we want
to do this internally to test our implementation before we open up to the world, how can I do this (we run Windows and Linux as clients)
Thank you
Phil
Phil Cryer Open Source Development Missouri Botanical Garden
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir _______________________________________________ tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir _______________________________________________ tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
Hi Stan,
Just a few comments about TAPIR and OAI-PMH.
I'm not sure if there's any core functionality offered by OAI-PMH that cannot be easily replicated with TAPIR. The main ingredients would be:
* A short list of concepts, basically record identifier, record timestamp, set membership and deletion flag. These would be the main concepts associated with request parameters and filters. * An extra list of concepts (or perhaps only one wrapper concept for XML content) that would be used to return the complete record representation in responses.
On the other hand, there are many functionalities in TAPIR that cannot be replicated in OAI-PMH since TAPIR is a generic search protocol. In some situations, and depending on how data providers are implemented, this can make TAPIR more efficient even in data harvesting scenarios. In OAI-PMH it may be necessary to send multiple requests to retrieve all data from a single record (in case there there are multiple metadata prefixes associated with the record). Also note that GBIF is using a name range query template for harvesting TAPIR providers - this approach has been created after years of experience and seems to give the best performance for them. I'm not sure if GBIF could use a similar strategy for an OAI-PMH provider, i.e., retrieving approximately the same number of records in sequential requests using a custom filter that potentially forces the local database to use an index. In TAPIR this can be done with an inventory request (with "count" activated) and subsequent searches using a parameterized range filter guaranteed to return a certain number of records.
I realize there may be other reasons to expose data using OAI-PMH (more available tools or compatibility with other networks). In this case, I should point to this interesting work where in the end Kevin Richards implemented an OAI-PMH service on top of TAPIR using less than 50 lines of code:
http://wiki.tdwg.org/twiki/bin/view/TAPIR/TapirOAIPMH
Best Regards, -- Renato
Phil,
TAPIR was intended to be a unification of DiGIR and BioCASE. There are a few implementations of providers but fewer instances of portals built on TAPIR. Networks built on DiGIR may eventually switch to TAPIR, but that remains to be seen. DiGIR and BioCASE were designed for distributed queries, not really harvesting. I understand harvesting can be done more simply and efficiently by other approaches, such as OAI-PMH. If the sensibilities of data providers evolves to accept and allow harvesting (which seems likely), we may see "networks" built on that architecture, instead of distributed queries.
If your only goal is to provide data to GBIF, I would suggest installing TAPIR (unless Tim Robertson tells you something else). If you are concerned about providing data to other networks, like www.SERNEC.org, you'll need a DiGIR provider, too. (Such is the nature of technical transition.)
-Stan
Stanley D. Blum, Ph.D. Research Information Manager California Academy of Sciences 875 Howard St. San Francisco, CA +1 (415) 321-8183
Phil,
I guess you wanted something like a graphic representation showing the number of providers/records using TAPIR versus the same with DiGIR over the time. I also wanted to see this. The closest thing we can probably do now is to query GBIF's UDDI registry. Today the figures are:
* 64 TAPIR providers. * 190 DiGIR providers. * 157 BioCASe providers.
But I have no idea about how this is changing over the time and I don't know what happens when a registered DiGIR provider switches to TAPIR (if the old record gets deleted in the registry or not). I also don't know if all providers there are active, and we should certainly consider that not all existing providers (DiGIR or TAPIR) are registered there.
I cannot tell about the other networks, but I know that the speciesLink network is planning to migrate all local providers to TAPIR. Not sure when exactly this is going to happen.
There are many ways you can harvest data from a TAPIR service. Markus mentioned how GBIF is doing this. The easiest way is probably using filterless search requests with TAPIR paging.
Best Regards, -- Renato
So we have DiGIR running at Mobot for Tropicos data, and clients hit it to harvest data. I was just wondering if people are still deploying DiGIR at all, or are they just using Tapir by default? It seems to have taken over for DiGIR, and I want to know if that's a 'standard' that we should follow.
For testing, yes, we're talking more of performance; make sure our network and server will handle X load. So I guess I want to know more of, how do clients attach to a Tapir server, how do they pull the data from us?
Sorry if this is such a newbie question, but I can't understand this aspect from the docs I've read.
Thanks for the reply!
Phil
Hi Renato and all,
The issue of harvesting isn't really a protocol one. In order to be able to have an efficient harvesting strategy (i.e do incremental harvests) data suppliers need:
uniquely identify objects (records or items or whatever) keep track of when they change these items
My understanding is that GBIF (and I guess other indexers) have to completely re-index the majority of data sources because these two things are not implemented consistently or at all by many of the suppliers. GBIF are now running out of resources and can't keep re- indexing every record every time. This is especially ironic as most of the records are from archives where the data rarely changes. It also means that data from the GBIF cache isn't comparable over time. If a data set is dropped and replaced by a new version with subtly different data points the consumer can't know if the different data points are additions or corrections to the old data points.
The TAPIR protocol does not require records to have ids and modifications dates. There is no reason for it to do so. The protocol may even be useful in applications where one positively does not want to enforce this.
If data providers who do implement TAPIR do supply ids and modification dates in a uniform way then it would be possible to incrementally harvest from them. It might even be possible to layer the OAI-PMH protocol over the top of TAPIR to make it more generic - as Kevin's work shows.
If TAPIR data sources don't supply ids and modification dates or they don't supply them in a "standard" way then efficient incremental harvesting is near enough impossible. One would have to do an inventory call where all the records began with "A" then with "B" etc.
OAI-PMH mandates the notions of ids (indeed GUIDs) and modification dates but obviously doesn't have a notion of search/query at all.
My belief/opinion is that the primary purpose of many people exposing data is to get it indexed (harvested) by GBIF. "Just" supplying data through TAPIR for this purpose does not make GBIFs job easy or scalable. Providers should also supply GUIDs and modification dates. If they supply the GUIDs and modification dates the protocol is not so important - RSS or Atom anyone?
I would go so far as saying that if data providers can't supply these two pieces of information they shouldn't expose their data as they are just polluting the global data pool - but that would probably be me saying way too much just to be provocative!
Hope my ranting is informative,
All the best,
Roger
------------------------------------------------------------- Roger Hyam Roger@BiodiversityCollectionsIndex.org http://www.BiodiversityCollectionsIndex.org ------------------------------------------------------------- Royal Botanic Garden Edinburgh 20A Inverleith Row, Edinburgh, EH3 5LR, UK Tel: +44 131 552 7171 ext 3015 Fax: +44 131 248 2901 http://www.rbge.org.uk/ -------------------------------------------------------------
On 30 Apr 2008, at 13:58, Renato De Giovanni wrote:
Hi Stan,
Just a few comments about TAPIR and OAI-PMH.
I'm not sure if there's any core functionality offered by OAI-PMH that cannot be easily replicated with TAPIR. The main ingredients would be:
- A short list of concepts, basically record identifier, record
timestamp, set membership and deletion flag. These would be the main concepts associated with request parameters and filters.
- An extra list of concepts (or perhaps only one wrapper concept for
XML content) that would be used to return the complete record representation in responses.
On the other hand, there are many functionalities in TAPIR that cannot be replicated in OAI-PMH since TAPIR is a generic search protocol. In some situations, and depending on how data providers are implemented, this can make TAPIR more efficient even in data harvesting scenarios. In OAI- PMH it may be necessary to send multiple requests to retrieve all data from a single record (in case there there are multiple metadata prefixes associated with the record). Also note that GBIF is using a name range query template for harvesting TAPIR providers - this approach has been created after years of experience and seems to give the best performance for them. I'm not sure if GBIF could use a similar strategy for an OAI-PMH provider, i.e., retrieving approximately the same number of records in sequential requests using a custom filter that potentially forces the local database to use an index. In TAPIR this can be done with an inventory request (with "count" activated) and subsequent searches using a parameterized range filter guaranteed to return a certain number of records.
I realize there may be other reasons to expose data using OAI-PMH (more available tools or compatibility with other networks). In this case, I should point to this interesting work where in the end Kevin Richards implemented an OAI-PMH service on top of TAPIR using less than 50 lines of code:
http://wiki.tdwg.org/twiki/bin/view/TAPIR/TapirOAIPMH
Best Regards,
Renato
Phil,
TAPIR was intended to be a unification of DiGIR and BioCASE. There are a few implementations of providers but fewer instances of portals built on TAPIR. Networks built on DiGIR may eventually switch to TAPIR, but that remains to be seen. DiGIR and BioCASE were designed for distributed queries, not really harvesting. I understand harvesting can be done more simply and efficiently by other approaches, such as OAI-PMH. If the sensibilities of data providers evolves to accept and allow harvesting (which seems likely), we may see "networks" built on that architecture, instead of distributed queries.
If your only goal is to provide data to GBIF, I would suggest installing TAPIR (unless Tim Robertson tells you something else). If you are concerned about providing data to other networks, like www.SERNEC.org, you'll need a DiGIR provider, too. (Such is the nature of technical transition.)
-Stan
Stanley D. Blum, Ph.D. Research Information Manager California Academy of Sciences 875 Howard St. San Francisco, CA +1 (415) 321-8183
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
Interesting. indeed a stable identifier is vital for many things. So is date last modified for incremental harvesting (using whatever protocol as roger explained).
And that is why I want to continue some of WASABIs ideas of having a data cache on the *provider* side. The provider software fills this cache from the live db anytime the provider wants to publish his data and the date last modified gets calculated per record. Also GUIDs can be assigned in this process based on stable local IDs. And from this cache different protocols incl TAPIRlite & OAI-PMH can easily be served. At GBIF we would even like to go further and create "local index files" (our current working title for this) for very efficient harvesting which can be downloaded as a static compressed single file - much like Google uses sitemaps for indexing. I am currently preparing a document on this with Tim Robertson and we are happy to hear your thoughts on this in a few weeks.
Markus
On 30 Apr, 2008, at 15:55, Roger Hyam (TDWG) wrote:
Hi Renato and all,
The issue of harvesting isn't really a protocol one. In order to be able to have an efficient harvesting strategy (i.e do incremental harvests) data suppliers need:
uniquely identify objects (records or items or whatever) keep track of when they change these items
My understanding is that GBIF (and I guess other indexers) have to completely re-index the majority of data sources because these two things are not implemented consistently or at all by many of the suppliers. GBIF are now running out of resources and can't keep re- indexing every record every time. This is especially ironic as most of the records are from archives where the data rarely changes. It also means that data from the GBIF cache isn't comparable over time. If a data set is dropped and replaced by a new version with subtly different data points the consumer can't know if the different data points are additions or corrections to the old data points.
The TAPIR protocol does not require records to have ids and modifications dates. There is no reason for it to do so. The protocol may even be useful in applications where one positively does not want to enforce this.
If data providers who do implement TAPIR do supply ids and modification dates in a uniform way then it would be possible to incrementally harvest from them. It might even be possible to layer the OAI-PMH protocol over the top of TAPIR to make it more generic - as Kevin's work shows.
If TAPIR data sources don't supply ids and modification dates or they don't supply them in a "standard" way then efficient incremental harvesting is near enough impossible. One would have to do an inventory call where all the records began with "A" then with "B" etc.
OAI-PMH mandates the notions of ids (indeed GUIDs) and modification dates but obviously doesn't have a notion of search/query at all.
My belief/opinion is that the primary purpose of many people exposing data is to get it indexed (harvested) by GBIF. "Just" supplying data through TAPIR for this purpose does not make GBIFs job easy or scalable. Providers should also supply GUIDs and modification dates. If they supply the GUIDs and modification dates the protocol is not so important - RSS or Atom anyone?
I would go so far as saying that if data providers can't supply these two pieces of information they shouldn't expose their data as they are just polluting the global data pool - but that would probably be me saying way too much just to be provocative!
Hope my ranting is informative,
All the best,
Roger
Roger Hyam Roger@BiodiversityCollectionsIndex.org http://www.BiodiversityCollectionsIndex.org
Royal Botanic Garden Edinburgh 20A Inverleith Row, Edinburgh, EH3 5LR, UK Tel: +44 131 552 7171 ext 3015 Fax: +44 131 248 2901 http://www.rbge.org.uk/
On 30 Apr 2008, at 13:58, Renato De Giovanni wrote:
Hi Stan,
Just a few comments about TAPIR and OAI-PMH.
I'm not sure if there's any core functionality offered by OAI-PMH that cannot be easily replicated with TAPIR. The main ingredients would be:
- A short list of concepts, basically record identifier, record
timestamp, set membership and deletion flag. These would be the main concepts associated with request parameters and filters.
- An extra list of concepts (or perhaps only one wrapper concept
for XML content) that would be used to return the complete record representation in responses.
On the other hand, there are many functionalities in TAPIR that cannot be replicated in OAI-PMH since TAPIR is a generic search protocol. In some situations, and depending on how data providers are implemented, this can make TAPIR more efficient even in data harvesting scenarios. In OAI- PMH it may be necessary to send multiple requests to retrieve all data from a single record (in case there there are multiple metadata prefixes associated with the record). Also note that GBIF is using a name range query template for harvesting TAPIR providers - this approach has been created after years of experience and seems to give the best performance for them. I'm not sure if GBIF could use a similar strategy for an OAI-PMH provider, i.e., retrieving approximately the same number of records in sequential requests using a custom filter that potentially forces the local database to use an index. In TAPIR this can be done with an inventory request (with "count" activated) and subsequent searches using a parameterized range filter guaranteed to return a certain number of records.
I realize there may be other reasons to expose data using OAI-PMH (more available tools or compatibility with other networks). In this case, I should point to this interesting work where in the end Kevin Richards implemented an OAI-PMH service on top of TAPIR using less than 50 lines of code:
http://wiki.tdwg.org/twiki/bin/view/TAPIR/TapirOAIPMH
Best Regards,
Renato
Phil,
TAPIR was intended to be a unification of DiGIR and BioCASE. There are a few implementations of providers but fewer instances of portals built on TAPIR. Networks built on DiGIR may eventually switch to TAPIR, but that remains to be seen. DiGIR and BioCASE were designed for distributed queries, not really harvesting. I understand harvesting can be done more simply and efficiently by other approaches, such as OAI-PMH. If the sensibilities of data providers evolves to accept and allow harvesting (which seems likely), we may see "networks" built on that architecture, instead of distributed queries.
If your only goal is to provide data to GBIF, I would suggest installing TAPIR (unless Tim Robertson tells you something else). If you are concerned about providing data to other networks, like www.SERNEC.org, you'll need a DiGIR provider, too. (Such is the nature of technical transition.)
-Stan
Stanley D. Blum, Ph.D. Research Information Manager California Academy of Sciences 875 Howard St. San Francisco, CA +1 (415) 321-8183
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
Markus,
Martin and were just having a conversation along these lines.
The trouble is with the notion of whether a "record" has changed when the internal database may not have the same notion of a record as the wrapper software. i.e. the record visible to the outside is the result of a query and the only way to know whether the data in a particular query result row has changed is to run the query again.
Our thoughts were along the lines of doing a hash of the query results so that you can just run a periodic exhaustive crawl of the data locally and update your local cache with changes but I guess you could just use a serialization of the object as the hash.
Your local cache would only need to contain three fields: object_id, last_mod, object_serialization and could be very generic. An adaptor to the client database would just have to run a query to generate the serialized objects. I wrote an implementation of OAI-PMH on top of a table like this and it worked really easily. The problem is still the adaptor to the client database. Some one has to do the mapping and they could use TAPIR to do that...
At this point the coffee was drunk and we had to stop...
All the best with it,
Roger
------------------------------------------------------------- Roger Hyam Roger@BiodiversityCollectionsIndex.org http://www.BiodiversityCollectionsIndex.org ------------------------------------------------------------- Royal Botanic Garden Edinburgh 20A Inverleith Row, Edinburgh, EH3 5LR, UK Tel: +44 131 552 7171 ext 3015 Fax: +44 131 248 2901 http://www.rbge.org.uk/ -------------------------------------------------------------
On 30 Apr 2008, at 15:38, Markus Döring wrote:
Interesting. indeed a stable identifier is vital for many things. So is date last modified for incremental harvesting (using whatever protocol as roger explained).
And that is why I want to continue some of WASABIs ideas of having a data cache on the *provider* side. The provider software fills this cache from the live db anytime the provider wants to publish his data and the date last modified gets calculated per record. Also GUIDs can be assigned in this process based on stable local IDs. And from this cache different protocols incl TAPIRlite & OAI-PMH can easily be served. At GBIF we would even like to go further and create "local index files" (our current working title for this) for very efficient harvesting which can be downloaded as a static compressed single file - much like Google uses sitemaps for indexing. I am currently preparing a document on this with Tim Robertson and we are happy to hear your thoughts on this in a few weeks.
Markus
On 30 Apr, 2008, at 15:55, Roger Hyam (TDWG) wrote:
Hi Renato and all,
The issue of harvesting isn't really a protocol one. In order to be able to have an efficient harvesting strategy (i.e do incremental harvests) data suppliers need:
uniquely identify objects (records or items or whatever) keep track of when they change these items
My understanding is that GBIF (and I guess other indexers) have to completely re-index the majority of data sources because these two things are not implemented consistently or at all by many of the suppliers. GBIF are now running out of resources and can't keep re- indexing every record every time. This is especially ironic as most of the records are from archives where the data rarely changes. It also means that data from the GBIF cache isn't comparable over time. If a data set is dropped and replaced by a new version with subtly different data points the consumer can't know if the different data points are additions or corrections to the old data points.
The TAPIR protocol does not require records to have ids and modifications dates. There is no reason for it to do so. The protocol may even be useful in applications where one positively does not want to enforce this.
If data providers who do implement TAPIR do supply ids and modification dates in a uniform way then it would be possible to incrementally harvest from them. It might even be possible to layer the OAI-PMH protocol over the top of TAPIR to make it more generic
- as Kevin's work shows.
If TAPIR data sources don't supply ids and modification dates or they don't supply them in a "standard" way then efficient incremental harvesting is near enough impossible. One would have to do an inventory call where all the records began with "A" then with "B" etc.
OAI-PMH mandates the notions of ids (indeed GUIDs) and modification dates but obviously doesn't have a notion of search/query at all.
My belief/opinion is that the primary purpose of many people exposing data is to get it indexed (harvested) by GBIF. "Just" supplying data through TAPIR for this purpose does not make GBIFs job easy or scalable. Providers should also supply GUIDs and modification dates. If they supply the GUIDs and modification dates the protocol is not so important - RSS or Atom anyone?
I would go so far as saying that if data providers can't supply these two pieces of information they shouldn't expose their data as they are just polluting the global data pool - but that would probably be me saying way too much just to be provocative!
Hope my ranting is informative,
All the best,
Roger
Roger Hyam Roger@BiodiversityCollectionsIndex.org http://www.BiodiversityCollectionsIndex.org
Royal Botanic Garden Edinburgh 20A Inverleith Row, Edinburgh, EH3 5LR, UK Tel: +44 131 552 7171 ext 3015 Fax: +44 131 248 2901 http://www.rbge.org.uk/
On 30 Apr 2008, at 13:58, Renato De Giovanni wrote:
Hi Stan,
Just a few comments about TAPIR and OAI-PMH.
I'm not sure if there's any core functionality offered by OAI-PMH that cannot be easily replicated with TAPIR. The main ingredients would be:
- A short list of concepts, basically record identifier, record
timestamp, set membership and deletion flag. These would be the main concepts associated with request parameters and filters.
- An extra list of concepts (or perhaps only one wrapper concept
for XML content) that would be used to return the complete record representation in responses.
On the other hand, there are many functionalities in TAPIR that cannot be replicated in OAI-PMH since TAPIR is a generic search protocol. In some situations, and depending on how data providers are implemented, this can make TAPIR more efficient even in data harvesting scenarios. In OAI-PMH it may be necessary to send multiple requests to retrieve all data from a single record (in case there there are multiple metadata prefixes associated with the record). Also note that GBIF is using a name range query template for harvesting TAPIR providers - this approach has been created after years of experience and seems to give the best performance for them. I'm not sure if GBIF could use a similar strategy for an OAI-PMH provider, i.e., retrieving approximately the same number of records in sequential requests using a custom filter that potentially forces the local database to use an index. In TAPIR this can be done with an inventory request (with "count" activated) and subsequent searches using a parameterized range filter guaranteed to return a certain number of records.
I realize there may be other reasons to expose data using OAI-PMH (more available tools or compatibility with other networks). In this case, I should point to this interesting work where in the end Kevin Richards implemented an OAI-PMH service on top of TAPIR using less than 50 lines of code:
http://wiki.tdwg.org/twiki/bin/view/TAPIR/TapirOAIPMH
Best Regards,
Renato
Phil,
TAPIR was intended to be a unification of DiGIR and BioCASE. There are a few implementations of providers but fewer instances of portals built on TAPIR. Networks built on DiGIR may eventually switch to TAPIR, but that remains to be seen. DiGIR and BioCASE were designed for distributed queries, not really harvesting. I understand harvesting can be done more simply and efficiently by other approaches, such as OAI-PMH. If the sensibilities of data providers evolves to accept and allow harvesting (which seems likely), we may see "networks" built on that architecture, instead of distributed queries.
If your only goal is to provide data to GBIF, I would suggest installing TAPIR (unless Tim Robertson tells you something else). If you are concerned about providing data to other networks, like www.SERNEC.org, you'll need a DiGIR provider, too. (Such is the nature of technical transition.)
-Stan
Stanley D. Blum, Ph.D. Research Information Manager California Academy of Sciences 875 Howard St. San Francisco, CA +1 (415) 321-8183
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
Hi Markus, As you indicate, the provider side data cache, one that reflects actual changes in content according to an agreed data model (and so providing true date last modified), is crucial to efficient propagation of content through the various networks. Such a cache, if implemented properly can also provide an effective basis for both push and pull models of data transfer. Indeed some data providers may implement a mechanism where they allow other providers to push content to their cache, thus enabling those with limited connectivity or expertise for running a server to contribute to a network. In such a model, the only really important pieces of information (for synchronization) are a unique identifier for each record and timestamps indicating when the object was created and last modified. Provenance metadata should also be captured unless the intended outcome is an entirely anonymous network. Such a push+pull approach is being implemented for the fishnet network, and results thus far have been satisfying.
regards, Dave V.
On Wed, Apr 30, 2008 at 7:38 AM, Markus Döring mdoering@gbif.org wrote:
Interesting. indeed a stable identifier is vital for many things. So is date last modified for incremental harvesting (using whatever protocol as roger explained).
And that is why I want to continue some of WASABIs ideas of having a data cache on the *provider* side. The provider software fills this cache from the live db anytime the provider wants to publish his data and the date last modified gets calculated per record. Also GUIDs can be assigned in this process based on stable local IDs. And from this cache different protocols incl TAPIRlite & OAI-PMH can easily be served. At GBIF we would even like to go further and create "local index files" (our current working title for this) for very efficient harvesting which can be downloaded as a static compressed single file - much like Google uses sitemaps for indexing. I am currently preparing a document on this with Tim Robertson and we are happy to hear your thoughts on this in a few weeks.
Markus
On 30 Apr, 2008, at 15:55, Roger Hyam (TDWG) wrote:
Hi Renato and all,
The issue of harvesting isn't really a protocol one. In order to be able to have an efficient harvesting strategy (i.e do incremental harvests) data suppliers need:
uniquely identify objects (records or items or whatever) keep track of when they change these items
My understanding is that GBIF (and I guess other indexers) have to completely re-index the majority of data sources because these two things are not implemented consistently or at all by many of the suppliers. GBIF are now running out of resources and can't keep re-indexing every record every time. This is especially ironic as most of the records are from archives where the data rarely changes. It also means that data from the GBIF cache isn't comparable over time. If a data set is dropped and replaced by a new version with subtly different data points the consumer can't know if the different data points are additions or corrections to the old data points.
The TAPIR protocol does not require records to have ids and modifications dates. There is no reason for it to do so. The protocol may even be useful in applications where one positively does not want to enforce this.
If data providers who do implement TAPIR do supply ids and modification dates in a uniform way then it would be possible to incrementally harvest from them. It might even be possible to layer the OAI-PMH protocol over the top of TAPIR to make it more generic - as Kevin's work shows.
If TAPIR data sources don't supply ids and modification dates or they don't supply them in a "standard" way then efficient incremental harvesting is near enough impossible. One would have to do an inventory call where all the records began with "A" then with "B" etc.
OAI-PMH mandates the notions of ids (indeed GUIDs) and modification dates but obviously doesn't have a notion of search/query at all.
My belief/opinion is that the primary purpose of many people exposing data is to get it indexed (harvested) by GBIF. "Just" supplying data through TAPIR for this purpose does not make GBIFs job easy or scalable. Providers should also supply GUIDs and modification dates. If they supply the GUIDs and modification dates the protocol is not so important - RSS or Atom anyone?
I would go so far as saying that if data providers can't supply these two pieces of information they shouldn't expose their data as they are just polluting the global data pool - but that would probably be me saying way too much just to be provocative!
Hope my ranting is informative,
All the best,
Roger
Roger Hyam Roger@BiodiversityCollectionsIndex.org http://www.BiodiversityCollectionsIndex.org
Royal Botanic Garden Edinburgh 20A Inverleith Row, Edinburgh, EH3 5LR, UK Tel: +44 131 552 7171 ext 3015 Fax: +44 131 248 2901 http://www.rbge.org.uk/
On 30 Apr 2008, at 13:58, Renato De Giovanni wrote: Hi Stan,
Just a few comments about TAPIR and OAI-PMH.
I'm not sure if there's any core functionality offered by OAI-PMH that cannot be easily replicated with TAPIR. The main ingredients would be:
- A short list of concepts, basically record identifier, record timestamp,
set membership and deletion flag. These would be the main concepts associated with request parameters and filters.
- An extra list of concepts (or perhaps only one wrapper concept for XML
content) that would be used to return the complete record representation in responses.
On the other hand, there are many functionalities in TAPIR that cannot be replicated in OAI-PMH since TAPIR is a generic search protocol. In some situations, and depending on how data providers are implemented, this can make TAPIR more efficient even in data harvesting scenarios. In OAI-PMH it may be necessary to send multiple requests to retrieve all data from a single record (in case there there are multiple metadata prefixes associated with the record). Also note that GBIF is using a name range query template for harvesting TAPIR providers - this approach has been created after years of experience and seems to give the best performance for them. I'm not sure if GBIF could use a similar strategy for an OAI-PMH provider, i.e., retrieving approximately the same number of records in sequential requests using a custom filter that potentially forces the local database to use an index. In TAPIR this can be done with an inventory request (with "count" activated) and subsequent searches using a parameterized range filter guaranteed to return a certain number of records.
I realize there may be other reasons to expose data using OAI-PMH (more available tools or compatibility with other networks). In this case, I should point to this interesting work where in the end Kevin Richards implemented an OAI-PMH service on top of TAPIR using less than 50 lines of code:
http://wiki.tdwg.org/twiki/bin/view/TAPIR/TapirOAIPMH
Best Regards,
Renato
Phil,
TAPIR was intended to be a unification of DiGIR and BioCASE. There are a few implementations of providers but fewer instances of portals built on TAPIR. Networks built on DiGIR may eventually switch to TAPIR, but that remains to be seen. DiGIR and BioCASE were designed for distributed queries, not really harvesting. I understand harvesting can be done more simply and efficiently by other approaches, such as OAI-PMH. If the sensibilities of data providers evolves to accept and allow harvesting (which seems likely), we may see "networks" built on that architecture, instead of distributed queries.
If your only goal is to provide data to GBIF, I would suggest installing TAPIR (unless Tim Robertson tells you something else). If you are concerned about providing data to other networks, like www.SERNEC.org, you'll need a DiGIR provider, too. (Such is the nature of technical transition.)
-Stan
Stanley D. Blum, Ph.D. Research Information Manager California Academy of Sciences 875 Howard St. San Francisco, CA +1 (415) 321-8183
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
Hi Roger, Your summation is correct - some source database implementations fail to record date last modified for individual records of the data model being exposed by the data provider (e.g. darwin core model over DiGIR, TAPIR, WASABI, etc), and so the only effective mechanism for change detection is to record a hash of a normalized form (ensuring field order and formatting consistency) of content from each record. The actual data from the record can be captured as well in the cache, or the cache interface can simply reconstruct the records from the source database. The important part of the process is rendering content according to the model being exposed to the networks, and recording the hash and time stamps, to enable content change tracking.
I've found this process to be very efficient, processing a change detection across an entire dataset of about 200k records in only a couple of minutes without any particular optimization implemented. Too slow for realtime access, but by caching the change information, a data provider could easily be modified to enable proper change detection to support the requirements of protocols such as OAI-PMH.
regards, Dave V.
On Wed, Apr 30, 2008 at 8:11 AM, Roger Hyam (TDWG) rogerhyam@mac.com wrote:
Markus,
Martin and were just having a conversation along these lines.
The trouble is with the notion of whether a "record" has changed when the internal database may not have the same notion of a record as the wrapper software. i.e. the record visible to the outside is the result of a query and the only way to know whether the data in a particular query result row has changed is to run the query again.
Our thoughts were along the lines of doing a hash of the query results so that you can just run a periodic exhaustive crawl of the data locally and update your local cache with changes but I guess you could just use a serialization of the object as the hash.
Your local cache would only need to contain three fields: object_id, last_mod, object_serialization and could be very generic. An adaptor to the client database would just have to run a query to generate the serialized objects. I wrote an implementation of OAI-PMH on top of a table like this and it worked really easily. The problem is still the adaptor to the client database. Some one has to do the mapping and they could use TAPIR to do that...
At this point the coffee was drunk and we had to stop...
All the best with it,
Roger
Roger Hyam Roger@BiodiversityCollectionsIndex.org http://www.BiodiversityCollectionsIndex.org
Royal Botanic Garden Edinburgh 20A Inverleith Row, Edinburgh, EH3 5LR, UK Tel: +44 131 552 7171 ext 3015 Fax: +44 131 248 2901 http://www.rbge.org.uk/
On 30 Apr 2008, at 15:38, Markus Döring wrote:
Interesting. indeed a stable identifier is vital for many things. So is date last modified for incremental harvesting (using whatever protocol as roger explained).
And that is why I want to continue some of WASABIs ideas of having a data cache on the *provider* side. The provider software fills this cache from the live db anytime the provider wants to publish his data and the date last modified gets calculated per record. Also GUIDs can be assigned in this process based on stable local IDs. And from this cache different protocols incl TAPIRlite & OAI-PMH can easily be served. At GBIF we would even like to go further and create "local index files" (our current working title for this) for very efficient harvesting which can be downloaded as a static compressed single file - much like Google uses sitemaps for indexing. I am currently preparing a document on this with Tim Robertson and we are happy to hear your thoughts on this in a few weeks.
Markus
On 30 Apr, 2008, at 15:55, Roger Hyam (TDWG) wrote:
Hi Renato and all,
The issue of harvesting isn't really a protocol one. In order to be able to have an efficient harvesting strategy (i.e do incremental harvests) data suppliers need:
uniquely identify objects (records or items or whatever) keep track of when they change these items
My understanding is that GBIF (and I guess other indexers) have to completely re-index the majority of data sources because these two things are not implemented consistently or at all by many of the suppliers. GBIF are now running out of resources and can't keep re-indexing every record every time. This is especially ironic as most of the records are from archives where the data rarely changes. It also means that data from the GBIF cache isn't comparable over time. If a data set is dropped and replaced by a new version with subtly different data points the consumer can't know if the different data points are additions or corrections to the old data points.
The TAPIR protocol does not require records to have ids and modifications dates. There is no reason for it to do so. The protocol may even be useful in applications where one positively does not want to enforce this.
If data providers who do implement TAPIR do supply ids and modification dates in a uniform way then it would be possible to incrementally harvest from them. It might even be possible to layer the OAI-PMH protocol over the top of TAPIR to make it more generic - as Kevin's work shows.
If TAPIR data sources don't supply ids and modification dates or they don't supply them in a "standard" way then efficient incremental harvesting is near enough impossible. One would have to do an inventory call where all the records began with "A" then with "B" etc.
OAI-PMH mandates the notions of ids (indeed GUIDs) and modification dates but obviously doesn't have a notion of search/query at all.
My belief/opinion is that the primary purpose of many people exposing data is to get it indexed (harvested) by GBIF. "Just" supplying data through TAPIR for this purpose does not make GBIFs job easy or scalable. Providers should also supply GUIDs and modification dates. If they supply the GUIDs and modification dates the protocol is not so important - RSS or Atom anyone?
I would go so far as saying that if data providers can't supply these two pieces of information they shouldn't expose their data as they are just polluting the global data pool - but that would probably be me saying way too much just to be provocative!
Hope my ranting is informative,
All the best,
Roger
Roger Hyam Roger@BiodiversityCollectionsIndex.org http://www.BiodiversityCollectionsIndex.org
Royal Botanic Garden Edinburgh 20A Inverleith Row, Edinburgh, EH3 5LR, UK Tel: +44 131 552 7171 ext 3015 Fax: +44 131 248 2901 http://www.rbge.org.uk/
On 30 Apr 2008, at 13:58, Renato De Giovanni wrote: Hi Stan,
Just a few comments about TAPIR and OAI-PMH.
I'm not sure if there's any core functionality offered by OAI-PMH that cannot be easily replicated with TAPIR. The main ingredients would be:
- A short list of concepts, basically record identifier, record timestamp,
set membership and deletion flag. These would be the main concepts associated with request parameters and filters.
- An extra list of concepts (or perhaps only one wrapper concept for XML
content) that would be used to return the complete record representation in responses.
On the other hand, there are many functionalities in TAPIR that cannot be replicated in OAI-PMH since TAPIR is a generic search protocol. In some situations, and depending on how data providers are implemented, this can make TAPIR more efficient even in data harvesting scenarios. In OAI-PMH it may be necessary to send multiple requests to retrieve all data from a single record (in case there there are multiple metadata prefixes associated with the record). Also note that GBIF is using a name range query template for harvesting TAPIR providers - this approach has been created after years of experience and seems to give the best performance for them. I'm not sure if GBIF could use a similar strategy for an OAI-PMH provider, i.e., retrieving approximately the same number of records in sequential requests using a custom filter that potentially forces the local database to use an index. In TAPIR this can be done with an inventory request (with "count" activated) and subsequent searches using a parameterized range filter guaranteed to return a certain number of records.
I realize there may be other reasons to expose data using OAI-PMH (more available tools or compatibility with other networks). In this case, I should point to this interesting work where in the end Kevin Richards implemented an OAI-PMH service on top of TAPIR using less than 50 lines of code:
http://wiki.tdwg.org/twiki/bin/view/TAPIR/TapirOAIPMH
Best Regards,
Renato
Phil,
TAPIR was intended to be a unification of DiGIR and BioCASE. There are a few implementations of providers but fewer instances of portals built on TAPIR. Networks built on DiGIR may eventually switch to TAPIR, but that remains to be seen. DiGIR and BioCASE were designed for distributed queries, not really harvesting. I understand harvesting can be done more simply and efficiently by other approaches, such as OAI-PMH. If the sensibilities of data providers evolves to accept and allow harvesting (which seems likely), we may see "networks" built on that architecture, instead of distributed queries.
If your only goal is to provide data to GBIF, I would suggest installing TAPIR (unless Tim Robertson tells you something else). If you are concerned about providing data to other networks, like www.SERNEC.org, you'll need a DiGIR provider, too. (Such is the nature of technical transition.)
-Stan
Stanley D. Blum, Ph.D. Research Information Manager California Academy of Sciences 875 Howard St. San Francisco, CA +1 (415) 321-8183
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
On Mon, 2008-04-28 at 17:34 -0500, Markus Döring wrote:
Phil, from the GBIF side it doesnt matter whether you use DiGIR or TAPIR. Both protocols are currently supported by the GBIF indexer. If you use TapirLink simply mapping to DarwinCore is enough. For other TAPIRlite providers please make sure your service works with the 2 following DarwinCore TAPIR templates found at TDWG:
http://rs.tdwg.org/tapir/cs/dwc/1.4/template/dwc_sci_name_range.xml http://rs.tdwg.org/tapir/cs/dwc/1.4/template/dwc_unfiltered_search.xml
Markus I've gotten DiGIR back in line and will start tracking it to see what kind of usage we're experiencing, after that I want to bring up Tapir, mapping out data via ABCD - after this I will speak to you so we can determine if I have things configured the most efficiently. I'm interested in how we can have the harvester pull only the latest data...I'll think about that.
Phil
At GBIF we are currently also thinking about a much simpler provider software tailored for harvesting. That will reduce load on providers enormously while still supporting basic TAPIR capabilities for true distributed queries. We will keep this list informed once we have thought this through.
Markus
-- Markus Döring, Berlin Senior Software Developer GBIF Secretariat mdoering@gbif.org
On 28 Apr, 2008, at 23:02, Blum, Stan wrote:
Phil,
TAPIR was intended to be a unification of DiGIR and BioCASE. There are a few implementations of providers but fewer instances of portals built
on
TAPIR. Networks built on DiGIR may eventually switch to TAPIR, but that remains to be seen. DiGIR and BioCASE were designed for distributed queries, not really harvesting. I understand harvesting can be done more simply and efficiently by other approaches, such as OAI-PMH. If the sensibilities of data providers evolves to accept and allow harvesting (which seems likely), we may see "networks" built on that architecture, instead of distributed
queries.
If your only goal is to provide data to GBIF, I would suggest installing TAPIR (unless Tim Robertson tells you something else). If you are concerned about providing data to other networks, like www.SERNEC.org, you'll need a DiGIR provider, too. (Such is the nature of technical transition.)
-Stan
Stanley D. Blum, Ph.D. Research Information Manager California Academy of Sciences 875 Howard St. San Francisco, CA +1 (415) 321-8183
-----Original Message----- From: tdwg-tapir-bounces@lists.tdwg.org [mailto:tdwg-tapir-bounces@lists.tdwg.org] On Behalf Of Phil Cryer Sent: Monday, April 28, 2008 1:22 PM To: Renato De Giovanni; tdwg-tapir@lists.tdwg.org Subject: RE: [tdwg-tapir] Tapir protocol - Harvest methods?
So we have DiGIR running at Mobot for Tropicos data, and clients
hit
it to harvest data. I was just wondering if people are still deploying DiGIR at all, or are they just using Tapir by default? It seems to have taken over for DiGIR, and I want to know if that's a 'standard' that we should follow.
For testing, yes, we're talking more of performance; make sure our network and server will handle X load. So I guess I want to know more of, how do clients attach to a Tapir server, how do they pull the data from us?
Sorry if this is such a newbie question, but I can't understand
this
aspect from the docs I've read.
Thanks for the reply!
Phil
-----Original Message----- From: tdwg-tapir-bounces@lists.tdwg.org [mailto:tdwg-tapir-bounces@lists.tdwg.org] On Behalf Of Renato De Giovanni Sent: Monday, April 28, 2008 1:54 PM To: tdwg-tapir@lists.tdwg.org Subject: Re: [tdwg-tapir] Tapir protocol - Harvest methods?
Phil,
Is the "DiGIR implementation that you want to move away from" just a DiGIR service? Or is it something else?
I would only keep a parallel DiGIR service if there are older
clients
that can only talk to it and for some reason (time/resources) can't be updated. I'm not sure if this is your case.
Also, when you said that you want to "test your implementation", did you mean that you want to test a TAPIR service, or is it some other application based on TAPIR? If you just want to test a TAPIR
service,
you could simply run TapirTester on it instead of developing your
own
harvester:
Note: If necessary, the existing tests can be improved. New ones can also be created (TapirTester is open source).
Hope this helps,
Renato
On 28 Apr 2008 at 10:39, Phil Cryer wrote:
Just starting with Tapir/DiGIR - I have 2 questions:
- I would like to know if the Tapir protocol is the preferred
method
over DiGIR. We have a DiGIR implementation that we want to move
away
from, and bring up a Tapir one in its place. Is this normal, or do organizations run both to facilitate their older clients to do harvesting?
- What is a method to harvest data from Tapir, and/or DiGIR -we
want
to do this internally to test our implementation before we open up
to
the world, how can I do this (we run Windows and Linux as clients)
Thank you
Phil
Phil Cryer Open Source Development Missouri Botanical Garden
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
--
On Mon, 2008-04-28 at 13:54 -0500, Renato De Giovanni wrote:
Phil,
Is the "DiGIR implementation that you want to move away from" just a DiGIR service? Or is it something else?
I would only keep a parallel DiGIR service if there are older clients that can only talk to it and for some reason (time/resources) can't be updated. I'm not sure if this is your case.
Also, when you said that you want to "test your implementation", did you mean that you want to test a TAPIR service, or is it some other application based on TAPIR? If you just want to test a TAPIR service, you could simply run TapirTester on it instead of developing your own harvester:
Thanks Renato, this does help. For now I need to get DiGIR back up so current users can hit it, I think I'm at that point now, but I want to be able to hit it like the users do. What would I do? Just a simple query within the web client window? I assume that does the same thing...still new to this.
As for Tapir, I have it setup and am mapping fields now, will hit the tester soon.
Thanks
Phil
Hope this helps,
Renato
On 28 Apr 2008 at 10:39, Phil Cryer wrote:
Just starting with Tapir/DiGIR - I have 2 questions:
- I would like to know if the Tapir protocol is the preferred method
over DiGIR. We have a DiGIR implementation that we want to move away from, and bring up a Tapir one in its place. Is this normal, or do organizations run both to facilitate their older clients to do harvesting?
- What is a method to harvest data from Tapir, and/or DiGIR -we want
to do this internally to test our implementation before we open up
to
the world, how can I do this (we run Windows and Linux as clients)
Thank you
Phil
Phil Cryer Open Source Development Missouri Botanical Garden
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
Phil,
DiGIR and TAPIR are generic query protocols, so if you don't know all your clients you could probably check the corresponding log directory to see what kind of requests you're receiving. In the case of TapirLink, there's also a web interface where you can find provider query statistics (statistics tracking is enabled by default).
Best Regards, -- Renato
On 1 May 2008 at 16:19, Phil Cryer wrote:
Thanks Renato, this does help. For now I need to get DiGIR back up so current users can hit it, I think I'm at that point now, but I want to be able to hit it like the users do. What would I do? Just a simple query within the web client window? I assume that does the same thing...still new to this.
As for Tapir, I have it setup and am mapping fields now, will hit the tester soon.
Thanks
Phil
Phil, incremental harvesting is not implemented on the GBIF side as far as I am aware. And I dont think that will be a simple thing to implement on the current system. Also, even if we can detect only the changed records since the last harevesting via dateLastModified we still have no information about deletions. We could have an arrangement saying that you keep deleted records as empty records with just the ID and nothing else (I vaguely remember LSIDs were supposed to work like this too). But that also needs to be supported on your side then, never entirely removing any record. I will have a discussion with the others at GBIF about that.
Markus
On 1 May, 2008, at 15:28, Phil Cryer wrote:
On Mon, 2008-04-28 at 17:34 -0500, Markus Döring wrote:
Phil, from the GBIF side it doesnt matter whether you use DiGIR or TAPIR. Both protocols are currently supported by the GBIF indexer. If you use TapirLink simply mapping to DarwinCore is enough. For other TAPIRlite providers please make sure your service works with the 2 following DarwinCore TAPIR templates found at TDWG:
http://rs.tdwg.org/tapir/cs/dwc/1.4/template/dwc_sci_name_range.xml http://rs.tdwg.org/tapir/cs/dwc/1.4/template/ dwc_unfiltered_search.xml
Markus I've gotten DiGIR back in line and will start tracking it to see what kind of usage we're experiencing, after that I want to bring up Tapir, mapping out data via ABCD - after this I will speak to you so we can determine if I have things configured the most efficiently. I'm interested in how we can have the harvester pull only the latest data...I'll think about that.
Phil
At GBIF we are currently also thinking about a much simpler provider software tailored for harvesting. That will reduce load on providers enormously while still supporting basic TAPIR capabilities for true distributed queries. We will keep this list informed once we have thought this through.
Markus
-- Markus Döring, Berlin Senior Software Developer GBIF Secretariat mdoering@gbif.org
On 28 Apr, 2008, at 23:02, Blum, Stan wrote:
Phil,
TAPIR was intended to be a unification of DiGIR and BioCASE. There are a few implementations of providers but fewer instances of portals built
on
TAPIR. Networks built on DiGIR may eventually switch to TAPIR, but that remains to be seen. DiGIR and BioCASE were designed for distributed queries, not really harvesting. I understand harvesting can be done more simply and efficiently by other approaches, such as OAI-PMH. If the sensibilities of data providers evolves to accept and allow harvesting (which seems likely), we may see "networks" built on that architecture, instead of distributed
queries.
If your only goal is to provide data to GBIF, I would suggest installing TAPIR (unless Tim Robertson tells you something else). If you are concerned about providing data to other networks, like www.SERNEC.org, you'll need a DiGIR provider, too. (Such is the nature of technical transition.)
-Stan
Stanley D. Blum, Ph.D. Research Information Manager California Academy of Sciences 875 Howard St. San Francisco, CA +1 (415) 321-8183
-----Original Message----- From: tdwg-tapir-bounces@lists.tdwg.org [mailto:tdwg-tapir-bounces@lists.tdwg.org] On Behalf Of Phil Cryer Sent: Monday, April 28, 2008 1:22 PM To: Renato De Giovanni; tdwg-tapir@lists.tdwg.org Subject: RE: [tdwg-tapir] Tapir protocol - Harvest methods?
So we have DiGIR running at Mobot for Tropicos data, and clients
hit
it to harvest data. I was just wondering if people are still deploying DiGIR at all, or are they just using Tapir by default? It seems to have taken over for DiGIR, and I want to know if that's a 'standard' that we should follow.
For testing, yes, we're talking more of performance; make sure our network and server will handle X load. So I guess I want to know more of, how do clients attach to a Tapir server, how do they pull the data from us?
Sorry if this is such a newbie question, but I can't understand
this
aspect from the docs I've read.
Thanks for the reply!
Phil
-----Original Message----- From: tdwg-tapir-bounces@lists.tdwg.org [mailto:tdwg-tapir-bounces@lists.tdwg.org] On Behalf Of Renato De Giovanni Sent: Monday, April 28, 2008 1:54 PM To: tdwg-tapir@lists.tdwg.org Subject: Re: [tdwg-tapir] Tapir protocol - Harvest methods?
Phil,
Is the "DiGIR implementation that you want to move away from" just a DiGIR service? Or is it something else?
I would only keep a parallel DiGIR service if there are older
clients
that can only talk to it and for some reason (time/resources) can't be updated. I'm not sure if this is your case.
Also, when you said that you want to "test your implementation", did you mean that you want to test a TAPIR service, or is it some other application based on TAPIR? If you just want to test a TAPIR
service,
you could simply run TapirTester on it instead of developing your
own
harvester:
Note: If necessary, the existing tests can be improved. New ones can also be created (TapirTester is open source).
Hope this helps,
Renato
On 28 Apr 2008 at 10:39, Phil Cryer wrote:
Just starting with Tapir/DiGIR - I have 2 questions:
- I would like to know if the Tapir protocol is the preferred
method
over DiGIR. We have a DiGIR implementation that we want to move
away
from, and bring up a Tapir one in its place. Is this normal, or do organizations run both to facilitate their older clients to do harvesting?
- What is a method to harvest data from Tapir, and/or DiGIR -we
want
to do this internally to test our implementation before we open up
to
the world, how can I do this (we run Windows and Linux as clients)
Thank you
Phil
Phil Cryer Open Source Development Missouri Botanical Garden
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
--
Hi Markus,
I would suggest creating new concepts for incremental harvesting, either in the data standards themselves or in some new extension. In the case of TAPIR, GBIF could easily check the mapped concepts before deciding between incremental or full harvesting.
Actually it could be just one new concept such as "recordStatus" or "deletionFlag". Or perhaps you could also want to create your own definition for dateLastModified indicating which set of concepts should be considered to see if something has changed or not, but I guess this level of granularity would be difficult to be supported.
Regards, -- Renato
On 5 May 2008 at 11:24, Markus Döring wrote:
Phil, incremental harvesting is not implemented on the GBIF side as far as I am aware. And I dont think that will be a simple thing to implement on the current system. Also, even if we can detect only the changed records since the last harevesting via dateLastModified we still have no information about deletions. We could have an arrangement saying that you keep deleted records as empty records with just the ID and nothing else (I vaguely remember LSIDs were supposed to work like this too). But that also needs to be supported on your side then, never entirely removing any record. I will have a discussion with the others at GBIF about that.
Markus
Hi Markus,
Without the notion of incremental harvesting there is little notion of harvesting at all I think. The supplier may as well burn a table to CD and mail it to you (or gzip it or something). The EML model of having a data set (table) bound to a descriptive file is more appropriate than an online data provider one.
If supplier 'A' provides 10,000 records this week and then replaces them with 10,001 next week and with 9,999 the week after how many records do we have from the point of view of a data consumer? 30k or just 3 (with ~10k data points in each)? It is a very different way to look at the data than from the original specimen based one that we started with. If the data represents an entomological collection it seems crazy (we are not replacing the specimens each week) if it represents bird sightings it seem a sensible (these may be different studies and are not replacements but separate data sets.).
Are we trying to combine two kinds of data that don't fit together very well?
I keep coming back to the need to know how people will use the data...
All the best,
Roger
On 5 May 2008, at 20:33, Renato De Giovanni wrote:
Hi Markus,
I would suggest creating new concepts for incremental harvesting, either in the data standards themselves or in some new extension. In the case of TAPIR, GBIF could easily check the mapped concepts before deciding between incremental or full harvesting.
Actually it could be just one new concept such as "recordStatus" or "deletionFlag". Or perhaps you could also want to create your own definition for dateLastModified indicating which set of concepts should be considered to see if something has changed or not, but I guess this level of granularity would be difficult to be supported.
Regards,
Renato
On 5 May 2008 at 11:24, Markus Döring wrote:
Phil, incremental harvesting is not implemented on the GBIF side as far as I am aware. And I dont think that will be a simple thing to implement on the current system. Also, even if we can detect only the changed records since the last harevesting via dateLastModified we still have no information about deletions. We could have an arrangement saying that you keep deleted records as empty records with just the ID and nothing else (I vaguely remember LSIDs were supposed to work like this too). But that also needs to be supported on your side then, never entirely removing any record. I will have a discussion with the others at GBIF about that.
Markus
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
I think I agree here.
The harvesting "procedure" is really defined outside the Tapir protocol, is it not? So it is really an agreement between the harvester and the harvestees.
So what is really needed here is the standard procedure for maintaining a "harvestable" dataset and the standard procedure for harvesting that dataset. We have a general rule at Landcare, that we never delete records in our datasets - they are either deprecated in favour of another record, and so the resolution of that record would point to the new record, or the are set to a state of "deleted", but are still kept in the dataset, and can be resolved (which would indicate a state of deleted).
Kevin
"Renato De Giovanni" renato@cria.org.br 6/05/2008 7:33 a.m. >>>
Hi Markus,
I would suggest creating new concepts for incremental harvesting, either in the data standards themselves or in some new extension. In the case of TAPIR, GBIF could easily check the mapped concepts before deciding between incremental or full harvesting.
Actually it could be just one new concept such as "recordStatus" or "deletionFlag". Or perhaps you could also want to create your own definition for dateLastModified indicating which set of concepts should be considered to see if something has changed or not, but I guess this level of granularity would be difficult to be supported.
Regards, -- Renato
On 5 May 2008 at 11:24, Markus Döring wrote:
Phil, incremental harvesting is not implemented on the GBIF side as far as
I
am aware. And I dont think that will be a simple thing to implement
on
the current system. Also, even if we can detect only the changed records since the last harevesting via dateLastModified we still have
no information about deletions. We could have an arrangement saying
that you keep deleted records as empty records with just the ID and
nothing else (I vaguely remember LSIDs were supposed to work like
this
too). But that also needs to be supported on your side then, never entirely removing any record. I will have a discussion with the
others
at GBIF about that.
Markus
_______________________________________________ tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ WARNING: This email and any attachments may be confidential and/or privileged. They are intended for the addressee only and are not to be read, used, copied or disseminated by anyone receiving them in error. If you are not the intended recipient, please notify the sender by return email and delete this message and any attachments.
The views expressed in this email are those of the sender and do not necessarily reflect the official views of Landcare Research. http://www.landcareresearch.co.nz ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
at berkeley we've recently prototyped a simple php program that uses an existing tapirlink installation to periodically dump tapir resources into a csv file. the solution is totally generic and can dump darwin core (and technically abcd schema, although it's currently untested). the resulting csv files are zip archived and made accessible using a web service. it's a simple approach that has proven to be, at least internally, quite reliable and useful.
for example, several of our caching applications use the web service to harvest csv data from tapirlink resources using the following process: 1) download latest csv dump for a resource using the web service. 2) flush all locally cached records for the resource. 3) bulk load the latest csv data into the cache.
in this way, cached data are always synchronized with the resource and there's no need to track new, deleted, or changed records. as an aside, each time these cached data are queried by the caching application or selected in the user interface, log-only search requests are sent back to the resource.
after discussion with renato giovanni and john wieczorek, we've decided that merging this functionality into the tapirlink codebase would benefit the broader community. csv generation support would be declared through capabilities. although incremental harvesting wouldn't be immediately implemented, we could certainly extend the service to include it later.
i'd like to pause here to gauge the consensus, thoughts, concerns, and ideas of others. anyone?
thanks, aaron
2008/5/5 Kevin Richards RichardsK@landcareresearch.co.nz:
I think I agree here.
The harvesting "procedure" is really defined outside the Tapir protocol, is it not? So it is really an agreement between the harvester and the harvestees.
So what is really needed here is the standard procedure for maintaining a "harvestable" dataset and the standard procedure for harvesting that dataset. We have a general rule at Landcare, that we never delete records in our datasets - they are either deprecated in favour of another record, and so the resolution of that record would point to the new record, or the are set to a state of "deleted", but are still kept in the dataset, and can be resolved (which would indicate a state of deleted).
Kevin
"Renato De Giovanni" renato@cria.org.br 6/05/2008 7:33 a.m. >>>
Hi Markus,
I would suggest creating new concepts for incremental harvesting, either in the data standards themselves or in some new extension. In the case of TAPIR, GBIF could easily check the mapped concepts before deciding between incremental or full harvesting.
Actually it could be just one new concept such as "recordStatus" or "deletionFlag". Or perhaps you could also want to create your own definition for dateLastModified indicating which set of concepts should be considered to see if something has changed or not, but I guess this level of granularity would be difficult to be supported.
Regards,
Renato
On 5 May 2008 at 11:24, Markus Döring wrote:
Phil, incremental harvesting is not implemented on the GBIF side as far as I am aware. And I dont think that will be a simple thing to implement on the current system. Also, even if we can detect only the changed records since the last harevesting via dateLastModified we still have no information about deletions. We could have an arrangement saying that you keep deleted records as empty records with just the ID and nothing else (I vaguely remember LSIDs were supposed to work like this too). But that also needs to be supported on your side then, never entirely removing any record. I will have a discussion with the others at GBIF about that.
Markus
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
Please consider the environment before printing this email
WARNING : This email and any attachments may be confidential and/or privileged. They are intended for the addressee only and are not to be read, used, copied or disseminated by anyone receiving them in error. If you are not the intended recipient, please notify the sender by return email and delete this message and any attachments.
The views expressed in this email are those of the sender and do not necessarily reflect the official views of Landcare Research. http://www.landcareresearch.co.nz _______________________________________________ tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
I think this is a great idea. I have thought a bit about how we can "build upon" then tapir protocol and services that currently exist, and this post reminded me of a few that I would like to look at. One in particular is extending the type of data sources that the Tapir configurator tools can connect to - I have done this a little in my TapirDotNET implementation where you can connect a concept to an LSID data source (ie it resolves the LSID and returns the resulting xml as the value for that mapped Tapir concept). But connecting to web services, etc, and also providing a "Tapir API" for the advanced user to programmatically provide data through a Tapir service would also be cool. Any thoughts?
Kevin
"Aaron D. Steele" eightysteele@gmail.com 14/05/2008 8:40 a.m.
at berkeley we've recently prototyped a simple php program that uses an existing tapirlink installation to periodically dump tapir resources into a csv file. the solution is totally generic and can dump darwin core (and technically abcd schema, although it's currently untested). the resulting csv files are zip archived and made accessible using a web service. it's a simple approach that has proven to be, at least internally, quite reliable and useful.
for example, several of our caching applications use the web service to harvest csv data from tapirlink resources using the following process: 1) download latest csv dump for a resource using the web service. 2) flush all locally cached records for the resource. 3) bulk load the latest csv data into the cache.
in this way, cached data are always synchronized with the resource and there's no need to track new, deleted, or changed records. as an aside, each time these cached data are queried by the caching application or selected in the user interface, log-only search requests are sent back to the resource.
after discussion with renato giovanni and john wieczorek, we've decided that merging this functionality into the tapirlink codebase would benefit the broader community. csv generation support would be declared through capabilities. although incremental harvesting wouldn't be immediately implemented, we could certainly extend the service to include it later.
i'd like to pause here to gauge the consensus, thoughts, concerns, and ideas of others. anyone?
thanks, aaron
2008/5/5 Kevin Richards RichardsK@landcareresearch.co.nz:
I think I agree here.
The harvesting "procedure" is really defined outside the Tapir
protocol, is
it not? So it is really an agreement between the harvester and the harvestees.
So what is really needed here is the standard procedure for
maintaining a
"harvestable" dataset and the standard procedure for harvesting that dataset. We have a general rule at Landcare, that we never delete records in
our
datasets - they are either deprecated in favour of another record,
and so
the resolution of that record would point to the new record, or the
are set
to a state of "deleted", but are still kept in the dataset, and can
be
resolved (which would indicate a state of deleted).
Kevin
"Renato De Giovanni" renato@cria.org.br 6/05/2008 7:33 a.m.
Hi Markus,
I would suggest creating new concepts for incremental harvesting, either in the data standards themselves or in some new extension. In the case of TAPIR, GBIF could easily check the mapped concepts
before
deciding between incremental or full harvesting.
Actually it could be just one new concept such as "recordStatus" or "deletionFlag". Or perhaps you could also want to create your own definition for dateLastModified indicating which set of concepts should be considered to see if something has changed or not, but I guess this level of granularity would be difficult to be supported.
Regards,
Renato
On 5 May 2008 at 11:24, Markus Döring wrote:
Phil, incremental harvesting is not implemented on the GBIF side as far
as I
am aware. And I dont think that will be a simple thing to implement
on
the current system. Also, even if we can detect only the changed records since the last harevesting via dateLastModified we still
have
no information about deletions. We could have an arrangement
saying
that you keep deleted records as empty records with just the ID
and
nothing else (I vaguely remember LSIDs were supposed to work like
this
too). But that also needs to be supported on your side then, never entirely removing any record. I will have a discussion with the
others
at GBIF about that.
Markus
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
Please consider the environment before printing this email
WARNING : This email and any attachments may be confidential and/or privileged. They are intended for the addressee only and are not to
be read,
used, copied or disseminated by anyone receiving them in error. If
you are
not the intended recipient, please notify the sender by return email
and
delete this message and any attachments.
The views expressed in this email are those of the sender and do not necessarily reflect the official views of Landcare Research.
http://www.landcareresearch.co.nz
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
_______________________________________________ tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ WARNING: This email and any attachments may be confidential and/or privileged. They are intended for the addressee only and are not to be read, used, copied or disseminated by anyone receiving them in error. If you are not the intended recipient, please notify the sender by return email and delete this message and any attachments.
The views expressed in this email are those of the sender and do not necessarily reflect the official views of Landcare Research. http://www.landcareresearch.co.nz ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Hi Kevin,
This is the same as what I do for WFS... I can't offer the full rich schema in WFS for the GBIF density layers, but by putting in what I call a "callback url" as a mapped concept (feature) the client calls a rest service to get back the (in this case) RDF for extra info. This is analogous to your LSID mapped concept. I can't see a better way of doing it, as a WFS response contains a flat structure.
Cheers
Tim
I think this is a great idea. I have thought a bit about how we can "build upon" then tapir protocol and services that currently exist, and this post reminded me of a few that I would like to look at. One in particular is extending the type of data sources that the Tapir configurator tools can connect to - I have done this a little in my TapirDotNET implementation where you can connect a concept to an LSID data source (ie it resolves the LSID and returns the resulting xml as the value for that mapped Tapir concept). But connecting to web services, etc, and also providing a "Tapir API" for the advanced user to programmatically provide data through a Tapir service would also be cool. Any thoughts?
Kevin
"Aaron D. Steele" eightysteele@gmail.com 14/05/2008 8:40 a.m.
at berkeley we've recently prototyped a simple php program that uses an existing tapirlink installation to periodically dump tapir resources into a csv file. the solution is totally generic and can dump darwin core (and technically abcd schema, although it's currently untested). the resulting csv files are zip archived and made accessible using a web service. it's a simple approach that has proven to be, at least internally, quite reliable and useful.
for example, several of our caching applications use the web service to harvest csv data from tapirlink resources using the following process:
- download latest csv dump for a resource using the web service.
- flush all locally cached records for the resource.
- bulk load the latest csv data into the cache.
in this way, cached data are always synchronized with the resource and there's no need to track new, deleted, or changed records. as an aside, each time these cached data are queried by the caching application or selected in the user interface, log-only search requests are sent back to the resource.
after discussion with renato giovanni and john wieczorek, we've decided that merging this functionality into the tapirlink codebase would benefit the broader community. csv generation support would be declared through capabilities. although incremental harvesting wouldn't be immediately implemented, we could certainly extend the service to include it later.
i'd like to pause here to gauge the consensus, thoughts, concerns, and ideas of others. anyone?
thanks, aaron
2008/5/5 Kevin Richards RichardsK@landcareresearch.co.nz:
I think I agree here.
The harvesting "procedure" is really defined outside the Tapir
protocol, is
it not? So it is really an agreement between the harvester and the harvestees.
So what is really needed here is the standard procedure for
maintaining a
"harvestable" dataset and the standard procedure for harvesting that dataset. We have a general rule at Landcare, that we never delete records in
our
datasets - they are either deprecated in favour of another record,
and so
the resolution of that record would point to the new record, or the
are set
to a state of "deleted", but are still kept in the dataset, and can
be
resolved (which would indicate a state of deleted).
Kevin
"Renato De Giovanni" renato@cria.org.br 6/05/2008 7:33 a.m.
Hi Markus,
I would suggest creating new concepts for incremental harvesting, either in the data standards themselves or in some new extension. In the case of TAPIR, GBIF could easily check the mapped concepts
before
deciding between incremental or full harvesting.
Actually it could be just one new concept such as "recordStatus" or "deletionFlag". Or perhaps you could also want to create your own definition for dateLastModified indicating which set of concepts should be considered to see if something has changed or not, but I guess this level of granularity would be difficult to be supported.
Regards,
Renato
On 5 May 2008 at 11:24, Markus Döring wrote:
Phil, incremental harvesting is not implemented on the GBIF side as far
as I
am aware. And I dont think that will be a simple thing to implement
on
the current system. Also, even if we can detect only the changed records since the last harevesting via dateLastModified we still
have
no information about deletions. We could have an arrangement
saying
that you keep deleted records as empty records with just the ID
and
nothing else (I vaguely remember LSIDs were supposed to work like
this
too). But that also needs to be supported on your side then, never entirely removing any record. I will have a discussion with the
others
at GBIF about that.
Markus
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
Please consider the environment before printing this email
WARNING : This email and any attachments may be confidential and/or privileged. They are intended for the addressee only and are not to
be read,
used, copied or disseminated by anyone receiving them in error. If
you are
not the intended recipient, please notify the sender by return email
and
delete this message and any attachments.
The views expressed in this email are those of the sender and do not necessarily reflect the official views of Landcare Research.
http://www.landcareresearch.co.nz
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ WARNING: This email and any attachments may be confidential and/or privileged. They are intended for the addressee only and are not to be read, used, copied or disseminated by anyone receiving them in error. If you are not the intended recipient, please notify the sender by return email and delete this message and any attachments.
The views expressed in this email are those of the sender and do not necessarily reflect the official views of Landcare Research. http://www.landcareresearch.co.nz ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ _______________________________________________ tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
Hi Kevin,
I'm not sure what exactly you have in mind, but I fully realize there's great potential to implement other services on top of a TAPIR service, such as you did with OAI-PMH. I see your implementation as a kind of "protocol rewrite rule" written in a single script.
The LSID mapping is also an interesting idea, although in practice you can only return content, not search on it, right? Anyway, providers are certainly free to implement any new type of local mapping - nothing in the protocol should stop this.
Best Regards, -- Renato
I think this is a great idea. I have thought a bit about how we can "build upon" then tapir protocol and services that currently exist, and this post reminded me of a few that I would like to look at. One in particular is extending the type of data sources that the Tapir configurator tools can connect to - I have done this a little in my TapirDotNET implementation where you can connect a concept to an LSID data source (ie it resolves the LSID and returns the resulting xml as the value for that mapped Tapir concept). But connecting to web services, etc, and also providing a "Tapir API" for the advanced user to programmatically provide data through a Tapir service would also be cool. Any thoughts?
Kevin
participants (11)
-
Aaron D. Steele
-
Blum, Stan
-
Dave Vieglais
-
Kevin Richards
-
Markus Döring
-
Phil Cryer
-
Phil Cryer
-
Renato De Giovanni
-
Roger Hyam
-
Roger Hyam (TDWG)
-
trobertson@gbif.org