[tdwg-tapir] Tapir protocol - Harvest methods? - tdwg-tag

[tdwg-tapir] Tapir protocol - Harvest methods?

older
OASIS reference arcitecture for...

Phil Cryer

28 Apr 2008 28 Apr '08

15:39

Just starting with Tapir/DiGIR - I have 2 questions: * I would like to know if the Tapir protocol is the preferred method over DiGIR. We have a DiGIR implementation that we want to move away from, and bring up a Tapir one in its place. Is this normal, or do organizations run both to facilitate their older clients to do harvesting? * What is a method to harvest data from Tapir, and/or DiGIR - we want to do this internally to test our implementation before we open up to the world, how can I do this (we run Windows and Linux as clients) Thank you Phil -- Phil Cryer Open Source Development Missouri Botanical Garden

Attachments:

attachment.html (text/html — 4.0 KB)

Show replies by date

Renato De Giovanni

28 Apr 28 Apr

18:54

Phil, Is the "DiGIR implementation that you want to move away from" just a DiGIR service? Or is it something else? I would only keep a parallel DiGIR service if there are older clients that can only talk to it and for some reason (time/resources) can't be updated. I'm not sure if this is your case. Also, when you said that you want to "test your implementation", did you mean that you want to test a TAPIR service, or is it some other application based on TAPIR? If you just want to test a TAPIR service, you could simply run TapirTester on it instead of developing your own harvester: http://tapir.tdwg.org/tester/ Note: If necessary, the existing tests can be improved. New ones can also be created (TapirTester is open source). Hope this helps, -- Renato On 28 Apr 2008 at 10:39, Phil Cryer wrote:

...

Just starting with Tapir/DiGIR - I have 2 questions:

* I would like to know if the Tapir protocol is the preferred method over DiGIR. We have a DiGIR implementation that we want to move away from, and bring up a Tapir one in its place. Is this normal, or do organizations run both to facilitate their older clients to do harvesting?

* What is a method to harvest data from Tapir, and/or DiGIR -we want to do this internally to test our implementation before we open up to the world, how can I do this (we run Windows and Linux as clients)

Thank you

Phil -- Phil Cryer Open Source Development Missouri Botanical Garden

Phil Cryer

20:22

So we have DiGIR running at Mobot for Tropicos data, and clients hit it to harvest data. I was just wondering if people are still deploying DiGIR at all, or are they just using Tapir by default? It seems to have taken over for DiGIR, and I want to know if that's a 'standard' that we should follow. For testing, yes, we're talking more of performance; make sure our network and server will handle X load. So I guess I want to know more of, how do clients attach to a Tapir server, how do they pull the data from us? Sorry if this is such a newbie question, but I can't understand this aspect from the docs I've read. Thanks for the reply! Phil -----Original Message----- From: tdwg-tapir-bounces@lists.tdwg.org [mailto:tdwg-tapir-bounces@lists.tdwg.org] On Behalf Of Renato De Giovanni Sent: Monday, April 28, 2008 1:54 PM To: tdwg-tapir@lists.tdwg.org Subject: Re: [tdwg-tapir] Tapir protocol - Harvest methods? Phil, Is the "DiGIR implementation that you want to move away from" just a DiGIR service? Or is it something else? I would only keep a parallel DiGIR service if there are older clients that can only talk to it and for some reason (time/resources) can't be updated. I'm not sure if this is your case. Also, when you said that you want to "test your implementation", did you mean that you want to test a TAPIR service, or is it some other application based on TAPIR? If you just want to test a TAPIR service, you could simply run TapirTester on it instead of developing your own harvester: http://tapir.tdwg.org/tester/ Note: If necessary, the existing tests can be improved. New ones can also be created (TapirTester is open source). Hope this helps, -- Renato On 28 Apr 2008 at 10:39, Phil Cryer wrote:

...

Just starting with Tapir/DiGIR - I have 2 questions:

* I would like to know if the Tapir protocol is the preferred method over DiGIR. We have a DiGIR implementation that we want to move away from, and bring up a Tapir one in its place. Is this normal, or do organizations run both to facilitate their older clients to do harvesting?

* What is a method to harvest data from Tapir, and/or DiGIR -we want to do this internally to test our implementation before we open up to the world, how can I do this (we run Windows and Linux as clients)

Thank you

Phil -- Phil Cryer Open Source Development Missouri Botanical Garden

_______________________________________________ tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir

Blum, Stan

21:02

Phil, TAPIR was intended to be a unification of DiGIR and BioCASE. There are a few implementations of providers but fewer instances of portals built on TAPIR. Networks built on DiGIR may eventually switch to TAPIR, but that remains to be seen. DiGIR and BioCASE were designed for distributed queries, not really harvesting. I understand harvesting can be done more simply and efficiently by other approaches, such as OAI-PMH. If the sensibilities of data providers evolves to accept and allow harvesting (which seems likely), we may see "networks" built on that architecture, instead of distributed queries. If your only goal is to provide data to GBIF, I would suggest installing TAPIR (unless Tim Robertson tells you something else). If you are concerned about providing data to other networks, like www.SERNEC.org, you'll need a DiGIR provider, too. (Such is the nature of technical transition.) -Stan Stanley D. Blum, Ph.D. Research Information Manager California Academy of Sciences 875 Howard St. San Francisco, CA +1 (415) 321-8183 -----Original Message----- From: tdwg-tapir-bounces@lists.tdwg.org [mailto:tdwg-tapir-bounces@lists.tdwg.org] On Behalf Of Phil Cryer Sent: Monday, April 28, 2008 1:22 PM To: Renato De Giovanni; tdwg-tapir@lists.tdwg.org Subject: RE: [tdwg-tapir] Tapir protocol - Harvest methods? So we have DiGIR running at Mobot for Tropicos data, and clients hit it to harvest data. I was just wondering if people are still deploying DiGIR at all, or are they just using Tapir by default? It seems to have taken over for DiGIR, and I want to know if that's a 'standard' that we should follow. For testing, yes, we're talking more of performance; make sure our network and server will handle X load. So I guess I want to know more of, how do clients attach to a Tapir server, how do they pull the data from us? Sorry if this is such a newbie question, but I can't understand this aspect from the docs I've read. Thanks for the reply! Phil -----Original Message----- From: tdwg-tapir-bounces@lists.tdwg.org [mailto:tdwg-tapir-bounces@lists.tdwg.org] On Behalf Of Renato De Giovanni Sent: Monday, April 28, 2008 1:54 PM To: tdwg-tapir@lists.tdwg.org Subject: Re: [tdwg-tapir] Tapir protocol - Harvest methods? Phil, Is the "DiGIR implementation that you want to move away from" just a DiGIR service? Or is it something else? I would only keep a parallel DiGIR service if there are older clients that can only talk to it and for some reason (time/resources) can't be updated. I'm not sure if this is your case. Also, when you said that you want to "test your implementation", did you mean that you want to test a TAPIR service, or is it some other application based on TAPIR? If you just want to test a TAPIR service, you could simply run TapirTester on it instead of developing your own harvester: http://tapir.tdwg.org/tester/ Note: If necessary, the existing tests can be improved. New ones can also be created (TapirTester is open source). Hope this helps, -- Renato On 28 Apr 2008 at 10:39, Phil Cryer wrote:

...

Just starting with Tapir/DiGIR - I have 2 questions:

* I would like to know if the Tapir protocol is the preferred method over DiGIR. We have a DiGIR implementation that we want to move away from, and bring up a Tapir one in its place. Is this normal, or do organizations run both to facilitate their older clients to do harvesting?

* What is a method to harvest data from Tapir, and/or DiGIR -we want to do this internally to test our implementation before we open up to the world, how can I do this (we run Windows and Linux as clients)

Thank you

Phil -- Phil Cryer Open Source Development Missouri Botanical Garden

_______________________________________________ tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir _______________________________________________ tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir

Markus Döring

22:34

Phil, from the GBIF side it doesnt matter whether you use DiGIR or TAPIR. Both protocols are currently supported by the GBIF indexer. If you use TapirLink simply mapping to DarwinCore is enough. For other TAPIRlite providers please make sure your service works with the 2 following DarwinCore TAPIR templates found at TDWG: http://rs.tdwg.org/tapir/cs/dwc/1.4/template/dwc_sci_name_range.xml http://rs.tdwg.org/tapir/cs/dwc/1.4/template/dwc_unfiltered_search.xml At GBIF we are currently also thinking about a much simpler provider software tailored for harvesting. That will reduce load on providers enormously while still supporting basic TAPIR capabilities for true distributed queries. We will keep this list informed once we have thought this through. Markus -- Markus Döring, Berlin Senior Software Developer GBIF Secretariat mdoering@gbif.org On 28 Apr, 2008, at 23:02, Blum, Stan wrote:

...

Phil,

TAPIR was intended to be a unification of DiGIR and BioCASE. There are a few implementations of providers but fewer instances of portals built on TAPIR. Networks built on DiGIR may eventually switch to TAPIR, but that remains to be seen. DiGIR and BioCASE were designed for distributed queries, not really harvesting. I understand harvesting can be done more simply and efficiently by other approaches, such as OAI-PMH. If the sensibilities of data providers evolves to accept and allow harvesting (which seems likely), we may see "networks" built on that architecture, instead of distributed queries.

If your only goal is to provide data to GBIF, I would suggest installing TAPIR (unless Tim Robertson tells you something else). If you are concerned about providing data to other networks, like www.SERNEC.org, you'll need a DiGIR provider, too. (Such is the nature of technical transition.)

-Stan

Stanley D. Blum, Ph.D. Research Information Manager California Academy of Sciences 875 Howard St. San Francisco, CA +1 (415) 321-8183

-----Original Message----- From: tdwg-tapir-bounces@lists.tdwg.org [mailto:tdwg-tapir-bounces@lists.tdwg.org] On Behalf Of Phil Cryer Sent: Monday, April 28, 2008 1:22 PM To: Renato De Giovanni; tdwg-tapir@lists.tdwg.org Subject: RE: [tdwg-tapir] Tapir protocol - Harvest methods?

So we have DiGIR running at Mobot for Tropicos data, and clients hit it to harvest data. I was just wondering if people are still deploying DiGIR at all, or are they just using Tapir by default? It seems to have taken over for DiGIR, and I want to know if that's a 'standard' that we should follow.

For testing, yes, we're talking more of performance; make sure our network and server will handle X load. So I guess I want to know more of, how do clients attach to a Tapir server, how do they pull the data from us?

Sorry if this is such a newbie question, but I can't understand this aspect from the docs I've read.

Thanks for the reply!

Phil

-----Original Message----- From: tdwg-tapir-bounces@lists.tdwg.org [mailto:tdwg-tapir-bounces@lists.tdwg.org] On Behalf Of Renato De Giovanni Sent: Monday, April 28, 2008 1:54 PM To: tdwg-tapir@lists.tdwg.org Subject: Re: [tdwg-tapir] Tapir protocol - Harvest methods?

Phil,

Is the "DiGIR implementation that you want to move away from" just a DiGIR service? Or is it something else?

I would only keep a parallel DiGIR service if there are older clients that can only talk to it and for some reason (time/resources) can't be updated. I'm not sure if this is your case.

Also, when you said that you want to "test your implementation", did you mean that you want to test a TAPIR service, or is it some other application based on TAPIR? If you just want to test a TAPIR service, you could simply run TapirTester on it instead of developing your own harvester:

http://tapir.tdwg.org/tester/

Note: If necessary, the existing tests can be improved. New ones can also be created (TapirTester is open source).

Hope this helps, -- Renato

On 28 Apr 2008 at 10:39, Phil Cryer wrote:

...
Just starting with Tapir/DiGIR - I have 2 questions:

* I would like to know if the Tapir protocol is the preferred method over DiGIR. We have a DiGIR implementation that we want to move away from, and bring up a Tapir one in its place. Is this normal, or do organizations run both to facilitate their older clients to do harvesting?

* What is a method to harvest data from Tapir, and/or DiGIR -we want to do this internally to test our implementation before we open up to the world, how can I do this (we run Windows and Linux as clients)

Thank you

Phil -- Phil Cryer Open Source Development Missouri Botanical Garden

_______________________________________________ tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir _______________________________________________ tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir _______________________________________________ tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir

Markus Döring

30 Apr 30 Apr

08:18

Phil, having said that GBIF is happy both with DiGIR and TAPIR I still wanted to raise one important issue: The DiGIR PHP code is rather old now and is not being maintained anymore by anyone. I had problems myself getting it running with PHP5, whereas TapirLink installed in 2 minutes without any problem. So for new provider installations GBIF definitely recommends to use TAPIR over DiGIR. Markus On 29 Apr, 2008, at 24:34, Markus Döring wrote:

...

Phil, from the GBIF side it doesnt matter whether you use DiGIR or TAPIR. Both protocols are currently supported by the GBIF indexer. If you use TapirLink simply mapping to DarwinCore is enough. For other TAPIRlite providers please make sure your service works with the 2 following DarwinCore TAPIR templates found at TDWG:

http://rs.tdwg.org/tapir/cs/dwc/1.4/template/dwc_sci_name_range.xml http://rs.tdwg.org/tapir/cs/dwc/1.4/template/dwc_unfiltered_search.xml

At GBIF we are currently also thinking about a much simpler provider software tailored for harvesting. That will reduce load on providers enormously while still supporting basic TAPIR capabilities for true distributed queries. We will keep this list informed once we have thought this through.

Markus

-- Markus Döring, Berlin Senior Software Developer GBIF Secretariat mdoering@gbif.org

On 28 Apr, 2008, at 23:02, Blum, Stan wrote:

...
Phil,

TAPIR was intended to be a unification of DiGIR and BioCASE. There are a few implementations of providers but fewer instances of portals built on TAPIR. Networks built on DiGIR may eventually switch to TAPIR, but that remains to be seen. DiGIR and BioCASE were designed for distributed queries, not really harvesting. I understand harvesting can be done more simply and efficiently by other approaches, such as OAI-PMH. If the sensibilities of data providers evolves to accept and allow harvesting (which seems likely), we may see "networks" built on that architecture, instead of distributed queries.

If your only goal is to provide data to GBIF, I would suggest installing TAPIR (unless Tim Robertson tells you something else). If you are concerned about providing data to other networks, like www.SERNEC.org, you'll need a DiGIR provider, too. (Such is the nature of technical transition.)

-Stan

Stanley D. Blum, Ph.D. Research Information Manager California Academy of Sciences 875 Howard St. San Francisco, CA +1 (415) 321-8183

-----Original Message----- From: tdwg-tapir-bounces@lists.tdwg.org [mailto:tdwg-tapir-bounces@lists.tdwg.org] On Behalf Of Phil Cryer Sent: Monday, April 28, 2008 1:22 PM To: Renato De Giovanni; tdwg-tapir@lists.tdwg.org Subject: RE: [tdwg-tapir] Tapir protocol - Harvest methods?

So we have DiGIR running at Mobot for Tropicos data, and clients hit it to harvest data. I was just wondering if people are still deploying DiGIR at all, or are they just using Tapir by default? It seems to have taken over for DiGIR, and I want to know if that's a 'standard' that we should follow.

For testing, yes, we're talking more of performance; make sure our network and server will handle X load. So I guess I want to know more of, how do clients attach to a Tapir server, how do they pull the data from us?

Sorry if this is such a newbie question, but I can't understand this aspect from the docs I've read.

Thanks for the reply!

Phil

-----Original Message----- From: tdwg-tapir-bounces@lists.tdwg.org [mailto:tdwg-tapir-bounces@lists.tdwg.org] On Behalf Of Renato De Giovanni Sent: Monday, April 28, 2008 1:54 PM To: tdwg-tapir@lists.tdwg.org Subject: Re: [tdwg-tapir] Tapir protocol - Harvest methods?

Phil,

Is the "DiGIR implementation that you want to move away from" just a DiGIR service? Or is it something else?

I would only keep a parallel DiGIR service if there are older clients that can only talk to it and for some reason (time/resources) can't be updated. I'm not sure if this is your case.

Also, when you said that you want to "test your implementation", did you mean that you want to test a TAPIR service, or is it some other application based on TAPIR? If you just want to test a TAPIR service, you could simply run TapirTester on it instead of developing your own harvester:

http://tapir.tdwg.org/tester/

Note: If necessary, the existing tests can be improved. New ones can also be created (TapirTester is open source).

Hope this helps, -- Renato

On 28 Apr 2008 at 10:39, Phil Cryer wrote:

...
Just starting with Tapir/DiGIR - I have 2 questions:

* I would like to know if the Tapir protocol is the preferred method over DiGIR. We have a DiGIR implementation that we want to move away from, and bring up a Tapir one in its place. Is this normal, or do organizations run both to facilitate their older clients to do harvesting?

* What is a method to harvest data from Tapir, and/or DiGIR -we want to do this internally to test our implementation before we open up to the world, how can I do this (we run Windows and Linux as clients)

Thank you

Phil -- Phil Cryer Open Source Development Missouri Botanical Garden

_______________________________________________ tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir _______________________________________________ tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir _______________________________________________ tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir

Phil Cryer

1 May 1 May

13:28

On Mon, 2008-04-28 at 17:34 -0500, Markus Döring wrote:

...

Phil, from the GBIF side it doesnt matter whether you use DiGIR or TAPIR. Both protocols are currently supported by the GBIF indexer. If you use TapirLink simply mapping to DarwinCore is enough. For other TAPIRlite providers please make sure your service works with the 2 following DarwinCore TAPIR templates found at TDWG:

http://rs.tdwg.org/tapir/cs/dwc/1.4/template/dwc_sci_name_range.xml http://rs.tdwg.org/tapir/cs/dwc/1.4/template/dwc_unfiltered_search.xml

Markus I've gotten DiGIR back in line and will start tracking it to see what kind of usage we're experiencing, after that I want to bring up Tapir, mapping out data via ABCD - after this I will speak to you so we can determine if I have things configured the most efficiently. I'm interested in how we can have the harvester pull only the latest data...I'll think about that. Phil

...

At GBIF we are currently also thinking about a much simpler provider software tailored for harvesting. That will reduce load on providers enormously while still supporting basic TAPIR capabilities for true distributed queries. We will keep this list informed once we have thought this through.

Markus

-- Markus Döring, Berlin Senior Software Developer GBIF Secretariat mdoering@gbif.org

On 28 Apr, 2008, at 23:02, Blum, Stan wrote:

...
Phil,

TAPIR was intended to be a unification of DiGIR and BioCASE. There are a few implementations of providers but fewer instances of portals built on TAPIR. Networks built on DiGIR may eventually switch to TAPIR, but that remains to be seen. DiGIR and BioCASE were designed for distributed queries, not really harvesting. I understand harvesting can be done more simply and efficiently by other approaches, such as OAI-PMH. If the sensibilities of data providers evolves to accept and allow harvesting (which seems likely), we may see "networks" built on that architecture, instead of distributed queries.

If your only goal is to provide data to GBIF, I would suggest installing TAPIR (unless Tim Robertson tells you something else). If you are concerned about providing data to other networks, like www.SERNEC.org, you'll need a DiGIR provider, too. (Such is the nature of technical transition.)

-Stan

Stanley D. Blum, Ph.D. Research Information Manager California Academy of Sciences 875 Howard St. San Francisco, CA +1 (415) 321-8183

-----Original Message----- From: tdwg-tapir-bounces@lists.tdwg.org [mailto:tdwg-tapir-bounces@lists.tdwg.org] On Behalf Of Phil Cryer Sent: Monday, April 28, 2008 1:22 PM To: Renato De Giovanni; tdwg-tapir@lists.tdwg.org Subject: RE: [tdwg-tapir] Tapir protocol - Harvest methods?

So we have DiGIR running at Mobot for Tropicos data, and clients hit it to harvest data. I was just wondering if people are still deploying DiGIR at all, or are they just using Tapir by default? It seems to have taken over for DiGIR, and I want to know if that's a 'standard' that we should follow.

For testing, yes, we're talking more of performance; make sure our network and server will handle X load. So I guess I want to know more of, how do clients attach to a Tapir server, how do they pull the data from us?

Sorry if this is such a newbie question, but I can't understand this aspect from the docs I've read.

Thanks for the reply!

Phil

-----Original Message----- From: tdwg-tapir-bounces@lists.tdwg.org [mailto:tdwg-tapir-bounces@lists.tdwg.org] On Behalf Of Renato De Giovanni Sent: Monday, April 28, 2008 1:54 PM To: tdwg-tapir@lists.tdwg.org Subject: Re: [tdwg-tapir] Tapir protocol - Harvest methods?

Phil,

Is the "DiGIR implementation that you want to move away from" just a DiGIR service? Or is it something else?

I would only keep a parallel DiGIR service if there are older clients that can only talk to it and for some reason (time/resources) can't be updated. I'm not sure if this is your case.

Also, when you said that you want to "test your implementation", did you mean that you want to test a TAPIR service, or is it some other application based on TAPIR? If you just want to test a TAPIR service, you could simply run TapirTester on it instead of developing your own harvester:

http://tapir.tdwg.org/tester/

Note: If necessary, the existing tests can be improved. New ones can also be created (TapirTester is open source).

Hope this helps, -- Renato

On 28 Apr 2008 at 10:39, Phil Cryer wrote:

...
Just starting with Tapir/DiGIR - I have 2 questions:

* I would like to know if the Tapir protocol is the preferred

method

...
over DiGIR. We have a DiGIR implementation that we want to move away from, and bring up a Tapir one in its place. Is this normal, or do organizations run both to facilitate their older clients to do harvesting?

* What is a method to harvest data from Tapir, and/or DiGIR -we want to do this internally to test our implementation before we open up to the world, how can I do this (we run Windows and Linux as clients)

Thank you

Phil -- Phil Cryer Open Source Development Missouri Botanical Garden

_______________________________________________ tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir _______________________________________________ tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir _______________________________________________ tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir

Markus Döring

5 May 5 May

09:24

Phil, incremental harvesting is not implemented on the GBIF side as far as I am aware. And I dont think that will be a simple thing to implement on the current system. Also, even if we can detect only the changed records since the last harevesting via dateLastModified we still have no information about deletions. We could have an arrangement saying that you keep deleted records as empty records with just the ID and nothing else (I vaguely remember LSIDs were supposed to work like this too). But that also needs to be supported on your side then, never entirely removing any record. I will have a discussion with the others at GBIF about that. Markus On 1 May, 2008, at 15:28, Phil Cryer wrote:

...

On Mon, 2008-04-28 at 17:34 -0500, Markus Döring wrote:

...
Phil, from the GBIF side it doesnt matter whether you use DiGIR or TAPIR. Both protocols are currently supported by the GBIF indexer. If you use TapirLink simply mapping to DarwinCore is enough. For other TAPIRlite providers please make sure your service works with the 2 following DarwinCore TAPIR templates found at TDWG:

http://rs.tdwg.org/tapir/cs/dwc/1.4/template/dwc_sci_name_range.xml http://rs.tdwg.org/tapir/cs/dwc/1.4/template/ dwc_unfiltered_search.xml

Markus I've gotten DiGIR back in line and will start tracking it to see what kind of usage we're experiencing, after that I want to bring up Tapir, mapping out data via ABCD - after this I will speak to you so we can determine if I have things configured the most efficiently. I'm interested in how we can have the harvester pull only the latest data...I'll think about that.

Phil

...
At GBIF we are currently also thinking about a much simpler provider software tailored for harvesting. That will reduce load on providers enormously while still supporting basic TAPIR capabilities for true distributed queries. We will keep this list informed once we have thought this through.

Markus

-- Markus Döring, Berlin Senior Software Developer GBIF Secretariat mdoering@gbif.org

On 28 Apr, 2008, at 23:02, Blum, Stan wrote:

...
Phil,

TAPIR was intended to be a unification of DiGIR and BioCASE. There are a few implementations of providers but fewer instances of portals built on TAPIR. Networks built on DiGIR may eventually switch to TAPIR, but that remains to be seen. DiGIR and BioCASE were designed for distributed queries, not really harvesting. I understand harvesting can be done more simply and efficiently by other approaches, such as OAI-PMH. If the sensibilities of data providers evolves to accept and allow harvesting (which seems likely), we may see "networks" built on that architecture, instead of distributed queries.

If your only goal is to provide data to GBIF, I would suggest installing TAPIR (unless Tim Robertson tells you something else). If you are concerned about providing data to other networks, like www.SERNEC.org, you'll need a DiGIR provider, too. (Such is the nature of technical transition.)

-Stan

Stanley D. Blum, Ph.D. Research Information Manager California Academy of Sciences 875 Howard St. San Francisco, CA +1 (415) 321-8183

-----Original Message----- From: tdwg-tapir-bounces@lists.tdwg.org [mailto:tdwg-tapir-bounces@lists.tdwg.org] On Behalf Of Phil Cryer Sent: Monday, April 28, 2008 1:22 PM To: Renato De Giovanni; tdwg-tapir@lists.tdwg.org Subject: RE: [tdwg-tapir] Tapir protocol - Harvest methods?

So we have DiGIR running at Mobot for Tropicos data, and clients hit it to harvest data. I was just wondering if people are still deploying DiGIR at all, or are they just using Tapir by default? It seems to have taken over for DiGIR, and I want to know if that's a 'standard' that we should follow.

For testing, yes, we're talking more of performance; make sure our network and server will handle X load. So I guess I want to know more of, how do clients attach to a Tapir server, how do they pull the data from us?

Sorry if this is such a newbie question, but I can't understand this aspect from the docs I've read.

Thanks for the reply!

Phil

-----Original Message----- From: tdwg-tapir-bounces@lists.tdwg.org [mailto:tdwg-tapir-bounces@lists.tdwg.org] On Behalf Of Renato De Giovanni Sent: Monday, April 28, 2008 1:54 PM To: tdwg-tapir@lists.tdwg.org Subject: Re: [tdwg-tapir] Tapir protocol - Harvest methods?

Phil,

Is the "DiGIR implementation that you want to move away from" just a DiGIR service? Or is it something else?

I would only keep a parallel DiGIR service if there are older clients that can only talk to it and for some reason (time/resources) can't be updated. I'm not sure if this is your case.

Also, when you said that you want to "test your implementation", did you mean that you want to test a TAPIR service, or is it some other application based on TAPIR? If you just want to test a TAPIR service, you could simply run TapirTester on it instead of developing your own harvester:

http://tapir.tdwg.org/tester/

Note: If necessary, the existing tests can be improved. New ones can also be created (TapirTester is open source).

Hope this helps, -- Renato

On 28 Apr 2008 at 10:39, Phil Cryer wrote:

...
Just starting with Tapir/DiGIR - I have 2 questions:

* I would like to know if the Tapir protocol is the preferred

method

...
over DiGIR. We have a DiGIR implementation that we want to move away from, and bring up a Tapir one in its place. Is this normal, or do organizations run both to facilitate their older clients to do harvesting?

* What is a method to harvest data from Tapir, and/or DiGIR -we want to do this internally to test our implementation before we open up to the world, how can I do this (we run Windows and Linux as clients)

Thank you

Phil -- Phil Cryer Open Source Development Missouri Botanical Garden

_______________________________________________ tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir _______________________________________________ tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir _______________________________________________ tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir

--

Renato De Giovanni

19:33

Hi Markus, I would suggest creating new concepts for incremental harvesting, either in the data standards themselves or in some new extension. In the case of TAPIR, GBIF could easily check the mapped concepts before deciding between incremental or full harvesting. Actually it could be just one new concept such as "recordStatus" or "deletionFlag". Or perhaps you could also want to create your own definition for dateLastModified indicating which set of concepts should be considered to see if something has changed or not, but I guess this level of granularity would be difficult to be supported. Regards, -- Renato On 5 May 2008 at 11:24, Markus Döring wrote:

...

Phil, incremental harvesting is not implemented on the GBIF side as far as I am aware. And I dont think that will be a simple thing to implement on the current system. Also, even if we can detect only the changed records since the last harevesting via dateLastModified we still have no information about deletions. We could have an arrangement saying that you keep deleted records as empty records with just the ID and nothing else (I vaguely remember LSIDs were supposed to work like this too). But that also needs to be supported on your side then, never entirely removing any record. I will have a discussion with the others at GBIF about that.

Markus

Roger Hyam

20:41

Hi Markus, Without the notion of incremental harvesting there is little notion of harvesting at all I think. The supplier may as well burn a table to CD and mail it to you (or gzip it or something). The EML model of having a data set (table) bound to a descriptive file is more appropriate than an online data provider one. If supplier 'A' provides 10,000 records this week and then replaces them with 10,001 next week and with 9,999 the week after how many records do we have from the point of view of a data consumer? 30k or just 3 (with ~10k data points in each)? It is a very different way to look at the data than from the original specimen based one that we started with. If the data represents an entomological collection it seems crazy (we are not replacing the specimens each week) if it represents bird sightings it seem a sensible (these may be different studies and are not replacements but separate data sets.). Are we trying to combine two kinds of data that don't fit together very well? I keep coming back to the need to know how people will use the data... All the best, Roger On 5 May 2008, at 20:33, Renato De Giovanni wrote:

...

Hi Markus,

I would suggest creating new concepts for incremental harvesting, either in the data standards themselves or in some new extension. In the case of TAPIR, GBIF could easily check the mapped concepts before deciding between incremental or full harvesting.

Actually it could be just one new concept such as "recordStatus" or "deletionFlag". Or perhaps you could also want to create your own definition for dateLastModified indicating which set of concepts should be considered to see if something has changed or not, but I guess this level of granularity would be difficult to be supported.

Regards, -- Renato

On 5 May 2008 at 11:24, Markus Döring wrote:

...
Phil, incremental harvesting is not implemented on the GBIF side as far as I am aware. And I dont think that will be a simple thing to implement on the current system. Also, even if we can detect only the changed records since the last harevesting via dateLastModified we still have no information about deletions. We could have an arrangement saying that you keep deleted records as empty records with just the ID and nothing else (I vaguely remember LSIDs were supposed to work like this too). But that also needs to be supported on your side then, never entirely removing any record. I will have a discussion with the others at GBIF about that.

Markus

_______________________________________________ tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir

Kevin Richards

21:17

I think I agree here. The harvesting "procedure" is really defined outside the Tapir protocol, is it not? So it is really an agreement between the harvester and the harvestees. So what is really needed here is the standard procedure for maintaining a "harvestable" dataset and the standard procedure for harvesting that dataset. We have a general rule at Landcare, that we never delete records in our datasets - they are either deprecated in favour of another record, and so the resolution of that record would point to the new record, or the are set to a state of "deleted", but are still kept in the dataset, and can be resolved (which would indicate a state of deleted). Kevin

...

...
...
"Renato De Giovanni" <renato@cria.org.br> 6/05/2008 7:33 a.m. >>> Hi Markus,

I would suggest creating new concepts for incremental harvesting, either in the data standards themselves or in some new extension. In the case of TAPIR, GBIF could easily check the mapped concepts before deciding between incremental or full harvesting. Actually it could be just one new concept such as "recordStatus" or "deletionFlag". Or perhaps you could also want to create your own definition for dateLastModified indicating which set of concepts should be considered to see if something has changed or not, but I guess this level of granularity would be difficult to be supported. Regards, -- Renato On 5 May 2008 at 11:24, Markus Döring wrote:

...

Phil, incremental harvesting is not implemented on the GBIF side as far as I am aware. And I dont think that will be a simple thing to implement on the current system. Also, even if we can detect only the changed records since the last harevesting via dateLastModified we still have

...

no information about deletions. We could have an arrangement saying

...

that you keep deleted records as empty records with just the ID and

...

nothing else (I vaguely remember LSIDs were supposed to work like this too). But that also needs to be supported on your side then, never entirely removing any record. I will have a discussion with the others at GBIF about that.

Markus

_______________________________________________ tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ WARNING: This email and any attachments may be confidential and/or privileged. They are intended for the addressee only and are not to be read, used, copied or disseminated by anyone receiving them in error. If you are not the intended recipient, please notify the sender by return email and delete this message and any attachments. The views expressed in this email are those of the sender and do not necessarily reflect the official views of Landcare Research. http://www.landcareresearch.co.nz ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Aaron D. Steele

13 May 13 May

20:40

at berkeley we've recently prototyped a simple php program that uses an existing tapirlink installation to periodically dump tapir resources into a csv file. the solution is totally generic and can dump darwin core (and technically abcd schema, although it's currently untested). the resulting csv files are zip archived and made accessible using a web service. it's a simple approach that has proven to be, at least internally, quite reliable and useful. for example, several of our caching applications use the web service to harvest csv data from tapirlink resources using the following process: 1) download latest csv dump for a resource using the web service. 2) flush all locally cached records for the resource. 3) bulk load the latest csv data into the cache. in this way, cached data are always synchronized with the resource and there's no need to track new, deleted, or changed records. as an aside, each time these cached data are queried by the caching application or selected in the user interface, log-only search requests are sent back to the resource. after discussion with renato giovanni and john wieczorek, we've decided that merging this functionality into the tapirlink codebase would benefit the broader community. csv generation support would be declared through capabilities. although incremental harvesting wouldn't be immediately implemented, we could certainly extend the service to include it later. i'd like to pause here to gauge the consensus, thoughts, concerns, and ideas of others. anyone? thanks, aaron 2008/5/5 Kevin Richards <RichardsK@landcareresearch.co.nz>:

...

I think I agree here.

The harvesting "procedure" is really defined outside the Tapir protocol, is it not? So it is really an agreement between the harvester and the harvestees.

So what is really needed here is the standard procedure for maintaining a "harvestable" dataset and the standard procedure for harvesting that dataset. We have a general rule at Landcare, that we never delete records in our datasets - they are either deprecated in favour of another record, and so the resolution of that record would point to the new record, or the are set to a state of "deleted", but are still kept in the dataset, and can be resolved (which would indicate a state of deleted).

Kevin

...
...
...
"Renato De Giovanni" <renato@cria.org.br> 6/05/2008 7:33 a.m. >>>

Hi Markus,

I would suggest creating new concepts for incremental harvesting, either in the data standards themselves or in some new extension. In the case of TAPIR, GBIF could easily check the mapped concepts before deciding between incremental or full harvesting.

Actually it could be just one new concept such as "recordStatus" or "deletionFlag". Or perhaps you could also want to create your own definition for dateLastModified indicating which set of concepts should be considered to see if something has changed or not, but I guess this level of granularity would be difficult to be supported.

Regards, -- Renato

On 5 May 2008 at 11:24, Markus Döring wrote:

...
Phil, incremental harvesting is not implemented on the GBIF side as far as I am aware. And I dont think that will be a simple thing to implement on the current system. Also, even if we can detect only the changed records since the last harevesting via dateLastModified we still have no information about deletions. We could have an arrangement saying that you keep deleted records as empty records with just the ID and nothing else (I vaguely remember LSIDs were supposed to work like this too). But that also needs to be supported on your side then, never entirely removing any record. I will have a discussion with the others at GBIF about that.

Markus

_______________________________________________ tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir

Please consider the environment before printing this email

WARNING : This email and any attachments may be confidential and/or privileged. They are intended for the addressee only and are not to be read, used, copied or disseminated by anyone receiving them in error. If you are not the intended recipient, please notify the sender by return email and delete this message and any attachments.

The views expressed in this email are those of the sender and do not necessarily reflect the official views of Landcare Research. http://www.landcareresearch.co.nz _______________________________________________ tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir

Kevin Richards

21:17

I think this is a great idea. I have thought a bit about how we can "build upon" then tapir protocol and services that currently exist, and this post reminded me of a few that I would like to look at. One in particular is extending the type of data sources that the Tapir configurator tools can connect to - I have done this a little in my TapirDotNET implementation where you can connect a concept to an LSID data source (ie it resolves the LSID and returns the resulting xml as the value for that mapped Tapir concept). But connecting to web services, etc, and also providing a "Tapir API" for the advanced user to programmatically provide data through a Tapir service would also be cool. Any thoughts? Kevin

...

...
...
"Aaron D. Steele" <eightysteele@gmail.com> 14/05/2008 8:40 a.m.

at berkeley we've recently prototyped a simple php program that uses an existing tapirlink installation to periodically dump tapir resources into a csv file. the solution is totally generic and can dump darwin core (and technically abcd schema, although it's currently untested). the resulting csv files are zip archived and made accessible using a web service. it's a simple approach that has proven to be, at least internally, quite reliable and useful.

for example, several of our caching applications use the web service to harvest csv data from tapirlink resources using the following process: 1) download latest csv dump for a resource using the web service. 2) flush all locally cached records for the resource. 3) bulk load the latest csv data into the cache. in this way, cached data are always synchronized with the resource and there's no need to track new, deleted, or changed records. as an aside, each time these cached data are queried by the caching application or selected in the user interface, log-only search requests are sent back to the resource. after discussion with renato giovanni and john wieczorek, we've decided that merging this functionality into the tapirlink codebase would benefit the broader community. csv generation support would be declared through capabilities. although incremental harvesting wouldn't be immediately implemented, we could certainly extend the service to include it later. i'd like to pause here to gauge the consensus, thoughts, concerns, and ideas of others. anyone? thanks, aaron 2008/5/5 Kevin Richards <RichardsK@landcareresearch.co.nz>:

...

I think I agree here.

The harvesting "procedure" is really defined outside the Tapir

...

it not? So it is really an agreement between the harvester and the harvestees.

So what is really needed here is the standard procedure for

...

"harvestable" dataset and the standard procedure for harvesting that dataset. We have a general rule at Landcare, that we never delete records in our datasets - they are either deprecated in favour of another record, and so the resolution of that record would point to the new record, or the are set to a state of "deleted", but are still kept in the dataset, and can be resolved (which would indicate a state of deleted).

Kevin

...
...
...
"Renato De Giovanni" <renato@cria.org.br> 6/05/2008 7:33 a.m.

Hi Markus,

I would suggest creating new concepts for incremental harvesting, either in the data standards themselves or in some new extension. In the case of TAPIR, GBIF could easily check the mapped concepts before deciding between incremental or full harvesting.

Actually it could be just one new concept such as "recordStatus" or "deletionFlag". Or perhaps you could also want to create your own definition for dateLastModified indicating which set of concepts should be considered to see if something has changed or not, but I guess this level of granularity would be difficult to be supported.

Regards, -- Renato

On 5 May 2008 at 11:24, Markus Döring wrote:

...
Phil, incremental harvesting is not implemented on the GBIF side as far as I am aware. And I dont think that will be a simple thing to implement on the current system. Also, even if we can detect only the changed records since the last harevesting via dateLastModified we still have no information about deletions. We could have an arrangement saying that you keep deleted records as empty records with just the ID and nothing else (I vaguely remember LSIDs were supposed to work like

protocol, is maintaining a this

...

...
too). But that also needs to be supported on your side then, never entirely removing any record. I will have a discussion with the others at GBIF about that.

Markus

_______________________________________________ tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir

Please consider the environment before printing this email

WARNING : This email and any attachments may be confidential and/or privileged. They are intended for the addressee only and are not to be read, used, copied or disseminated by anyone receiving them in error. If you are not the intended recipient, please notify the sender by return email and delete this message and any attachments.

The views expressed in this email are those of the sender and do not necessarily reflect the official views of Landcare Research. http://www.landcareresearch.co.nz _______________________________________________ tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir