[tdwg-tapir] Hosting strategies
Hi Everyone,
There is a requirement that all wrapper type applications (TAPIR, DiGIR, BioCASe and others) have but that I don't think we address.
All instances need to have:
Either a database on a server in a DMZ or with an ISP with the ability to export data from the production database to the public database and then keep changes in the production database synchronize with the public database.
Or the ability to provide a secured/restricted connection directly to production database through the firewall.
Configuring the wrapper software against a database seems a smaller problem than getting a handle on an up to date database to configure it against!
Should we have a recommended strategy or best practice for overcoming these problems? Do we have any figures on how they are overcome in the existing BioCASe and DiGIR networks?
Many thanks for your thoughts,
Roger
Hi Everyone, Not really a TAPIR specific response, but perhaps the right audience. I'm probably stating the obvious, but the simplest way to get around the hassles of running a server and the associated firewall headaches is not to serve the data but instead to push it. By adding an authentication layer, it would be an extension to the GBIF REST services to allow POSTing data, rather than just GET (I may be wrong on this - not exactly sure what degree of REST implementation has been done by GBIF). Add in DELETE and UPDATE and instead of GBIF running harvesters to capture data, contributors could simply push their data when necessary. This would I expect be an attractive solution for those data providers that would prefer not to operate servers but would still like to contribute to the global knowledge pool of biodiversity.
More than likely a mixed model may be more ideal - with some data sources acting as servers, and others pushing their data. How about if those institutions that were comfortable running and maintaining servers also adopted the same complete REST implementation as the hypothetical GBIF mentioned above? And what if the servers were, for the most part, aware of each other and so could act as proxies or mirrors for the other servers (perhaps even automatically replicating content). The end result would be more of a mesh topology of comparatively high reliability and availability.
It would I think be a more scalable solution, and perhaps more maintainable in the long term, since it is a relatively simple thing to update a standalone application that could push the data compared with updating and securing a server. The expense of participation in the networks would drop, and resources could be directed towards operation of a few high quality / high reliability services for accessing the data.
Just a thought. There are obvious social implications, such as the perception of loosing control of one's data - but then it could also be argued that if a provider had the ability to DELETE their records from a server, then they actually have more control over the distribution of their data than currently.
cheers, Dave V.
On May 11, 2007, at 19:19, Roger Hyam wrote:
Hi Everyone,
There is a requirement that all wrapper type applications (TAPIR, DiGIR, BioCASe and others) have but that I don't think we address.
All instances need to have:
Either a database on a server in a DMZ or with an ISP with the ability to export data from the production database to the public database and then keep changes in the production database synchronize with the public database.
Or the ability to provide a secured/restricted connection directly to production database through the firewall.
Configuring the wrapper software against a database seems a smaller problem than getting a handle on an up to date database to configure it against!
Should we have a recommended strategy or best practice for overcoming these problems? Do we have any figures on how they are overcome in the existing BioCASe and DiGIR networks?
Many thanks for your thoughts,
Roger
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
Dave,
I think you are right. Whatever strategy is used there has to be an element of push in it. The production database has to push data to the publicly visible database that can then be scraped/searched by interested parties. The difference between pushing data to a public database that is managed by your own institution or one that is managed by a third party is really quite minor.
I am concerned because I am not sure of the technical facilities and human resources of potential data suppliers.
I wonder if anyone has some figures on this stuff?
All the best,
Roger.
On 11 May 2007, at 12:02, Dave Vieglais wrote:
Hi Everyone, Not really a TAPIR specific response, but perhaps the right audience. I'm probably stating the obvious, but the simplest way to get around the hassles of running a server and the associated firewall headaches is not to serve the data but instead to push it. By adding an authentication layer, it would be an extension to the GBIF REST services to allow POSTing data, rather than just GET (I may be wrong on this - not exactly sure what degree of REST implementation has been done by GBIF). Add in DELETE and UPDATE and instead of GBIF running harvesters to capture data, contributors could simply push their data when necessary. This would I expect be an attractive solution for those data providers that would prefer not to operate servers but would still like to contribute to the global knowledge pool of biodiversity.
More than likely a mixed model may be more ideal - with some data sources acting as servers, and others pushing their data. How about if those institutions that were comfortable running and maintaining servers also adopted the same complete REST implementation as the hypothetical GBIF mentioned above? And what if the servers were, for the most part, aware of each other and so could act as proxies or mirrors for the other servers (perhaps even automatically replicating content). The end result would be more of a mesh topology of comparatively high reliability and availability.
It would I think be a more scalable solution, and perhaps more maintainable in the long term, since it is a relatively simple thing to update a standalone application that could push the data compared with updating and securing a server. The expense of participation in the networks would drop, and resources could be directed towards operation of a few high quality / high reliability services for accessing the data.
Just a thought. There are obvious social implications, such as the perception of loosing control of one's data - but then it could also be argued that if a provider had the ability to DELETE their records from a server, then they actually have more control over the distribution of their data than currently.
cheers, Dave V.
On May 11, 2007, at 19:19, Roger Hyam wrote:
Hi Everyone,
There is a requirement that all wrapper type applications (TAPIR, DiGIR, BioCASe and others) have but that I don't think we address.
All instances need to have:
Either a database on a server in a DMZ or with an ISP with the ability to export data from the production database to the public database and then keep changes in the production database synchronize with the public database.
Or the ability to provide a secured/restricted connection directly to production database through the firewall.
Configuring the wrapper software against a database seems a smaller problem than getting a handle on an up to date database to configure it against!
Should we have a recommended strategy or best practice for overcoming these problems? Do we have any figures on how they are overcome in the existing BioCASe and DiGIR networks?
Many thanks for your thoughts,
Roger
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
Dave, we are following a kind of push strategy with the development of a checklists provider tool for GBIF currently. The tool converts data in spreadsheets or text files into a public database compatible with TCS, with a TAPIR service. That database can be at the local provider or directly at GBIF.
regards, Wouter
----- Original Message ----- From: "Roger Hyam" roger@tdwg.org To: "Dave Vieglais" vieglais@ku.edu Cc: tdwg-tapir@lists.tdwg.org Sent: Monday, May 14, 2007 11:40 AM Subject: Re: [tdwg-tapir] Hosting strategies
Dave,
I think you are right. Whatever strategy is used there has to be an element of push in it. The production database has to push data to the publicly visible database that can then be scraped/searched by interested parties. The difference between pushing data to a public database that is managed by your own institution or one that is managed by a third party is really quite minor.
I am concerned because I am not sure of the technical facilities and human resources of potential data suppliers.
I wonder if anyone has some figures on this stuff?
All the best,
Roger.
On 11 May 2007, at 12:02, Dave Vieglais wrote:
Hi Everyone, Not really a TAPIR specific response, but perhaps the right audience. I'm probably stating the obvious, but the simplest way to get around the hassles of running a server and the associated firewall headaches is not to serve the data but instead to push it. By adding an authentication layer, it would be an extension to the GBIF REST services to allow POSTing data, rather than just GET (I may be wrong on this - not exactly sure what degree of REST implementation has been done by GBIF). Add in DELETE and UPDATE and instead of GBIF running harvesters to capture data, contributors could simply push their data when necessary. This would I expect be an attractive solution for those data providers that would prefer not to operate servers but would still like to contribute to the global knowledge pool of biodiversity.
More than likely a mixed model may be more ideal - with some data sources acting as servers, and others pushing their data. How about if those institutions that were comfortable running and maintaining servers also adopted the same complete REST implementation as the hypothetical GBIF mentioned above? And what if the servers were, for the most part, aware of each other and so could act as proxies or mirrors for the other servers (perhaps even automatically replicating content). The end result would be more of a mesh topology of comparatively high reliability and availability.
It would I think be a more scalable solution, and perhaps more maintainable in the long term, since it is a relatively simple thing to update a standalone application that could push the data compared with updating and securing a server. The expense of participation in the networks would drop, and resources could be directed towards operation of a few high quality / high reliability services for accessing the data.
Just a thought. There are obvious social implications, such as the perception of loosing control of one's data - but then it could also be argued that if a provider had the ability to DELETE their records from a server, then they actually have more control over the distribution of their data than currently.
cheers, Dave V.
On May 11, 2007, at 19:19, Roger Hyam wrote:
Hi Everyone,
There is a requirement that all wrapper type applications (TAPIR, DiGIR, BioCASe and others) have but that I don't think we address.
All instances need to have:
Either a database on a server in a DMZ or with an ISP with the ability to export data from the production database to the public database and then keep changes in the production database synchronize with the public database.
Or the ability to provide a secured/restricted connection directly to production database through the firewall.
Configuring the wrapper software against a database seems a smaller problem than getting a handle on an up to date database to configure it against!
Should we have a recommended strategy or best practice for overcoming these problems? Do we have any figures on how they are overcome in the existing BioCASe and DiGIR networks?
Many thanks for your thoughts,
Roger
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
Thanks Wouter,
How are you getting the data into the remote database. Have you designed you own service? Do you post spreadsheets to the server and get it to ingest them?
Roger
On 14 May 2007, at 13:12, Wouter Addink wrote:
Dave, we are following a kind of push strategy with the development of a checklists provider tool for GBIF currently. The tool converts data in spreadsheets or text files into a public database compatible with TCS, with a TAPIR service. That database can be at the local provider or directly at GBIF.
regards, Wouter
----- Original Message ----- From: "Roger Hyam" roger@tdwg.org To: "Dave Vieglais" vieglais@ku.edu Cc: tdwg-tapir@lists.tdwg.org Sent: Monday, May 14, 2007 11:40 AM Subject: Re: [tdwg-tapir] Hosting strategies
Dave,
I think you are right. Whatever strategy is used there has to be an element of push in it. The production database has to push data to the publicly visible database that can then be scraped/ searched by interested parties. The difference between pushing data to a public database that is managed by your own institution or one that is managed by a third party is really quite minor.
I am concerned because I am not sure of the technical facilities and human resources of potential data suppliers.
I wonder if anyone has some figures on this stuff?
All the best,
Roger.
On 11 May 2007, at 12:02, Dave Vieglais wrote:
Hi Everyone, Not really a TAPIR specific response, but perhaps the right audience. I'm probably stating the obvious, but the simplest way to get around the hassles of running a server and the associated firewall headaches is not to serve the data but instead to push it. By adding an authentication layer, it would be an extension to the GBIF REST services to allow POSTing data, rather than just GET (I may be wrong on this - not exactly sure what degree of REST implementation has been done by GBIF). Add in DELETE and UPDATE and instead of GBIF running harvesters to capture data, contributors could simply push their data when necessary. This would I expect be an attractive solution for those data providers that would prefer not to operate servers but would still like to contribute to the global knowledge pool of biodiversity.
More than likely a mixed model may be more ideal - with some data sources acting as servers, and others pushing their data. How about if those institutions that were comfortable running and maintaining servers also adopted the same complete REST implementation as the hypothetical GBIF mentioned above? And what if the servers were, for the most part, aware of each other and so could act as proxies or mirrors for the other servers (perhaps even automatically replicating content). The end result would be more of a mesh topology of comparatively high reliability and availability.
It would I think be a more scalable solution, and perhaps more maintainable in the long term, since it is a relatively simple thing to update a standalone application that could push the data compared with updating and securing a server. The expense of participation in the networks would drop, and resources could be directed towards operation of a few high quality / high reliability services for accessing the data.
Just a thought. There are obvious social implications, such as the perception of loosing control of one's data - but then it could also be argued that if a provider had the ability to DELETE their records from a server, then they actually have more control over the distribution of their data than currently.
cheers, Dave V.
On May 11, 2007, at 19:19, Roger Hyam wrote:
Hi Everyone,
There is a requirement that all wrapper type applications (TAPIR, DiGIR, BioCASe and others) have but that I don't think we address.
All instances need to have:
Either a database on a server in a DMZ or with an ISP with the ability to export data from the production database to the public database and then keep changes in the production database synchronize with the public database.
Or the ability to provide a secured/restricted connection directly to production database through the firewall.
Configuring the wrapper software against a database seems a smaller problem than getting a handle on an up to date database to configure it against!
Should we have a recommended strategy or best practice for overcoming these problems? Do we have any figures on how they are overcome in the existing BioCASe and DiGIR networks?
Many thanks for your thoughts,
Roger
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
Greetings,
For figures on technical facilities and human resources:
I've included the results from a survey we did of the invasive species data providers we're working with. Some information on human resources can be pulled from the results. The providers are basically split with about 1/4 only able to spend one hour a year on technical issues and 1/4 willing to spend whatever it takes and the others in-between. They are also spread across a variety of languages, web frameworks, and servers. I feel this makes for a particularly challenging user base.
We have divided our user's into:
- Data Commons - the most sophisticated providers who also provide the ability for end-users to add data into their systems over the web. A number of these will also harvest data from other providers and do some value added features with the integrated data sets
- Providers - typical providers just wanting to provide their data and harvest as needed
- Contributors - users with data but without a server who will need a Data Common to put their data on the Internet
Web services are new to most of our providers and I think they will either want to provide their data through a traditional pull service or will need a user interface to deal with the issues of data integration. This might be a good content for a user survey to see what the larger GBIF user base is interested in and capable of.
For server configurations:
From talking to folks who have worked with a number of providers, most
of them have a single server that is serving data through a firewall to the Internet.
At CSU we integrate all the data we obtain into a single server that goes through a firewall to the Internet (i.e. no mirror). We also have the ability for users to upload text files and Shapefiles into the database (Data Common). We create maps on the fly so we have to have all the data in the same schema and heavily index it. The software to convert the various formats into our schema was far more complicated than expected (~3 years for 2 people). We do have 2 other servers but these are for development and testing of the web software.
Jim
PS. Our web system is at www.niiss.org
-----Original Message----- From: tdwg-tapir-bounces@lists.tdwg.org [mailto:tdwg-tapir-bounces@lists.tdwg.org] On Behalf Of Roger Hyam Sent: Monday, May 14, 2007 3:40 AM To: Dave Vieglais Cc: tdwg-tapir@lists.tdwg.org Subject: Re: [tdwg-tapir] Hosting strategies
Dave,
I think you are right. Whatever strategy is used there has to be an element of push in it. The production database has to push data to the publicly visible database that can then be scraped/searched by interested parties. The difference between pushing data to a public database that is managed by your own institution or one that is managed by a third party is really quite minor.
I am concerned because I am not sure of the technical facilities and human resources of potential data suppliers.
I wonder if anyone has some figures on this stuff?
All the best,
Roger.
On 11 May 2007, at 12:02, Dave Vieglais wrote:
Hi Everyone, Not really a TAPIR specific response, but perhaps the right audience. I'm probably stating the obvious, but the simplest way to get around the hassles of running a server and the associated firewall headaches is not to serve the data but instead to push it. By adding an authentication layer, it would be an extension to the GBIF REST services to allow POSTing data, rather than just GET (I may be wrong on this - not exactly sure what degree of REST implementation has been done by GBIF). Add in DELETE and UPDATE and instead of GBIF running harvesters to capture data, contributors could simply push their data when necessary. This would I expect be an attractive solution for those data providers that would prefer not to operate servers but would still like to contribute to the global knowledge pool of biodiversity.
More than likely a mixed model may be more ideal - with some data sources acting as servers, and others pushing their data. How about if those institutions that were comfortable running and maintaining servers also adopted the same complete REST implementation as the hypothetical GBIF mentioned above? And what if the servers were, for the most part, aware of each other and so could act as proxies or mirrors for the other servers (perhaps even automatically replicating content). The end result would be more of a mesh topology of comparatively high reliability and availability.
It would I think be a more scalable solution, and perhaps more maintainable in the long term, since it is a relatively simple thing to update a standalone application that could push the data compared with updating and securing a server. The expense of participation in the networks would drop, and resources could be directed towards operation of a few high quality / high reliability services for accessing the data.
Just a thought. There are obvious social implications, such as the perception of loosing control of one's data - but then it could also be argued that if a provider had the ability to DELETE their records from a server, then they actually have more control over the distribution of their data than currently.
cheers, Dave V.
On May 11, 2007, at 19:19, Roger Hyam wrote:
Hi Everyone,
There is a requirement that all wrapper type applications (TAPIR, DiGIR, BioCASe and others) have but that I don't think we address.
All instances need to have:
Either a database on a server in a DMZ or with an ISP with the ability to export data from the production database to the public database and then keep changes in the production database synchronize with the public database.
Or the ability to provide a secured/restricted connection directly to production database through the firewall.
Configuring the wrapper software against a database seems a smaller problem than getting a handle on an up to date database to configure it against!
Should we have a recommended strategy or best practice for overcoming these problems? Do we have any figures on how they are overcome in the existing BioCASe and DiGIR networks?
Many thanks for your thoughts,
Roger
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
_______________________________________________ tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
Hi Roger,
The speciesLink network makes use of regional servers to mirror data from collections that cannot set up a provider service. Some of them still use dial-up connections for instance. The following diagram is not up-to-date but gives an idea of how many collections are using this approach in the network:
http://splink.cria.org.br/manager/pdf/esquema.pdf
We had to define our own protocol to achieve this. It's based on SOAP and it has limitations such as only handling tabular data, but it does its job. We developed a client software in Java which is installed on providers' machines and has many interesting features. The newest ones include automatic updates of the software, and the possiblity of choosing mapping templates for specific collection management systems.
Best Regards, -- Renato
Hi Everyone,
There is a requirement that all wrapper type applications (TAPIR, DiGIR, BioCASe and others) have but that I don't think we address.
All instances need to have:
Either a database on a server in a DMZ or with an ISP with the ability to export data from the production database to the public database and then keep changes in the production database synchronize with the public database.
Or the ability to provide a secured/restricted connection directly to production database through the firewall.
Configuring the wrapper software against a database seems a smaller problem than getting a handle on an up to date database to configure it against!
Should we have a recommended strategy or best practice for overcoming these problems? Do we have any figures on how they are overcome in the existing BioCASe and DiGIR networks?
Many thanks for your thoughts,
Roger
Renato,
Thanks for that. Great diagram!
If I read it correctly the number of non-DiGIR (i.e. SOAP) providers is more than double the number of DiGIR providers so to say that speciesLink is DiGIR network is only describing the public facing part of it. In a way it is more like a SOAP network with a DiGIR backbone.
I am currently putting together a series ANT build files with an AntForms interface for day to day use. It will effectively do the job of your Java client. I figured it was the easiest way to build configurable pipelines with some common components.
I intend to base the table structure on a TAPIR CNS. The CNS files effectively flatten the ontology into a table in a predictable way - but without the opportunity to have repeating properties. The server side tables can then have the same schema.
The only downside of this is the need to have a table for any one to many relationship such as multiple identifications of specimens.
I may put a wiki page on the TAG wiki about this if I get a chance as it would be good for people to share experiences and know what resources are available.
All the best,
Roger
On 14 May 2007, at 17:34, Renato De Giovanni wrote:
Hi Roger,
The speciesLink network makes use of regional servers to mirror data from collections that cannot set up a provider service. Some of them still use dial-up connections for instance. The following diagram is not up- to-date but gives an idea of how many collections are using this approach in the network:
http://splink.cria.org.br/manager/pdf/esquema.pdf
We had to define our own protocol to achieve this. It's based on SOAP and it has limitations such as only handling tabular data, but it does its job. We developed a client software in Java which is installed on providers' machines and has many interesting features. The newest ones include automatic updates of the software, and the possiblity of choosing mapping templates for specific collection management systems.
Best Regards,
Renato
Hi Everyone,
There is a requirement that all wrapper type applications (TAPIR, DiGIR, BioCASe and others) have but that I don't think we address.
All instances need to have:
Either a database on a server in a DMZ or with an ISP with the ability to export data from the production database to the public database and then keep changes in the production database synchronize with the public database.
Or the ability to provide a secured/restricted connection directly to production database through the firewall.
Configuring the wrapper software against a database seems a smaller problem than getting a handle on an up to date database to configure it against!
Should we have a recommended strategy or best practice for overcoming these problems? Do we have any figures on how they are overcome in the existing BioCASe and DiGIR networks?
Many thanks for your thoughts,
Roger
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
Dear Roger
We have the end-points for our DiGIR and BioCASE providers from the Royal Museum For Central Africa available here : http://193.190.223.48/
We usually never connect our collection management database directly to GBIF or other networks.This is only available in the Intranet. The data for the internet (websites or networks) are copies that have been approved to be ready to be shown to the public by the curators.
For our DiGIR and BioCASE for updates it is done as a full new copie replacing the previous one, with all necessary back-ups to avoid losses. These exports have been automated and we are currently working on enhancing this.
The data of our amphibian data are hosted on virtual servers (using VMware), configured with the requirments needed to be accessible for GBIF or other networks so it does not disturbed the other services of our museum and we have all freedom to adapt it best to our needs.
As we collaborate to several project; the export is on the virtual server in a PostgreSQL database. We have produces SQL views (in this case one for DarwinCore and a another one for ABCD) and we use these "virtual tables" to map our fields against the schemas
This works fine and avoids to have too many copies of your data for different purposes. in case of changes we just need to change the script of the view.
One drawback is that sometimes the visual interfaces of some of the providers cannot see the fields of the fields of a virtual table and you have to do the mappings directly using the config files.
We were also told that if the views get too complicated to execute it may slow down the response time. We have not so complicated views so far to experience this and if it is possible to avoid it by caching the created virtual tables and not to have to execute the script each time if it prooved too slow.
Hope this can help by showing you an way to do it that may not technically be the best solution but managed to respect eventual security issues, respect our museums ICT requirments and not to disturbe other already running services.
If you wish more detailed information on how we did it with examples of scripts or so, let us know.
Best regards
Pat
--------------------------------- Be a better Heartthrob. Get better relationship answers from someone who knows. Yahoo! Answers - Check it out.
participants (6)
-
Dave Vieglais
-
Jim Graham
-
Patricia Mergen
-
Renato De Giovanni
-
Roger Hyam
-
Wouter Addink