[tdwg-tapir] Hosting strategies

Jim Graham jim at nrel.colostate.edu
Tue May 15 22:35:03 CEST 2007


For figures on technical facilities and human resources:

I've included the results from a survey we did of the invasive species
data providers we're working with.  Some information on human resources
can be pulled from the results. The providers are basically split with
about 1/4 only able to spend one hour a year on technical issues and 1/4
willing to spend whatever it takes and the others in-between.  They are
also spread across a variety of languages, web frameworks, and servers.
I feel this makes for a particularly challenging user base.

We have divided our user's into:

- Data Commons - the most sophisticated providers who also provide the
ability for end-users to add data into their systems over the web.  A
number of these will also harvest data from other providers and do some
value added features with the integrated data sets

- Providers - typical providers just wanting to provide their data and
harvest as needed

- Contributors - users with data but without a server who will need a
Data Common to put their data on the Internet

Web services are new to most of our providers and I think they will
either want to provide their data through a traditional pull service or
will need a user interface to deal with the issues of data integration.
This might be a good content for a user survey to see what the larger
GBIF user base is interested in and capable of.

For server configurations:

>>>From talking to folks who have worked with a number of providers, most
of them have a single server that is serving data through a firewall to
the Internet.  

At CSU we integrate all the data we obtain into a single server that
goes through a firewall to the Internet (i.e. no mirror).  We also have
the ability for users to upload text files and Shapefiles into the
database (Data Common).  We create maps on the fly so we have to have
all the data in the same schema and heavily index it.  The software to
convert the various formats into our schema was far more complicated
than expected (~3 years for 2 people).  We do have 2 other servers but
these are for development and testing of the web software.  


PS. Our web system is at www.niiss.org

-----Original Message-----
From: tdwg-tapir-bounces at lists.tdwg.org
[mailto:tdwg-tapir-bounces at lists.tdwg.org] On Behalf Of Roger Hyam
Sent: Monday, May 14, 2007 3:40 AM
To: Dave Vieglais
Cc: tdwg-tapir at lists.tdwg.org
Subject: Re: [tdwg-tapir] Hosting strategies


I think you are right. Whatever strategy is used there has to be an  
element of push in it. The production database has to push data to  
the publicly visible database that can then be scraped/searched by  
interested parties. The difference between pushing data to a public  
database that is managed by your own institution or one that is  
managed by a third party is really quite minor.

I am concerned because I am not sure of the technical facilities and  
human resources of potential data suppliers.

I wonder if anyone has some figures on this stuff?

All the best,


On 11 May 2007, at 12:02, Dave Vieglais wrote:

> Hi Everyone,
> Not really a TAPIR specific response, but perhaps the right
> audience.  I'm probably stating the obvious, but the simplest way  
> to get around the hassles of running a server and the associated  
> firewall headaches is not to serve the data but instead to push  
> it.  By adding an authentication layer, it would be an extension to  
> the GBIF REST services to allow POSTing data, rather than just GET  
> (I may be wrong on this - not exactly sure what degree of REST  
> implementation has been done by GBIF).  Add in DELETE and UPDATE  
> and instead of GBIF running harvesters to capture data,  
> contributors could simply push their data when necessary.  This  
> would I expect be an attractive solution for those data providers  
> that would prefer not to operate servers but would still like to  
> contribute to the global knowledge pool of biodiversity.
> More than likely a mixed model may be more ideal - with some data
> sources acting as servers, and others pushing their data.  How  
> about if those institutions that were comfortable running and  
> maintaining servers also adopted the same complete REST  
> implementation as the hypothetical GBIF mentioned above?  And what  
> if the servers were, for the most part, aware of each other and so  
> could act as proxies or mirrors for the other servers (perhaps even  
> automatically replicating content).  The end result would be more  
> of a mesh topology of comparatively high reliability and availability.
> It would I think be a more scalable solution, and perhaps more
> maintainable in the long term, since it is a relatively simple  
> thing to update a standalone application that could push the data  
> compared with updating and securing a server.  The expense of  
> participation in the networks would drop, and resources could be  
> directed towards operation of a few high quality / high reliability  
> services for accessing the data.
> Just a thought.  There are obvious social implications, such as the
> perception of loosing control of one's data - but then it could  
> also be argued that if a provider had the ability to DELETE their  
> records from a server, then they actually have more control over  
> the distribution of their data than currently.
> cheers,
>   Dave V.
> On May 11, 2007, at 19:19, Roger Hyam wrote:
>> Hi Everyone,
>> There is a requirement that all wrapper type applications (TAPIR,
>> DiGIR, BioCASe and others) have but that I don't think we address.
>> All instances need to have:
>> Either a database on a server in a DMZ or with an ISP with the
>> ability to export data from the production database to the public  
>> database and then keep changes in the production database  
>> synchronize with the public database.
>> Or the ability to provide a secured/restricted connection directly
>> to production database through the firewall.
>> Configuring the wrapper software against a database seems a
>> smaller problem than getting a handle on an up to date database to  
>> configure it against!
>> Should we have a recommended strategy or best practice for
>> overcoming these problems? Do we have any figures on how they are  
>> overcome in the existing BioCASe and DiGIR networks?
>> Many thanks for your thoughts,
>> Roger
>> _______________________________________________
>> tdwg-tapir mailing list
>> tdwg-tapir at lists.tdwg.org 
>> http://lists.tdwg.org/mailman/listinfo/tdwg-tapir

tdwg-tapir mailing list
tdwg-tapir at lists.tdwg.org

-------------- next part --------------
A non-text attachment was scrubbed...
Name: SurveyResults.doc
Type: application/msword
Size: 837120 bytes
Desc: not available
Url : http://lists.tdwg.org/pipermail/tdwg-tag/attachments/20070515/0a832194/attachment.doc 

More information about the tdwg-tag mailing list