Re: RDF query and inference in a distributed environment
Dear Richard
I agree with you that several mirror copies will and are needed, preferably well spread geographically as back-ups. This is exactely the approach of GBIF, as they are now in the process to mirror their services.
However as highlighted by Bob Morris their is are social, but also financial barriers to have all contributing institutions run a "full" mirror. In order to insure the participation of all those who are willing to, I believe that a distributed system where each provider can participate with his part should be kept. Those who have the ressources could of course set up full mirrors if this match their needs and if this is allowed by the providers (there are also IPRs issues which may be raise here by some institutions).
Patricia
Richard Pyle deepreef@BISHOPMUSEUM.ORG wrote: > Long term what I think might happen is that users have their own triple
stores, and as they do queries the results get added to their own triple store and they can make inferences locally that they are interested in. MIT's Piggy bank project (http://simile.mit.edu/piggy-bank/) is an example of this sort of approach.
With hard drive sizes spiraling skyward, and $/GB ($/TB) spiraling downward.... I'm wondering whether or not the "distributed" system that serves us best might be "distributeded mirror copies", rather than distributed complementary data. I've been pushing this approach for taxonomic data for a while, but perhaps it would be useful for other shared data as well (geographic localities, people/agents, publications/references, etc.) Even for specimen data -- where "ownership" is unambiguous -- it seems that as long as the ownership is clearly embedded in the core metadata, there are more fundamental advantages in storing and serving data from multiple data resources, rather than serving it from only one single data resource.
One way to look at it would be "robust caching", with automated update capabilities. The main benefits would be:
1) Large-scale distributed backup of the world's biodata (ensuring perpetuity across a changing technological landscape); 2) Performance and reliability enhancement for local data authority needs; 4) Essentially 100% data availability (like DNS), regardless of which servers are up or down at any given moment; 3) Maximization of distributed work/effort for data "maintenance and repair".
The point is, the technology discussions would focus less on issues of distributed queries, and more on issues of replication/synchronization and data edit authorization protocols.
Perhaps this would be reaching too far, too soon. But on the other hand, I don't see why implementing a "distributed mirror" system would be any more technically, financially, or socially challenging than implementing a distributed query system for distributed data.
Aloha, Rich
Richard L. Pyle, PhD Database Coordinator for Natural Sciences and Associate Zoologist in Ichthyology Department of Natural Sciences, Bishop Museum 1525 Bernice St., Honolulu, HI 96817 Ph: (808)848-4115, Fax: (808)847-8252 email: deepreef@bishopmuseum.org http://hbs.bishopmuseum.org/staff/pylerichard.html
--------------------------------- Yahoo! Photos Ring in the New Year with Photo Calendars. Add photos, events, holidays, whatever.
participants (1)
-
Patricia Mergen