Re: RDF query and inference in a distributed environment

3 Jan 2006

      Hi Patricia,

Many thanks for the feedback (and thanks also to Bob -- who I neglected to
thank in my previous post).

What do you reckon would be the limiting social and financial factors for
full mirrors?  In social terms, if I'm going to expose my data to the world
anyway (e.g., via DiGIR), then I don't see why I would be socially reluctant
to allow others to mirror the data (provided robust syncronization protocols
are in place -- see my previous response to Bob; and provided data
"ownership" credentials are embedded within the core metadata).

As for financial, I prefaced my original post with the observation of ever
decreasing $/GB for storage space. I suspect that, before TDWG nails down
the GUID protocols, entry-level web servers (of the sort that even the most
modest DiGIR provider would need to establish) will come with nearly a TB of
disk storage space by default. Perhaps the cost of bandwidth will be a
limiting factor? Or maybe DB software capable of managing such large
datasets?

As for IPR -- well, ultimately that applies mostly to specimens.  And again,
assuming that "ownership" metadata remains intact, I see no basis for
increased apprehension about allowing mirrored copies of data records (as
GBIF already does, for example) over and above exposing them in the first
place.

Personally, I don't think the social, legal, or financial barriers are
significantly greater for a mass-mirror paradigm than they are for
distributed complementary data sets.  I suspect the major barriers will be
more technical (i.e., those aforementioned "robust syncronization
protocols").

Aloha,
Rich

-----Original Message-----
From: Taxonomic Databases Working Group GUID Project
[mailto:TDWG-GUID@LISTSERV.NHM.KU.EDU]On Behalf Of Patricia Mergen
Sent: Tuesday, January 03, 2006 10:31 PM
To: TDWG-GUID@LISTSERV.NHM.KU.EDU
Subject: Re: RDF query and inference in a distributed environment

Dear Richard

I agree with you that several mirror copies will and are needed, preferably
well spread geographically as back-ups. This is exactely the approach of
GBIF, as they are now in the process to mirror their services.

However as highlighted by Bob Morris their is are social, but also financial
barriers to have all contributing institutions run a "full" mirror. In order
to insure the participation of all those who are willing to, I believe that
a distributed system where each provider can participate with his part
should be kept. Those who have the ressources could of course set up full
mirrors  if this match their needs and if this is allowed by the providers
(there are also IPRs issues which may be raise here by some institutions).

Patricia

Richard Pyle <deepreef@BISHOPMUSEUM.ORG> wrote:
...
Long term what I think might happen is that users have their own triple
stores, and as they do queries the results get added to their own
triple store and they can make inferences locally that they are
interested in. MIT's Piggy bank project
(http://simile.mit.edu/piggy-bank/) is an example of this sort of
approach.
With hard drive sizes spiraling skyward, and $/GB ($/TB) spiraling
downward.... I'm wondering whether or not the "distributed" system that
serves us best might be "distributeded mirror copies", rather than
distributed complementary data. I've been pushing this approach for
taxonomic data for a while, but perhaps it would be useful for other shared
data as well (geographic localities, people/agents, publications/references,
etc.) Even for specimen data -- where "ownership" is unambiguous -- it
seems that as long as the ownership is clearly embedded in the core
metadata, there are more fundamental advantages in storing and serving data
from multiple data resources, rather than serving it from only one single
data resource.

One way to look at it would be "robust caching", with automated update
capabilities. The main benefits would be:

1) Large-scale distributed backup of the world's biodata (ensuring
perpetuity across a changing technological landscape);
2) Performance and reliability enhancement for local data authority needs;
4) Essentially 100% data availability (like DNS), regardless of which
servers are up or down at any given moment;
3) Maximization of distributed work/effort for data "maintenance and
repair".

The point is, the technology discussions would focus less on issues of
distributed queries, and more on issues of replication/synchronization and
data edit authorization protocols.

Perhaps this would be reaching too far, too soon. But on the other han d, I
don't see why implementing a "distributed mirror" system would be any more
technically, financially, or socially challenging than implementing a
distributed query system for distributed data.

Aloha,
Rich

Richard L. Pyle, PhD
Database Coordinator for Natural Sciences
and Associate Zoologist in Ichthyology
Department of Natural Sciences, Bishop Museum
1525 Bernice St., Honolulu, HI 96817
Ph: (808)848-4115, Fax: (808)847-8252
email: deepreef@bishopmuseum.org
http://hbs.bishopmuseum.org/staff/pylerichard.html

Yahoo! Photos
Ring in the New Year with Photo Calendars. Add photos, events, holidays,
whatever.