RDF query and inference in a distributed environment

Bob Morris ram at CS.UMB.EDU
Tue Jan 3 18:47:05 CET 2006

I'm a great believer in caching, but it comes equipped with one trivial
technical cost which is not always trivial socially. That is, the data
originator must be willing to enter into a contract (in the software,
not legal, sense) to update the data only at a specific time, that is to
provide a "good until" promise. Otherwise, a client of the cached
version is clueless whether the data in the cache is accurate. Only with
some notion of expiration can the cache host determine if a query must
induce a cache refresh.Often when I've raised this point with
biodiversity data providers, they hold out the position that they have
to be allowed to update the data whenever they need to (as opposed, say
to waiting a day, a week, or...). I always find this a strange position
from a community which tolerates  delays of years before , e.g. a
taxonomic  name is official because that's how long the journal took
from submission to publication....

There are several functionally equivalent solutions to the technical
problem and this list needn't discuss them (not the least because more
than a few readers are probably tired of my carping about this point)

Richard Pyle wrote:

>>Long term what I think might happen is that users have their own triple
>>stores, and as they do queries the results get added to their own
>>triple store and they can make inferences locally that they are
>>interested in. MIT's Piggy bank project
>>(http://simile.mit.edu/piggy-bank/) is an example of this sort of
>With hard drive sizes spiraling skyward, and $/GB ($/TB) spiraling
>downward.... I'm wondering whether or not the "distributed" system that
>serves us best might be "distributeded mirror copies", rather than
>distributed complementary data.  I've been pushing this approach for
>taxonomic data for a while, but perhaps it would be useful for other shared
>data as well (geographic localities, people/agents, publications/references,
>etc.)  Even for specimen data -- where "ownership" is unambiguous -- it
>seems that as long as the ownership is clearly embedded in the core
>metadata, there are more fundamental advantages in storing and serving data
>from multiple data resources, rather than serving it from only one single
>data resource.
>One way to look at it would be "robust caching", with automated update
>capabilities.  The main benefits would be:
>1) Large-scale distributed backup of the world's biodata (ensuring
>perpetuity across a changing technological landscape);
>2) Performance and reliability enhancement for local data authority needs;
>4) Essentially 100% data availability (like DNS), regardless of which
>servers are up or down at any given moment;
[This is not entirely true of the DNS. If no authoritative server is
accessible for a record  at the time a DNS record expires, then
applications have to just decide whether they will attempt, and accept
as trustworthy, whatever answers up on the expired IP address, if indeed
there is one available. Not that the DNS attempts to provide any trust
model anyway...]

>3) Maximization of distributed work/effort for data "maintenance and
>The point is, the technology discussions would focus less on issues of
>distributed queries, and more on issues of replication/synchronization and
>data edit authorization protocols.
>Perhaps this would be reaching too far, too soon.  But on the other hand, I
>don't see why implementing a "distributed mirror" system would be any more
>technically, financially, or socially challenging than implementing a
>distributed query system for distributed data.
