RDF query and inference in a distributed environment

Wed Jan 4 09:22:49 CET 2006

Wow!  Lots of resposes this morning. Many thanks to all who responded! Some
replies:

Patricia wrote:

> I agree with you about the logic in this. However accoding to my daily
> experience with potential dataproviders there is a lot of teaching and
> convincing needed to make this logic accepted that this does not result
> in the loss of control over own data. I agree that to be conviencing
> a robust syncronization is needed.

I understand the psychology, and agree that the barrier for "expose your
data online so that everyone can access it" strikes a more palatable chord
than "expose your data online so that everyone can mirror it".

> I agree that for machines and storage it is not that expensive. I was
> more referring to the human ressources needed to manage the
> mirror. Smaller institutions do not have necessary the funds or cannot
> justify to their hierachy that staff is devoting time to maint ain a
> full miror containing mainly "references" to information coming from
> other institutions, but it is easier to justify the time spent to
> contribute to the whole with the part concerning directly the institution
...

Agreed -- but I wonder really how much more time would be involved in
setting up an automated mirror, compared with, e.g., a DiGIR provider.  In
my earlier post when I said that barriers would likely be technical, I
didn't mean that technology doesn't exist (it does).  Rather, I meant that
in order to be successful, setting up and maintaining a mirror would have to
be no more technically challenging than establishing a DiGIR provider.  I'm
not sure that's possible (yet).

> Yes I agree with you that rob ust syncronization will be needed but
> as my IT colleague always remind me, I guess we must not forget that
> setting up an IT infrastructure is most of the time 10 % technical
> issues to be solved and 90% of the time solving "human problems
> and barriers" to make it work and accepted ...

I agree -- but part of overcoming the human problems and barriers is to make
setting up a mirror site relatively easy for a non-specialist IT person.
And this is really a technical challenge.  The other human problems &
barriers (IPR, allowing access to hard-earned data, confidence that data
edit authorization rules will be enforced, etc.) also have some technical
foundation, and also exist for the distributed/complementary approach.

Ricardo wrote:

>     Instead of mutually exclusive, both approaches you mention are
> complementary. The "distributed complementary data" aproach is a
> fundamental part of the infrastructure necessary to build the
> "distributed mirror copies" you propose. That (the "distributed
> complementary data" approach) is essentially what we have in place now
> with DiGIR/BioCase/Tapir as the harvesting protocol and GBIF and other
> institutions as harvesters. The only missing piece of software we need
> to really have "an automated and robust synchronization protocol", in my
> opinion is some kind of push mechanism to trigger updates in the caches.

Agreed on all points.

>     However, I think it is not very useful to try to standardize how the
> distributed mirror copies should be built and organized.

Here's where I disagree somewhat.  Standardized mirror protocols would be a
fundamental necessity of the paradigm I described. The paradigm would simply
not work without such standardization.  Non-standard means non-interactive,
and therefore no automated synchronization. What makes it useful is that
every participating organization has local (=high performance) access to a
full biodiversity dataset. Another thing that makes it useful is that,
rather than needing *all* data providers to be fully functional at any given
moment for 100% data retrival, only one mirror site needs to be functional.

I understand this is the role that GBIF already serves -- and perhaps that's
all that's really needed.

> A (socially)
> descentralized schema would work better in this case: data providers
> would make available data they create and that is under their direct
> custody and individual harvesters would be free to look at the metadata
> being served and create their own caches, selectively harvesting only
> the information that is relevant to the services they intend to provide.

There really should be no difference, as long as data edit authorization
protocols are as robust as the synchronization protocols.  If I'm exposing
my specimen data for the world to access (e.g., via DiGIR/BioCase/Tapir),
then it shouldn't bother me that GBIF is mirroring it -- as long as GBIF
includes metadata about data ownership, and as long as the only people who
can edit the data are people who I authorize to do so.  And if GBIF is
mirroring it, what difference is there if there are 100, or 1000, or 10,000
other servers mirroring it (same assumptions).  The two things that are
important to me are: 1) corrections to data get propagated quickly to
mirrors; and 2) only people who I authorize to edit my data may do so.
Other than that, it doesn't really matter if the data are only on my
server's hard drive, or are replicated on GBIF's hard drive, or are
replicated on 10,000 server hard drives.

But the 10,000 hard drives option gives me two advantages: 1) confidence
that the data will survive even continental-scale catastrophies; and 2) it
means that my server's hard drive (one of the 10,000) has local and
immediate access to everyone else's data (for use as authorities, etc.)

Rod Page wrote:

[an excellent and thought-proviking post, making important, if
uncomfortable, observations]

> What is present in bioinformatics are large numbers of "value added"
> databases that take GenBank,  PubMed, etc. and do neat things with
> them. This is possible because you can download the entire database.

Exactly!  The real value of portals to end users are the value-added "neat
things". I think we ought to encourage as many enterprising biology/data
nerds as possible to build such things.

> Each one of these value added databases does need to deal with the
> issue of what happens when GenBank (say) changes, but because GenBank
> has well defined releases, essentially they can grab a new copy of the
> data, update their local copy, and regenerate their database.

Wouldn't it be so much nicer if the protocols were in place to automaticaly
propagate updates to the value-added sites in real time, so that the
aggregated databases were effectively self-maintaining?  I guess the
fundamental question of this whole discussion ultimately boils down to
"push" vs. "pull" paradigms.

> In summary, I think the issue raised by Rich is important, but is one
> to be addressed by whoever takes on the task of assembling a data
> warehouse from the individual providers. Of course, once providers make
> their data available, anybody can do this...

Yes, but aybody (everybody) has to rebuild it themselves from scratch.  What
I'm suggesting is to use the TDWG/GBIF/GUID standard community to come up
with standard protocols to lower the technological bar of becomming an
aggregator -- so much so that it becomes the real foundation of data
distribution, dissemination, and access (and makes it feasible for the
enterprising biology/data nerds to make good use of their late night
hours....)

In a later post, Rod wrote:

> I think there is a continuum of possibilities:
>
> 1. Pure distributed - query always sent to remote sources, no data ever
> held locally
>
> 2. Distributed with cache - results from a user query are cached
> (either for a limited time, or until source sends a message saying its
> data has been updated).
>
> 3. Distributed with partial local copy - some information harvested
> from sources stored locally (e.g., metadata), detailed information only
> held by sources.
>
> 4. Not distributed (data warehouse) - all data from distributed sources
> held locally (harvested), with periodic updates.

I would add to this list:

5. Distributed Warehouses -- all data is mirrored across many warehouses,
automatically kept in synchronization in real time. Queries are sent to any
one warehouse.

Also, Rod's reference to "The Cathedral and the Bazaar" is excellent and
highly relevant. The example of the Hawaii Biological Survey is a good
example of this in our community.

Chuck Miller wrote:

> To cache or mirror data in multiple locations, the dilemma lies
> in the simplest question: “Where is my data?”  A cache or mirror
> from a technologist’s perspective is just a technical trick, identical
> to the original, dependent upon the master version, no big deal.
> But, simplistically, the data may be actually located in another
> country, no matter how it got there, and some politicians could m
> isconstrue the whole thing, if not properly negotiated and agreed
> upon in advance.  In my experience, negotiating such agreements
> is more work than the technical development.

I agree the political issues may be show-stoppers -- but mostly because of
unwarranted paranoia on the part of the data owners (unfortunate, but
certainly real).  But the reality is, any data exposed on a distributed
provider (indeed, any data published in any form -- electronic or paper) is
already "out there", and located in many different countries.  Data
contained in a printed publication simulatnously exist in many countries.
The only difference with the distributed warehouse approach is that the data
are much, MUCH more powerful.  Perhaps the real political barrier is putting
so much power into so many hands. Sort of like the World Wide Web itself.

> I think the distributed nature of DiGIR was critical to selling
> it at the start of GBIF.  The original design assured providers that
> their source data would “stay” in their country and not be wholesale
> copied somewhere else.  It’s hard to say what the political effect
> of creating mirrors would be.

Agreed on all points.  There was a time when the world wasn't even even
ready for the DiGIR approach.  Perhaps it's too soon to push it to the next
level.

Bear in mind, though, that all these IPR/political issues relate mostly to
specimen data.  The real value of the "distributed warehouses" approach to
the warehouses providers themselves are the things that would traditionally
be though of as "authorities" (taxon names, reference citations, geographic
gazetteers, etc.)  Maybe the distributed warehouse approach will work well
for these kinds of data, but not for specimen data?

O.K., I should probably stop here and get to work....

Aloha,
Rich

Richard L. Pyle, PhD
Database Coordinator for Natural Sciences
  and Associate Zoologist in Ichthyology
Department of Natural Sciences, Bishop Museum
1525 Bernice St., Honolulu, HI 96817
Ph: (808)848-4115, Fax: (808)847-8252
email: deepreef at bishopmuseum.org
http://hbs.bishopmuseum.org/staff/pylerichard.html