Re: RDF query and inference in a distributed environment

5 Jan 2006

      Coming from an IT background rather than a taxonomic background, I have never understood the strong "ownership" of data that people/scientists have for their data.  This seems rather short sited to me - people with these concerns must have some thought about how to maintain/expose/use "their" data in the long term future?  I can understand their concerns, but there must be a solution, otherwise their data will be no-ones concern in 50 years time when it disappears from existence.

Another thought I had about data caching systems.  Say you want to search the cached/centralised copy of the data (eg a GBIF cache).  A list of results is returned, then you decide you want to view more details of one of the results, so you follow a link off to the associated data (this would theoretically be by using the GUID system we are discussing).  This would result in viewing the details of the selected record at the location where the GUID resolves to - this would always be the same location as a GUID only resolves to a single location.  Is this correct, or would the intention here be to view the cached details of the selected record (which would require an separate ID for all the cached records)?  Its this navigation through the caches/repositories that I am not quite sure of how it will work?

Kevin
...
...
...
deepreef@BISHOPMUSEUM.ORG 5/01/2006 8:22 a.m. >>>
Wow!  Lots of resposes this morning. Many thanks to all who responded! Some
replies:

Patricia wrote:
...
I agree with you about the logic in this. However accoding to my daily
experience with potential dataproviders there is a lot of teaching and
convincing needed to make this logic accepted that this does not result
in the loss of control over own data. I agree that to be conviencing
a robust syncronization is needed.
I understand the psychology, and agree that the barrier for "expose your
data online so that everyone can access it" strikes a more palatable chord
than "expose your data online so that everyone can mirror it".
...
I agree that for machines and storage it is not that expensive. I was
more referring to the human ressources needed to manage the
mirror. Smaller institutions do not have necessary the funds or cannot
justify to their hierachy that staff is devoting time to maint ain a
full miror containing mainly "references" to information coming from
other institutions, but it is easier to justify the time spent to
contribute to the whole with the part concerning directly the institution
...
Agreed -- but I wonder really how much more time would be involved in
setting up an automated mirror, compared with, e.g., a DiGIR provider.  In
my earlier post when I said that barriers would likely be technical, I
didn't mean that technology doesn't exist (it does).  Rather, I meant that
in order to be successful, setting up and maintaining a mirror would have to
be no more technically challenging than establishing a DiGIR provider.  I'm
not sure that's possible (yet).
...
Yes I agree with you that rob ust syncronization will be needed but
as my IT colleague always remind me, I guess we must not forget that
setting up an IT infrastructure is most of the time 10 % technical
issues to be solved and 90% of the time solving "human problems
and barriers" to make it work and accepted ...
I agree -- but part of overcoming the human problems and barriers is to make
setting up a mirror site relatively easy for a non-specialist IT person.
And this is really a technical challenge.  The other human problems &
barriers (IPR, allowing access to hard-earned data, confidence that data
edit authorization rules will be enforced, etc.) also have some technical
foundation, and also exist for the distributed/complementary approach.

Ricardo wrote:
...
Instead of mutually exclusive, both approaches you mention are
complementary. The "distributed complementary data" aproach is a
fundamental part of the infrastructure necessary to build the
"distributed mirror copies" you propose. That (the "distributed
complementary data" approach) is essentially what we have in place now
with DiGIR/BioCase/Tapir as the harvesting protocol and GBIF and other
institutions as harvesters. The only missing piece of software we need
to really have "an automated and robust synchronization protocol", in my
opinion is some kind of push mechanism to trigger updates in the caches.
Agreed on all points.
...
However, I think it is not very useful to try to standardize how the
distributed mirror copies should be built and organized.
Here's where I disagree somewhat.  Standardized mirror protocols would be a
fundamental necessity of the paradigm I described. The paradigm would simply
not work without such standardization.  Non-standard means non-interactive,
and therefore no automated synchronization. What makes it useful is that
every participating organization has local (=high performance) access to a
full biodiversity dataset. Another thing that makes it useful is that,
rather than needing *all* data providers to be fully functional at any given
moment for 100% data retrival, only one mirror site needs to be functional.

I understand this is the role that GBIF already serves -- and perhaps that's
all that's really needed.
...
A (socially)
descentralized schema would work better in this case: data providers
would make available data they create and that is under their direct
custody and individual harvesters would be free to look at the metadata
being served and create their own caches, selectively harvesting only
the information that is relevant to the services they intend to provide.
There really should be no difference, as long as data edit authorization
protocols are as robust as the synchronization protocols.  If I'm exposing
my specimen data for the world to access (e.g., via DiGIR/BioCase/Tapir),
then it shouldn't bother me that GBIF is mirroring it -- as long as GBIF
includes metadata about data ownership, and as long as the only people who
can edit the data are people who I authorize to do so.  And if GBIF is
mirroring it, what difference is there if there are 100, or 1000, or 10,000
other servers mirroring it (same assumptions).  The two things that are
important to me are: 1) corrections to data get propagated quickly to
mirrors; and 2) only people who I authorize to edit my data may do so.
Other than that, it doesn't really matter if the data are only on my
server's hard drive, or are replicated on GBIF's hard drive, or are
replicated on 10,000 server hard drives.

But the 10,000 hard drives option gives me two advantages: 1) confidence
that the data will survive even continental-scale catastrophies; and 2) it
means that my server's hard drive (one of the 10,000) has local and
immediate access to everyone else's data (for use as authorities, etc.)

Rod Page wrote:

[an excellent and thought-proviking post, making important, if
uncomfortable, observations]
...
What is present in bioinformatics are large numbers of "value added"
databases that take GenBank,  PubMed, etc. and do neat things with
them. This is possible because you can download the entire database.
Exactly!  The real value of portals to end users are the value-added "neat
things". I think we ought to encourage as many enterprising biology/data
nerds as possible to build such things.
...
Each one of these value added databases does need to deal with the
issue of what happens when GenBank (say) changes, but because GenBank
has well defined releases, essentially they can grab a new copy of the
data, update their local copy, and regenerate their database.
Wouldn't it be so much nicer if the protocols were in place to automaticaly
propagate updates to the value-added sites in real time, so that the
aggregated databases were effectively self-maintaining?  I guess the
fundamental question of this whole discussion ultimately boils down to
"push" vs. "pull" paradigms.
...
In summary, I think the issue raised by Rich is important, but is one
to be addressed by whoever takes on the task of assembling a data
warehouse from the individual providers. Of course, once providers make
their data available, anybody can do this...
Yes, but aybody (everybody) has to rebuild it themselves from scratch.  What
I'm suggesting is to use the TDWG/GBIF/GUID standard community to come up
with standard protocols to lower the technological bar of becomming an
aggregator -- so much so that it becomes the real foundation of data
distribution, dissemination, and access (and makes it feasible for the
enterprising biology/data nerds to make good use of their late night
hours....)

In a later post, Rod wrote:
...
I think there is a continuum of possibilities:
1. Pure distributed - query always sent to remote sources, no data ever
held locally
2. Distributed with cache - results from a user query are cached
(either for a limited time, or until source sends a message saying its
data has been updated).
3. Distributed with partial local copy - some information harvested
from sources stored locally (e.g., metadata), detailed information only
held by sources.
4. Not distributed (data warehouse) - all data from distributed sources
held locally (harvested), with periodic updates.
I would add to this list:

5. Distributed Warehouses -- all data is mirrored across many warehouses,
automatically kept in synchronization in real time. Queries are sent to any
one warehouse.

Also, Rod's reference to "The Cathedral and the Bazaar" is excellent and
highly relevant. The example of the Hawaii Biological Survey is a good
example of this in our community.

Chuck Miller wrote:
...
To cache or mirror data in multiple locations, the dilemma lies
in the simplest question: "Where is my data?"  A cache or mirror
from a technologist's perspective is just a technical trick, identical
to the original, dependent upon the master version, no big deal.
But, simplistically, the data may be actually located in another
country, no matter how it got there, and some politicians could m
isconstrue the whole thing, if not properly negotiated and agreed
upon in advance.  In my experience, negotiating such agreements
is more work than the technical development.
I agree the political issues may be show-stoppers -- but mostly because of
unwarranted paranoia on the part of the data owners (unfortunate, but
certainly real).  But the reality is, any data exposed on a distributed
provider (indeed, any data published in any form -- electronic or paper) is
already "out there", and located in many different countries.  Data
contained in a printed publication simulatnously exist in many countries.
The only difference with the distributed warehouse approach is that the data
are much, MUCH more powerful.  Perhaps the real political barrier is putting
so much power into so many hands. Sort of like the World Wide Web itself.
...
I think the distributed nature of DiGIR was critical to selling
it at the start of GBIF.  The original design assured providers that
their source data would "stay" in their country and not be wholesale
copied somewhere else.  It's hard to say what the political effect
of creating mirrors would be.
Agreed on all points.  There was a time when the world wasn't even even
ready for the DiGIR approach.  Perhaps it's too soon to push it to the next
level.

Bear in mind, though, that all these IPR/political issues relate mostly to
specimen data.  The real value of the "distributed warehouses" approach to
the warehouses providers themselves are the things that would traditionally
be though of as "authorities" (taxon names, reference citations, geographic
gazetteers, etc.)  Maybe the distributed warehouse approach will work well
for these kinds of data, but not for specimen data?

O.K., I should probably stop here and get to work....

Aloha,
Rich

Richard L. Pyle, PhD
Database Coordinator for Natural Sciences
  and Associate Zoologist in Ichthyology
Department of Natural Sciences, Bishop Museum
1525 Bernice St., Honolulu, HI 96817
Ph: (808)848-4115, Fax: (808)847-8252
email: deepreef@bishopmuseum.org
http://hbs.bishopmuseum.org/staff/pylerichard.html

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
WARNING: This email and any attachments may be confidential and/or
privileged. They are intended for the addressee only and are not to be read,
used, copied or disseminated by anyone receiving them in error.  If you are
not the intended recipient, please notify the sender by return email and
delete this message and any attachments.

The views expressed in this email are those of the sender and do not
necessarily reflect the official views of Landcare Research.

Landcare Research
http://www.landcareresearch.co.nz
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Re: RDF query and inference in a distributed environment

Kevin Richards