RDF query and inference in a distributed environm ent

Wed Jan 4 09:50:51 CET 2006

Querying a central database is certainly technically easier and more
responsive than distributed data. Mirroring servers is mature technology.
And, disk drives get cheaper every day.

But, in addition to the social, technical and financial factors that limit
the ability to implement distributed caches or full mirrors there is also
the political factor, which GBIF has spent great effort successfully
addressing.  As a global entity, GBIF cannot exist without the political
support from its members that leads in turn to financial support.  Most of
the members of GBIF are sovereign nations who seek to preserve and protect
their country's assets.  Some protect more than others.

To cache or mirror data in multiple locations, the dilemma lies in the
simplest question: "Where is my data?"  A cache or mirror from a
technologist's perspective is just a technical trick, identical to the
original, dependent upon the master version, no big deal.  But,
simplistically, the data may be actually located in another country, no
matter how it got there, and some politicians could misconstrue the whole
thing, if not properly negotiated and agreed upon in advance.  In my
experience, negotiating such agreements is more work than the technical
development.

I think the distributed nature of DiGIR was critical to selling it at the
start of GBIF.  The original design assured providers that their source data
would "stay" in their country and not be wholesale copied somewhere else.
It's hard to say what the political effect of creating mirrors would be.

Chuck Miller

Missouri Botanical Garden

  _____

From: Patricia Mergen [mailto:p_mergen at YAHOO.COM]
Sent: Wednesday, January 04, 2006 5:38 AM
To: TDWG-GUID at LISTSERV.NHM.KU.EDU
Subject: Re: [TDWG-GUID] RDF query and inference in a distributed
environment

Dear Rich

Richard Pyle <deepreef at BISHOPMUSEUM.ORG> wrote:

Hi Patricia,

Many thanks for the feedback (and thanks also to Bob -- who I neglected to
thank in my previous post).

What do you reckon would be the limiting social and financial factors for
full mirrors? In social terms, if I'm going to expose my data to the world
anyway (e.g., via DiGIR), then I don't see why I would be socially reluctant
to allow others to mirror the data (provided robust syncronization protocols
are in place -- see my previous response to Bob; and provided data
"ownership" credentials are embedded within the core metadata).

     I agree with you about the logic in this. However accoding to my daily
experience with potential dataproviders there is a lot of teaching and
conviencing needed to make this logic accepted that this does not result in
the loss of control over own data. I agree that to be conviencing a robust
syncronization is needed.

As for financial, I prefaced my original post with the observation of ever
decreasing $/GB for storage space. I suspect that, before TDWG nails down
the GUID protocols, entry-level web servers (of the sort that even the most
modest DiGIR provider would need to establish) will come with nearly a TB of
disk storage space by default. Perhaps the cost of bandwidth will be a
limiting factor? Or maybe DB software capable of managing such large
datasets?

     I agree that for machines and storage it is not that expensive. I was
more referring to the human ressources needed to manage the     mirror.
Smaller institutions do not have necessary the funds or cannot justify to
their hierachy that staff is devoting time to maintain a full miror
containing mainly "references" to information coming from other
institutions, but it is easier to justify the time spent to contribute to
the whole with the part concerning directly the institution ...

As for IPR -- well, ultimately that applies mostly to specimens. And again,
assuming that "ownership" metadata remains intact, I see no basis for
increased apprehension about allowing mirrored copies of data records (as
GBIF already does, for example) over and above exposing them in the first
place.

Yes I agree with you here too, but as said before this need teaching and
convincing ...

Personally, I don't think the social, legal, or financial barriers are
significantly greater for a mass-mirror paradigm than they are for
distributed complementary data sets. I suspect the major barriers will be
more technical (i.e., those aforementioned "robust syncronization
protocols").

Yes I agree with you that robust syncronization will be needed but as my IT
colleague always remind me, I guess we must not forget that setting up an IT
infrastructure is most of the time 10 % technical issues to be solved and
90% of the time solving "human problems and barriers" to make it work and
accepted ...

Pat

Aloha,
Rich

-----Original Message-----
From: Taxonomic Databases Working Group GUID Project
[mailto:TDWG-GUID at LISTSERV.NHM.KU.EDU]On Behalf Of Patricia Mergen
Sent: Tuesday, January 03, 2006 10:31 PM
To: TDWG-GUID at LISTSERV.NHM.KU.EDU
Subject: Re: RDF query and inference in a distributed environment

Dear Richard

I agree with you that several mirror copies will and are needed, preferably
well spread geographically as back-ups. This is exactely the approach of
GBIF, as they are now in the process to mirror their services.

However as highlighted by Bob Morris their is are social, but also financial
barriers to have all contributing institutions run a "full" mirror. In order
to insure the participation of all those who are willing to, I believe that
a distributed system where each provider can participate with his part
should be kept. Those who have the ressources could of course set up full
mirrors if this match their needs and if this is allowed by the providers
(there are also IPRs issues which may be raise here by some institutions).

Patricia

Richard Pyle wrote:
> Long term what I think might happen is that users have their own triple
> stores, and as they do queries the results get added to their own
> triple store and they can make inferences locally that they are
> interested in. MIT's Piggy bank project
> (http://simile.mit.edu/piggy-bank/) is an example of this sort of
> approach.

With hard drive sizes spiraling skyward, and $/GB ($/TB) spiraling
downward.... I'm wondering whether or not the "distributed" system that
serves us best might be "distributeded mirror copies", rather than
distributed complementary data. I've been pushing this approach for
taxonomic data for a while, but perhaps it would be useful for other shared
data as well (geographic localities, people/agents, publications/references,
etc.) Even for specimen data -- where "ownership" is unambiguous -- it
seems that as long as the ownership is clearly embedded in the core
metadata, there are more fundamental advantages in storing and serving data
from multiple data resources, rather than serving it from only one single
data resource.

One way to look at it would be "robust caching", with automated update
capabilities. The main benefits would be:

1) Large-scale distributed backup of the world's biodata (ensuring
perpetuity across a changing technological landscape);
2) Performance and reliability enhancement for local data authority needs;
4) Essentially 100% data availability (like DNS), regardless of which
servers are up or down at any given moment;
3) Maximization of distributed work/effort for data "maintenance and
repair".

The point is, the technology discussions would focus less on issues of
distributed queries, and more on issues of replication/synchronization and
data edit authorization protocols.

Perhaps this would be reaching too far, too soon. But on the other han d, I
don't see why implementing a "distributed mirror" system would be any more
technically, financially, or socially challenging than implementing a
distributed query system for distributed data.

Aloha,
Rich

Richard L. Pyle, PhD
Database Coordinator for Natural Sciences
and Associate Zoologist in Ichthyology
Department of Natural Sciences, Bishop Museum
1525 Bernice St., Honolulu, HI 96817
Ph: (808)848-4115, Fax: (808)847-8252
email: deepreef at bishopmuseum.org
http://hbs.bishopmuseum.org/staff/pylerichard.html

Yahoo! Photos
Ring in the New Year with Photo Calendars. Add photos, events, holidays,
whatever.

  _____

Yahoo! Photos
Ring in the New Year with Photo
<http://us.rd.yahoo.com/mail_us/taglines/photos/*http:/pg.photos.yahoo.com/p
h/page?.file=calendar_splash.html&.dir=> Calendars. Add photos, events,
holidays, whatever.

------_=_NextPart_001_01C61146.A0F6BBC6
Content-Type: text/html
Content-Transfer-Encoding: quoted-printable

<html xmlns:v=3D"urn:schemas-microsoft-com:vml" =
xmlns:o=3D"urn:schemas-microsoft-com:office:office" =
xmlns:w=3D"urn:schemas-microsoft-com:office:word" =
xmlns:st1=3D"urn:schemas-microsoft-com:office:smarttags" =
xmlns=3D"http://www.w3.org/TR/REC-html40">

<head>
<META HTTP-EQUIV=3D"Content-Type" CONTENT=3D"text/html; =
charset=3DUS-ASCII">

<meta name=3DGenerator content=3D"Microsoft Word 11 (filtered medium)">
<!--[if !mso]>
<style>
v\:* {behavior:url(#default#VML);}
o\:* {behavior:url(#default#VML);}
w\:* {behavior:url(#default#VML);}
.shape {behavior:url(#default#VML);}
</style>
<![endif]--><o:SmartTagType
 namespaceuri=3D"urn:schemas-microsoft-com:office:smarttags" =
name=3D"PlaceType"/>
<o:SmartTagType =
namespaceuri=3D"urn:schemas-microsoft-com:office:smarttags"
 name=3D"PlaceName"/>
<o:SmartTagType =
namespaceuri=3D"urn:schemas-microsoft-com:office:smarttags"
 name=3D"PostalCode"/>
<o:SmartTagType =
namespaceuri=3D"urn:schemas-microsoft-com:office:smarttags"
 name=3D"State"/>
<o:SmartTagType =
namespaceuri=3D"urn:schemas-microsoft-com:office:smarttags"
 name=3D"City"/>
<o:SmartTagType =
namespaceuri=3D"urn:schemas-microsoft-com:office:smarttags"
 name=3D"place"/>
<o:SmartTagType =
namespaceuri=3D"urn:schemas-microsoft-com:office:smarttags"
 name=3D"Street"/>
<o:SmartTagType =
namespaceuri=3D"urn:schemas-microsoft-com:office:smarttags"
 name=3D"address"/>
<!--[if !mso]>
<style>
st1\:*{behavior:url(#default#ieooui) }
</style>
<![endif]-->
<style>
<!--
 /* Font Definitions */
 @font-face
        {font-family:Tahoma;
        panose-1:2 11 6 4 3 5 4 4 2 4;}
 /* Style Definitions */
 p.MsoNormal, li.MsoNormal, div.MsoNormal
        {margin:0in;
        margin-bottom:.0001pt;
        font-size:12.0pt;
        font-family:"Times New Roman";}
a:link, span.MsoHyperlink
        {color:blue;
        text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
        {color:blue;
        text-decoration:underline;}
p
        {mso-margin-top-alt:auto;
        margin-right:0in;
        mso-margin-bottom-alt:auto;
        margin-left:0in;
        font-size:12.0pt;
        font-family:"Times New Roman";}
span.EmailStyle18
        {mso-style-type:personal-reply;
        font-family:Arial;
        color:navy;}
@page Section1
        {size:8.5in 11.0in;
        margin:1.0in 1.25in 1.0in 1.25in;}
div.Section1
        {page:Section1;}
-->
</style>

</head>

<body lang=3DEN-US link=3Dblue vlink=3Dblue>

<div class=3DSection1>

Querying a central database is =
certainly
technically easier and more responsive than distributed data. Mirroring =
servers
is mature technology. &nbsp;And, disk drives get cheaper every =
day.<o:p></o:p>

<p class=3DMsoNormal><font size=3D2 color=3Dnavy face=3DArial><span =
style=3D'font-size:
10.0pt;font-family:Arial;color:navy'><o:p>&nbsp;</o:p></span></font></p>=

But, in addition to the social, =
technical
and financial factors that limit the ability to implement distributed =
caches or
full mirrors there is also the political factor, which GBIF has spent =
great
effort successfully addressing. &nbsp;As a global entity, GBIF cannot =
exist
without the political support from its members that leads in turn to =
financial
support. &nbsp;Most of the members of GBIF are sovereign nations who =
seek to
preserve and protect their country&#8217;s assets.&nbsp; Some protect =
more than
others.<o:p></o:p>

<p class=3DMsoNormal><font size=3D2 color=3Dnavy face=3DArial><span =
style=3D'font-size:
10.0pt;font-family:Arial;color:navy'><o:p>&nbsp;</o:p></span></font></p>=

<p class=3DMsoNormal><font size=3D2 color=3Dnavy face=3DArial><span =
style=3D'font-size:
10.0pt;font-family:Arial;color:navy'>To cache or mirror data in =
multiple
locations, the dilemma lies in the simplest question: &#8220;Where is =
my data?&#8221;
&nbsp;A cache or mirror from a technologist&#8217;s perspective is just =
a
technical trick, identical to the original, dependent upon the master =
version,
no big deal.&nbsp; But, simplistically, the data may be actually =
located in
another country, no matter how it got there, and some politicians could
misconstrue the whole thing, if not properly negotiated and agreed upon =
in
advance.&nbsp; In my experience, negotiating such agreements is more =
work than
the technical development.<o:p></o:p></span></font></p>

<p class=3DMsoNormal><font size=3D2 color=3Dnavy face=3DArial><span =
style=3D'font-size:
10.0pt;font-family:Arial;color:navy'><o:p>&nbsp;</o:p></span></font></p>=

<p class=3DMsoNormal><font size=3D2 color=3Dnavy face=3DArial><span =
style=3D'font-size:
10.0pt;font-family:Arial;color:navy'>I think the distributed nature of =
DiGIR
was critical to selling it at the start of GBIF.&nbsp; The original =
design assured
providers that their source data would &#8220;stay&#8221; in their =
country and
not be wholesale copied somewhere else.&nbsp; It&#8217;s hard to say =
what the political
effect of creating mirrors would be.&nbsp; =
&nbsp;&nbsp;&nbsp;<o:p></o:p></span></font></p>

<p class=3DMsoNormal><font size=3D2 color=3Dnavy face=3DArial><span =
style=3D'font-size:
10.0pt;font-family:Arial;color:navy'><o:p>&nbsp;</o:p></span></font></p>=

<p class=3DMsoNormal><font size=3D2 color=3Dnavy face=3DArial><span =
style=3D'font-size:
10.0pt;font-family:Arial;color:navy'>Chuck =
Miller<o:p></o:p></span></font></p>

<p class=3DMsoNormal><font size=3D2 color=3Dnavy face=3DArial><span =
style=3D'font-size:
10.0pt;font-family:Arial;color:navy'>Missouri Botanical =
Garden<o:p></o:p></span></font></p>

<p class=3DMsoNormal><font size=3D2 color=3Dnavy face=3DArial><span =
style=3D'font-size:
10.0pt;font-family:Arial;color:navy'><o:p>&nbsp;</o:p></span></font></p>=

<div>

<div class=3DMsoNormal align=3Dcenter style=3D'text-align:center'><font =
size=3D3
face=3D"Times New Roman"><span style=3D'font-size:12.0pt'>

<hr size=3D2 width=3D"100%" align=3Dcenter tabindex=3D-1>

</span></font></div>

<p class=3DMsoNormal><b><font size=3D2 face=3DTahoma><span =
style=3D'font-size:10.0pt;
font-family:Tahoma;font-weight:bold'>From:</span></font></b><font =
size=3D2
face=3DTahoma><span style=3D'font-size:10.0pt;font-family:Tahoma'> =
Patricia Mergen
[mailto:p_mergen at YAHOO.COM] <br>
<b><span style=3D'font-weight:bold'>Sent:</span></b> Wednesday, January =
04, 2006
5:38 AM<br>
<b><span style=3D'font-weight:bold'>To:</span></b> =
TDWG-GUID at LISTSERV.NHM.KU.EDU<br>
<b><span style=3D'font-weight:bold'>Subject:</span></b> Re: [TDWG-GUID] =
RDF query
and inference in a distributed environment</span></font><o:p></o:p></p>

</div>

<p class=3DMsoNormal><font size=3D3 face=3D"Times New Roman"><span =
style=3D'font-size:
12.0pt'><o:p>&nbsp;</o:p></span></font></p>

<div id=3DRTEContent>

<p class=3DMsoNormal><font size=3D3 face=3D"Times New Roman"><span =
style=3D'font-size:
12.0pt'>Dear Rich<br>
<br>
<b><i><span style=3D'font-weight:bold;font-style:italic'>Richard Pyle
&lt;deepreef at BISHOPMUSEUM.ORG&gt;</span></i></b> =
wrote:<o:p></o:p></span></font></p>

<p class=3DMsoNormal><font size=3D3 face=3D"Times New Roman"><span =
style=3D'font-size:
12.0pt'>Hi Patricia,<br>
<br>
Many thanks for the feedback (and thanks also to Bob -- who I neglected =
to<br>
thank in my previous post).<br>
<br>
What do you reckon would be the limiting social and financial factors =
for<br>
full mirrors? In social terms, if I'm going to expose my data to the =
world<br>
anyway (e.g., via DiGIR), then I don't see why I would be socially =
reluctant<br>
to allow others to mirror the data (provided robust syncronization =
protocols<br>
are in place -- see my previous response to Bob; and provided data<br>
&quot;ownership&quot; credentials are embedded within the core =
metadata).<br>
<br>
&nbsp;&nbsp;&nbsp;&nbsp; I agree with you about the logic in this. =
However
accoding to my daily experience with potential dataproviders there is a =
lot of
teaching and conviencing needed to make this logic accepted that this =
does not
result in the loss of control over own data. I agree that to be =
conviencing a
robust syncronization is needed. <br>
<br>
As for financial, I prefaced my original post with the observation of =
ever<br>
decreasing $/GB for storage space. I suspect that, before TDWG nails =
down<br>
the GUID protocols, entry-level web servers (of the sort that even the =
most<br>
modest DiGIR provider would need to establish) will come with nearly a =
TB of<br>
disk storage space by default. Perhaps the cost of bandwidth will be =
a<br>
limiting factor? Or maybe DB software capable of managing such =
large<br>
datasets?<br>
<br>
&nbsp;&nbsp;&nbsp;&nbsp; I agree that for machines and storage it is =
not that
expensive. I was &nbsp;&nbsp;&nbsp; more referring to the human =
ressources
needed to manage the &nbsp;&nbsp;&nbsp; mirror. Smaller institutions do =
not
have necessary the funds or cannot justify to their hierachy that staff =
is
devoting time to maintain a full miror containing mainly =
&quot;references&quot;
to information coming from other institutions, but it is easier to =
justify the
time spent to contribute to the whole with the part concerning directly =
the
institution ... <br>
<br>
As for IPR -- well, ultimately that applies mostly to specimens. And =
again,<br>
assuming that &quot;ownership&quot; metadata remains intact, I see no =
basis for<br>
increased apprehension about allowing mirrored copies of data records =
(as<br>
GBIF already does, for example) over and above exposing them in the =
first<br>
place.<br>
<br>
Yes I agree with you here too, but as said before this need teaching =
and
convincing ... <br>
<br>
Personally, I don't think the social, legal, or financial barriers =
are<br>
significantly greater for a mass-mirror paradigm than they are for<br>
distributed complementary data sets. I suspect the major barriers will =
be<br>
more technical (i.e., those aforementioned &quot;robust =
syncronization<br>
protocols&quot;).<br>
<br>
Yes I agree with you that robust syncronization will be needed but as =
my IT
colleague always remind me, I guess we must not forget that setting up =
an IT
infrastructure is most of the time 10 % technical issues to be solved =
and 90%
of the time solving &quot;human problems and barriers&quot; to make it =
work and
accepted ... <br>
<br>
Pat<br>
<br>
<br>
Aloha,<br>
Rich<br>
<br>
-----Original Message-----<br>
From: Taxonomic Databases Working Group GUID Project<br>
[mailto:TDWG-GUID at LISTSERV.NHM.KU.EDU]On Behalf Of Patricia Mergen<br>
Sent: Tuesday, January 03, 2006 10:31 PM<br>
To: TDWG-GUID at LISTSERV.NHM.KU.EDU<br>
Subject: Re: RDF query and inference in a distributed environment<br>
<br>
<br>
Dear Richard<br>
<br>
I agree with you that several mirror copies will and are needed, =
preferably<br>
well spread geographically as back-ups. This is exactely the approach =
of<br>
GBIF, as they are now in the process to mirror their services.<br>
<br>
However as highlighted by Bob Morris their is are social, but also =
financial<br>
barriers to have all contributing institutions run a &quot;full&quot; =
mirror.
In order<br>
to insure the participation of all those who are willing to, I believe =
that<br>
a distributed system where each provider can participate with his =
part<br>
should be kept. Those who have the ressources could of course set up =
full<br>
mirrors if this match their needs and if this is allowed by the =
providers<br>
(there are also IPRs issues which may be raise here by some =
institutions).<br>
<br>
Patricia<br>
<br>
<br>
Richard Pyle <deepreef @bishopmuseum.org=3D"">wrote:<br>
&gt; Long term what I think might happen is that users have their own =
triple<br>
&gt; stores, and as they do queries the results get added to their =
own<br>
&gt; triple store and they can make inferences locally that they =
are<br>
&gt; interested in. MIT's Piggy bank project<br>
&gt; (http://simile.mit.edu/piggy-bank/) is an example of this sort =
of<br>
&gt; approach.<br>
<br>
With hard drive sizes spiraling skyward, and $/GB ($/TB) spiraling<br>
downward.... I'm wondering whether or not the &quot;distributed&quot; =
system
that<br>
serves us best might be &quot;distributeded mirror copies&quot;, rather =
than<br>
distributed complementary data. I've been pushing this approach for<br>
taxonomic data for a while, but perhaps it would be useful for other =
shared<br>
data as well (geographic localities, people/agents, =
publications/references,<br>
etc.) Even for specimen data -- where &quot;ownership&quot; is =
unambiguous --
it<br>
seems that as long as the ownership is clearly embedded in the core<br>
metadata, there are more fundamental advantages in storing and serving =
data<br>
from multiple data resources, rather than serving it from only one =
single<br>
data resource.<br>
<br>
One way to look at it would be &quot;robust caching&quot;, with =
automated
update<br>
capabilities. The main benefits would be:<br>
<br>
1) Large-scale distributed backup of the world's biodata (ensuring<br>
perpetuity across a changing technological landscape);<br>
2) Performance and reliability enhancement for local data authority =
needs;<br>
4) Essentially 100% data availability (like DNS), regardless of =
which<br>
servers are up or down at any given moment;<br>
3) Maximization of distributed work/effort for data &quot;maintenance =
and<br>
repair&quot;.<br>
<br>
The point is, the technology discussions would focus less on issues =
of<br>
distributed queries, and more on issues of replication/synchronization =
and<br>
data edit authorization protocols.<br>
<br>
Perhaps this would be reaching too far, too soon. But on the other han =
d, I<br>
don't see why implementing a &quot;distributed mirror&quot; system =
would be any
more<br>
technically, financially, or socially challenging than implementing =
a<br>
distributed query system for distributed data.<br>
<br>
Aloha,<br>
Rich<br>
<br>
Richard L. Pyle, PhD<br>
Database Coordinator for Natural Sciences<br>
and Associate Zoologist in Ichthyology<br>
Department of Natural Sciences, <st1:place w:st=3D"on"><st1:PlaceName =
w:st=3D"on">Bishop</st1:PlaceName>
 <st1:PlaceType w:st=3D"on">Museum</st1:PlaceType></st1:place><br>
<st1:address w:st=3D"on"><st1:Street w:st=3D"on">1525 Bernice =
St.</st1:Street>, <st1:City
 w:st=3D"on">Honolulu</st1:City>, <st1:State w:st=3D"on">HI</st1:State> =
<st1:PostalCode
 w:st=3D"on">96817</st1:PostalCode></st1:address><br>
Ph: (808)848-4115, Fax: (808)847-8252<br>
email: deepreef at bishopmuseum.org<br>
http://hbs.bishopmuseum.org/staff/pylerichard.html<br>
<br>
<br>
<br>
<br>
<br>
Yahoo! Photos<br>
Ring in the New Year with Photo Calendars. Add photos, events, =
holidays,<br>
whatever.<br>
<br>
<o:p></o:p></span></font></p>

</deepreef>

<p class=3DMsoNormal><font size=3D3 face=3D"Times New Roman"><span =
style=3D'font-size:
12.0pt'><o:p>&nbsp;</o:p></span></font></p>

</div>

<div class=3DMsoNormal align=3Dcenter style=3D'text-align:center'><font =
size=3D3
face=3D"Times New Roman"><span style=3D'font-size:12.0pt'>

<hr size=3D1 width=3D"100%" align=3Dcenter>

</span></font></div>

Yahoo! Photos 
Ring in the New Year with <a
href=3D"http://us.rd.yahoo.com/mail_us/taglines/photos/*http:/pg.photos.=
yahoo.com/ph/page?.file=3Dcalendar_splash.html&amp;.dir=3D">Photo
Calendars</a>. Add photos, events, holidays, =
whatever.<o:p></o:p>

</div>

</body>

</html>