RDF query and inference in a distributed environment

Wed Jan 4 13:31:01 CET 2006

Dear all,

There have been some interesting ideas put forward in this thread. The
future is bright but nearly all the proposed mechanisms rely on some
basic things we don't have yet. Even if we are only considering
collecting data from a hand full of data sources and caching them till
we go for a coffee break:

   1. Heterogeneous data from multiple sources must fit into a single
      data model (our cache's data model).
   2. New data that we have never come across before needs to fit in the
      same model.
   3. Data from different but overlapping domains needs to be linked
      semantically.
   4. Real world objects need GUIDs so we can link them.

To achieve this pick 'n' mix approach to data we need it all to follow a
single low level schema. An RDF approach, backed up with a ontology in
OWL or some other technology, is the strongest contender for being able
to do this.

We haven't done this yet so the only way to really join data together is
in a big data warehouse like GBIF where Donald and his team have to work
hard keeping track of the different data standards and mapping them onto
their own schema that we can then search. Rod's recent post also gives a
good example of the work involved with doing this kind of thing for
TreeBASE. I am not saying warehouses will go away with a more RDF like
approach but it will become less onerous to collect pools of it to work
with.

Roger

Roderic Page wrote:
> I wonder whether at some point we need to think carefully about why we
> have a distributed model in the first place. Is it the best choice? Did
> we chose it, or has it emerged primarily because data providers
> want/need to keep control over "their" data?
>
> In the context of bioinformatics, I'm not aware of any large scale
> distributed environments that actually work (probably ignorance on my
> part). The largest databases (GenBank, PubMed, EMBL, DDBJ) have all the
> data stored centrally, and mirror at least some of it (e.g., sequences
> are mirrored between GenBank, EMBL, and DDBJ). Hence, issues of
> synchronisation are largely limited to these big databases, and hence
> manageable.
>
> My sense is that there is a lot of computer science research on
> federated databases, but few actual large-scale systems that people
> make regular use of.
>
> What is present in bioinformatics are large numbers of "value added"
> databases that take GenBank,  PubMed, etc. and do neat things with
> them. This is possible because you can download the entire database.
> Each one of these value added databases does need to deal with the
> issue of what happens when GenBank (say) changes, but because GenBank
> has well defined releases, essentially they can grab a new copy of the
> data, update their local copy, and regenerate their database.
>
> Having access to all the data makes all kinds of things possible which
> are harder to do if the data is distributed. I'd argue that part of the
> success of bioinformatics is because of data availability.
>
> Hence, my own view (at least today) is:
>
> 1. Individual data providers manage their own data, and also make
> available their data in the following ways:
> i) provide GUIDs and metadata (e.g., LSIDs)
> ii) provide a basic, standard search web service
> iii) provide their own web interface
> iv) periodically provide complete dump of data
>
> 2. GBIF (or some equivalent) takes on the job of harvesting all
> providers, building a warehouse, and making that data available through
> i) web interface
> ii) web services
> iii) complete data dump
>
> 3. Researchers can decide how best to make use of this data. They may
> wish to get a complete "GBIF" dump and install that locally, or query
> GBIF, or they may wish to query the individual providers for the most
> up to date information, or some mixture of this.
>
> Issues of synchronisation are dealt with by GBIF and its providers,
> which I think essentially amounts to having versioning and release
> numbers (but I'm probably being naive).
>
> Probably 1iv and 3iii are going to cause some issues, and this is off
> topic, but if bioinformatics is anything to go by, if we don't make our
> data available in bulk we are tying our own hands. However, this is
> obviously something each individual provider will have to decide upon
> themselves.
>
> My other feeling is that from the point of view of end users (and I
> class myself as one) the real game will be services, especially search
> (think "Google Life"). And my feeling is that this won't work if
> queries are done in a distributed fashion -- the Web is supremely
> distributed, but Google doesn't query the Web, it queries its local
> copy.
>
> In summary, I think the issue raised by Rich is important, but is one
> to be addressed by whoever takes on the task of assembling a data
> warehouse from the individual providers. Of course, once providers make
> their data available, anybody can do this...
>
> Regards
>
> Rod
>
>
> ------------------------------------------------------------------------
> ----------------------------------------
> Professor Roderic D. M. Page
> Editor, Systematic Biology
> DEEB, IBLS
> Graham Kerr Building
> University of Glasgow
> Glasgow G12 8QP
> United Kingdom
>
> Phone:    +44 141 330 4778
> Fax:      +44 141 330 2792
> email:    r.page at bio.gla.ac.uk
> web:      http://taxonomy.zoology.gla.ac.uk/rod/rod.html
> reprints: http://taxonomy.zoology.gla.ac.uk/rod/pubs.html
>
> Subscribe to Systematic Biology through the Society of Systematic
> Biologists Website:  http://systematicbiology.org
> Search for taxon names at http://darwin.zoology.gla.ac.uk/~rpage/portal/
> Find out what we know about a species at http://ispecies.org
>

--

-------------------------------------
 Roger Hyam
 Technical Architect
 Taxonomic Databases Working Group
-------------------------------------
 http://www.tdwg.org
 roger at tdwg.org
 +44 1578 722782
-------------------------------------

--------------050300080107090807010203
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
  <meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">
</head>
<body bgcolor="#ffffff" text="#000000">
Dear all,<br>
&nbsp;<br>
There have been some interesting ideas put forward in this thread. The
future is bright but nearly all the proposed mechanisms rely on some
basic things we don't have yet. Even if we are only considering
collecting data from a hand full of data sources and caching them till
we go for a coffee break:<br>
<ol>
  <li>Heterogeneous data from multiple sources must fit into a single
data model (our cache's data model).</li>
  <li>New data that we have never come across before needs to fit in
the same model.</li>
  <li>Data from different but overlapping domains needs to be linked
semantically. <br>
  </li>
  <li>Real world objects need GUIDs so we can link them.<br>
  </li>
</ol>
To achieve this pick 'n' mix approach to data we need it all to follow
a single low level schema. An RDF approach, backed up with a ontology
in OWL or some other technology, is the strongest contender for being
able to do this.<br>
<br>
We haven't done this yet so the only way to really join data together
is in a big data warehouse like GBIF where Donald and his team have to
work hard keeping track of the different data standards and mapping
them onto their own schema that we can then search. Rod's recent post
also gives a good example of the work involved with doing this kind of
thing for TreeBASE. I am not saying warehouses will go away with a more
RDF like approach but it will become less onerous to collect pools of
it to work with.<br>
<br>
Roger<br>
<br>
<br>
Roderic Page wrote:
<blockquote cite="mid8646a7d959b9a8462ec23b8ee3c20caa at bio.gla.ac.uk"
 type="cite">I wonder whether at some point we need to think carefully
about why we
  <br>
have a distributed model in the first place. Is it the best choice? Did
  <br>
we chose it, or has it emerged primarily because data providers
  <br>
want/need to keep control over "their" data?
  <br>
  <br>
In the context of bioinformatics, I'm not aware of any large scale
  <br>
distributed environments that actually work (probably ignorance on my
  <br>
part). The largest databases (GenBank, PubMed, EMBL, DDBJ) have all the
  <br>
data stored centrally, and mirror at least some of it (e.g., sequences
  <br>
are mirrored between GenBank, EMBL, and DDBJ). Hence, issues of
  <br>
synchronisation are largely limited to these big databases, and hence
  <br>
manageable.
  <br>
  <br>
My sense is that there is a lot of computer science research on
  <br>
federated databases, but few actual large-scale systems that people
  <br>
make regular use of.
  <br>
  <br>
What is present in bioinformatics are large numbers of "value added"
  <br>
databases that take GenBank,&nbsp; PubMed, etc. and do neat things with
  <br>
them. This is possible because you can download the entire database.
  <br>
Each one of these value added databases does need to deal with the
  <br>
issue of what happens when GenBank (say) changes, but because GenBank
  <br>
has well defined releases, essentially they can grab a new copy of the
  <br>
data, update their local copy, and regenerate their database.
  <br>
  <br>
Having access to all the data makes all kinds of things possible which
  <br>
are harder to do if the data is distributed. I'd argue that part of the
  <br>
success of bioinformatics is because of data availability.
  <br>
  <br>
Hence, my own view (at least today) is:
  <br>
  <br>
1. Individual data providers manage their own data, and also make
  <br>
available their data in the following ways:
  <br>
i) provide GUIDs and metadata (e.g., LSIDs)
  <br>
ii) provide a basic, standard search web service
  <br>
iii) provide their own web interface
  <br>
iv) periodically provide complete dump of data
  <br>
  <br>
2. GBIF (or some equivalent) takes on the job of harvesting all
  <br>
providers, building a warehouse, and making that data available through
  <br>
i) web interface
  <br>
ii) web services
  <br>
iii) complete data dump
  <br>
  <br>
3. Researchers can decide how best to make use of this data. They may
  <br>
wish to get a complete "GBIF" dump and install that locally, or query
  <br>
GBIF, or they may wish to query the individual providers for the most
  <br>
up to date information, or some mixture of this.
  <br>
  <br>
Issues of synchronisation are dealt with by GBIF and its providers,
  <br>
which I think essentially amounts to having versioning and release
  <br>
numbers (but I'm probably being naive).
  <br>
  <br>
Probably 1iv and 3iii are going to cause some issues, and this is off
  <br>
topic, but if bioinformatics is anything to go by, if we don't make our
  <br>
data available in bulk we are tying our own hands. However, this is
  <br>
obviously something each individual provider will have to decide upon
  <br>
themselves.
  <br>
  <br>
My other feeling is that from the point of view of end users (and I
  <br>
class myself as one) the real game will be services, especially search
  <br>
(think "Google Life"). And my feeling is that this won't work if
  <br>
queries are done in a distributed fashion -- the Web is supremely
  <br>
distributed, but Google doesn't query the Web, it queries its local
  <br>
copy.
  <br>
  <br>
In summary, I think the issue raised by Rich is important, but is one
  <br>
to be addressed by whoever takes on the task of assembling a data
  <br>
warehouse from the individual providers. Of course, once providers make
  <br>
their data available, anybody can do this...
  <br>
  <br>
Regards
  <br>
  <br>
Rod
  <br>
  <br>
  <br>
------------------------------------------------------------------------
  <br>
----------------------------------------
  <br>
Professor Roderic D. M. Page
  <br>
Editor, Systematic Biology
  <br>
DEEB, IBLS
  <br>
Graham Kerr Building
  <br>
University of Glasgow
  <br>
Glasgow G12 8QP
  <br>
United Kingdom
  <br>
  <br>
Phone:&nbsp;&nbsp;&nbsp; +44 141 330 4778
  <br>
Fax:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; +44 141 330 2792
  <br>
email:&nbsp;&nbsp;&nbsp; <a class="moz-txt-link-abbreviated" href="mailto:r.page at bio.gla.ac.uk">r.page at bio.gla.ac.uk</a>
  <br>
web:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <a class="moz-txt-link-freetext" href="http://taxonomy.zoology.gla.ac.uk/rod/rod.html">http://taxonomy.zoology.gla.ac.uk/rod/rod.html</a>
  <br>
reprints: <a class="moz-txt-link-freetext" href="http://taxonomy.zoology.gla.ac.uk/rod/pubs.html">http://taxonomy.zoology.gla.ac.uk/rod/pubs.html</a>
  <br>
  <br>
Subscribe to Systematic Biology through the Society of Systematic
  <br>
Biologists Website:&nbsp; <a class="moz-txt-link-freetext" href="http://systematicbiology.org">http://systematicbiology.org</a>
  <br>
Search for taxon names at
<a class="moz-txt-link-freetext" href="http://darwin.zoology.gla.ac.uk/~rpage/portal/">http://darwin.zoology.gla.ac.uk/~rpage/portal/</a>
  <br>
Find out what we know about a species at <a class="moz-txt-link-freetext" href="http://ispecies.org">http://ispecies.org</a>
  <br>
  <br>
</blockquote>
<br>
<br>
<pre class="moz-signature" cols="72">--

-------------------------------------
 Roger Hyam
 Technical Architect
 Taxonomic Databases Working Group
-------------------------------------
 <a class="moz-txt-link-freetext" href="http://www.tdwg.org">http://www.tdwg.org</a>
 <a class="moz-txt-link-abbreviated" href="mailto:roger at tdwg.org">roger at tdwg.org</a>
 +44 1578 722782
-------------------------------------
</pre>
</body>
</html>