RDF query and inference in a distributed environment
r.page at BIO.GLA.AC.UK
Wed Jan 4 12:50:58 CET 2006
I wonder whether at some point we need to think carefully about why we
have a distributed model in the first place. Is it the best choice? Did
we chose it, or has it emerged primarily because data providers
want/need to keep control over "their" data?
In the context of bioinformatics, I'm not aware of any large scale
distributed environments that actually work (probably ignorance on my
part). The largest databases (GenBank, PubMed, EMBL, DDBJ) have all the
data stored centrally, and mirror at least some of it (e.g., sequences
are mirrored between GenBank, EMBL, and DDBJ). Hence, issues of
synchronisation are largely limited to these big databases, and hence
My sense is that there is a lot of computer science research on
federated databases, but few actual large-scale systems that people
make regular use of.
What is present in bioinformatics are large numbers of "value added"
databases that take GenBank, PubMed, etc. and do neat things with
them. This is possible because you can download the entire database.
Each one of these value added databases does need to deal with the
issue of what happens when GenBank (say) changes, but because GenBank
has well defined releases, essentially they can grab a new copy of the
data, update their local copy, and regenerate their database.
Having access to all the data makes all kinds of things possible which
are harder to do if the data is distributed. I'd argue that part of the
success of bioinformatics is because of data availability.
Hence, my own view (at least today) is:
1. Individual data providers manage their own data, and also make
available their data in the following ways:
i) provide GUIDs and metadata (e.g., LSIDs)
ii) provide a basic, standard search web service
iii) provide their own web interface
iv) periodically provide complete dump of data
2. GBIF (or some equivalent) takes on the job of harvesting all
providers, building a warehouse, and making that data available through
i) web interface
ii) web services
iii) complete data dump
3. Researchers can decide how best to make use of this data. They may
wish to get a complete "GBIF" dump and install that locally, or query
GBIF, or they may wish to query the individual providers for the most
up to date information, or some mixture of this.
Issues of synchronisation are dealt with by GBIF and its providers,
which I think essentially amounts to having versioning and release
numbers (but I'm probably being naive).
Probably 1iv and 3iii are going to cause some issues, and this is off
topic, but if bioinformatics is anything to go by, if we don't make our
data available in bulk we are tying our own hands. However, this is
obviously something each individual provider will have to decide upon
My other feeling is that from the point of view of end users (and I
class myself as one) the real game will be services, especially search
(think "Google Life"). And my feeling is that this won't work if
queries are done in a distributed fashion -- the Web is supremely
distributed, but Google doesn't query the Web, it queries its local
In summary, I think the issue raised by Rich is important, but is one
to be addressed by whoever takes on the task of assembling a data
warehouse from the individual providers. Of course, once providers make
their data available, anybody can do this...
Professor Roderic D. M. Page
Editor, Systematic Biology
Graham Kerr Building
University of Glasgow
Glasgow G12 8QP
Phone: +44 141 330 4778
Fax: +44 141 330 2792
email: r.page at bio.gla.ac.uk
Subscribe to Systematic Biology through the Society of Systematic
Biologists Website: http://systematicbiology.org
Search for taxon names at http://darwin.zoology.gla.ac.uk/~rpage/portal/
Find out what we know about a species at http://ispecies.org
More information about the tdwg-tag