RDF query and inference in a distributed environment

Wed Jan 4 13:23:02 CET 2006

Species 2000 uses a federated data model. As to whether that counts as
 'large scale'  or not I don't know. I believe that even getting
permission for a data cache from the various providers was a painful
exercise.
Sally

> I wonder whether at some point we need to think carefully about why we
> have a distributed model in the first place. Is it the best choice? Did
> we chose it, or has it emerged primarily because data providers
> want/need to keep control over "their" data?
>
> In the context of bioinformatics, I'm not aware of any large scale
> distributed environments that actually work (probably ignorance on my
> part). The largest databases (GenBank, PubMed, EMBL, DDBJ) have all the
> data stored centrally, and mirror at least some of it (e.g., sequences
> are mirrored between GenBank, EMBL, and DDBJ). Hence, issues of
> synchronisation are largely limited to these big databases, and hence
> manageable.
>
> My sense is that there is a lot of computer science research on
> federated databases, but few actual large-scale systems that people
> make regular use of.
>
> What is present in bioinformatics are large numbers of "value added"
> databases that take GenBank,  PubMed, etc. and do neat things with
> them. This is possible because you can download the entire database.
> Each one of these value added databases does need to deal with the
> issue of what happens when GenBank (say) changes, but because GenBank
> has well defined releases, essentially they can grab a new copy of the
> data, update their local copy, and regenerate their database.
>
> Having access to all the data makes all kinds of things possible which
> are harder to do if the data is distributed. I'd argue that part of the
> success of bioinformatics is because of data availability.
>
> Hence, my own view (at least today) is:
>
> 1. Individual data providers manage their own data, and also make
> available their data in the following ways:
> i) provide GUIDs and metadata (e.g., LSIDs)
> ii) provide a basic, standard search web service
> iii) provide their own web interface
> iv) periodically provide complete dump of data
>
> 2. GBIF (or some equivalent) takes on the job of harvesting all
> providers, building a warehouse, and making that data available through
> i) web interface
> ii) web services
> iii) complete data dump
>
> 3. Researchers can decide how best to make use of this data. They may
> wish to get a complete "GBIF" dump and install that locally, or query
> GBIF, or they may wish to query the individual providers for the most
> up to date information, or some mixture of this.
>
> Issues of synchronisation are dealt with by GBIF and its providers,
> which I think essentially amounts to having versioning and release
> numbers (but I'm probably being naive).
>
> Probably 1iv and 3iii are going to cause some issues, and this is off
> topic, but if bioinformatics is anything to go by, if we don't make our
> data available in bulk we are tying our own hands. However, this is
> obviously something each individual provider will have to decide upon
> themselves.
>
> My other feeling is that from the point of view of end users (and I
> class myself as one) the real game will be services, especially search
> (think "Google Life"). And my feeling is that this won't work if
> queries are done in a distributed fashion -- the Web is supremely
> distributed, but Google doesn't query the Web, it queries its local
> copy.
>
> In summary, I think the issue raised by Rich is important, but is one
> to be addressed by whoever takes on the task of assembling a data
> warehouse from the individual providers. Of course, once providers make
> their data available, anybody can do this...
>
> Regards
>
> Rod
>
>
> ------------------------------------------------------------------------
> ----------------------------------------
> Professor Roderic D. M. Page
> Editor, Systematic Biology
> DEEB, IBLS
> Graham Kerr Building
> University of Glasgow
> Glasgow G12 8QP
> United Kingdom
>
> Phone:    +44 141 330 4778
> Fax:      +44 141 330 2792
> email:    r.page at bio.gla.ac.uk
> web:      http://taxonomy.zoology.gla.ac.uk/rod/rod.html
> reprints: http://taxonomy.zoology.gla.ac.uk/rod/pubs.html
>
> Subscribe to Systematic Biology through the Society of Systematic
> Biologists Website:  http://systematicbiology.org
> Search for taxon names at http://darwin.zoology.gla.ac.uk/~rpage/portal/
> Find out what we know about a species at http://ispecies.org

*** Sally Hinchcliffe
*** Computer section, Royal Botanic Gardens, Kew
*** tel: +44 (0)20 8332 5708
*** S.Hinchcliffe at rbgkew.org.uk