Re: RDF query and inference in a distributed environment
Rod,
This is an excellent question. It is quite clear that much of what we want to do would be much easier, more efficient and more authoritative if we established a data-warehousing model for our data. To my mind, it should not be impossible to develop such a system while still giving data providers the right and ability to pull their data at any subsequent date. (Indeed it may be simpler to assure this possibility in a centralised model than it is with a distributed model in which users will inevitably try to make their own local caches.)
The driver for the distributed model has been precisely that many data providers have stated the concern that they should maintain control over their data. The current GBIF data index is an attempt to balance the needs of users with these concerns from providers. I am sure that we will continue to review the situation, but for now my hope is that we will be able to follow just about exactly the path you describe below. I see this work on GUIDs as an important step in making it possible to do this much better than we have been able to do up to now.
All of this will however critically require a careful review of the real expectations, fears and needs of our data providers. They have up to this point not given us any explicit mandate to offer complete data dumps. We need therefore to work with them to develop a legitimate and acceptable strategy.
Donald
--------------------------------------------------------------- Donald Hobern (dhobern@gbif.org) Programme Officer for Data Access and Database Interoperability Global Biodiversity Information Facility Secretariat Universitetsparken 15, DK-2100 Copenhagen, Denmark Tel: +45-35321483 Mobile: +45-28751483 Fax: +45-35321480 ---------------------------------------------------------------
-----Original Message----- From: Taxonomic Databases Working Group GUID Project [mailto:TDWG-GUID@LISTSERV.NHM.KU.EDU] On Behalf Of Roderic Page Sent: 04 January 2006 13:51 To: TDWG-GUID@LISTSERV.NHM.KU.EDU Subject: Re: RDF query and inference in a distributed environment
I wonder whether at some point we need to think carefully about why we have a distributed model in the first place. Is it the best choice? Did we chose it, or has it emerged primarily because data providers want/need to keep control over "their" data?
In the context of bioinformatics, I'm not aware of any large scale distributed environments that actually work (probably ignorance on my part). The largest databases (GenBank, PubMed, EMBL, DDBJ) have all the data stored centrally, and mirror at least some of it (e.g., sequences are mirrored between GenBank, EMBL, and DDBJ). Hence, issues of synchronisation are largely limited to these big databases, and hence manageable.
My sense is that there is a lot of computer science research on federated databases, but few actual large-scale systems that people make regular use of.
What is present in bioinformatics are large numbers of "value added" databases that take GenBank, PubMed, etc. and do neat things with them. This is possible because you can download the entire database. Each one of these value added databases does need to deal with the issue of what happens when GenBank (say) changes, but because GenBank has well defined releases, essentially they can grab a new copy of the data, update their local copy, and regenerate their database.
Having access to all the data makes all kinds of things possible which are harder to do if the data is distributed. I'd argue that part of the success of bioinformatics is because of data availability.
Hence, my own view (at least today) is:
1. Individual data providers manage their own data, and also make available their data in the following ways: i) provide GUIDs and metadata (e.g., LSIDs) ii) provide a basic, standard search web service iii) provide their own web interface iv) periodically provide complete dump of data
2. GBIF (or some equivalent) takes on the job of harvesting all providers, building a warehouse, and making that data available through i) web interface ii) web services iii) complete data dump
3. Researchers can decide how best to make use of this data. They may wish to get a complete "GBIF" dump and install that locally, or query GBIF, or they may wish to query the individual providers for the most up to date information, or some mixture of this.
Issues of synchronisation are dealt with by GBIF and its providers, which I think essentially amounts to having versioning and release numbers (but I'm probably being naive).
Probably 1iv and 3iii are going to cause some issues, and this is off topic, but if bioinformatics is anything to go by, if we don't make our data available in bulk we are tying our own hands. However, this is obviously something each individual provider will have to decide upon themselves.
My other feeling is that from the point of view of end users (and I class myself as one) the real game will be services, especially search (think "Google Life"). And my feeling is that this won't work if queries are done in a distributed fashion -- the Web is supremely distributed, but Google doesn't query the Web, it queries its local copy.
In summary, I think the issue raised by Rich is important, but is one to be addressed by whoever takes on the task of assembling a data warehouse from the individual providers. Of course, once providers make their data available, anybody can do this...
Regards
Rod
------------------------------------------------------------------------ ---------------------------------------- Professor Roderic D. M. Page Editor, Systematic Biology DEEB, IBLS Graham Kerr Building University of Glasgow Glasgow G12 8QP United Kingdom
Phone: +44 141 330 4778 Fax: +44 141 330 2792 email: r.page@bio.gla.ac.uk web: http://taxonomy.zoology.gla.ac.uk/rod/rod.html reprints: http://taxonomy.zoology.gla.ac.uk/rod/pubs.html
Subscribe to Systematic Biology through the Society of Systematic Biologists Website: http://systematicbiology.org Search for taxon names at http://darwin.zoology.gla.ac.uk/~rpage/portal/ Find out what we know about a species at http://ispecies.org
participants (1)
-
Donald Hobern