Re: RDF query and inference in a distributed environment
Coming from an IT background rather than a taxonomic background, I have never understood the strong "ownership" of data that people/scientists have for their data. This seems rather short sited to me - people with these concerns must have some thought about how to maintain/expose/use "their" data in the long term future? I can understand their concerns, but there must be a solution, otherwise their data will be no-ones concern in 50 years time when it disappears from existence. Another thought I had about data caching systems. Say you want to search the cached/centralised copy of the data (eg a GBIF cache). A list of results is returned, then you decide you want to view more details of one of the results, so you follow a link off to the associated data (this would theoretically be by using the GUID system we are discussing). This would result in viewing the details of the selected record at the location where the GUID resolves to - this would always be the same location as a GUID only resolves to a single location. Is this correct, or would the intention here be to view the cached details of the selected record (which would require an separate ID for all the cached records)? Its this navigation through the caches/repositories that I am not quite sure of how it will work? Kevin
deepreef@BISHOPMUSEUM.ORG 5/01/2006 8:22 a.m. >>>
Wow! Lots of resposes this morning. Many thanks to all who responded! Some replies: Patricia wrote:
I agree with you about the logic in this. However accoding to my daily experience with potential dataproviders there is a lot of teaching and convincing needed to make this logic accepted that this does not result in the loss of control over own data. I agree that to be conviencing a robust syncronization is needed.
I understand the psychology, and agree that the barrier for "expose your data online so that everyone can access it" strikes a more palatable chord than "expose your data online so that everyone can mirror it".
I agree that for machines and storage it is not that expensive. I was more referring to the human ressources needed to manage the mirror. Smaller institutions do not have necessary the funds or cannot justify to their hierachy that staff is devoting time to maint ain a full miror containing mainly "references" to information coming from other institutions, but it is easier to justify the time spent to contribute to the whole with the part concerning directly the institution ...
Agreed -- but I wonder really how much more time would be involved in setting up an automated mirror, compared with, e.g., a DiGIR provider. In my earlier post when I said that barriers would likely be technical, I didn't mean that technology doesn't exist (it does). Rather, I meant that in order to be successful, setting up and maintaining a mirror would have to be no more technically challenging than establishing a DiGIR provider. I'm not sure that's possible (yet).
Yes I agree with you that rob ust syncronization will be needed but as my IT colleague always remind me, I guess we must not forget that setting up an IT infrastructure is most of the time 10 % technical issues to be solved and 90% of the time solving "human problems and barriers" to make it work and accepted ...
I agree -- but part of overcoming the human problems and barriers is to make setting up a mirror site relatively easy for a non-specialist IT person. And this is really a technical challenge. The other human problems & barriers (IPR, allowing access to hard-earned data, confidence that data edit authorization rules will be enforced, etc.) also have some technical foundation, and also exist for the distributed/complementary approach. Ricardo wrote:
Instead of mutually exclusive, both approaches you mention are complementary. The "distributed complementary data" aproach is a fundamental part of the infrastructure necessary to build the "distributed mirror copies" you propose. That (the "distributed complementary data" approach) is essentially what we have in place now with DiGIR/BioCase/Tapir as the harvesting protocol and GBIF and other institutions as harvesters. The only missing piece of software we need to really have "an automated and robust synchronization protocol", in my opinion is some kind of push mechanism to trigger updates in the caches.
Agreed on all points.
However, I think it is not very useful to try to standardize how the distributed mirror copies should be built and organized.
Here's where I disagree somewhat. Standardized mirror protocols would be a fundamental necessity of the paradigm I described. The paradigm would simply not work without such standardization. Non-standard means non-interactive, and therefore no automated synchronization. What makes it useful is that every participating organization has local (=high performance) access to a full biodiversity dataset. Another thing that makes it useful is that, rather than needing *all* data providers to be fully functional at any given moment for 100% data retrival, only one mirror site needs to be functional. I understand this is the role that GBIF already serves -- and perhaps that's all that's really needed.
A (socially) descentralized schema would work better in this case: data providers would make available data they create and that is under their direct custody and individual harvesters would be free to look at the metadata being served and create their own caches, selectively harvesting only the information that is relevant to the services they intend to provide.
There really should be no difference, as long as data edit authorization protocols are as robust as the synchronization protocols. If I'm exposing my specimen data for the world to access (e.g., via DiGIR/BioCase/Tapir), then it shouldn't bother me that GBIF is mirroring it -- as long as GBIF includes metadata about data ownership, and as long as the only people who can edit the data are people who I authorize to do so. And if GBIF is mirroring it, what difference is there if there are 100, or 1000, or 10,000 other servers mirroring it (same assumptions). The two things that are important to me are: 1) corrections to data get propagated quickly to mirrors; and 2) only people who I authorize to edit my data may do so. Other than that, it doesn't really matter if the data are only on my server's hard drive, or are replicated on GBIF's hard drive, or are replicated on 10,000 server hard drives. But the 10,000 hard drives option gives me two advantages: 1) confidence that the data will survive even continental-scale catastrophies; and 2) it means that my server's hard drive (one of the 10,000) has local and immediate access to everyone else's data (for use as authorities, etc.) Rod Page wrote: [an excellent and thought-proviking post, making important, if uncomfortable, observations]
What is present in bioinformatics are large numbers of "value added" databases that take GenBank, PubMed, etc. and do neat things with them. This is possible because you can download the entire database.
Exactly! The real value of portals to end users are the value-added "neat things". I think we ought to encourage as many enterprising biology/data nerds as possible to build such things.
Each one of these value added databases does need to deal with the issue of what happens when GenBank (say) changes, but because GenBank has well defined releases, essentially they can grab a new copy of the data, update their local copy, and regenerate their database.
Wouldn't it be so much nicer if the protocols were in place to automaticaly propagate updates to the value-added sites in real time, so that the aggregated databases were effectively self-maintaining? I guess the fundamental question of this whole discussion ultimately boils down to "push" vs. "pull" paradigms.
In summary, I think the issue raised by Rich is important, but is one to be addressed by whoever takes on the task of assembling a data warehouse from the individual providers. Of course, once providers make their data available, anybody can do this...
Yes, but aybody (everybody) has to rebuild it themselves from scratch. What I'm suggesting is to use the TDWG/GBIF/GUID standard community to come up with standard protocols to lower the technological bar of becomming an aggregator -- so much so that it becomes the real foundation of data distribution, dissemination, and access (and makes it feasible for the enterprising biology/data nerds to make good use of their late night hours....) In a later post, Rod wrote:
I think there is a continuum of possibilities:
1. Pure distributed - query always sent to remote sources, no data ever held locally
2. Distributed with cache - results from a user query are cached (either for a limited time, or until source sends a message saying its data has been updated).
3. Distributed with partial local copy - some information harvested from sources stored locally (e.g., metadata), detailed information only held by sources.
4. Not distributed (data warehouse) - all data from distributed sources held locally (harvested), with periodic updates.
I would add to this list: 5. Distributed Warehouses -- all data is mirrored across many warehouses, automatically kept in synchronization in real time. Queries are sent to any one warehouse. Also, Rod's reference to "The Cathedral and the Bazaar" is excellent and highly relevant. The example of the Hawaii Biological Survey is a good example of this in our community. Chuck Miller wrote:
To cache or mirror data in multiple locations, the dilemma lies in the simplest question: "Where is my data?" A cache or mirror from a technologist's perspective is just a technical trick, identical to the original, dependent upon the master version, no big deal. But, simplistically, the data may be actually located in another country, no matter how it got there, and some politicians could m isconstrue the whole thing, if not properly negotiated and agreed upon in advance. In my experience, negotiating such agreements is more work than the technical development.
I agree the political issues may be show-stoppers -- but mostly because of unwarranted paranoia on the part of the data owners (unfortunate, but certainly real). But the reality is, any data exposed on a distributed provider (indeed, any data published in any form -- electronic or paper) is already "out there", and located in many different countries. Data contained in a printed publication simulatnously exist in many countries. The only difference with the distributed warehouse approach is that the data are much, MUCH more powerful. Perhaps the real political barrier is putting so much power into so many hands. Sort of like the World Wide Web itself.
I think the distributed nature of DiGIR was critical to selling it at the start of GBIF. The original design assured providers that their source data would "stay" in their country and not be wholesale copied somewhere else. It's hard to say what the political effect of creating mirrors would be.
Agreed on all points. There was a time when the world wasn't even even ready for the DiGIR approach. Perhaps it's too soon to push it to the next level. Bear in mind, though, that all these IPR/political issues relate mostly to specimen data. The real value of the "distributed warehouses" approach to the warehouses providers themselves are the things that would traditionally be though of as "authorities" (taxon names, reference citations, geographic gazetteers, etc.) Maybe the distributed warehouse approach will work well for these kinds of data, but not for specimen data? O.K., I should probably stop here and get to work.... Aloha, Rich Richard L. Pyle, PhD Database Coordinator for Natural Sciences and Associate Zoologist in Ichthyology Department of Natural Sciences, Bishop Museum 1525 Bernice St., Honolulu, HI 96817 Ph: (808)848-4115, Fax: (808)847-8252 email: deepreef@bishopmuseum.org http://hbs.bishopmuseum.org/staff/pylerichard.html ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ WARNING: This email and any attachments may be confidential and/or privileged. They are intended for the addressee only and are not to be read, used, copied or disseminated by anyone receiving them in error. If you are not the intended recipient, please notify the sender by return email and delete this message and any attachments. The views expressed in this email are those of the sender and do not necessarily reflect the official views of Landcare Research. Landcare Research http://www.landcareresearch.co.nz ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
participants (1)
-
Kevin Richards