Hi Rich,
Richard Pyle wrote:
With hard drive sizes spiraling skyward, and $/GB ($/TB) spiraling downward.... I'm wondering whether or not the "distributed" system that serves us best might be "distributeded mirror copies", rather than distributed complementary data.
Instead of mutually exclusive, both approaches you mention are complementary. The "distributed complementary data" aproach is a fundamental part of the infrastructure necessary to build the "distributed mirror copies" you propose. That (the "distributed complementary data" approach) is essentially what we have in place now with DiGIR/BioCase/Tapir as the harvesting protocol and GBIF and other institutions as harvesters. The only missing piece of software we need to really have "an automated and robust synchronization protocol", in my opinion is some kind of push mechanism to trigger updates in the caches.
However, I think it is not very useful to try to standardize how the distributed mirror copies should be built and organized. I also don't think that having exact copies of all metadata available is very helpful. Considering a RDF approach to metadata, it may not be feasible or desirable to harvest every single triple there is. A (socially) descentralized schema would work better in this case: data providers would make available data they create and that is under their direct custody and individual harvesters would be free to look at the metadata being served and create their own caches, selectively harvesting only the information that is relevant to the services they intend to provide.
One important point in the GUID discussion is to make sure that IPR is in the metadata and ownership is unanbiguous, regardless of whether the harvester left that information intact or not. As long as the aggregator leave a reference to the original record (using its GUID), any of the GUID technologies under evaluation lets you get back to the ownership information.
Regards,
Ricardo