RDF query and inference in a distributed environment

Wed Jan 4 11:14:14 CET 2006

Kevin Richards wrote:

> Coming from an IT background rather than a taxonomic background,
> I have never understood the strong "ownership" of data that
> people/scientists have for their data.  This seems rather short
> sited to me - people with these concerns must have some thought
> about how to maintain/expose/use "their" data in the long term
> future?  I can understand their concerns, but there must be a
> solution, otherwise their data will be no-ones concern in 50
> years time when it disappears from existence.

Philosophically, I agree with you 100%. As a scientist myself, I strongly
believe that scientific data (ESPECIALLY from government-funded research)
belongs in the public domain.  But there are socio-political realities that
must be dealt with -- and, as pointed out by Chuck, Patricia, and others,
dealt with carefully and with sensitivity. It's really the main reason we're
not further along than we currently are.

> Another thought I had about data caching systems.  Say you
> want to search the cached/centralised copy of the data (eg a
> GBIF cache).  A list of results is returned, then you decide
> you want to view more details of one of the results, so you
> follow a link off to the associated data (this would
> theoretically be by using the GUID system we are discussing).
> This would result in viewing the details of the selected
> record at the location where the GUID resolves to - this
> would always be the same location as a GUID only resolves
> to a single location.

I'm confused.  I thought that a GUID resolves to a single data record.  The
location of that data record seems to me to be an issue of resolution, not
(neccessarily) an intrinsic component of the GUID<->data relationship per
se.  I don't understand why -- from a purely technological perspective -- a
GUID must be resolved through a single designated resolution service, as
opposed to any of a multitude of resolution services that conform to a
standard protocol. The main issue would be the confidence/certainty that the
same GUID would be resolved by any one of the services to the exact same
data -- which comes back to the robustness/reliability of the automated
synchronization protocols.

> Is this correct, or would the intention here be to view
> the cached details of the selected record (which would
> require an separate ID for all the cached records)?

I guess it comes back to what the GUID is attached to (i.e., defined as
representing).  I view the GUID as a universally adopted surrogate
representation of a specified collection of data (information).  A more
technically rigorous interpretation of a GUID would be as a data-record
*instance* identifier. In the former case, the taxon-name data object
"Centropyge boylei Pyle & Randall 2001 [etc...]" and the specimen data
object "BPBM 35041 [etc...]" would each be representd by a single GUID,
regardless of how many instances of the bit-equivalent data records existed
on various servers around the world (e.g., my laptop computer, the Catalog
of Fishes, GBIF, ITIS, FishBase, Species2000, etc.).  In the latter case,
each of these two data objects (the taxon name and the specimen) would have
a different GUID for each database/server that a copy of the record for the
data objects lived on.

Aloha,
Rich