RDF query and inference in a distributed environment

Wed Jan 4 13:28:09 CET 2006

The problem with searching GBIF for the Index and then searching for each individual record makes it almost impossible to do any medium or large-scale data integration.

The inclusion of more information within the "Index" in cached form, would allow for a greater range of studies to be conducted that involve large-scale data integration.  For example species modelling can be done with name, latitude/longitude, etc., but a study of a particular collector could not be done if the collector's name were not included, etc.

So the list of 4(5) methods for data query is not just that simple - as there are data that will fit into more than one category - some can be warehoused/mirrored, whereas other data may not be and only be searched via a second query to the source.  One issue that needs to be discussed is what data needs/should to be included in the cache/mirrored warehouses etc. and what is not necessary to be so stored (at this stage).

But we are getting off track on the discussion of technical issues wrt to GUIDs etc.  Whatever method is used, we will need GUIDs of some sort and this discussion has been very interesting and informative.

It has taken many years to bring the biological collection community this far (and still many institutions are reluctant to make their data available via GBIF) - to even propose that ALL their data be warehoused in mirror sites around the world would, I fear, cause many to withdraw even the data they are making available now. As desirable as I believe it is to make as much of these data freely available as possible, we have to be careful or we may lose much of which we have already gained.  It is a political/social issue, and as such is a very difficult one to negotiate.

Cheers

Arthur D. Chapman

>>>From Kevin Richards <RichardsK at LANDCARERESEARCH.CO.NZ> on 4 Jan 2006:

> Coming from an IT background rather than a taxonomic background, I have
> =
> never understood the strong "ownership" of data that people/scientists
> h=
> ave for their data.  This seems rather short sited to me - people with
> t=
> hese concerns must have some thought about how to maintain/expose/use
> "t=
> heir" data in the long term future?  I can understand their concerns,
> bu=
> t there must be a solution, otherwise their data will be no-ones
> concern=
>  in 50 years time when it disappears from existence.
>
> Another thought I had about data caching systems.  Say you want to
> searc=
> h the cached/centralised copy of the data (eg a GBIF cache).  A list of
> =
> results is returned, then you decide you want to view more details of
> on=
> e of the results, so you follow a link off to the associated data (this
> =
> would theoretically be by using the GUID system we are discussing).
> Thi=
> s would result in viewing the details of the selected record at the
> loca=
> tion where the GUID resolves to - this would always be the same
> location=
>  as a GUID only resolves to a single location.  Is this correct, or
> woul=
> d the intention here be to view the cached details of the selected
> recor=
> d (which would require an separate ID for all the cached records)?  Its
> =
> this navigation through the caches/repositories that I am not quite
> sure=
>  of how it will work?
>
> Kevin
>
> >>> deepreef at BISHOPMUSEUM.ORG 5/01/2006 8:22 a.m. >>>
>
> Wow!  Lots of resposes this morning. Many thanks to all who responded!
> S=
> ome
> replies:
>
> Patricia wrote:
>
> > I agree with you about the logic in this. However accoding to my daily
> > experience with potential dataproviders there is a lot of teaching and
> > convincing needed to make this logic accepted that this does not
> resul=
> t
> > in the loss of control over own data. I agree that to be conviencing
> > a robust syncronization is needed.
>
> I understand the psychology, and agree that the barrier for "expose your
> data online so that everyone can access it" strikes a more palatable
> cho=
> rd
> than "expose your data online so that everyone can mirror it".
>
> > I agree that for machines and storage it is not that expensive. I was
> > more referring to the human ressources needed to manage the
> > mirror. Smaller institutions do not have necessary the funds or cannot
> > justify to their hierachy that staff is devoting time to maint ain a
> > full miror containing mainly "references" to information coming from
> > other institutions, but it is easier to justify the time spent to
> > contribute to the whole with the part concerning directly the
> institut=
> ion
> ...
>
> Agreed -- but I wonder really how much more time would be involved in
> setting up an automated mirror, compared with, e.g., a DiGIR provider.
> =
> In
> my earlier post when I said that barriers would likely be technical, I
> didn't mean that technology doesn't exist (it does).  Rather, I meant
> th=
> at
> in order to be successful, setting up and maintaining a mirror would
> hav=
> e to
> be no more technically challenging than establishing a DiGIR provider.
> =
> I'm
> not sure that's possible (yet).
>
> > Yes I agree with you that rob ust syncronization will be needed but
> > as my IT colleague always remind me, I guess we must not forget that
> > setting up an IT infrastructure is most of the time 10 % technical
> > issues to be solved and 90% of the time solving "human problems
> > and barriers" to make it work and accepted ...
>
> I agree -- but part of overcoming the human problems and barriers is to
> =
> make
> setting up a mirror site relatively easy for a non-specialist IT person.
> And this is really a technical challenge.  The other human problems &
> barriers (IPR, allowing access to hard-earned data, confidence that data
> edit authorization rules will be enforced, etc.) also have some
> technica=
> l
> foundation, and also exist for the distributed/complementary approach.
>
> Ricardo wrote:
>
> >     Instead of mutually exclusive, both approaches you mention are
> > complementary. The "distributed complementary data" aproach is a
> > fundamental part of the infrastructure necessary to build the
> > "distributed mirror copies" you propose. That (the "distributed
> > complementary data" approach) is essentially what we have in place now
> > with DiGIR/BioCase/Tapir as the harvesting protocol and GBIF and other
> > institutions as harvesters. The only missing piece of software we need
> > to really have "an automated and robust synchronization protocol", in
> =
> my
> > opinion is some kind of push mechanism to trigger updates in the
> cache=
> s.
>
> Agreed on all points.
>
> >     However, I think it is not very useful to try to standardize how
> t=
> he
> > distributed mirror copies should be built and organized.
>

=== message truncated ===