[Tdwg-tag] Triple store debate ...

Fri Mar 31 15:43:44 CEST 2006

Donald writes:
> 2. Please don't judge GBIF's goals on the limited achievements of the
> current data portal (which will be discarded as soon as I can replace it -
> some time later this year).  The aim is certainly to provide clear "trust
> networks" (all the way back to original sources) and to allow all data to be
> filtered by such criteria.  

I did not mean to belittle anything GBIF has achieved -  I probably should have 
made that clearer. I am not arguing against GBIFs or your achievements in using 
big data providers, I just try to express my thoughts that part (and I believe 
a big part) of the future may lie in something less organized, document-centric 
rather than institutional-database-provider-centric.

This has some implications for the RDF debate - if we argue that big data 
providers will set up the conversion tools to publish excerpts of their 
proprietary data structure in RDF, and if we argue that RDF import use cases 
are relevant mostly for aggregators/indexing services. 

Your point about trust-networks is excellent (and I did participate in your 
survey asking for suggestions about the data portal...)

> retrieve information.  Again I'd really like to receive your comments on how
> e.g. the international pool of SDD data might best be handled.

I have no easy answer to this. Not only is SDD implemented only in beta stage, 
but also the Delta/Lucid/DeltaAccess documents which could currently be 
expressed in SDD are rarely made available - partly because their seems to be 
not enough value in making them available (which using them through GBIF could 
change in the future) partly because people have reservations making their work 
available.

> As a side issue, I'm not sure how easy it really would be for us to use an
> RDBMS-based approach to support the integration of all of the disparate and
> relevant information which is (as you say) scattered through so many
> sources.  A world in which anyone can annotate any data element would seem
> much more suitable.

Perhaps indeed, I just cannot think it through, it seems to blow my brain. I 
feel a major point is what Steven said about CBD being the real unit of 
information, not the triples. This rings a bell in me, but I cannot hear it 
load enough yet. I am sorry if this is causing confusing posts from me.

An example: In SDD we think modifiers are very important.

"Flower red"
and 
"Flowers almost never red"

could in RDF be:

a) Species - FlowerColor - red

b) Species - FlowerColor - red
   ReificationOfTheAbove - Modifier - "almost never"

Getting this as independent, extensible tuples (getting the first, but not the 
second is not real information. The whole, the "CBD" is the unit of information 
which I can critize, reject, approve.

Similar in Taxon concepts expressed through character circumscription, the 
concept that can be analyzed or critisized is only the total of all descriptive 
statements, nothing less.

Now, in the xml-schema world, this boundary is assumed (though nowhere 
guaranteed) to be a document. However, in SDD we ran exactly into the opposite 
problem, that we did mean to extent across servers and documents (although in a 
less atomized way than RDF). RDF would be a solution for these problems - 
perhaps we will understand the problems better if we better understand how to 
define and refer to CBDs when using RDF? I do not understand this yet.

Donald writes:
> The database you construct to support
> efficient queries need not be the same as the one that I construct, or the
> object model inside someone else's application.  The critical issue is how
> easy it is for two parties to exchange the set of objects and properties
> that they wish to share.

If you want interoperability of documents, you have to be able to match 
imported data losslessly into your inner information model. I feel that it will 
be very difficult to import data into your (permanent, editable) data store 
unless you at least use a very similar basic object ontology and a similar 
concept of cardinality constraints. Currently a major problem in consuming 
DarwinCore flat structures is that your are left to guess about relationships 
between multiple element instances. Better "boxing" in object types of DwC 
clearly overcomes this, but if you have two different boxing models (object 
ontology, internal information model) the problem probably appears worse than 
before.

Gregor----------------------------------------------------------
Gregor Hagedorn (G.Hagedorn at bba.de)
Institute for Plant Virology, Microbiology, and Biosafety
Federal Research Center for Agriculture and Forestry (BBA)
Königin-Luise-Str. 19           Tel: +49-30-8304-2220
14195 Berlin, Germany           Fax: +49-30-8304-2203