[Tdwg-tag] Triple store debate ...

30 Mar 2006

      Roger wrote:
...
...
No one would claim to design a conventional relational structure (or 
perhaps generate one from an XML Schema?) that was guaranteed to 
perform equally quickly for any arbitrary query. All real world 
relational schemas are optimized for a particular purpose. When people
I would not expect triple store to behave well for all possible queries. 
However, my feeling is that triple store in rdbms will very soon hit the 
ceiling. Essentially it is an "overnormalized" model, i.e. the parts of an 
entity are considered independent entities themselves.

With RDBS (based on experience more than theory) I can say that a good 
relational model holds for an astonishingly wide range of queries. That is, I 
only exceptionally "optimize" the model for queries (largely by adding 
additional indices), and my experience is that the query optimizer, exactly 
because I do not tell how to solve it) works well with rather unexpected 
queries.
...
...
As an example: Donald would be mad to build the GBIF data portal as a 
basic triple store because 90% of queries are currently going to be 
the same i.e. by taxon name and geographical area. ...
I agree that any hypothetical RDF-based GBIF portal is a bad candidate 
for implementation over a triple store.  To build something like a GBIF 
portal, I'd use a design that is a  combination of an indexer and an 
aggregator.  In my mind, an aggregator is a piece of software that 
harvests RDF from providers and stores it in a persistent cache.  An 
indexer either harvests RDF or scans an aggregator's cache in order to 
build indexes that can be used to rapidly answer simple queries.  ...
This discussion is exactly the one-way street use case of data which I can see 
that RDF is good for. However, in the case of taxonomic and descriptive data 
would it not be absurd to be unable to build upon existing data?

The underlying assumption in GBIF seems to be that knowledge is 
institutionalized, and adding knowledge is done only within the institution. I 
believe that this is true for specimens, and may be desirable to become true 
for names. However, these examples are rather the exception (and essentially 
boring infrastructure cases - no-one is interested in specimens or names per 
se).

The assumption does not hold true for the truly relevant forms of knowledge on 
biology, including species concepts, taxonomic revisions, organism properties 
and interactions, and identification. Here knowledge has for centuries been 
expressed in short articles or monographs. The goal of many of us (SDD and 
others) is to move this on to digital, and to become able to share this 
knowledge. That means that I must be able to write software that fully captures 
all data in a document - and triple store seems to be the only way to handle 
RDF-based data.

---
Personally I think that the current data-provider/data aggregator mode of GBIF 
is already contraproductive in the case of species data. In my scientific work 
I am unable to make use of GBIF information, because it no longer corresponds 
to interpretable trust networks, but refers to unitnerpretable aggregators and 
superaggregators like species 2000. 
---

This is not to say that RDF is not the right choice for exactly the  knowledge-
documents TAxMLit, SDD and other want to share (i.e. edit on two sides, with 
different, TDWG-standard compatible software...). But I am worried that the 
generality of RDF *may* make it impossible to limit the number of options, 
choices etc. a software has to deal with, and ONLY allows super-generalized 
software like reasoning engines and triple-store.

Gregor----------------------------------------------------------
Gregor Hagedorn (G.Hagedorn@bba.de)
Institute for Plant Virology, Microbiology, and Biosafety
Federal Research Center for Agriculture and Forestry (BBA)
Königin-Luise-Str. 19           Tel: +49-30-8304-2220
14195 Berlin, Germany           Fax: +49-30-8304-2203