Roger wrote:
No one would claim to design a conventional relational structure (or perhaps generate one from an XML Schema?) that was guaranteed to perform equally quickly for any arbitrary query. All real world relational schemas are optimized for a particular purpose. When people
I would not expect triple store to behave well for all possible queries. However, my feeling is that triple store in rdbms will very soon hit the ceiling. Essentially it is an "overnormalized" model, i.e. the parts of an entity are considered independent entities themselves.
With RDBS (based on experience more than theory) I can say that a good relational model holds for an astonishingly wide range of queries. That is, I only exceptionally "optimize" the model for queries (largely by adding additional indices), and my experience is that the query optimizer, exactly because I do not tell how to solve it) works well with rather unexpected queries.
As an example: Donald would be mad to build the GBIF data portal as a basic triple store because 90% of queries are currently going to be the same i.e. by taxon name and geographical area. ...
I agree that any hypothetical RDF-based GBIF portal is a bad candidate for implementation over a triple store. To build something like a GBIF portal, I'd use a design that is a combination of an indexer and an aggregator. In my mind, an aggregator is a piece of software that harvests RDF from providers and stores it in a persistent cache. An indexer either harvests RDF or scans an aggregator's cache in order to build indexes that can be used to rapidly answer simple queries. ...
This discussion is exactly the one-way street use case of data which I can see that RDF is good for. However, in the case of taxonomic and descriptive data would it not be absurd to be unable to build upon existing data?
The underlying assumption in GBIF seems to be that knowledge is institutionalized, and adding knowledge is done only within the institution. I believe that this is true for specimens, and may be desirable to become true for names. However, these examples are rather the exception (and essentially boring infrastructure cases - no-one is interested in specimens or names per se).
The assumption does not hold true for the truly relevant forms of knowledge on biology, including species concepts, taxonomic revisions, organism properties and interactions, and identification. Here knowledge has for centuries been expressed in short articles or monographs. The goal of many of us (SDD and others) is to move this on to digital, and to become able to share this knowledge. That means that I must be able to write software that fully captures all data in a document - and triple store seems to be the only way to handle RDF-based data.
--- Personally I think that the current data-provider/data aggregator mode of GBIF is already contraproductive in the case of species data. In my scientific work I am unable to make use of GBIF information, because it no longer corresponds to interpretable trust networks, but refers to unitnerpretable aggregators and superaggregators like species 2000. ---
This is not to say that RDF is not the right choice for exactly the knowledge- documents TAxMLit, SDD and other want to share (i.e. edit on two sides, with different, TDWG-standard compatible software...). But I am worried that the generality of RDF *may* make it impossible to limit the number of options, choices etc. a software has to deal with, and ONLY allows super-generalized software like reasoning engines and triple-store.
Gregor---------------------------------------------------------- Gregor Hagedorn (G.Hagedorn@bba.de) Institute for Plant Virology, Microbiology, and Biosafety Federal Research Center for Agriculture and Forestry (BBA) Königin-Luise-Str. 19 Tel: +49-30-8304-2220 14195 Berlin, Germany Fax: +49-30-8304-2203