Gregor, Thanks as ever for your thoughts. I think that (at the risk of digressing too far) I should make a few comments on the points at which your comments relate to GBIF. 1. I repeat what I have said many times before. TDWG must not optimise its standards either for what GBIF is today or for what we may like it to become if such optimisation will in any way be harmful for other purposes. We need a good information architecture which will support the widest possible range of applications. 2. Please don't judge GBIF's goals on the limited achievements of the current data portal (which will be discarded as soon as I can replace it - some time later this year). The aim is certainly to provide clear "trust networks" (all the way back to original sources) and to allow all data to be filtered by such criteria. 3. Right now we are offering access to very little data that I would characterise as "species data". I would have been massively surprised if the current portal were providing you with anything valuable for your work. I'm really looking for guidance from people such as yourself who have clear expectations of how data should be presented for their purposes. Please give your thoughts. 4. I would also say that the level of aggregation that may be appropriate for specimen/observation data (to facilitate rapid search by taxonomic, geospatial and temporal criteria) is unlikely to the same for many other classes of information. In such cases I expect central services such as GBIF to be much more clearly acting as brokers to help users find and retrieve information. Again I'd really like to receive your comments on how e.g. the international pool of SDD data might best be handled. I'll also comment briefly again on the general subject of RDF, since I get the impression that some people think I (or GBIF) has a secret agenda in this area. As far as I am concerned, the following are the things I would really like to see as elements in the overall TDWG architecture. Most other things could be worked out in just about any way that will work and I would still be very happy. 1. Clear, well-understood ontology of primary data classes 2. Data modelled as extensible sets of properties for these classes (more Darwin Core-like, rather than as monolithic documents) 3. Modeling in a 'neutral' language such as UML with clearly defined mechanisms for generating actual working representations 4. A well-defined way to represent of data in RDF for situations which need it (e.g. LSID metadata) 5. An LSID identifier property available for use with any object 6. A clear path to allow us to use TAPIR queries to perform searches for data objects (much simpler with objects like these than for whole documents) For number 2 above, RDF or an OWL language would certainly be a good fit, but I know that a Darwin Core-like (GML-like) approach could easily give us what we need and I would be thrilled with any approach that met this criterion. As a side issue, I'm not sure how easy it really would be for us to use an RDBMS-based approach to support the integration of all of the disparate and relevant information which is (as you say) scattered through so many sources. A world in which anyone can annotate any data element would seem much more suitable. I'd also like to emphasise that the choice of a TDWG representation for data (XML schema, RDF, whatever) should serve the needs of data exchange and will not necessarily be the appropriate way to store the data at either end (any more than we would expect all collection databases to have a flat table which looks like Darwin Core). The database you construct to support efficient queries need not be the same as the one that I construct, or the object model inside someone else's application. The critical issue is how easy it is for two parties to exchange the set of objects and properties that they wish to share. Thanks, Donald --------------------------------------------------------------- Donald Hobern (dhobern@gbif.org) Programme Officer for Data Access and Database Interoperability Global Biodiversity Information Facility Secretariat Universitetsparken 15, DK-2100 Copenhagen, Denmark Tel: +45-35321483 Mobile: +45-28751483 Fax: +45-35321480 --------------------------------------------------------------- -----Original Message----- From: Tdwg-tag-bounces@lists.tdwg.org [mailto:Tdwg-tag-bounces@lists.tdwg.org] On Behalf Of Gregor Hagedorn Sent: 30 March 2006 19:29 To: Tdwg-tag@lists.tdwg.org Subject: [Tdwg-tag] Triple store debate ... Roger wrote:
No one would claim to design a conventional relational structure (or perhaps generate one from an XML Schema?) that was guaranteed to perform equally quickly for any arbitrary query. All real world relational schemas are optimized for a particular purpose. When people
I would not expect triple store to behave well for all possible queries. However, my feeling is that triple store in rdbms will very soon hit the ceiling. Essentially it is an "overnormalized" model, i.e. the parts of an entity are considered independent entities themselves. With RDBS (based on experience more than theory) I can say that a good relational model holds for an astonishingly wide range of queries. That is, I only exceptionally "optimize" the model for queries (largely by adding additional indices), and my experience is that the query optimizer, exactly because I do not tell how to solve it) works well with rather unexpected queries.
As an example: Donald would be mad to build the GBIF data portal as a basic triple store because 90% of queries are currently going to be the same i.e. by taxon name and geographical area. ...
I agree that any hypothetical RDF-based GBIF portal is a bad candidate for implementation over a triple store. To build something like a GBIF portal, I'd use a design that is a combination of an indexer and an aggregator. In my mind, an aggregator is a piece of software that harvests RDF from providers and stores it in a persistent cache. An indexer either harvests RDF or scans an aggregator's cache in order to build indexes that can be used to rapidly answer simple queries. ...
This discussion is exactly the one-way street use case of data which I can see that RDF is good for. However, in the case of taxonomic and descriptive data would it not be absurd to be unable to build upon existing data? The underlying assumption in GBIF seems to be that knowledge is institutionalized, and adding knowledge is done only within the institution. I believe that this is true for specimens, and may be desirable to become true for names. However, these examples are rather the exception (and essentially boring infrastructure cases - no-one is interested in specimens or names per se). The assumption does not hold true for the truly relevant forms of knowledge on biology, including species concepts, taxonomic revisions, organism properties and interactions, and identification. Here knowledge has for centuries been expressed in short articles or monographs. The goal of many of us (SDD and others) is to move this on to digital, and to become able to share this knowledge. That means that I must be able to write software that fully captures all data in a document - and triple store seems to be the only way to handle RDF-based data. --- Personally I think that the current data-provider/data aggregator mode of GBIF is already contraproductive in the case of species data. In my scientific work I am unable to make use of GBIF information, because it no longer corresponds to interpretable trust networks, but refers to unitnerpretable aggregators and superaggregators like species 2000. --- This is not to say that RDF is not the right choice for exactly the knowledge- documents TAxMLit, SDD and other want to share (i.e. edit on two sides, with different, TDWG-standard compatible software...). But I am worried that the generality of RDF *may* make it impossible to limit the number of options, choices etc. a software has to deal with, and ONLY allows super-generalized software like reasoning engines and triple-store. Gregor---------------------------------------------------------- Gregor Hagedorn (G.Hagedorn@bba.de) Institute for Plant Virology, Microbiology, and Biosafety Federal Research Center for Agriculture and Forestry (BBA) Königin-Luise-Str. 19 Tel: +49-30-8304-2220 14195 Berlin, Germany Fax: +49-30-8304-2203 _______________________________________________ Tdwg-tag mailing list Tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag_lists.tdwg.org