Re: [Tdwg-tag] Triple store debate ...

30 Mar 2006

      Gregor,

Thanks as ever for your thoughts.  I think that (at the risk of digressing
too far) I should make a few comments on the points at which your comments
relate to GBIF.  

1. I repeat what I have said many times before.  TDWG must not optimise its
standards either for what GBIF is today or for what we may like it to become
if such optimisation will in any way be harmful for other purposes.  We need
a good information architecture which will support the widest possible range
of applications.

2. Please don't judge GBIF's goals on the limited achievements of the
current data portal (which will be discarded as soon as I can replace it -
some time later this year).  The aim is certainly to provide clear "trust
networks" (all the way back to original sources) and to allow all data to be
filtered by such criteria.  

3. Right now we are offering access to very little data that I would
characterise as "species data".  I would have been massively surprised if
the current portal were providing you with anything valuable for your work.
I'm really looking for guidance from people such as yourself who have clear
expectations of how data should be presented for their purposes.  Please
give your thoughts.

4. I would also say that the level of aggregation that may be appropriate
for specimen/observation data (to facilitate rapid search by taxonomic,
geospatial and temporal criteria) is unlikely to the same for many other
classes of information.  In such cases I expect central services such as
GBIF to be much more clearly acting as brokers to help users find and
retrieve information.  Again I'd really like to receive your comments on how
e.g. the international pool of SDD data might best be handled.

I'll also comment briefly again on the general subject of RDF, since I get
the impression that some people think I (or GBIF) has a secret agenda in
this area.

As far as I am concerned, the following are the things I would really like
to see as elements in the overall TDWG architecture.  Most other things
could be worked out in just about any way that will work and I would still
be very happy.

1. Clear, well-understood ontology of primary data classes

2. Data modelled as extensible sets of properties for these classes (more
Darwin Core-like, rather than as monolithic documents) 

3. Modeling in a 'neutral' language such as UML with clearly defined
mechanisms for generating actual working representations

4. A well-defined way to represent of data in RDF for situations which need
it (e.g. LSID metadata)

5. An LSID identifier property available for use with any object

6. A clear path to allow us to use TAPIR queries to perform searches for
data objects (much simpler with objects like these than for whole documents)

For number 2 above, RDF or an OWL language would certainly be a good fit,
but I know that a Darwin Core-like (GML-like) approach could easily give us
what we need and I would be thrilled with any approach that met this
criterion.

As a side issue, I'm not sure how easy it really would be for us to use an
RDBMS-based approach to support the integration of all of the disparate and
relevant information which is (as you say) scattered through so many
sources.  A world in which anyone can annotate any data element would seem
much more suitable.

I'd also like to emphasise that the choice of a TDWG representation for data
(XML schema, RDF, whatever) should serve the needs of data exchange and will
not necessarily be the appropriate way to store the data at either end (any
more than we would expect all collection databases to have a flat table
which looks like Darwin Core).  The database you construct to support
efficient queries need not be the same as the one that I construct, or the
object model inside someone else's application.  The critical issue is how
easy it is for two parties to exchange the set of objects and properties
that they wish to share.

Thanks,

Donald

---------------------------------------------------------------
Donald Hobern (dhobern@gbif.org)
Programme Officer for Data Access and Database Interoperability 
Global Biodiversity Information Facility Secretariat 
Universitetsparken 15, DK-2100 Copenhagen, Denmark
Tel: +45-35321483   Mobile: +45-28751483   Fax: +45-35321480
---------------------------------------------------------------

-----Original Message-----
From: Tdwg-tag-bounces@lists.tdwg.org
[mailto:Tdwg-tag-bounces@lists.tdwg.org] On Behalf Of Gregor Hagedorn
Sent: 30 March 2006 19:29
To: Tdwg-tag@lists.tdwg.org
Subject: [Tdwg-tag] Triple store debate ...

Roger wrote:
...
...
No one would claim to design a conventional relational structure (or 
perhaps generate one from an XML Schema?) that was guaranteed to 
perform equally quickly for any arbitrary query. All real world 
relational schemas are optimized for a particular purpose. When people
I would not expect triple store to behave well for all possible queries. 
However, my feeling is that triple store in rdbms will very soon hit the 
ceiling. Essentially it is an "overnormalized" model, i.e. the parts of an 
entity are considered independent entities themselves.

With RDBS (based on experience more than theory) I can say that a good 
relational model holds for an astonishingly wide range of queries. That is,
I 
only exceptionally "optimize" the model for queries (largely by adding 
additional indices), and my experience is that the query optimizer, exactly 
because I do not tell how to solve it) works well with rather unexpected 
queries.
...
...
As an example: Donald would be mad to build the GBIF data portal as a 
basic triple store because 90% of queries are currently going to be 
the same i.e. by taxon name and geographical area. ...
I agree that any hypothetical RDF-based GBIF portal is a bad candidate 
for implementation over a triple store.  To build something like a GBIF 
portal, I'd use a design that is a  combination of an indexer and an 
aggregator.  In my mind, an aggregator is a piece of software that 
harvests RDF from providers and stores it in a persistent cache.  An 
indexer either harvests RDF or scans an aggregator's cache in order to 
build indexes that can be used to rapidly answer simple queries.  ...
This discussion is exactly the one-way street use case of data which I can
see 
that RDF is good for. However, in the case of taxonomic and descriptive data

would it not be absurd to be unable to build upon existing data?

The underlying assumption in GBIF seems to be that knowledge is 
institutionalized, and adding knowledge is done only within the institution.
I 
believe that this is true for specimens, and may be desirable to become true

for names. However, these examples are rather the exception (and essentially

boring infrastructure cases - no-one is interested in specimens or names per

se).

The assumption does not hold true for the truly relevant forms of knowledge
on 
biology, including species concepts, taxonomic revisions, organism
properties 
and interactions, and identification. Here knowledge has for centuries been 
expressed in short articles or monographs. The goal of many of us (SDD and 
others) is to move this on to digital, and to become able to share this 
knowledge. That means that I must be able to write software that fully
captures 
all data in a document - and triple store seems to be the only way to handle

RDF-based data.

---
Personally I think that the current data-provider/data aggregator mode of
GBIF 
is already contraproductive in the case of species data. In my scientific
work 
I am unable to make use of GBIF information, because it no longer
corresponds 
to interpretable trust networks, but refers to unitnerpretable aggregators
and 
superaggregators like species 2000. 
---

This is not to say that RDF is not the right choice for exactly the
knowledge-
documents TAxMLit, SDD and other want to share (i.e. edit on two sides, with

different, TDWG-standard compatible software...). But I am worried that the 
generality of RDF *may* make it impossible to limit the number of options, 
choices etc. a software has to deal with, and ONLY allows super-generalized 
software like reasoning engines and triple-store.

Gregor----------------------------------------------------------
Gregor Hagedorn (G.Hagedorn@bba.de)
Institute for Plant Virology, Microbiology, and Biosafety
Federal Research Center for Agriculture and Forestry (BBA)
Königin-Luise-Str. 19           Tel: +49-30-8304-2220
14195 Berlin, Germany           Fax: +49-30-8304-2203

_______________________________________________
Tdwg-tag mailing list
Tdwg-tag@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-tag_lists.tdwg.org