Hi Roger,
Thanks for the excellent reply. You raise some serious concerns and touch on some of the issues that Robert Gales, Dave Vieglais, and I have been investigating.
Roger Hyam wrote:
Steve,
I find the triple store debate interesting because the goal posts seem to shift.
Perhaps the goal posts seem to shift because I haven't clearly made my point. We've been envisioning many different services that might participate in a semantic network. These include but aren't limited to providers, indexers, aggregators/mirrors, LSID authorities, analysis services and translation services. In addition to these we see a bunch of web and desktop applications that use these services. Only some of these services and applications are good candidates for implementation over a triple store. I'll address that more fully in a minute. Despite the fact that not all services will be backed by them, we see triple stores playing an important role.
No one would claim to design a conventional relational structure (or perhaps generate one from an XML Schema?) that was guaranteed to perform equally quickly for any arbitrary query. All real world relational schemas are optimized for a particular purpose. When people talk about triple stores they expect to be able to ask them *anything *and get a responsive answer - when that is never expected of relational db.
I certainly never meant to imply that any arbitrary query could be guaranteed answerable in a reasonable amount of time. As you point out, it's well known that this is not true of either relational databases that use SQL or of triple stores that use SPARQL. However, since SPARQL query-enabled triple stores may play a significant role in some of the services I listed above, I'm interested in measuring relative performance of different triple store implementations. Where we decide to use them, we ought to use the fastest stores available.
As an example: Donald would be mad to build the GBIF data portal as a basic triple store because 90% of queries are currently going to be the same i.e. by taxon name and geographical area. Even if there is a triple store out the back some place one is going to want to optimize for the most common queries. If you want to ask something weird you are going to have to wait!
I agree that any hypothetical RDF-based GBIF portal is a bad candidate for implementation over a triple store. To build something like a GBIF portal, I'd use a design that is a combination of an indexer and an aggregator. In my mind, an aggregator is a piece of software that harvests RDF from providers and stores it in a persistent cache. An indexer either harvests RDF or scans an aggregator's cache in order to build indexes that can be used to rapidly answer simple queries. Index services don't provide a SPARQL query interface so they don't need to be implemented over a triple store. Instead an index service could be backed by a field-enabled on-disk inverted index. This is the same technology that backs search engines and *it is very different from a general purpose database*. So an index service is a kind of search engine. A hypothetical GBIF index service could be designed to index only the scientific name and geography fields of the concise bounded descriptions (metadata objects) that represent specimens.
The query interface for an indexer is much like a search engine except that the query string can be a boolean expression of field/query-term pairs. For example, class:"DarwinCoreSpecimen" AND scientificName:"Anthophora linsleyi" AND country:"US"
Also like a search engine, the result of an index service query is a list of URIs. With an indexer these URIs are the LSIDs of the matching CBDs (metadata objects). If the client of a hypothetical GBIF index service wants RDF returned instead of a list of LSIDs, it can fetch the corresponding RDF chunks from the aggregator or resolve each LSID against it's authority.
This is a scalable design built using well-understood search engine technology. It allows one to perform simple searches very rapidly, but it has a downside that I'll address below.
Enabling a client to ask an arbitrary query of a data store with no knowledge of the underlying structure (only the semantics) and guaranteeing response times seems, to me, to be a general problem of any system - whether we are in a triple based world or a XML Schema based one. It also seems to be one we don't need to answer.
I imagine that the people who are looking at optimizing triple stores are looking at using the queries to build 'clever' indexes that amount to separate tables for triple patterns that occur regularly a little like MS SQL Server does with regular indexes. But then this is just me speculating.
We have to accept that data publishers are only going to offer a limited range of queries of their data. Complex queries have to be answered by gathering a subset of data (probably from several publishers) locally or in a grid and then querying that in interesting ways. Triple stores would be great for this local cache as it will be smaller and can sit in memory etc. The way to get data into these local caches is by making sure the publishers supply it in RDF using common vocabularies - even if they don't care a fig about RDF and are just using an XML Schema which has an RDF mapping.
I understand where you're coming from here and I sympathize with your design to keep the barrier-to-entry low for data providers. However, I hope we can aim higher than this for a several reasons.
The first is pragmatic: if data providers offer only limited query capabilities, then they become more difficult to use. As an example, imagine that a data provider serves specimens and supports queries by the taxonomic rank including genus, species or subspecies as well as by country. If I want to get all specimens collected in Kansas I have to query for and download all specimens for the US and then filter out results from the other 49 states before I can do anything meaningful with the data. Likewise, if I want to get all specimens for a particular order like Hymenoptera, then I'm forced to do several queries for genera that I think are under it, then aggregate the result and filter out any false positives before I can actually use the data. To sum up this first point, providing only minimal query capabilities on providers can increase the number of queries that must be made to perform a search and can lead to excessive traffic, not to mention inconvenience to the users of providers. This is one of the criticisms that also applies to an index service; if a field that you're interested in is not indexed, then you can't query on it.
You might argue that no one would restrict searches to only those three ranks. That argument brings me to my second point. When you place restrictions on the *type* of queries supported by a provider it's quite easy to inadvertently prevent a large number of simple but useful queries. The two example queries above (state = Kansas, order = Hymenoptera) are both simple searches yet they were not directly allowed by the provider. If the reason behind restricting queries is to lighten the load on providers and enable rapid response times, then, in the examples above, the client was frustrated (because she had to find an alternative method to get the desired data) and the goal was not met because the provider incurred the expense of transmitting a huge amount of data in the first case and handling many separate queries in the second.
To me, the major benefit of RDF is that it gives us a flexible set of data structures that can be extended and expanded over time. RDF makes it easier to interoperate with new data models within our domain such as descriptive data and taxon concepts. It could also make it easier to integrate our data with data sets from other disciplines such as geology, physical oceanography, or climatology. One consequence of this is that we can't reliably know now what queries will be most beneficial in the future. Right now the low hanging fruit is specimen queries by taxa and gross locality, but will this be true tomorrow? Building and deploying general-use data provider software is expensive, but if we restrict queries on providers then we have to have many different domain-specific data provider packages (one for specimen, one for names, etc.). Are we going to disallow the same provider from serving both specimens and descriptive data at the same time? I'd hate to see us restrict queries to a limited set now only to find that we have to change this set in the future.
So how do we guarantee reasonable performance on providers? I don't think we need to. Performance is only one criteria for queries. The other two important ones are precision and recall. If users of a particular application think performance is more important than precision, then the application should be configured to use an index service. This is a great solution for portal-style browsing applications. If, on the other hand, the application is designed to collect data for an analysis such as niche modeling, then precision is much more important than performance and it should be allowed to pose complicated queries to providers.
I think what's important is making sure that providers do not suffer inadvertent denial of service attacks by overeager clients posing too many simultaneous long-running queries. I think the best way to prevent this is to allow providers to limit the number of threads they will service and limit the time they will allocate to any single query (to say 5 minutes). If they can't respond in that time, then they ought to send the HTTP 408 timeout response. This is a moderate level of protection for providers that also allows us the flexibility to pose new queries over expanded data models in the future without rewriting or redeploying code.
Sorry for the long-winded response. In part I wanted to get these ideas out before the TAG meeting. At the meeting I'd be happy to present some of these ideas in more depth (with pretty diagrams) and perhaps stage a demonstration of a prototype provider (DiGIR2), a prototype index service, and a prototype application that uses them.
-Steve
Can we make a separation between the use of RDF/S for transfer, for query and for storage or are these things only in my mind?
Thanks for your input on this,
Roger
Steven Perry wrote:
Bob,
This is interesting stuff. I don't know what claims Oracle is making for it's triple store, but there are many other database-backed triple stores out there and I've examined several of them in depth.
With most of the current crop of triple stores (including Jena's which we use in DiGIR2), triples are stored in a single de-normalized table that has columns for subject, predicate, and object. This table is heavily indexed to allow quick lookups, for example, to find statements with a given subject and predicate. The difficult thing in measuring triple store performance is not raw throughput (how fast triples can be loaded or listed), but is instead their performance on queries. With many of the triple stores I've examined, raw throughput is limited only by how fast the underlying database is at performing SQL inserts and selects. Granted there is some overhead in the RDF framework that sits atop the database, but performance for insertions and basic retrievals is dominated by the underlying database.
With sophisticated queries the story is quite different. For a long time every triple store had it's own query language. Now that the world is starting to standardize on SPARQL I hope to see a standard set of SPARQL-based metrics that will allow query performance comparisons to be made across triple store implementations. SPARQL is very powerful and allows a large variety of useful queries. However much of SPARQL cannot be pushed down into SQL queries. This makes any triple store designed to work over a relational database at risk of having to load all triples into memory for examination by the RDF framework in order to answer sophisticated SPARQL queries. The simplest example of such a query is one that uses the filter(regex()) pattern because most relational databases cannot perform XPath's matches regex function.
I hope to have more information about Oracle's performance claims soon and I'll share them with the list when I get them.
-Steve
Bob Morris wrote:
http://www.franz.com/resources/educational_resources/white_papers/AllegroCac... is a rather interesting piece about RDF scalability. They claim to load 300,000 triples/sec from a triple store based on Allegro Common Lisp.
Allegro CL is also at the heart of Ora Lassila's Wilbur toolkit. OINK, a way cool new Wilbur application is described at http://www.lassila.org/blog/archive/2006/03/oink.html. [Does Wilbur run on the free version of Allegro?]. Wilbur loads 2600 triples/sec. Lassila is generally regarded with Hendler and Berner's-Lee as one of the founders of the Semantic Web.
[Bill Campbell, a colleague of mine and author of UMB-Scheme distributed with RedHat Linux, once remarked "XML is just Lisp with pointy brackets". The above might support: "RDF is just CLOS with pointy brackets". Which, by the way, is positive.]
Does anyone know what triple retrieval claims Oracle is making for its triple store support?
There is a good current survey of RDF programming support at http://www.wiwiss.fu-berlin.de/suhl/bizer/toolkits/
--Bob
Tdwg-tag mailing list Tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag_lists.tdwg.org
Tdwg-tag mailing list Tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag_lists.tdwg.org
--
Roger Hyam Technical Architect Taxonomic Databases Working Group
http://www.tdwg.org roger@tdwg.org
+44 1578 722782