Re: [Tdwg-tag] TCS in RDF for use in LSIDs and possiblegeneric mechanism.

30 Mar 2006

      Hi Roger,

Thanks for the excellent reply.  You raise some serious concerns and 
touch on some of the issues that Robert Gales, Dave Vieglais, and I have 
been investigating.

Roger Hyam wrote:
...
Steve,
I find the triple store debate interesting because the goal posts seem 
to shift.
Perhaps the goal posts seem to shift because I haven't clearly made my 
point.  We've been envisioning many different services that might 
participate in a semantic network.  These include but aren't limited to 
providers, indexers, aggregators/mirrors, LSID authorities, analysis 
services and translation services.  In addition to these we see a bunch 
of web and desktop applications that use these services.  Only some of 
these services and applications are good candidates for implementation 
over a triple store.  I'll address that more fully in a minute.  Despite 
the fact that not all services will be backed by them, we see triple 
stores playing an important role.
...
No one would claim to design a conventional relational structure (or 
perhaps generate one from an XML Schema?) that was guaranteed to 
perform equally quickly for any arbitrary query. All real world 
relational schemas are optimized for a particular purpose. When people 
talk about triple stores they expect to be able to ask them *anything 
*and get a responsive answer - when that is never expected of 
relational db.
I certainly never meant to imply that any arbitrary query could be 
guaranteed answerable in a reasonable amount of time.  As you point out, 
it's well known that this is not true of either relational databases 
that use SQL or of triple stores that use SPARQL.  However, since SPARQL 
query-enabled triple stores may play a significant role in some of the 
services I listed above, I'm interested in measuring relative 
performance of different triple store implementations.  Where we decide 
to use them, we ought to use the fastest stores available.
...
As an example: Donald would be mad to build the GBIF data portal as a 
basic triple store because 90% of queries are currently going to be 
the same i.e. by taxon name and geographical area. Even if there is a 
triple store out the back some place one is going to want to optimize 
for the most common queries. If you want to ask something weird you 
are going to have to wait!
I agree that any hypothetical RDF-based GBIF portal is a bad candidate 
for implementation over a triple store.  To build something like a GBIF 
portal, I'd use a design that is a  combination of an indexer and an 
aggregator.  In my mind, an aggregator is a piece of software that 
harvests RDF from providers and stores it in a persistent cache.  An 
indexer either harvests RDF or scans an aggregator's cache in order to 
build indexes that can be used to rapidly answer simple queries.  Index 
services don't provide a SPARQL query interface so they don't need to be 
implemented over a triple store.  Instead an index service could be 
backed by a field-enabled on-disk inverted index.  This is the same 
technology that backs search engines and *it is very different from a 
general purpose database*.  So an index service is a kind of search 
engine.  A hypothetical GBIF index service could be designed to index 
only the scientific name and geography fields of the concise bounded 
descriptions (metadata objects) that represent specimens. 

The query interface for an indexer is much like a search engine except 
that the query string can be a boolean expression of field/query-term 
pairs.  For example,
class:"DarwinCoreSpecimen" AND scientificName:"Anthophora linsleyi" AND 
country:"US"

Also like a search engine, the result of an index service query is a 
list of URIs.  With an indexer these URIs are the LSIDs of the matching 
CBDs (metadata objects).  If the client of a hypothetical GBIF index 
service wants RDF returned instead of a list of LSIDs, it can fetch the 
corresponding RDF chunks from the aggregator or resolve each LSID 
against it's authority.

This is a scalable design built using well-understood search engine 
technology.  It allows one to perform simple searches very rapidly, but 
it has a downside that I'll address below.
...
Enabling a client to ask an arbitrary query of a data store with no 
knowledge of the underlying structure (only the semantics) and 
guaranteeing response times seems, to me, to be a general problem of 
any system - whether we are in a triple based world or a XML Schema 
based one. It also seems to be one we don't need to answer.

...
I imagine that the people who are looking at optimizing triple stores 
are looking at using the queries to build 'clever' indexes that amount 
to separate tables for triple patterns that occur regularly a little 
like MS SQL Server does with regular indexes. But then this is just me 
speculating.
...
We have to accept that data publishers are only going to offer a 
limited range of queries of their data. Complex queries have to be 
answered by gathering a subset of data (probably from several 
publishers) locally or in a grid and then querying that in interesting 
ways. Triple stores would be great for this local cache as it will be 
smaller and can sit in memory etc. The way to get data into these 
local caches is by making sure the publishers supply it in RDF using 
common vocabularies - even if they don't care a fig about RDF and are 
just using an XML Schema which has an RDF mapping.
I understand where you're coming from here and I sympathize with your 
design to keep the barrier-to-entry low for data providers.  However, I 
hope we can aim higher than this for a several reasons.

The first is pragmatic: if data providers offer only limited query 
capabilities, then they become more difficult to use.  As an example, 
imagine that a data provider serves specimens and supports queries by 
the taxonomic rank including genus, species or subspecies as well as by 
country.  If I want to get all specimens collected in Kansas I have to 
query for and download all specimens for the US and then filter out 
results from the other 49 states before I can do anything meaningful 
with the data.  Likewise, if I want to get all specimens for a 
particular order like Hymenoptera, then I'm forced to do several queries 
for genera that I think are under it, then aggregate the result and 
filter out any false positives before I can actually use the data.  To 
sum up this first point, providing only minimal query capabilities on 
providers can increase the number of queries that must be made to 
perform a search and can lead to excessive traffic, not to mention 
inconvenience to the users of providers.  This is one of the criticisms 
that also applies to an index service; if a field that you're interested 
in is not indexed, then you can't query on it.

You might argue that no one would restrict searches to only those three 
ranks.  That argument brings me to my second point.  When you place 
restrictions on the *type* of queries supported by a provider it's quite 
easy to inadvertently prevent a large number of simple but useful 
queries.  The two example queries above (state = Kansas, order = 
Hymenoptera) are both simple searches yet they were not directly allowed 
by the provider.  If the reason behind restricting queries is to lighten 
the load on providers and enable rapid response times, then, in the 
examples above, the client was frustrated (because she had to find an 
alternative method to get the desired data) and the goal was not met 
because the provider incurred the expense of transmitting a huge amount 
of data in the first case and handling many separate queries in the second.

To me, the major benefit of RDF is that it gives us a flexible set of 
data structures that can be extended and expanded over time.  RDF makes 
it easier to interoperate with new data models within our domain such as 
descriptive data and taxon concepts.  It could also make it easier to 
integrate our data with data sets from other disciplines such as 
geology, physical oceanography, or climatology.  One consequence of this 
is that we can't reliably know now what queries will be most beneficial 
in the future.  Right now the low hanging fruit is specimen queries by 
taxa and gross locality, but will this be true tomorrow?  Building and 
deploying general-use data provider software is expensive, but if we 
restrict queries on providers then we have to have many different 
domain-specific data provider packages (one for specimen, one for names, 
etc.).  Are we going to disallow the same provider from serving both 
specimens and descriptive data at the same time?  I'd hate to see us 
restrict queries to a limited set now only to find that we have to 
change this set in the future.

So how do we guarantee reasonable performance on providers?  I don't 
think we need to.  Performance is only one criteria for queries.  The 
other two important ones are precision and recall.  If users of a 
particular application think performance is more important than 
precision, then the application should be configured to use an index 
service.  This is a great solution for portal-style browsing 
applications.  If, on the other hand, the application is designed to 
collect data for an analysis such as niche modeling, then precision is 
much more important than performance and it should be allowed to pose 
complicated queries to providers.

I think what's important is making sure that providers do not suffer 
inadvertent denial of service attacks by overeager clients posing too 
many simultaneous long-running queries.  I think the best way to prevent 
this is to allow providers to limit the number of threads they will 
service and limit the time they will allocate to any single query (to 
say 5 minutes).  If they can't respond in that time, then they ought to 
send the HTTP 408 timeout response.  This is a moderate level of 
protection for providers that also allows us the flexibility to pose new 
queries over expanded data models in the future without rewriting or 
redeploying code.

Sorry for the long-winded response.  In part I wanted to get these ideas 
out before the TAG meeting.  At the meeting I'd be happy to present some 
of these ideas in more depth (with pretty diagrams) and perhaps stage a 
demonstration of a prototype provider (DiGIR2), a prototype index 
service, and a prototype application that uses them.

-Steve
...
Can we make a separation between the use of RDF/S for transfer, for 
query and for storage or are these things only in my mind?
Thanks for your input on this,
Roger
Steven Perry wrote:
...
Bob,
This is interesting stuff.  I don't know what claims Oracle is making 
for it's triple store, but there are many other database-backed triple 
stores out there and I've examined several of them in depth.
With most of the current crop of triple stores (including Jena's which 
we use in DiGIR2), triples are stored in a single de-normalized table 
that has columns for subject, predicate, and object.  This table is 
heavily indexed to allow quick lookups, for example, to find statements 
with a given subject and predicate.  The difficult thing in measuring 
triple store performance is not raw throughput (how fast triples can be 
loaded or listed), but is instead their performance on queries.  With 
many of the triple stores I've examined, raw throughput is limited only 
by how fast the underlying database is at performing SQL inserts and 
selects.  Granted there is some overhead in the RDF framework that sits 
atop the database, but performance for insertions and basic retrievals 
is dominated by the underlying database.
With sophisticated queries the story is quite different.  For a long 
time every triple store had it's own query language.  Now that the world 
is starting to standardize on SPARQL I hope to see a standard set of 
SPARQL-based metrics that will allow query performance comparisons to be 
made across triple store implementations.  SPARQL is very powerful and 
allows a large variety of useful queries.  However much of SPARQL cannot 
be pushed down into SQL queries.  This makes any triple store designed 
to work over a relational database at risk of having to load all triples 
into memory for examination by the RDF framework in order to answer 
sophisticated SPARQL queries.  The simplest example of such a query is 
one that uses the filter(regex()) pattern because most relational 
databases cannot perform XPath's matches regex function.
I hope to have more information about Oracle's performance claims soon 
and I'll share them with the list when I get them.
-Steve
Bob Morris wrote:
...
http://www.franz.com/resources/educational_resources/white_papers/AllegroCac...
is a rather interesting piece about RDF scalability. They claim to load 
300,000 triples/sec from a triple store based on Allegro Common Lisp.
Allegro CL is also at the heart of Ora Lassila's Wilbur toolkit. OINK, a 
way cool new Wilbur application is described at 
http://www.lassila.org/blog/archive/2006/03/oink.html. [Does Wilbur run 
on the free version of Allegro?]. Wilbur loads 2600 triples/sec. Lassila 
is generally regarded with Hendler and Berner's-Lee as one of the 
founders of the Semantic Web.
[Bill Campbell, a colleague of mine and author of UMB-Scheme distributed 
with RedHat Linux, once remarked "XML is just Lisp with pointy 
brackets". The above might support: "RDF is just CLOS with pointy 
brackets". Which, by the way, is positive.]
Does anyone know what triple retrieval claims Oracle is making for its 
triple store support?
There is a good current survey of RDF programming support at 
http://www.wiwiss.fu-berlin.de/suhl/bizer/toolkits/
--Bob
_______________________________________________
Tdwg-tag mailing list
Tdwg-tag@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-tag_lists.tdwg.org
_______________________________________________
Tdwg-tag mailing list
Tdwg-tag@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-tag_lists.tdwg.org
--
-------------------------------------
Roger Hyam
Technical Architect
Taxonomic Databases Working Group
-------------------------------------
http://www.tdwg.org
roger@tdwg.org
+44 1578 722782
-------------------------------------