[Tdwg-tag] TCS in RDF for use in LSIDs and possiblegeneric mechanism.
Steven Perry
smperry at ku.edu
Thu Mar 30 10:46:39 CEST 2006
Hi Roger,
Thanks for the excellent reply. You raise some serious concerns and
touch on some of the issues that Robert Gales, Dave Vieglais, and I have
been investigating.
Roger Hyam wrote:
>
> Steve,
>
> I find the triple store debate interesting because the goal posts seem
> to shift.
Perhaps the goal posts seem to shift because I haven't clearly made my
point. We've been envisioning many different services that might
participate in a semantic network. These include but aren't limited to
providers, indexers, aggregators/mirrors, LSID authorities, analysis
services and translation services. In addition to these we see a bunch
of web and desktop applications that use these services. Only some of
these services and applications are good candidates for implementation
over a triple store. I'll address that more fully in a minute. Despite
the fact that not all services will be backed by them, we see triple
stores playing an important role.
> No one would claim to design a conventional relational structure (or
> perhaps generate one from an XML Schema?) that was guaranteed to
> perform equally quickly for any arbitrary query. All real world
> relational schemas are optimized for a particular purpose. When people
> talk about triple stores they expect to be able to ask them *anything
> *and get a responsive answer - when that is never expected of
> relational db.
>
I certainly never meant to imply that any arbitrary query could be
guaranteed answerable in a reasonable amount of time. As you point out,
it's well known that this is not true of either relational databases
that use SQL or of triple stores that use SPARQL. However, since SPARQL
query-enabled triple stores may play a significant role in some of the
services I listed above, I'm interested in measuring relative
performance of different triple store implementations. Where we decide
to use them, we ought to use the fastest stores available.
> As an example: Donald would be mad to build the GBIF data portal as a
> basic triple store because 90% of queries are currently going to be
> the same i.e. by taxon name and geographical area. Even if there is a
> triple store out the back some place one is going to want to optimize
> for the most common queries. If you want to ask something weird you
> are going to have to wait!
I agree that any hypothetical RDF-based GBIF portal is a bad candidate
for implementation over a triple store. To build something like a GBIF
portal, I'd use a design that is a combination of an indexer and an
aggregator. In my mind, an aggregator is a piece of software that
harvests RDF from providers and stores it in a persistent cache. An
indexer either harvests RDF or scans an aggregator's cache in order to
build indexes that can be used to rapidly answer simple queries. Index
services don't provide a SPARQL query interface so they don't need to be
implemented over a triple store. Instead an index service could be
backed by a field-enabled on-disk inverted index. This is the same
technology that backs search engines and *it is very different from a
general purpose database*. So an index service is a kind of search
engine. A hypothetical GBIF index service could be designed to index
only the scientific name and geography fields of the concise bounded
descriptions (metadata objects) that represent specimens.
The query interface for an indexer is much like a search engine except
that the query string can be a boolean expression of field/query-term
pairs. For example,
class:"DarwinCoreSpecimen" AND scientificName:"Anthophora linsleyi" AND
country:"US"
Also like a search engine, the result of an index service query is a
list of URIs. With an indexer these URIs are the LSIDs of the matching
CBDs (metadata objects). If the client of a hypothetical GBIF index
service wants RDF returned instead of a list of LSIDs, it can fetch the
corresponding RDF chunks from the aggregator or resolve each LSID
against it's authority.
This is a scalable design built using well-understood search engine
technology. It allows one to perform simple searches very rapidly, but
it has a downside that I'll address below.
>
> Enabling a client to ask an arbitrary query of a data store with no
> knowledge of the underlying structure (only the semantics) and
> guaranteeing response times seems, to me, to be a general problem of
> any system - whether we are in a triple based world or a XML Schema
> based one. It also seems to be one we don't need to answer.
> I imagine that the people who are looking at optimizing triple stores
> are looking at using the queries to build 'clever' indexes that amount
> to separate tables for triple patterns that occur regularly a little
> like MS SQL Server does with regular indexes. But then this is just me
> speculating.
> We have to accept that data publishers are only going to offer a
> limited range of queries of their data. Complex queries have to be
> answered by gathering a subset of data (probably from several
> publishers) locally or in a grid and then querying that in interesting
> ways. Triple stores would be great for this local cache as it will be
> smaller and can sit in memory etc. The way to get data into these
> local caches is by making sure the publishers supply it in RDF using
> common vocabularies - even if they don't care a fig about RDF and are
> just using an XML Schema which has an RDF mapping.
>
I understand where you're coming from here and I sympathize with your
design to keep the barrier-to-entry low for data providers. However, I
hope we can aim higher than this for a several reasons.
The first is pragmatic: if data providers offer only limited query
capabilities, then they become more difficult to use. As an example,
imagine that a data provider serves specimens and supports queries by
the taxonomic rank including genus, species or subspecies as well as by
country. If I want to get all specimens collected in Kansas I have to
query for and download all specimens for the US and then filter out
results from the other 49 states before I can do anything meaningful
with the data. Likewise, if I want to get all specimens for a
particular order like Hymenoptera, then I'm forced to do several queries
for genera that I think are under it, then aggregate the result and
filter out any false positives before I can actually use the data. To
sum up this first point, providing only minimal query capabilities on
providers can increase the number of queries that must be made to
perform a search and can lead to excessive traffic, not to mention
inconvenience to the users of providers. This is one of the criticisms
that also applies to an index service; if a field that you're interested
in is not indexed, then you can't query on it.
You might argue that no one would restrict searches to only those three
ranks. That argument brings me to my second point. When you place
restrictions on the *type* of queries supported by a provider it's quite
easy to inadvertently prevent a large number of simple but useful
queries. The two example queries above (state = Kansas, order =
Hymenoptera) are both simple searches yet they were not directly allowed
by the provider. If the reason behind restricting queries is to lighten
the load on providers and enable rapid response times, then, in the
examples above, the client was frustrated (because she had to find an
alternative method to get the desired data) and the goal was not met
because the provider incurred the expense of transmitting a huge amount
of data in the first case and handling many separate queries in the second.
To me, the major benefit of RDF is that it gives us a flexible set of
data structures that can be extended and expanded over time. RDF makes
it easier to interoperate with new data models within our domain such as
descriptive data and taxon concepts. It could also make it easier to
integrate our data with data sets from other disciplines such as
geology, physical oceanography, or climatology. One consequence of this
is that we can't reliably know now what queries will be most beneficial
in the future. Right now the low hanging fruit is specimen queries by
taxa and gross locality, but will this be true tomorrow? Building and
deploying general-use data provider software is expensive, but if we
restrict queries on providers then we have to have many different
domain-specific data provider packages (one for specimen, one for names,
etc.). Are we going to disallow the same provider from serving both
specimens and descriptive data at the same time? I'd hate to see us
restrict queries to a limited set now only to find that we have to
change this set in the future.
So how do we guarantee reasonable performance on providers? I don't
think we need to. Performance is only one criteria for queries. The
other two important ones are precision and recall. If users of a
particular application think performance is more important than
precision, then the application should be configured to use an index
service. This is a great solution for portal-style browsing
applications. If, on the other hand, the application is designed to
collect data for an analysis such as niche modeling, then precision is
much more important than performance and it should be allowed to pose
complicated queries to providers.
I think what's important is making sure that providers do not suffer
inadvertent denial of service attacks by overeager clients posing too
many simultaneous long-running queries. I think the best way to prevent
this is to allow providers to limit the number of threads they will
service and limit the time they will allocate to any single query (to
say 5 minutes). If they can't respond in that time, then they ought to
send the HTTP 408 timeout response. This is a moderate level of
protection for providers that also allows us the flexibility to pose new
queries over expanded data models in the future without rewriting or
redeploying code.
Sorry for the long-winded response. In part I wanted to get these ideas
out before the TAG meeting. At the meeting I'd be happy to present some
of these ideas in more depth (with pretty diagrams) and perhaps stage a
demonstration of a prototype provider (DiGIR2), a prototype index
service, and a prototype application that uses them.
-Steve
> Can we make a separation between the use of RDF/S for transfer, for
> query and for storage or are these things only in my mind?
>
> Thanks for your input on this,
>
> Roger
>
>
> Steven Perry wrote:
>
>>Bob,
>>
>>This is interesting stuff. I don't know what claims Oracle is making
>>for it's triple store, but there are many other database-backed triple
>>stores out there and I've examined several of them in depth.
>>
>>With most of the current crop of triple stores (including Jena's which
>>we use in DiGIR2), triples are stored in a single de-normalized table
>>that has columns for subject, predicate, and object. This table is
>>heavily indexed to allow quick lookups, for example, to find statements
>>with a given subject and predicate. The difficult thing in measuring
>>triple store performance is not raw throughput (how fast triples can be
>>loaded or listed), but is instead their performance on queries. With
>>many of the triple stores I've examined, raw throughput is limited only
>>by how fast the underlying database is at performing SQL inserts and
>>selects. Granted there is some overhead in the RDF framework that sits
>>atop the database, but performance for insertions and basic retrievals
>>is dominated by the underlying database.
>>
>>With sophisticated queries the story is quite different. For a long
>>time every triple store had it's own query language. Now that the world
>>is starting to standardize on SPARQL I hope to see a standard set of
>>SPARQL-based metrics that will allow query performance comparisons to be
>>made across triple store implementations. SPARQL is very powerful and
>>allows a large variety of useful queries. However much of SPARQL cannot
>>be pushed down into SQL queries. This makes any triple store designed
>>to work over a relational database at risk of having to load all triples
>>into memory for examination by the RDF framework in order to answer
>>sophisticated SPARQL queries. The simplest example of such a query is
>>one that uses the filter(regex()) pattern because most relational
>>databases cannot perform XPath's matches regex function.
>>
>>I hope to have more information about Oracle's performance claims soon
>>and I'll share them with the list when I get them.
>>
>>-Steve
>>
>>
>>
>>Bob Morris wrote:
>>
>>
>>
>>>http://www.franz.com/resources/educational_resources/white_papers/AllegroCache_RDF_Dobbs2006.pdf
>>>is a rather interesting piece about RDF scalability. They claim to load
>>>300,000 triples/sec from a triple store based on Allegro Common Lisp.
>>>
>>>Allegro CL is also at the heart of Ora Lassila's Wilbur toolkit. OINK, a
>>>way cool new Wilbur application is described at
>>>http://www.lassila.org/blog/archive/2006/03/oink.html. [Does Wilbur run
>>>on the free version of Allegro?]. Wilbur loads 2600 triples/sec. Lassila
>>>is generally regarded with Hendler and Berner's-Lee as one of the
>>>founders of the Semantic Web.
>>>
>>>[Bill Campbell, a colleague of mine and author of UMB-Scheme distributed
>>>with RedHat Linux, once remarked "XML is just Lisp with pointy
>>>brackets". The above might support: "RDF is just CLOS with pointy
>>>brackets". Which, by the way, is positive.]
>>>
>>>Does anyone know what triple retrieval claims Oracle is making for its
>>>triple store support?
>>>
>>>There is a good current survey of RDF programming support at
>>>http://www.wiwiss.fu-berlin.de/suhl/bizer/toolkits/
>>>
>>>--Bob
>>>
>>>_______________________________________________
>>>Tdwg-tag mailing list
>>>Tdwg-tag at lists.tdwg.org
>>>http://lists.tdwg.org/mailman/listinfo/tdwg-tag_lists.tdwg.org
>>>
>>>
>>>
>>>
>>
>>
>>_______________________________________________
>>Tdwg-tag mailing list
>>Tdwg-tag at lists.tdwg.org
>>http://lists.tdwg.org/mailman/listinfo/tdwg-tag_lists.tdwg.org
>>
>>
>>
>
>
>--
>
>-------------------------------------
> Roger Hyam
> Technical Architect
> Taxonomic Databases Working Group
>-------------------------------------
> http://www.tdwg.org
> roger at tdwg.org
> +44 1578 722782
>-------------------------------------
>
>
More information about the tdwg-tag
mailing list