[Tdwg-tag] Scalability

Wed Apr 12 14:41:58 CEST 2006

The HP report mentions scalability as a deficiency of RDF. The latest on 
the subject that  I can find  on the issue is the report of the 2003 
SWAD meeting 
http://www.w3.org/2001/sw/Europe/reports/dev_workshop_report_4/ which 
puts the then state of the art at 40M triples, with typical stores 
supporting around 10M. This leads me to some questions:

1. There must be substantial improvement in the intervening 2.5 years. 
Can someone point me at what is currently the situation about large 
triple stores?

2. Are there known indexing techniques that would not require a store 
that holds all the triples germaine to a project?

3. What are the estimates of the number of triples that would be needed 
to deal with the domains of interest to TDWG and how should one try to 
make such? For example, if the current GBIF specimen record service were 
implemented on a triple store, what would be its size and how does one 
estimate this? For descriptive data, one might reasonably expect an 
average of 100 property values per taxon (i.e. the state of 100 
characters), so does this mean 180M triples would be an adequate 
(required?) store for descriptions of 1.8M taxa?

If scalability is rational, then for SDD there is an irony that the 
advantages cited in the HP paper are a good match to the problems of 
descriptive data, while at the same time the disadvantages are 
debilitating. Those in Section 7.2 (OWL Expressitivity limitations) are 
mostly quite important to some of the current things SDD expresses, but 
the workarounds suggested might not be onerous. The inability to 
describe continuous properties may be more problematic, but discretizing 
continuous properties---which is frequently done anyway---might be 
accepted by descriptive data users if the benefits were high. The 
inability to compare the size of butterfly wings to the size of raptor 
wings might bring limitations that would not interfere with 95% of the 
uses of descriptive data. What's unclear to me at the moment is whether 
working around the cross-slot constraint on OWL entails a lot of 
reification in order to talk about collections of properties, which is 
quite fundamental to descriptions of taxa, and if so, does this push an 
OWL DL ontology into OWL Full or worse, possibly removing the principal 
benefit---reasoning---from the matter.

Bob

Roger Hyam wrote:
> Bob (inspired by Damian) asked me to post this for him as he is away 
> from his email account just now .
> 
> http://www.hpl.hp.com/techreports/2005/HPL-2005-189.pdf
> is touted on some blogs as extremely balanced.
> 
> Looks good to me to - though not gone through it in detail yet.
> 
> Roger
> 
> 

-- 
Robert A. Morris
Professor of Computer Science
UMASS-Boston
http://www.cs.umb.edu/~ram
phone (+1)617 287 6466