Re: [Tdwg-tag] Why should data providers supply search and query services?

28 Mar 2006

      I haven't examined Amazon's offerings, but we have been looking into two 
different approaches that also use Information Retrieval (search engine) 
technologies for serving and searching RDF.  The first is Google Base 
and the second is a custom server built on top of the Apache Lucene 
search engine.  Overall I think the Information Retrieval (IR) approach 
is very fast and elegant and provides an alternative to SPARQL queries 
for locating RDF-based biodiversity data. 

One difficulty however is in integrating the IR approach into a 
web-service oriented architecture.  Another is that, while the IR 
approach yields search results much faster than SPARQL queries do, it 
introduces the classic IR problems of precision (are the results 
relevant) and recall (are all relevant results retrieved).

-Steve

Roderic Page wrote:
...
A belated comment on Roger's question about search at the start of the  
month. I think we could look at OpenSearch (http://opensearch.a9.com/),  
which is a simple format for searching. It provides a standard way to  
describe a search engine, and tags to add to the results (which are  
formatted in RSS or Atom). If providers output RSS 1.0 containing RDF  
(which will be trivial to do if they've already got LSIDs working),  
then for minimal effort basic searching can be supported.
Long term, more specialised searches would be highly desirable, but  
this is a quick way to get stuff up and running that is also  
discoverable and usable by others (e.g., Amazon's A9 search engine).  
OpenSearch seems to be gaining momentum, Microsoft's IE 7 supports it,  
for example (see the link on my blog  
http://iphylo.blogspot.com/2006/03/opensearch-and-ie7.html). Given that  
search results are RSS, people can also view search results using news  
feed reading software, hence in effect make their own biodiversity  
information aggregators.
Last year I played with an early version of OpenSearch and used it to  
wrap the Taxonomic Search Engine, and image database we're working on,  
and the LSUMZ's mammal collection (these no longer work as we've not  
updated the search description to OpenSearch 1.1).
I think this is a very easy way for providers to make their data  
available with minimal effort, and potentially lots of benefits. Again,  
I'd stress that we need to be more aware of what is going on in the  
outside world, rather than focussing on solutions specific to our  
problems.
Regards
Rod
On 1 Mar 2006, at 14:43, Roger Hyam wrote:
...
This is a little more of a controversial question that has been  
suggested:
"Why should data providers supply search and query services?"
• 	We have many potential data providers (potentially every  
collection and institution).
  • 	We have many potential data consumers (potentially every  
researcher with a laptop).
  • 	We have a few potential data indexers (GBIF, ORBIS , etc + others  
to come).
The implementation burden should therefore be:
• 	Light for the providers - who's role is to conserve data and  
physical objects.
  • 	Light for the consumer - who's role is to do research not mess  
with data handling.
  • 	Heavy for the indexers - who's core business is making the data  
accessible.
Data providers should give the objects they curate GUIDs. This is  
important because it stamps their ownership (and responsibility) on  
that piece of data. They then need to run an LSID service that serves  
the (meta)data for the objects they own. There work should stop at  
this point! They should not have to implement search and query  
services. They should not anticipate what people will require by way  
of data access - that is a separate function.
Data consumers should be able to access indexing services that pool  
information from multiple data providers. They should not have to run  
federated queries across multiple data providers or have to discover  
providers as this is complex and difficult (though they may want to  
browse round data providers like they would browse links on web  
pages). Once they have retrieved the GUIDs of the objects they are  
interested in from the indexers they may want to call the data  
providers for more detailed information.
Data indexers should crawl the data exposed by the providers and  
index them in thematic ways. e.g. provide geographic or taxon focused  
services. This is a complex job as it involves doing clever,  
innovative things with data and optimization of searches etc.
Currently we are trying to make every data provider support searching  
and querying when the consumers aren't really interested in querying  
or searching individual providers - they want to search thematically  
across providers.
If a big data provider wants to provide search and query then they  
can set themselves up as both a provider and an indexer - which is  
more or less what everyone is forced to do now - but the functions are  
separate.
Data providers would have to implement a little more than just an  
LSID resolver services for this to work. They would need to provide a  
single web service method (URL call) that allowed indexers to get  
lists of LSIDs they hold that have had their (meta)data modified since  
a certain date but this would be a relatively simple thing compared  
with providing arbitrary query facilities.
I believe (though I haven't done a thorough analysis of log data )  
that this is more or less the situation now. Data providers implement  
complete DiGIR or BioCASE protocols but are only queried in a limited  
way by portal engines. Consumers go directly to portals for their data  
discovery. So why implement full search and query at the data provider  
nodes of the network (possibly the hardest thing we have to do) when  
it may not be used?
This may be controversial. What do you think?
Roger
--
-------------------------------------
Roger Hyam
Technical Architect
Taxonomic Databases Working Group
-------------------------------------
http://www.tdwg.org
roger@tdwg.org
+44 1578 722782
-------------------------------------
_______________________________________________
Tdwg-tag mailing list
Tdwg-tag@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-tag_lists.tdwg.org
------------------------------------------------------------------------ 
----------------------------------------
Professor Roderic D. M. Page
Editor, Systematic Biology
DEEB, IBLS
Graham Kerr Building
University of Glasgow
Glasgow G12 8QP
United Kingdom
Phone:    +44 141 330 4778
Fax:      +44 141 330 2792
email:    r.page@bio.gla.ac.uk
web:      http://taxonomy.zoology.gla.ac.uk/rod/rod.html
reprints: http://taxonomy.zoology.gla.ac.uk/rod/pubs.html
Subscribe to Systematic Biology through the Society of Systematic
Biologists Website:  http://systematicbiology.org
Search for taxon names: http://darwin.zoology.gla.ac.uk/~rpage/portal/
Find out what we know about a species: http://ispecies.org
Rod's rants on phyloinformatics: http://iphylo.blogspot.com
___________________________________________________________ 
Yahoo! Messenger - NEW crystal clear PC to PC calling worldwide with voicemail http://uk.messenger.yahoo.com
_______________________________________________
Tdwg-tag mailing list
Tdwg-tag@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-tag_lists.tdwg.org

Re: [Tdwg-tag] Why should data providers supply search and query services?

Steven Perry