[Tdwg-tag] Why should data providers supply search and query services?

1 Mar 2006

      This is a little more of a controversial question that has been suggested:

"Why should data providers supply search and query services?"

    * We have many potential data providers (potentially every
      collection and institution).
    * We have many potential data consumers (potentially every
      researcher with a laptop).
    * We have a few potential data indexers (GBIF, ORBIS , etc + others
      to come).

The implementation burden should therefore be:

    * Light for the providers - who's role is to conserve data and
      physical objects.
    * Light for the consumer - who's role is to do research not mess
      with data handling.
    * Heavy for the indexers - who's core business is making the data
      accessible.

Data providers should give the objects they curate GUIDs. This is 
important because it stamps their ownership (and responsibility) on that 
piece of data. They then need to run an LSID service that serves the 
(meta)data for the objects they own. *There work should stop at this 
point!* They should not have to implement search and query services. 
They should not anticipate what people will require by way of data 
access - that is a separate function.

Data consumers should be able to access indexing services that pool 
information from multiple data providers. They should not have to run 
federated queries across multiple data providers or have to discover 
providers as this is complex and difficult (though they may want to 
browse round data providers like they would browse links on web pages). 
Once they have retrieved the GUIDs of the objects they are interested in 
from the indexers they may want to call the data providers for more 
detailed information.

Data indexers should crawl the data exposed by the providers and index 
them in thematic ways. e.g. provide geographic or taxon focused 
services. This is a complex job as it involves doing clever, innovative 
things with data and optimization of searches etc.

Currently we are trying to make every data provider support searching 
and querying when the consumers aren't really interested in querying or 
searching individual providers - they want to search thematically across 
providers.

If a big data provider wants to provide search and query then they can 
set themselves up as both a provider and an indexer - which is more or 
less what everyone is forced to do now - but the functions are separate.

Data providers would have to implement a little more than just an LSID 
resolver services for this to work. They would need to provide a single 
web service method (URL call) that allowed indexers to get lists of 
LSIDs they hold that have had their (meta)data modified since a certain 
date but this would be a relatively simple thing compared with providing 
arbitrary query facilities.

I believe (though I haven't done a thorough analysis of log data ) that 
this is more or less the situation now. Data providers implement 
complete DiGIR or BioCASE protocols but are only queried in a limited 
way by portal engines. Consumers go directly to portals for their data 
discovery. So why implement full search and query at the data provider 
nodes of the network (possibly the hardest thing we have to do) when it 
may not be used?

This may be controversial. What do you think?

Roger

-- 

-------------------------------------
 Roger Hyam
 Technical Architect
 Taxonomic Databases Working Group
-------------------------------------
 http://www.tdwg.org
 roger@tdwg.org
 +44 1578 722782
-------------------------------------