Re: [Tdwg-tag] Why should data providers supply search and query services?

3 Mar 2006


      Bob Morris wrote:
...
Umm...there is a distinguishable class of data consumers, namely 
applications, and so a distinguishable constituency whose burden is 
relevant, namely application writers. Some applications may well be 
motivated to query providers directly for a number of reasons, including:
* the data indexers currency policies may be unsuitable
This equally applies to data providers. They may not index data in a way 
the consumer requires. It may lag behind their own live data set etc.
...
* the data indexers may aggregate in undesirable ways [the present
      model seems to be that indexer==portal, but I doubt that is general]
Ditto from point above. Data suppliers may index in undesirable ways 
plus they might index heterogeneously - each supplier may be undesirable 
in different ways - which would be a really big headache. Is there 
anything to say this will cause less of a burden when spread across many 
providers rather than few indexers? If a thematic indexer doesn't do 
what is required then it may be possible to get something changed. If 50 
suppliers don't index something correctly then it will no doubt take 
years to get any changes affected - especially if they are all doing it 
wrong differently.
...
* the data indexers may index too promiscuously or not
      promiscuously enough for the application's taste [this might be
      a non-issue if there were a way for a machine to understand what
      exactly the indexing strategy is and perhaps how to induce the
      indexer to alter it, but that sounds hard]
Again ditto. If providers are also indexers then any criticism of 
problems with indexing has to apply to the suppliers but is magnified by 
the number of suppliers.
...
* portals, and maybe indexers---indeed, any processor of the
      data---can intentionally or inadvertantly hide assumptions about
      how the data will be used, making it unsuited for uses that
      don't meet these assumptions. Put another way, it is probably
      difficult to insure that a machine-enforceable contract is
      possible between aggregators and applications that assures the
      application that records obtained from the aggregator or
      identical to those available from the provider. I think it is
      even a deep problem to have  machine-understandable "fitness for
      use" metadata that would allow a machine to understand what
      fitness contract the aggregator is actually offering.
I would assume that the aggregator is assembling metadata (in the sense 
of things that can be searched on) rather than actual data. The 
aggregator/indexer is really only providing a GUID discovery service. 
The consumer can always retrieve the original objects from the data 
supplier. The aggregator/indexer is only providing a match making service.
...
In general it should never be harder to query providers than 
aggregators, especially if it is difficult for a machine to understand 
what, if any, point of view the aggregator has imposed on the view 
they offer of the aggregated data.
I don't believe this follows from your points above:

I frequently go to websites and can't find what I want so I go to Google 
and do a search restricting its scope to just that site. Indeed Google 
provide this as a service - just embed a search box on your site that 
passes the right parameters. In this situation it is definitely easier 
to query the aggregator than the supplier. Indeed many sites don't 
bother with providing search services other than Google (which is the 
point I make precisely). The alternative is that every tin-pot website 
has to have an implementation of the Google search algorithm and indexes 
within it. (I appreciate that this is a human example but it translates 
to a machine world. A data provider's metadata could easily provide the 
location of web services to query it that are not actually part of the 
provider itself. Indeed it could offer a list of services. A neat place 
to do this would be in the WSDL returned by a LSID Authority.)
...
People are no doubt tired of hearing this from me, but my position is 
always that modeling data consumers as humans is dangerously 
constricting. Humans are too smart and readily deal with lots of 
violations of the principle of least amazement, whereas machines 
don't. In point of fact, except for those on paper, stone, clay 
tablets and the like, there is no such thing as a database accessed by 
a human. They all have software between the human and the data 
provision service.  From this I conclude that in your trinity below, 
reduction of the burden on humans actually falls to the applications, 
and so  I think TAGs  requirement is to reduce the burden on 
application writers  (including those of TDWG itself, but also all 
others in the world) in their quest to reduce the burden on human data 
consumers. My intuition is that this will lead to a different analysis 
than thinking about humans as consumers, but at the moment I have no 
specific examples to offer.
I think this is a really good point and will take it forward. I hope to 
start the TAG meeting with a discussion of Actors within our domain and 
will attempt to differentiate client-human from client-machine within this.
...
A little more is interspersed below.
On 3/1/06, *Roger Hyam* <roger@tdwg.org <mailto:roger@tdwg.org>> wrote:
This is a little more of a controversial question that has been suggested:
"Why should data providers supply search and query services?"
* We have many potential data providers (potentially every
          collection and institution).
        * We have many potential data consumers (potentially every
          researcher with a laptop).
        * We have a few potential data indexers (GBIF, ORBIS , etc +
          others to come).
The implementation burden should therefore be:
* Light for the providers - who's role is to conserve data and
          physical objects.
        * Light for the consumer - who's role is to do research not
          mess with data handling.
        * Heavy for the indexers - who's core business is making the
          data accessible.
Data providers should give the objects they curate GUIDs. This is
    important because it stamps their ownership (and responsibility)
    on that piece of data. They then need to run an LSID service that
    serves the (meta)data for the objects they own. *There work should
    stop at this point!* They should not have to implement search and
    query services. They should not anticipate what people will
    require by way of data access - that is a separate function.
Data consumers should be able to access indexing services that
    pool information from multiple data providers. They should not
    have to run federated queries across multiple data providers or
    have to discover providers as this is complex and difficult
    (though they may want to browse round data providers like they
    would browse links on web pages). Once they have retrieved the
    GUIDs of the objects they are interested in from the indexers they
    may want to call the data providers for more detailed information.
Data indexers should crawl the data exposed by the providers and
    index them in thematic ways. e.g. provide geographic or taxon
    focused services. This is a complex job as it involves doing
    clever, innovative things with data and optimization of searches etc.
Currently we are trying to make every data provider support
    searching and querying when the consumers aren't really interested
    in querying or searching individual providers - they want to
    search thematically across providers.
Restated, this sentence may fall in my class of questions forbidden to 
software architects, namely  that class of questions that begin with 
the words "Why would anybody ever want to ..."
I should restate it "What is the use case that indicates the system 
should support this behavior?"
If a big data provider wants to provide search and query then they
    can set themselves up as both a provider and an indexer - which is
    more or less what everyone is forced to do now - but the functions
    are separate.
Data providers would have to implement a little more than just an
    LSID resolver services for this to work. They would need to
    provide a single web service method (URL call) that allowed
    indexers to get lists of LSIDs they hold that have had their
    (meta)data modified since a certain date but this would be a
    relatively simple thing compared with providing arbitrary query
    facilities.
I believe (though I haven't done a thorough analysis of log data )
    that this is more or less the situation now. Data providers
    implement complete DiGIR or BioCASE protocols but are only queried
    in a limited way by portal engines. Consumers go directly to
    portals for their data discovery. So why implement full search and
    query at the data provider nodes of the network (possibly the
    hardest thing we have to do) when it may not be used?
This may be controversial. What do you think?
I'm not sure about controversial, but I am pretty sure that what you 
are pointing at is a warehouse model. I don't know if I am  prepared 
to agree that  all possible present and future concerns  of TDWG  can 
be answered by data warehouses.  In particular, if you analyse log 
data of a warehouse, it won't be too surprising if the conclusion is 
that users are behaving as though they mainly need a warehouse. [To 
data consumers a warehouse and a portal are indistinguishable. I think.]
This is why I use the term 'indexer' rather than aggregator. The analogy 
with web search engines is a good one. Basically we have to implement 
aggregated-indexes for key data (although federated searching by 
crawling all the providers is theoretically possible if you are not in a 
hurry) the question I raise is whether we also need to implement 
querying in every provider.
...
Bob Morris
Roger
--
-------------------------------------
     Roger Hyam
     Technical Architect
     Taxonomic Databases Working Group
    -------------------------------------
http://www.tdwg.org
     roger@tdwg.org <mailto:roger@tdwg.org>
     +44 1578 722782
    -------------------------------------
_______________________________________________
    Tdwg-tag mailing list
    Tdwg-tag@lists.tdwg.org <mailto:Tdwg-tag@lists.tdwg.org>
    http://lists.tdwg.org/mailman/listinfo/tdwg-tag_lists.tdwg.org