Re: [Tdwg-tag] Why should data providers supply search and query services?

3 Mar 2006

      Umm...there is a distinguishable class of data consumers, namely
applications, and so a distinguishable constituency whose burden is
relevant, namely application writers. Some applications may well be
motivated to query providers directly for a number of reasons, including:

   - the data indexers currency policies may be unsuitable
   - the data indexers may aggregate in undesirable ways [the present
   model seems to be that indexer==portal, but I doubt that is general]
   - the data indexers may index too promiscuously or not promiscuously
   enough for the application's taste [this might be a non-issue if there were
   a way for a machine to understand what exactly the indexing strategy is and
   perhaps how to induce the indexer to alter it, but that sounds hard]
   - portals, and maybe indexers---indeed, any processor of the
   data---can intentionally or inadvertantly hide assumptions about how the
   data will be used, making it unsuited for uses that don't meet these
   assumptions. Put another way, it is probably difficult to insure that a
   machine-enforceable contract is possible between aggregators and
   applications that assures the application that records obtained from the
   aggregator or identical to those available from the provider. I think it is
   even a deep problem to have  machine-understandable "fitness for use"
   metadata that would allow a machine to understand what fitness contract the
   aggregator is actually offering.

In general it should never be harder to query providers than aggregators,
especially if it is difficult for a machine to understand what, if any,
point of view the aggregator has imposed on the view they offer of the
aggregated data.

People are no doubt tired of hearing this from me, but my position is always
that modeling data consumers as humans is dangerously constricting. Humans
are too smart and readily deal with lots of violations of the principle of
least amazement, whereas machines don't. In point of fact, except for those
on paper, stone, clay tablets and the like, there is no such thing as a
database accessed by a human. They all have software between the human and
the data provision service.  From this I conclude that in your trinity
below, reduction of the burden on humans actually falls to the applications,
and so  I think TAGs  requirement is to reduce the burden on application
writers  (including those of TDWG itself, but also all others in the world)
in their quest to reduce the burden on human data consumers. My intuition is
that this will lead to a different analysis than thinking about humans as
consumers, but at the moment I have no specific examples to offer.

A little more is interspersed below.

On 3/1/06, Roger Hyam <roger@tdwg.org> wrote:
...
This is a little more of a controversial question that has been suggested:
"Why should data providers supply search and query services?"
- We have many potential data providers (potentially every
   collection and institution).
   - We have many potential data consumers (potentially every
   researcher with a laptop).
   - We have a few potential data indexers (GBIF, ORBIS , etc + others
   to come).
The implementation burden should therefore be:
- Light for the providers - who's role is to conserve data and
   physical objects.
   - Light for the consumer - who's role is to do research not mess
   with data handling.
   - Heavy for the indexers - who's core business is making the data
   accessible.
Data providers should give the objects they curate GUIDs. This is
important because it stamps their ownership (and responsibility) on that
piece of data. They then need to run an LSID service that serves the
(meta)data for the objects they own. *There work should stop at this
point!* They should not have to implement search and query services. They
should not anticipate what people will require by way of data access - that
is a separate function.
Data consumers should be able to access indexing services that pool
information from multiple data providers. They should not have to run
federated queries across multiple data providers or have to discover
providers as this is complex and difficult (though they may want to browse
round data providers like they would browse links on web pages). Once they
have retrieved the GUIDs of the objects they are interested in from the
indexers they may want to call the data providers for more detailed
information.
Data indexers should crawl the data exposed by the providers and index
them in thematic ways. e.g. provide geographic or taxon focused services.
This is a complex job as it involves doing clever, innovative things with
data and optimization of searches etc.
Currently we are trying to make every data provider support searching and
querying when the consumers aren't really interested in querying or
searching individual providers - they want to search thematically across
providers.
Restated, this sentence may fall in my class of questions forbidden to
software architects, namely  that class of questions that begin with the
words "Why would anybody ever want to ..."

If a big data provider wants to provide search and query then they can set
...
themselves up as both a provider and an indexer - which is more or less what
everyone is forced to do now - but the functions are separate.
Data providers would have to implement a little more than just an LSID
resolver services for this to work. They would need to provide a single web
service method (URL call) that allowed indexers to get lists of LSIDs they
hold that have had their (meta)data modified since a certain date but this
would be a relatively simple thing compared with providing arbitrary query
facilities.
I believe (though I haven't done a thorough analysis of log data ) that
this is more or less the situation now. Data providers implement complete
DiGIR or BioCASE protocols but are only queried in a limited way by portal
engines. Consumers go directly to portals for their data discovery. So why
implement full search and query at the data provider nodes of the network
(possibly the hardest thing we have to do) when it may not be used?
This may be controversial. What do you think?
I'm not sure about controversial, but I am pretty sure that what you are
pointing at is a warehouse model. I don't know if I am  prepared to agree
that  all possible present and future concerns  of TDWG  can be answered by
data warehouses.  In particular, if you analyse log data of a warehouse, it
won't be too surprising if the conclusion is that users are behaving as
though they mainly need a warehouse. [To data consumers a warehouse and a
portal are indistinguishable. I think.]

Bob Morris

Roger

--

-------------------------------------
 Roger Hyam
 Technical Architect
 Taxonomic Databases Working Group
-------------------------------------

http://www.tdwg.org
 roger@tdwg.org
 +44 1578 722782
-------------------------------------

______________________________
...
_________________
Tdwg-tag mailing list
Tdwg-tag@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-tag_lists.tdwg.org

...