[Tdwg-tag] Why should data providers supply search and query services?

Fri Mar 3 14:51:43 CET 2006

On 3/3/06, Roger Hyam <roger at tdwg.org> wrote:
>
> Bob Morris wrote:
>
> Umm...there is a distinguishable class of data consumers, namely
> applications, and so a distinguishable constituency whose burden is
> relevant, namely application writers. Some applications may well be
> motivated to query providers directly for a number of reasons, including:
>
>
> the data indexers currency policies may be unsuitable
> This equally applies to data providers. They may not index data in a way
> the consumer requires. It may lag behind their own live data set etc.
>

I agree completely on this and your other dittos. It's typically hard to
figure out whether something is an aggregator or an originator. This is the
oft-discussed issue of "data provenance" which is quite difficult to
establish on a per-record data. In the (defunct?) UBIF schema there is a
weak attempt to record how, or at least if, a record evolved from its
originator. Furthermore, the history of that evolution, were it understood
(by a machine!) could prove quite useful to an application, which may well
find it interesting to incorporate the wisdom of intermediaries and find
some of them provide a better view of a given record than do others,
possibly even including the originator. As a simple example,  It could be
quite convenient if an intermediiary that by some clever processing could
establish that some datum in a record is inconsistent with some other in the
same record  and could record that fact in its forwarding metadata. Really,
my vision here is machines as scholars. I don't suggest TDWG should attempt
to accomplish that. I merely say that if that is one's vision, then one
buries fewer difficiult to extract assumptions in the modeling.  I think
this is the real point of my arguments: how to recognize all the "gotchas"
in one's models and make sure they are acknowledged enough that others can
deal with them. ["Gotcha" is an Americanism(?) contracted from "I got you!"
typically uttered to the victim of a practical joke who has been
successfully blind-sided].

[As an aside, I note that the much vaunted data-information-knowledge
pyramid is actually cited as data-information-knowledge-wisdom by some
authors. Scientists too often stop at "knowledge" because "wisdom" seems too
hard to define and perhaps a little too uncomfortable to assert about
oneself. ]

the data indexers may aggregate in undesirable ways [the present model seems
> to be that indexer==portal, but I doubt that is general]
> Ditto from point above. Data suppliers may index in undesirable ways plus
> they might index heterogeneously - each supplier may be undesirable in
> different ways - which would be a really big headache. Is there anything to
> say this will cause less of a burden when spread across many providers
> rather than few indexers? If a thematic indexer doesn't do what is required
> then it may be possible to get something changed. If 50 suppliers don't
> index something correctly then it will no doubt take years to get any
> changes affected - especially if they are all doing it wrong differently.
>

This might also be addressed by good provenance trails in the data. [Iterate
this sentiment for all your dittos...]

the data indexers may index too promiscuously or not promiscuously enough
> for the application's taste [this might be a non-issue if there were a way
> for a machine to understand what exactly the indexing strategy is and
> perhaps how to induce the indexer to alter it, but that sounds hard]
> Again ditto. If providers are also indexers then any criticism of problems
> with indexing has to apply to the suppliers but is magnified by the number
> of suppliers.
>
>
>
> portals, and maybe indexers---indeed, any processor of the data---can
> intentionally or inadvertantly hide assumptions about how the data will be
> used, making it unsuited for uses that don't meet these assumptions. Put
> another way, it is probably difficult to insure that a machine-enforceable
> contract is possible between aggregators and applications that assures the
> application that records obtained from the aggregator or identical to those
> available from the provider. I think it is even a deep problem to have
> machine-understandable "fitness for use" metadata that would allow a machine
> to understand what fitness contract the aggregator is actually offering.
> I would assume that the aggregator is assembling metadata (in the sense of
> things that can be searched on) rather than actual data. The
> aggregator/indexer is really only providing a GUID discovery service. The
> consumer can always retrieve the original objects from the data supplier.
> The aggregator/indexer is only providing a match making service.
>

As to "only",  I agree for indexers but doubt it for aggregators. Sometimes.

In general it should never be harder to query providers than aggregators,
> especially if it is difficult for a machine to understand what, if any,
> point of view the aggregator has imposed on the view they offer of the
> aggregated data.
>
>  I don't believe this follows from your points above:
>
> I frequently go to websites and can't find what I want so I go to Google
> and do a search restricting its scope to just that site. Indeed Google
> provide this as a service - just embed a search box on your site that passes
> the right parameters. In this situation it is definitely easier to query the
> aggregator than the supplier. Indeed many sites don't bother with providing
> search services other than Google (which is the point I make precisely). The
> alternative is that every tin-pot website has to have an implementation of
> the Google search algorithm and indexes within it. (I appreciate that this
> is a human example but it translates to a machine world. A data provider's
> metadata could easily provide the location of web services to query it that
> are not actually part of the provider itself. Indeed it could offer a list
> of services. A neat place to do this would be in the WSDL returned by a LSID
> Authority.)
>

Good point. Google deserves thought.  If it is an aggregator other than
trivially, it is certainly one with a point of view, a hint of which can be
seen in their cached pages, where they helpfully add to the data by
highlighting the search terms. Who asked for that? Not me. But I don't seem
to be offered a choice about it. Conversely, someone who desires to take
advantage of Google's wisdom in this regard may actually find their view
more useful than the originator's. Indeed, for me it frequent that I go to
the original page and then am frustrated by the weak Firefox search facility
when I try to figure out where in the original I should be looking. But if I
use the Google cache, I may be at the mercy of their currency policies. This
frequently makes it not so useful in searching for things in archived poorly
threaded archives such as email archives---if the discussion is so old that
the Google cache is complete it is sometimes the case that the answer is in
the originator but hard to find, yet not in the cache where it would be easy
to find.

> People are no doubt tired of hearing this from me, but my position is
> always that modeling data consumers as humans is dangerously constricting.
> Humans are too smart and readily deal with lots of violations of the
> principle of least amazement, whereas machines don't. In point of fact,
> except for those on paper, stone, clay tablets and the like, there is no
> such thing as a database accessed by a human. They all have software between
> the human and the data provision service.  From this I conclude that in your
> trinity below, reduction of the burden on humans actually falls to the
> applications, and so  I think TAGs  requirement is to reduce the burden on
> application writers  (including those of TDWG itself, but also all others in
> the world) in their quest to reduce the burden on human data consumers. My
> intuition is that this will lead to a different analysis than thinking about
> humans as consumers, but at the moment I have no specific examples to offer.
>
>
>  I think this is a really good point and will take it forward. I hope to
> start the TAG meeting with a discussion of Actors within our domain and will
> attempt to differentiate client-human from client-machine within this.
>

I often muse upon the fact that the UML Actor symbol doesn't distinguish
human from non-human actors. There are good and bad aspects of that. Good
when you are modeling a software system. Bad when there are actually humans
who can push the buttons. [Or maybe it's really good if you are constantly
aware that humans behave unexpectedly. Keeping that in mind is the real
point about my "forbidden questions"].

A little more is interspersed below.
>
>
> On 3/1/06, Roger Hyam <roger at tdwg.org> wrote:
> >
> >  This is a little more of a controversial question that has been suggested:
> >
> > "Why should data providers supply search and query services?"
> >
> >
> >    - We have many potential data providers (potentially every
> >    collection and institution).
> >    - We have many potential data consumers (potentially every
> >    researcher with a laptop).
> >    - We have a few potential data indexers (GBIF, ORBIS , etc +
> >    others to come).
> >
> > The implementation burden should therefore be:
> >
> >
> >    - Light for the providers - who's role is to conserve data and
> >    physical objects.
> >    - Light for the consumer - who's role is to do research not mess
> >    with data handling.
> >     - Heavy for the indexers - who's core business is making the data
> >    accessible.
> >
> > Data providers should give the objects they curate GUIDs. This is
> > important because it stamps their ownership (and responsibility) on that
> > piece of data. They then need to run an LSID service that serves the
> > (meta)data for the objects they own. *There work should stop at this
> > point!* They should not have to implement search and query services.
> > They should not anticipate what people will require by way of data access -
> > that is a separate function.
> >
> > Data consumers should be able to access indexing services that pool
> > information from multiple data providers. They should not have to run
> > federated queries across multiple data providers or have to discover
> > providers as this is complex and difficult (though they may want to browse
> > round data providers like they would browse links on web pages). Once they
> > have retrieved the GUIDs of the objects they are interested in from the
> > indexers they may want to call the data providers for more detailed
> > information.
> >
> > Data indexers should crawl the data exposed by the providers and index
> > them in thematic ways. e.g. provide geographic or taxon focused
> > services. This is a complex job as it involves doing clever, innovative
> > things with data and optimization of searches etc.
> >
> > Currently we are trying to make every data provider support searching
> > and querying when the consumers aren't really interested in querying or
> > searching individual providers - they want to search thematically across
> > providers.
> >
>
> Restated, this sentence may fall in my class of questions forbidden to
> software architects, namely  that class of questions that begin with the
> words "Why would anybody ever want to ..."
>  I should restate it "What is the use case that indicates the system
> should support this behavior?"
>
>
>  If a big data provider wants to provide search and query then they can
> > set themselves up as both a provider and an indexer - which is more or less
> > what everyone is forced to do now - but the functions are separate.
> >
> > Data providers would have to implement a little more than just an LSID
> > resolver services for this to work. They would need to provide a single web
> > service method (URL call) that allowed indexers to get lists of LSIDs they
> > hold that have had their (meta)data modified since a certain date but this
> > would be a relatively simple thing compared with providing arbitrary query
> > facilities.
> >
> > I believe (though I haven't done a thorough analysis of log data ) that
> > this is more or less the situation now. Data providers implement complete
> > DiGIR or BioCASE protocols but are only queried in a limited way by portal
> > engines. Consumers go directly to portals for their data discovery. So why
> > implement full search and query at the data provider nodes of the network
> > (possibly the hardest thing we have to do) when it may not be used?
> >
> > This may be controversial. What do you think?
> >
>
>
> I'm not sure about controversial, but I am pretty sure that what you are
> pointing at is a warehouse model. I don't know if I am  prepared to agree
> that  all possible present and future concerns  of TDWG  can be answered by
> data warehouses.  In particular, if you analyse log data of a warehouse, it
> won't be too surprising if the conclusion is that users are behaving as
> though they mainly need a warehouse. [To data consumers a warehouse and a
> portal are indistinguishable. I think.]
>
>   This is why I use the term 'indexer' rather than aggregator. The analogy
> with web search engines is a good one. Basically we have to implement
> aggregated-indexes for key data (although federated searching by crawling
> all the providers is theoretically possible if you are not in a hurry) the
> question I raise is whether we also need to implement querying in every
> provider.
>

Maybe not.  What would alarm me though, is if we do something that
precludesit or even makes it hard. I could grudgingly live with a
position that
TDWG's service function definitions are all about aggregation. But the data
exchange standards had better not distinguish aggregators from originators
from transformers except for providing those actors with the ability to
identify their role and point of view.

Bob Morris
>
>  Roger
> >
> > --
> >
> > -------------------------------------
> >  Roger Hyam
> >  Technical Architect
> >  Taxonomic Databases Working Group
> > -------------------------------------
> >
> > http://www.tdwg.org
> >  roger at tdwg.org
> >  +44 1578 722782
> > -------------------------------------
> >
> >
> > _______________________________________________
> > Tdwg-tag mailing list
> > Tdwg-tag at lists.tdwg.org
> > http://lists.tdwg.org/mailman/listinfo/tdwg-tag_lists.tdwg.org
> >
> >
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.tdwg.org/pipermail/tdwg-tag/attachments/20060303/05c40a61/attachment.html