<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

  <meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">

</head>

<body bgcolor="#ffffff" text="#000000">

<br>

Bob,<br>

<br>

Great points on provenance and maintaining standards whether they are

moving between provider, aggregator, indexer, client or whatever. GUIDs

should be a help in tracking provenance but we do need policies on what

aggregators can do with objects...<br>

<br>

I don't think my suggestions preclude anything. Perhaps they can be

summed up as a suggestion that providers, indexers and search and query

services should be modeled as separate actors within the architecture.

Some software instances may&nbsp; play the roles of more than one of these

actors but this is not compulsory .<br>

<br>

Roger<br>

<br>

<br>

Bob Morris wrote:

<blockquote

 cite="mid16957b040603030551k603d64a5o637fea265ce71d9c@mail.gmail.com"

 type="cite"><br>

  <br>

  <div><span class="gmail_quote">On 3/3/06, <b class="gmail_sendername">Roger

Hyam</b> &lt;<a href="mailto:roger@tdwg.org">roger@tdwg.org</a>&gt;

wrote:</span>

  <blockquote class="gmail_quote"

 style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

    <div style="direction: ltr; color: rgb(0, 153, 0);"> Bob Morris

wrote:

    </div>

    <div style="direction: ltr;"><span class="q"><span

 style="color: rgb(0, 153, 0);">Umm...there is a distinguishable class

of data consumers,

namely </span><span style="font-style: italic; color: rgb(0, 153, 0);">applications</span><span

 style="color: rgb(0, 153, 0);">, and so a

distinguishable constituency whose burden is relevant, namely </span><span

 style="font-style: italic; color: rgb(0, 153, 0);">

application writers.</span><span style="color: rgb(0, 153, 0);"> Some

applications may well be motivated to

query providers directly for a number of reasons, including:</span><br>

    </span></div>

    <div style="direction: ltr;"><span class="q">

    <li>the data indexers currency policies may be unsuitable</li>

    </span></div>

    <div style="direction: ltr;"> This equally applies to data

providers. They may not index data in a

way the consumer requires. It may lag behind their own live data set

etc. </div>

  </blockquote>

  <div><br>

  <span style="color: rgb(0, 153, 0);">I agree completely on this and

your other dittos. It's typically hard to figure out whether something

is an aggregator or an originator. This is the oft-discussed issue of

"data provenance" which is quite difficult to establish on a per-record

data. In the (defunct?) UBIF schema there is a weak attempt to record

how, or at least if, a record evolved from its originator. Furthermore,

the history of that evolution, were it understood (by a machine!) could

prove quite useful to an application, which may well find it

interesting to incorporate the wisdom of intermediaries and find some

of them provide a better view of a given record than do others,

possibly even including the originator. As a simple example,&nbsp; It could

be quite convenient if an intermediiary that by some clever processing

could establish that some datum in a record is inconsistent with some

other in the same record&nbsp; and could record that fact in its forwarding

metadata. Really, my vision here is machines as scholars. I don't

suggest TDWG should attempt to accomplish that. I merely say that if

that is one's vision, then one buries fewer difficiult to extract

assumptions in the modeling.&nbsp; I think this is the real point of my

arguments: how to recognize all the "gotchas" in one's models and make

sure they are acknowledged enough that others can deal with them.

["Gotcha" is an Americanism(?) contracted from "I got you!" typically

uttered to the victim of a practical joke who has been successfully

blind-sided].

  <br>

  <br>

[As an aside, I note that the much vaunted data-information-knowledge

pyramid is actually cited as data-information-knowledge-wisdom by some

authors. Scientists too often stop at "knowledge" because "wisdom"

seems too hard to define and perhaps a little too uncomfortable to

assert about oneself. ]

  <br>

  </span></div>

  <br>

  <blockquote class="gmail_quote"

 style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

    <div style="direction: ltr;"><span class="q">

    <li>the data indexers may aggregate in undesirable ways [the

present model seems to be that indexer==portal, but I doubt that is

general]</li>

    </span></div>

    <div style="direction: ltr;"> Ditto from point above. Data

suppliers may index in undesirable ways

plus they might index heterogeneously - each supplier may be

undesirable in different ways - which would be a really big headache.

Is there anything to say this will cause

less of a burden when spread across many providers rather than few

indexers? If a thematic indexer doesn't do what is required then it may

be possible to get something changed. If 50 suppliers don't index

something correctly then it will no doubt take years to get any changes

affected - especially if they are all doing it wrong differently.</div>

  </blockquote>

  <div><br style="color: rgb(51, 153, 153);">

  <span style="color: rgb(0, 153, 0);">This might also be addressed by

good provenance trails in the data. [Iterate this sentiment for all

your dittos...]

  </span><span style="color: rgb(51, 102, 102);"></span><br>

  </div>

  <br>

  <blockquote class="gmail_quote"

 style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

    <div style="direction: ltr;"><span class="q">

    <li>the data indexers may index too promiscuously or not

promiscuously enough for the application's taste [this might be a

non-issue if there were a way for a machine to understand what exactly

the indexing strategy is and perhaps how to induce the indexer to alter

it, but that sounds hard]</li>

    </span></div>

    <div style="direction: ltr;"> Again ditto. If providers are also

indexers then any criticism of

problems with indexing has to apply to the suppliers but is magnified

by the number of suppliers.<br>

    <blockquote

 cite="http://mid16957b040603021857p56e9075fvc0fbb169b669c8fd@mail.gmail.com"

 type="cite"> </blockquote>

    </div>

    <div style="direction: ltr;"><span class="q">

    <li>portals, and maybe indexers---indeed, <span

 style="font-style: italic;">any</span> processor of the data---can

intentionally or inadvertantly hide assumptions about how the data will

be used, making it unsuited for uses that don't meet these assumptions.

Put another way, it is probably difficult to insure that a

machine-enforceable contract is possible between aggregators and

applications that assures the application that records obtained from

the aggregator or identical to those available from the provider. I

think it is even a deep problem to have&nbsp; machine-understandable

"fitness for use" metadata that would allow a machine to understand

what fitness contract the aggregator is actually offering. </li>

    </span></div>

    <div style="direction: ltr;"> I would assume that the aggregator is

assembling metadata (in the sense

of things that can be searched on) rather than actual data. The

aggregator/indexer is really only providing a GUID discovery service.

The consumer can always retrieve the original objects from the data

supplier. The aggregator/indexer is only providing a match making

service.</div>

  </blockquote>

  <div><br>

  <span style="color: rgb(0, 153, 0);">As to "only",&nbsp; I agree for

indexers but doubt it for aggregators. Sometimes.</span><br>

  </div>

  <br>

  <blockquote class="gmail_quote"

 style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

    <div style="direction: ltr;"><span class="q">

    <blockquote

 cite="http://mid16957b040603021857p56e9075fvc0fbb169b669c8fd@mail.gmail.com"

 type="cite">In general it should never be <span

 style="font-style: italic;">harder</span> to query providers than

aggregators, especially if it is difficult for a machine to understand

what, if any, point of view the aggregator has imposed on the view they

offer of the aggregated data. <br>

      <br>

    </blockquote>

    </span></div>

    <div style="direction: ltr;">I don't believe this follows from your

points above:<br>

    <br>

I frequently go to websites and can't find what I want so I go to

Google and do a search restricting its scope to just that site. Indeed

Google provide this as a service - just embed a search box on your site

that passes the right parameters. In this situation it is definitely

easier to query the aggregator than the supplier. Indeed many sites

don't bother with providing search services other than Google (which is

the point I make precisely). The alternative is that every tin-pot

website has to have an implementation of the Google search algorithm

and indexes within it. (I appreciate that this is a human example but

it translates to a machine world. A data provider's metadata could

easily provide the location of web services to query it that are not

actually part of the provider itself. Indeed it could offer a list of

services. A neat place to do this would be in the WSDL returned by a

LSID Authority.) </div>

  </blockquote>

  <div><br>

  <span style="color: rgb(0, 153, 0);">Good point. Google deserves

thought.&nbsp; If it is an aggregator other than trivially, it is certainly

one with a point of view, a hint of which can be seen in their <span

 style="font-style: italic;">cached</span> pages, where they helpfully <span

 style="font-style: italic;">add</span> to the data by highlighting the

search terms. Who asked for that? Not me. But I don't seem to be

offered a choice about it. Conversely, someone who desires to take

advantage of Google's wisdom in this regard may actually find their

view <span style="font-style: italic;">more</span> useful than the

originator's. Indeed, for me it frequent that I go to the original page

and then am frustrated by the weak Firefox search facility when I try

to figure out where in the original I should be looking. But if I use

the Google cache, I may be at the mercy of their currency policies.

This frequently makes it not so useful in searching for things in

archived poorly threaded archives such as email archives---if the

discussion is so old that the Google cache is complete it is sometimes

the case that the answer is in the originator but hard to find, yet not

in the cache where it would be easy to find.

  </span><br>

  </div>

  <br>

  <blockquote class="gmail_quote"

 style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

    <div style="direction: ltr;"><span style="color: rgb(0, 153, 0);"></span><br>

    <span class="q">

    <blockquote

 cite="http://mid16957b040603021857p56e9075fvc0fbb169b669c8fd@mail.gmail.com"

 type="cite">People are no doubt tired of hearing this from me, but my

position is always that modeling data consumers as humans is

dangerously constricting. Humans are too smart and readily deal with

lots of violations of the principle of least amazement, whereas

machines don't. In point of fact, except for those on paper, stone,

clay tablets and the like, there is no such thing as a database

accessed by a human. They all have software between the human and the

data provision service.&nbsp; From this I conclude that in your trinity

below, reduction of the burden on humans actually falls to the

applications, and so&nbsp; I think TAGs&nbsp; requirement is to reduce the burden

on application writers&nbsp; (including those of TDWG itself, but also all

others in the world) in <span style="font-style: italic;">their</span>

quest to reduce the burden on human data consumers. My intuition is

that this will lead to a different analysis than thinking about humans

as consumers, but at the moment I have no specific examples to offer. <br>

      <br>

    </blockquote>

    </span></div>

    <div style="direction: ltr;">I think this is a really good point

and will take it forward. I hope to

start the TAG meeting with a discussion of Actors within our domain and

will attempt to differentiate client-human from client-machine within

this.</div>

  </blockquote>

  <div><br>

  <span style="color: rgb(0, 153, 0);"> I often muse upon the fact that

the UML Actor symbol doesn't distinguish human from non-human actors.

There are good and bad aspects of that. Good when you are modeling a

software system. Bad when there are actually humans who can push the

buttons. [Or maybe it's really good <span style="font-style: italic;">if</span>

you are constantly aware that humans behave unexpectedly. Keeping that

in mind is the real point about my "forbidden questions"]. <br>

  </span><span style="color: rgb(0, 153, 0);"></span><br>

  </div>

  <blockquote class="gmail_quote"

 style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

    <div style="direction: ltr;"><span class="q">A little more is

interspersed below.

    <br>

    <br>

    <br>

    </span></div>

    <div style="direction: ltr;">

    <div style="direction: ltr;"><span class="q"><span

 class="gmail_quote">On 3/1/06, <b class="gmail_sendername">Roger

Hyam</b> &lt;<a href="mailto:roger@tdwg.org" target="_blank"

 onclick="return top.js.OpenExtLink(window,event,this)">roger@tdwg.org</a>&gt;

wrote:</span>

    <blockquote class="gmail_quote"

 style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

      <div style="direction: ltr;">

      <pre>This is a little more of a controversial question that has been suggested:

"Why should data providers supply search and query services?"

    </pre>

      <ul>

        <li>We have many potential data providers (potentially every

collection and institution).</li>

        <li>We have many potential data consumers (potentially every

researcher with a laptop).</li>

        <li>We have a few potential data indexers (GBIF, ORBIS , etc +

others

to come).</li>

      </ul>

      <pre>The implementation burden should therefore be:

    </pre>

      <ul>

        <li>Light for the providers - who's role is to conserve data

and

physical objects.</li>

        <li>Light for the consumer - who's role is to do research not

mess

with data handling.<br>

        </li>

        <li>Heavy for the indexers - who's core business is making the

data

accessible.</li>

      </ul>

Data providers should give the objects they curate GUIDs. This is

important because it stamps their ownership (and responsibility) on

that piece of data. They then need to run an LSID service that serves

the

(meta)data for the objects they own. <b>There work should stop at this

point!</b>

They should not have to implement search and query services. They

should not anticipate what people will require by way of data access -

that is a separate function.<br>

      <br>

Data consumers should be able to access indexing services that pool

information from multiple data providers. They should not have to run

federated queries across multiple data providers or have to discover

providers as this is complex and

difficult (though they may want to browse round data providers like

they would browse links on web pages). Once they have retrieved the

GUIDs of the objects they are interested

in from the indexers they may want to call the data providers for more

detailed information.<br>

      <br>

Data indexers should crawl the data exposed by the providers and index

them in thematic ways. e.g. provide geographic or taxon focused

services. This is a complex job as it involves doing clever, innovative

things with data and optimization of searches etc.<br>

      <br>

Currently we are trying to make every data provider support searching

and querying when the consumers aren't really interested in querying or

searching individual providers - they want to search thematically

across

providers.</div>

    </blockquote>

    <div><br>

Restated, this sentence may fall in my class of questions forbidden to

software architects, namely&nbsp; that class of questions that begin with

the words "Why would anybody ever want to ..." <br>

    </div>

    </span></div>

    <div style="direction: ltr;"> </div>

I should restate it "What is the use case that indicates the system

should support this behavior?"<br>

    <blockquote

 cite="http://mid16957b040603021857p56e9075fvc0fbb169b669c8fd@mail.gmail.com"

 type="cite">

      <div style="direction: ltr;"><span class="q"><br>

      <blockquote class="gmail_quote"

 style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

        <div style="direction: ltr;">If a big data provider wants to

provide search and query then

they can set themselves up as both a provider and an

indexer - which is more or less what everyone is forced to do now - but

the functions are separate.<br>

        <br>

Data providers would have to implement a little more than just an LSID

resolver services for this to work. They would need to provide a single

web service

method (URL call) that allowed indexers to get lists of LSIDs they hold

that have had their (meta)data modified since a certain date but this

would be a relatively simple thing compared with providing arbitrary

query facilities.<br>

        <br>

I believe (though I haven't done a thorough analysis of log data ) that

this is more or less the situation now. Data providers implement

complete DiGIR or BioCASE protocols but are only queried in a limited

way by portal engines. Consumers go directly to portals for their data

discovery. So why implement full search and query at the data provider

nodes of the network (possibly the hardest thing we have to do) when it

may not be used?<br>

        <br>

This may be controversial. What do you think?</div>

      </blockquote>

      <div><br>

      <br>

I'm not sure about controversial, but I am pretty sure that what you

are pointing at is a warehouse model. I don't know if I am&nbsp; prepared to

agree that&nbsp; all possible present and future concerns&nbsp; of TDWG&nbsp; can be

answered by data warehouses.&nbsp; In particular, if you analyse log data of

a warehouse, it won't be too surprising if the conclusion is that users

are behaving as though they mainly need a warehouse. [To data consumers

a warehouse and a portal are indistinguishable. I think.] <br>

      <br>

      </div>

      </span></div>

      <div style="direction: ltr;"> </div>

    </blockquote>

This is why I use the term 'indexer' rather than aggregator. The

analogy with web search engines is a good one. Basically we have to

implement aggregated-indexes for key data (although federated searching

by crawling all the providers is theoretically possible if you are not

in a hurry) the question I raise is whether we also need to implement

querying in every provider.</div>

  </blockquote>

  <div><br>

  <br>

  <span style="color: rgb(0, 153, 0);">Maybe not.&nbsp; What would alarm me

though, is if we do something that <span style="font-style: italic;">precludes</span>

it or even makes it hard. I could grudgingly live with a position that

TDWG's service function definitions are all about aggregation. But the <span

 style="font-style: italic;"><span style="font-style: italic;">data

exchange standards</span></span> had better not distinguish aggregators

from originators from transformers except for providing those actors

with the ability to identify their role and point of view.

  <br>

  <br>

  </span></div>

  <br>

  <blockquote class="gmail_quote"

 style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

    <div style="direction: ltr;"><span class="e"

 id="q_109c02eef1138b4f_21">

    <blockquote

 cite="http://mid16957b040603021857p56e9075fvc0fbb169b669c8fd@mail.gmail.com"

 type="cite">

      <div>

      <div>Bob Morris<br>

      </div>

      <br>

      <blockquote class="gmail_quote"

 style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

        <div style="direction: ltr;">Roger<br>

        <br>

        <pre cols="72">-- 

-------------------------------------

 Roger Hyam

 Technical Architect

 Taxonomic Databases Working Group

-------------------------------------

 <a href="http://www.tdwg.org" target="_blank"

 onclick="return top.js.OpenExtLink(window,event,this)">

http://www.tdwg.org</a>

 <a href="mailto:roger@tdwg.org" target="_blank"

 onclick="return top.js.OpenExtLink(window,event,this)">roger@tdwg.org</a>

 +44 1578 722782

-------------------------------------

    </pre>

        </div>

        <br>

_______________________________________________<br>

Tdwg-tag mailing list<br>

        <a href="mailto:Tdwg-tag@lists.tdwg.org" target="_blank"

 onclick="return top.js.OpenExtLink(window,event,this)">Tdwg-tag@lists.tdwg.org</a><br>

        <a

 href="http://lists.tdwg.org/mailman/listinfo/tdwg-tag_lists.tdwg.org"

 target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">http://lists.tdwg.org/mailman/listinfo/tdwg-tag_lists.tdwg.org</a><br>

        <br>

        <br>

      </blockquote>

      </div>

      <br>

    </blockquote>

    <br>

    </span></div>

  </blockquote>

  </div>

  <br>

</blockquote>

<br>

<br>

<pre class="moz-signature" cols="72">-- 

-------------------------------------

 Roger Hyam

 Technical Architect

 Taxonomic Databases Working Group

-------------------------------------

 <a class="moz-txt-link-freetext" href="http://www.tdwg.org">http://www.tdwg.org</a>

 <a class="moz-txt-link-abbreviated" href="mailto:roger@tdwg.org">roger@tdwg.org</a>

 +44 1578 722782

-------------------------------------

</pre>

</body>

</html>