Re: [Tdwg-tag] Why should data providers supply search and query services?
Roger,
I really think its time to discuss the issue of indexers and searchable providers. And I have to say I mostly do agree with your analysis, although consequently this would mean to abandon our DiGIR, BioCASE and TAPIR protocols. Instead we could go for OAI or any other similar standard. Or we could create our own minimal TDWG sync protocol to retrieve lists of changed records.
Historically we created search protocols to avoid central caches; mainly not to scare providers, to convince them to publish their data so that they are keeping control of it. But that argument is probably gone by now.
Although there is the need to discuss these things in the architecture group, we should be careful that TDWG will not become a synonym for GBIF. How people are using TDWG standards should be left open to some degree. Someone might want to set up searchable providers, others not. >>>From my feeling I guess this might actually be different depending on the kind of data (objects) they are exchanging.
As it looks to me we currently cannot create a reliable and fast fully distributed system. So the initial DiGIR and Co dream has somehow failed. For a small number of providers with a good server infrastructure and relatively small data its a different thing though. But in general for many if not all applications we will have to have some kind of local cache.
A huge warehouse keeping all our data is quite challenging though. I doubt we could fill a single system with all our thousands of concepts and 100 millions of records. In RDF we will end up with several billions of triples! But as soon as we start selecting subsets of relevant attributes (concepts), interpret incoming data to harmonize it or to remove "obvious" errors, we render the cache useless for other applications. Updating intervalls and reliability are also issues that freak out clients. So in the worst case we might end up having a seperate cache for every different application. So the heavy burden of indexing becomes a problem for most of our clients! I am not suggesting that this is wrong, but I dont think the indexing problem will only touch a few.
Searchable providers allow us to do tests in advance and create ad-hoc networks that dont need central infrastructures. Many providers are currently being convinced to publish their data just because we bundle local portals with the software. They immediately have a search interface on the web which is also great for local networks especially in smaller institutions with little IT resources. This is only possible via the generic search interface we are providing with DiGIR, BioCASE or TAPIR.
From a pure technical point I would probably argue for LSIDs, RDF and OAI. But if you think about the consequences, it would mean starting from scratch. Its a huge change and I cant think of much that would stay the same. I am afraid that we simply cant cope with that and by the time we are close to having a productive system it turns out there are better ways of doing this. Maybe distributed XQuery is simple to use by then? or a SPARQL server in front of our providers is easy to set up and fast to use? Or GRID is finally taking over all of us! who knows.
So for sure we have to address all the issues in a wider audience. Many people outside the TAG or GUID group are not aware of the proposed changes and would be really suprised to know about the consequences. Still I hope we can make a transition to RDF at some point. Maybe the solution lies in the combinations of both technologies for some time? Integrating TAPIR, LSID resolution and OAI(?) into DiGIR2, PyWrapper and the GBIF Indexing would smooth transition.
I am sure there's more to say, but the weekend is approaching and I think I consumed too much of all your attention already.
Markus
-----Ursprüngliche Nachricht----- Von: Tdwg-tag-bounces@lists.tdwg.org [mailto:Tdwg-tag-bounces@lists.tdwg.org] Im Auftrag von Roger Hyam Gesendet: Freitag, 3. März 2006 15:44 An: Bob Morris Cc: Tdwg-tag@lists.tdwg.org Betreff: Re: [Tdwg-tag] Why should data providers supply search and query services?
Bob,
Great points on provenance and maintaining standards whether they are moving between provider, aggregator, indexer, client or whatever. GUIDs should be a help in tracking provenance but we do need policies on what aggregators can do with objects...
I don't think my suggestions preclude anything. Perhaps they can be summed up as a suggestion that providers, indexers and search and query services should be modeled as separate actors within the architecture. Some software instances may play the roles of more than one of these actors but this is not compulsory .
Roger
Bob Morris wrote:
On 3/3/06, Roger Hyam roger@tdwg.org wrote:
Bob Morris wrote: Umm...there is a distinguishable class of data consumers, namely applications, and so a distinguishable constituency whose burden is relevant, namely application writers. Some applications may well be motivated to query providers directly for a number of reasons, including: * the data indexers currency policies may be unsuitable This equally applies to data providers. They may not index data in a way the consumer requires. It may lag behind their own live data set etc.
I agree completely on this and your other dittos. It's typically hard to figure out whether something is an aggregator or an originator. This is the oft-discussed issue of "data provenance" which is quite difficult to establish on a per-record data. In the (defunct?) UBIF schema there is a weak attempt to record how, or at least if, a record evolved from its originator. Furthermore, the history of that evolution, were it understood (by a machine!) could prove quite useful to an application, which may well find it interesting to incorporate the wisdom of intermediaries and find some of them provide a better view of a given record than do others, possibly even including the originator. As a simple example, It could be quite convenient if an intermediiary that by some clever processing could establish that some datum in a record is inconsistent with some other in the same record and could record that fact in its forwarding metadata. Really, my vision here is machines as scholars. I don't suggest TDWG should attempt to accomplish that. I merely say that if that is one's vision, then one buries fewer difficiult to extract assumptions in the modeling. I think this is the real point of my arguments: how to recognize all the "gotchas" in one's models and make sure they are acknowledged enough that others can deal with them. ["Gotcha" is an Americanism(?) contracted from "I got you!" typically uttered to the victim of a practical joke who has been successfully blind-sided]. [As an aside, I note that the much vaunted data-information-knowledge pyramid is actually cited as data-information-knowledge-wisdom by some authors. Scientists too often stop at "knowledge" because "wisdom" seems too hard to define and perhaps a little too uncomfortable to assert about oneself. ]
* the data indexers may aggregate in undesirable ways [the present model seems to be that indexer==portal, but I doubt that is general] Ditto from point above. Data suppliers may index in undesirable ways plus they might index heterogeneously - each supplier may be undesirable in different ways - which would be a really big headache. Is there anything to say this will cause less of a burden when spread across many providers rather than few indexers? If a thematic indexer doesn't do what is required then it may be possible to get something changed. If 50 suppliers don't index something correctly then it will no doubt take years to get any changes affected - especially if they are all doing it wrong differently.
This might also be addressed by good provenance trails in the data. [Iterate this sentiment for all your dittos...]
* the data indexers may index too promiscuously or not promiscuously enough for the application's taste [this might be a non-issue if there were a way for a machine to understand what exactly the indexing strategy is and perhaps how to induce the indexer to alter it, but that sounds hard] Again ditto. If providers are also indexers then any criticism of problems with indexing has to apply to the suppliers but is magnified by the number of suppliers.
* portals, and maybe indexers---indeed, any processor of the data---can intentionally or inadvertantly hide assumptions about how the data will be used, making it unsuited for uses that don't meet these assumptions. Put another way, it is probably difficult to insure that a machine-enforceable contract is possible between aggregators and applications that assures the application that records obtained from the aggregator or identical to those available from the provider. I think it is even a deep problem to have machine-understandable "fitness for use" metadata that would allow a machine to understand what fitness contract the aggregator is actually offering. I would assume that the aggregator is assembling metadata (in the sense of things that can be searched on) rather than actual data. The aggregator/indexer is really only providing a GUID discovery service. The consumer can always retrieve the original objects from the data supplier. The aggregator/indexer is only providing a match making service.
As to "only", I agree for indexers but doubt it for aggregators. Sometimes.
In general it should never be harder to query providers than aggregators, especially if it is difficult for a machine to understand what, if any, point of view the aggregator has imposed on the view they offer of the aggregated data.
I don't believe this follows from your points above: I frequently go to websites and can't find what I want so I go to Google and do a search restricting its scope to just that site. Indeed Google provide this as a service - just embed a search box on your site that passes the right parameters. In this situation it is definitely easier to query the aggregator than the supplier. Indeed many sites don't bother with providing search services other than Google (which is the point I make precisely). The alternative is that every tin-pot website has to have an implementation of the Google search algorithm and indexes within it. (I appreciate that this is a human example but it translates to a machine world. A data provider's metadata could easily provide the location of web services to query it that are not actually part of the provider itself. Indeed it could offer a list of services. A neat place to do this would be in the WSDL returned by a LSID Authority.)
Good point. Google deserves thought. If it is an aggregator other than trivially, it is certainly one with a point of view, a hint of which can be seen in their cached pages, where they helpfully add to the data by highlighting the search terms. Who asked for that? Not me. But I don't seem to be offered a choice about it. Conversely, someone who desires to take advantage of Google's wisdom in this regard may actually find their view more useful than the originator's. Indeed, for me it frequent that I go to the original page and then am frustrated by the weak Firefox search facility when I try to figure out where in the original I should be looking. But if I use the Google cache, I may be at the mercy of their currency policies. This frequently makes it not so useful in searching for things in archived poorly threaded archives such as email archives---if the discussion is so old that the Google cache is complete it is sometimes the case that the answer is in the originator but hard to find, yet not in the cache where it would be easy to find.
People are no doubt tired of hearing this from me, but my position is always that modeling data consumers as humans is dangerously constricting. Humans are too smart and readily deal with lots of violations of the principle of least amazement, whereas machines don't. In point of fact, except for those on paper, stone, clay tablets and the like, there is no such thing as a database accessed by a human. They all have software between the human and the data provision service. From this I conclude that in your trinity below, reduction of the burden on humans actually falls to the applications, and so I think TAGs requirement is to reduce the burden on application writers (including those of TDWG itself, but also all others in the world) in their quest to reduce the burden on human data consumers. My intuition is that this will lead to a different analysis than thinking about humans as consumers, but at the moment I have no specific examples to offer.
I think this is a really good point and will take it forward. I hope to start the TAG meeting with a discussion of Actors within our domain and will attempt to differentiate client-human from client-machine within this.
I often muse upon the fact that the UML Actor symbol doesn't distinguish human from non-human actors. There are good and bad aspects of that. Good when you are modeling a software system. Bad when there are actually humans who can push the buttons. [Or maybe it's really good if you are constantly aware that humans behave unexpectedly. Keeping that in mind is the real point about my "forbidden questions"].
A little more is interspersed below. On 3/1/06, Roger Hyam roger@tdwg.org wrote:
This is a little more of a controversial question that has been suggested: "Why should data providers supply search and query services?"
* We have many potential data providers (potentially every collection and institution). * We have many potential data consumers (potentially every researcher with a laptop). * We have a few potential data indexers (GBIF, ORBIS , etc + others to come).
The implementation burden should therefore be:
* Light for the providers - who's role is to conserve data and physical objects. * Light for the consumer - who's role is to do research not mess with data handling. * Heavy for the indexers - who's core business is making the data accessible.
Data providers should give the objects they curate GUIDs. This is important because it stamps their ownership (and responsibility) on that piece of data. They then need to run an LSID service that serves the (meta)data for the objects they own. There work should stop at this point! They should not have to implement search and query services. They should not anticipate what people will require by way of data access - that is a separate function. Data consumers should be able to access indexing services that pool information from multiple data providers. They should not have to run federated queries across multiple data providers or have to discover providers as this is complex and difficult (though they may want to browse round data providers like they would browse links on web pages). Once they have retrieved the GUIDs of the objects they are interested in from the indexers they may want to call the data providers for more detailed information. Data indexers should crawl the data exposed by the providers and index them in thematic ways. e.g. provide geographic or taxon focused services. This is a complex job as it involves doing clever, innovative things with data and optimization of searches etc. Currently we are trying to make every data provider support searching and querying when the consumers aren't really interested in querying or searching individual providers - they want to search thematically across providers.
Restated, this sentence may fall in my class of questions forbidden to software architects, namely that class of questions that begin with the words "Why would anybody ever want to ..." I should restate it "What is the use case that indicates the system should support this behavior?"
If a big data provider wants to provide search and query then they can set themselves up as both a provider and an indexer - which is more or less what everyone is forced to do now - but the functions are separate. Data providers would have to implement a little more than just an LSID resolver services for this to work. They would need to provide a single web service method (URL call) that allowed indexers to get lists of LSIDs they hold that have had their (meta)data modified since a certain date but this would be a relatively simple thing compared with providing arbitrary query facilities. I believe (though I haven't done a thorough analysis of log data ) that this is more or less the situation now. Data providers implement complete DiGIR or BioCASE protocols but are only queried in a limited way by portal engines. Consumers go directly to portals for their data discovery. So why implement full search and query at the data provider nodes of the network (possibly the hardest thing we have to do) when it may not be used? This may be controversial. What do you think?
I'm not sure about controversial, but I am pretty sure that what you are pointing at is a warehouse model. I don't know if I am prepared to agree that all possible present and future concerns of TDWG can be answered by data warehouses. In particular, if you analyse log data of a warehouse, it won't be too surprising if the conclusion is that users are behaving as though they mainly need a warehouse. [To data consumers a warehouse and a portal are indistinguishable. I think.]
This is why I use the term 'indexer' rather than aggregator. The analogy with web search engines is a good one. Basically we have to implement aggregated-indexes for key data (although federated searching by crawling all the providers is theoretically possible if you are not in a hurry) the question I raise is whether we also need to implement querying in every provider.
Maybe not. What would alarm me though, is if we do something that precludes it or even makes it hard. I could grudgingly live with a position that TDWG's service function definitions are all about aggregation. But the data exchange standards had better not distinguish aggregators from originators from transformers except for providing those actors with the ability to identify their role and point of view.
Bob Morris
Roger -- ------------------------------------- Roger Hyam Technical Architect Taxonomic Databases Working Group ------------------------------------- http://www.tdwg.org http://www.tdwg.org roger@tdwg.org +44 1578 722782 -------------------------------------
_______________________________________________ Tdwg-tag mailing list Tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag_lists.tdwg.org
Hi,
There are many reason why I think data providers should be more capable than just simple interfaces to indexers. Some of them have been already pointed by Markus and Bob, but I would like to use to a non-technical reason and very related to what TDWG is good for.
While making the biological collection databases available for GBIF indexing I think we are also helping them in their daily work. Some databases are setting up their own web interface based on BioCASe, there are projects to help them geoloreference their specimens based on the provider software, they have the possibility to export their collection database and import it into another collection management software, and many other useful possibilities that will hopefully appear in the next years. These solutions are possible because the software installed on their servers is capable of doing searches and queries. So here my argument is that: by setting up a query level and installing a capable software on the providers directly we are improving these databases.
I have used this argument for a while already when convincing data providers: by joining GBIF they are not just only making their data available to the community but that they will also benefit of the tools that are appearing for them based on TDWG standards. I think it is a good deal, make your data available and we will help you to improve it with standard tools from the community for no cost.
There is also many people who do not want to share their data, specially researches, and that can also benefit from our software and standards without having to participate in any network. If we create good and useful software they might consider using it to handle their data and at some point maybe open it to the public.
This is also somehow related to what I call the OAI "model" versus the OGC "model". The OAI is helping and promoting the accessibility to data in distributed databases while the OGC is an organism just promoting the interoperability of applications. While the OAI is focus on making the accessibility to the data as good as possible (to set up value-added services on top of cache databases) the OGC community is working on making software interoperable, extensible and open to new uses that they might not know now.
I want to think that GBIF is like the OAI and that instead of creating their own technology to achieve their goals is using TDWG. I tend to think of TDWG more like OGC in the other hand. So, GBIF is just one user of TDWG work.
So my vote goes for more sophisticated data providers that allow us to construct more things on top of them without having to consider GBIF at all. TAPIR looks fine to me for this task, even more if complemented with the TAPIR "Lite" idea for providers that just want to contribute to GBIF.
Best regards,
Javier.
I agree that there are many good reasons for protocols such as TAPIR that go beyond the need for harvesting information. Using DiGIR/BioCASe/TAPIR to map a relational database to a common form such as DwC or ABCD can serve as the basis for simple mapping of these same data for BioMOBY, WFS and other web services and search interfaces. This is the basic approach that IPGRI is following to bridge between their own network of data resources and toolkits for molecular analysis, etc. TAPIR providers can serve as general-purpose query tools which can be used to standardise underlying data models to a common form. It is then simple to map other services against the common form rather than against all the varying underlying data models.
Donald
--------------------------------------------------------------- Donald Hobern (dhobern@gbif.org) Programme Officer for Data Access and Database Interoperability Global Biodiversity Information Facility Secretariat Universitetsparken 15, DK-2100 Copenhagen, Denmark Tel: +45-35321483 Mobile: +45-28751483 Fax: +45-35321480 ---------------------------------------------------------------
-----Original Message----- From: Tdwg-tag-bounces@lists.tdwg.org [mailto:Tdwg-tag-bounces@lists.tdwg.org] On Behalf Of Javier de la Torre Sent: 06 March 2006 11:14 To: =?ISO-8859-1?Q? D=F6ring; _Markus ?= Cc: Bob Morris; Guentsch, Anton; Tdwg-tag@lists.tdwg.org Subject: Re: [Tdwg-tag] Why should data providers supply search and queryservices?
Hi,
There are many reason why I think data providers should be more capable than just simple interfaces to indexers. Some of them have been already pointed by Markus and Bob, but I would like to use to a non-technical reason and very related to what TDWG is good for.
While making the biological collection databases available for GBIF indexing I think we are also helping them in their daily work. Some databases are setting up their own web interface based on BioCASe, there are projects to help them geoloreference their specimens based on the provider software, they have the possibility to export their collection database and import it into another collection management software, and many other useful possibilities that will hopefully appear in the next years. These solutions are possible because the software installed on their servers is capable of doing searches and queries. So here my argument is that: by setting up a query level and installing a capable software on the providers directly we are improving these databases.
I have used this argument for a while already when convincing data providers: by joining GBIF they are not just only making their data available to the community but that they will also benefit of the tools that are appearing for them based on TDWG standards. I think it is a good deal, make your data available and we will help you to improve it with standard tools from the community for no cost.
There is also many people who do not want to share their data, specially researches, and that can also benefit from our software and standards without having to participate in any network. If we create good and useful software they might consider using it to handle their data and at some point maybe open it to the public.
This is also somehow related to what I call the OAI "model" versus the OGC "model". The OAI is helping and promoting the accessibility to data in distributed databases while the OGC is an organism just promoting the interoperability of applications. While the OAI is focus on making the accessibility to the data as good as possible (to set up value-added services on top of cache databases) the OGC community is working on making software interoperable, extensible and open to new uses that they might not know now.
I want to think that GBIF is like the OAI and that instead of creating their own technology to achieve their goals is using TDWG. I tend to think of TDWG more like OGC in the other hand. So, GBIF is just one user of TDWG work.
So my vote goes for more sophisticated data providers that allow us to construct more things on top of them without having to consider GBIF at all. TAPIR looks fine to me for this task, even more if complemented with the TAPIR "Lite" idea for providers that just want to contribute to GBIF.
Best regards,
Javier.
_______________________________________________ Tdwg-tag mailing list Tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag_lists.tdwg.org
Hi Javier,
Thanks for your points. My responses below. Including a bit about GML that should really be in a different thread!
Javier de la Torre wrote:
Hi,
There are many reason why I think data providers should be more capable than just simple interfaces to indexers. Some of them have been already pointed by Markus and Bob, but I would like to use to a non-technical reason and very related to what TDWG is good for.
While making the biological collection databases available for GBIF indexing I think we are also helping them in their daily work. Some databases are setting up their own web interface based on BioCASe, there are projects to help them geoloreference their specimens based on the provider software, they have the possibility to export their collection database and import it into another collection management software, and many other useful possibilities that will hopefully appear in the next years. These solutions are possible because the software installed on their servers is capable of doing searches and queries. So here my argument is that: by setting up a query level and installing a capable software on the providers directly we are improving these databases.
But surely these people can search their own database already. If you are providing a cheap and easy web interface for them then that is a tangible benefit but could equally be done centrally with a branding for the institution. It doesn't physically have to reside with them and be maintained by them - though that may be the best for many organisations.
I have used this argument for a while already when convincing data providers: by joining GBIF they are not just only making their data available to the community but that they will also benefit of the tools that are appearing for them based on TDWG standards. I think it is a good deal, make your data available and we will help you to improve it with standard tools from the community for no cost.
This is just as available if the indexing is outsourced from the data owner I believe but is difficult to discuss abstractly here.
There is also many people who do not want to share their data, specially researches, and that can also benefit from our software and standards without having to participate in any network. If we create good and useful software they might consider using it to handle their data and at some point maybe open it to the public.
That might be a really nice side effect of our activities but as it is based on serendipity (or good karma perhaps!) is not something that we can plan for.
This is also somehow related to what I call the OAI "model" versus the OGC "model". The OAI is helping and promoting the accessibility to data in distributed databases while the OGC is an organism just promoting the interoperability of applications. While the OAI is focus on making the accessibility to the data as good as possible (to set up value-added services on top of cache databases) the OGC community is working on making software interoperable, extensible and open to new uses that they might not know now.
I am not sure that either approach works in isolation. If I want to define a GML application schema that contains a definition of a bridge in it, for example, I am not sure how I relate this to all the other people who have defined (or may define in the future) bridge-like things. How do I write an application that will 'understand' not only my bridge feature but also any other features out there that are bridge-like but that I am not aware of just now. I can see how we get interoperability if we all agree to use the same Application Schema and I can see that we can use the same software for multiple Application Schemas but we want *data *interoperability not just *software *interoperability
Here is an example: The British Ordnance Survey have their own GML Application Schema and it defines a "FerryLink" that extends from their own abstract feature type and back to GML feature. So I can treat it like a GML feature in an application - very useful - this is software interoperability. Trouble is I have no way of knowing that it is to do with ferries and water or anything useful about it unless I can read English - which machines don't. (Incidentally there is no documentation in the schema so I can't retrieve it and display it to the user automatically either). Presumably another mapping agency is also encoding ferry links. In fact the instances of ferry routes that are encoded using this schema may join the UK to countries that use different GML Application Schemas to define the *same *physical ferry links!
My understanding is that the GML model does not give me a way to discover this or express it once I know that there are two ways of talking about the same physical objects.
You can get the OS application schemas here:
http://www.ordnancesurvey.co.uk/oswebsite/xml/schema/index.html
My knowledge of OGC standards is limited but I may be wrong on this so stand to be corrected.
We are building a global system so we have to be able to reconcile different encodings of the same object types. GML does not solve this problem but might be useful in other ways. The OAI standards may be useful for finding stuff.
I want to think that GBIF is like the OAI and that instead of creating their own technology to achieve their goals is using TDWG. I tend to think of TDWG more like OGC in the other hand. So, GBIF is just one user of TDWG work.
So my vote goes for more sophisticated data providers that allow us to construct more things on top of them without having to consider GBIF at all. TAPIR looks fine to me for this task, even more if complemented with the TAPIR "Lite" idea for providers that just want to contribute to GBIF.
I am sure we need both fat and thin providers but I also think we need to define the roles played by different actors within the network more formally - which I think we will in the near future.
Some good points,
Roger
Best regards,
Javier.
Roger wrote:
[...] We are building a global system so we have to be able to reconcile different encodings of the same object types.
Bob Morris replies:
I don't see anything in the TDWG Constitution that calls for a "global system". Any discussion of system building surely represents an interpretation of Article 1, in which the only explicit activity mentioned is that TDWG "develops, adopts and promotes standards and guidelines for the recording and exchange of data about organisms". Whether building systems at all is within the purview of TDWG, is probably beyond the mandate of the Secretariat to determine. If standards building is the focus instead of systems building, it does not follow logically that encodings have to be reconciled. I don't think standards bodies are obliged to make their standards reconciled with other people's standards. Doing so could only fall within Article 1.b, in which TDWG "promotes [the standards'] use through the most appropriate and effective means." That is so vague as to not consistute a requirement, and so addressing it with architecture probably needs more agreement in the organization about what it really means.
I share with Javier a concern that the architecture discussions may be conflating TDWG with GBIF. I am not necesarily opposed to this, and I even suspect Article 1 may in fact need revision. I just doubt that, if conflation with the goals of GBIF is inadvertantly happening, it is happening without consent of the membership.
Bob
Tdwg-tag mailing list Tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag_lists.tdwg.org
I would also like to emphasise that I would be unhappy to see the TDWG architecture restricted to what GBIF needs (or perhaps what GBIF knows it needs today). I would far rather see TDWG as the developer of a robust information architecture which can support a wide range of applications now and does not depend on any particular piece of infrastructure being provided indefinitely by GBIF or any other party.
Donald
--------------------------------------------------------------- Donald Hobern (dhobern@gbif.org) Programme Officer for Data Access and Database Interoperability Global Biodiversity Information Facility Secretariat Universitetsparken 15, DK-2100 Copenhagen, Denmark Tel: +45-35321483 Mobile: +45-28751483 Fax: +45-35321480 ---------------------------------------------------------------
-----Original Message----- From: Tdwg-tag-bounces@lists.tdwg.org [mailto:Tdwg-tag-bounces@lists.tdwg.org] On Behalf Of Bob Morris Sent: 06 March 2006 15:53 To: roger@tdwg.org Cc: Bob Morris; ""Döring, "@gerula.gbif.org; Guentsch, Anton; Tdwg-tag@lists.tdwg.org Subject: Re: [Tdwg-tag] Why should data providers supply search and query services?
Roger wrote:
[...] We are building a global system so we have to be able to reconcile different encodings of the same object types.
Bob Morris replies:
I don't see anything in the TDWG Constitution that calls for a "global system". Any discussion of system building surely represents an interpretation of Article 1, in which the only explicit activity mentioned is that TDWG "develops, adopts and promotes standards and guidelines for the recording and exchange of data about organisms". Whether building systems at all is within the purview of TDWG, is probably beyond the mandate of the Secretariat to determine. If standards building is the focus instead of systems building, it does not follow logically that encodings have to be reconciled. I don't think standards bodies are obliged to make their standards reconciled with other people's standards. Doing so could only fall within Article 1.b, in which TDWG "promotes [the standards'] use through the most appropriate and effective means." That is so vague as to not consistute a requirement, and so addressing it with architecture probably needs more agreement in the organization about what it really means.
I share with Javier a concern that the architecture discussions may be conflating TDWG with GBIF. I am not necesarily opposed to this, and I even suspect Article 1 may in fact need revision. I just doubt that, if conflation with the goals of GBIF is inadvertantly happening, it is happening without consent of the membership.
Bob
Tdwg-tag mailing list Tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag_lists.tdwg.org
Hi Roger,
Thanks for your points. My responses below. Including a bit about GML that should really be in a different thread!
Yes. I also think the GML discussion must be moved somewhere. I actually did not want to discuss about GML but more about OGC standards in general. I know WFS is totally based on GML (for the moment) but the, maybe, most successful standard from OGC, WMS, does not have anything to do with GML.
So here my argument is that: by setting up a query level and installing a capable software on the providers directly we are improving these databases.
But surely these people can search their own database already. If you are providing a cheap and easy web interface for them then that is a tangible benefit but could equally be done centrally with a branding for the institution. It doesn't physically have to reside with them and be maintained by them - though that may be the best for many organisations.
Yes. I think most of the users will like to have the interface in their own servers, I don't think centralized solutions will, now, work. Actually other reasons to create this local interfaces is to let them see the data the way is going to be published on external sites. It would be great if the provider software could include a template fot his Querytool that will look exactly the way GBIF does. I remember to tried that but GBIF is not using XSLTs at the moment.
I have used this argument for a while already when convincing data providers: by joining GBIF they are not just only making their data available to the community but that they will also benefit of the tools that are appearing for them based on TDWG standards. I think it is a good deal, make your data available and we will help you to improve it with standard tools from the community for no cost.
This is just as available if the indexing is outsourced from the data owner I believe but is difficult to discuss abstractly here.
But remember that also some users will like to use this without having to share their data first.
Regarding the comments on GML app schemas, again, I did not want to start the discussion on this. I was more talking about WMS, WFS, Catalogue, WPS, etc. standards. They are not dealing with the semantic or modelling problems, they are dealing with the software interoperability problem. I think this is also part of TDWG bussiness.
I am not an expert on GML to answer some of your questions but we should maybe try to someone involved. I know GML is not restricted to geographic features and that there is some movement in the direction of GML profiling. But again this fits very well in another post.
Javier.
Hi Markus,
I'll try and catch up on this thread...
Döring, Markus wrote:
Roger,
I really think its time to discuss the issue of indexers and searchable providers. And I have to say I mostly do agree with your analysis, although consequently this would mean to abandon our DiGIR, BioCASE and TAPIR protocols. Instead we could go for OAI or any other similar standard. Or we could create our own minimal TDWG sync protocol to retrieve lists of changed records.
I am glad you understand what I was driving at.
Historically we created search protocols to avoid central caches; mainly not to scare providers, to convince them to publish their data so that they are keeping control of it. But that argument is probably gone by now.
I think that argument still exists. The motivation to demonstrate data ownership remains and mustn't be forgotten.
Although there is the need to discuss these things in the architecture group, we should be careful that TDWG will not become a synonym for GBIF. How people are using TDWG standards should be left open to some degree.
I see the GBIF data portal as potentially ephemeral (sorry Donald). It is also a very general indexer that will never meet all needs. The question is why there are not more data portals? I would argue that we need to make it as easy as possible for people to set up 'competitors' to GBIF. This implies an easy ability to crawl providers in some way.
Someone might want to set up searchable providers, others not. >From my feeling I guess this might actually be different depending on the kind of data (objects) they are exchanging.
Yes this is my point. We should separate the notion of publishing your data from providing associated search/query services for your data. A data owner may choose to provide both but they may also not.
As it looks to me we currently cannot create a reliable and fast fully distributed system. So the initial DiGIR and Co dream has somehow failed. For a small number of providers with a good server infrastructure and relatively small data its a different thing though. But in general for many if not all applications we will have to have some kind of local cache.
I take it you mean a system base on federated searches. This will no doubt remain problematic - indexes (not complete data caches) will be needed - even if they are hidden to consuming applications.
A huge warehouse keeping all our data is quite challenging though. I doubt we could fill a single system with all our thousands of concepts and 100 millions of records. In RDF we will end up with several billions of triples! But as soon as we start selecting subsets of relevant attributes (concepts), interpret incoming data to harmonize it or to remove "obvious" errors, we render the cache useless for other applications. Updating intervalls and reliability are also issues that freak out clients. So in the worst case we might end up having a seperate cache for every different application. So the heavy burden of indexing becomes a problem for most of our clients! I am not suggesting that this is wrong, but I dont think the indexing problem will only touch a few.
Indexes are really quite different from data warehouses. Indexes contain ephemeral copies of (meta)data arranged in a convenient ways. Warehouses store data for the long term in ways that are efficient for storage/retrieval rather than searching. You warehouse data when you think you might need it in the future but it isn't currently operationally important. My bank probably has my statements from 10 years ago in a data warehouse somewhere but they are not easily accessible to me or bank employees. My statement from last month is indexed and available within a fraction of a second though.
We should try an differentiate between indexing services and warehousing services both of which might be important.
Searchable providers allow us to do tests in advance and create ad-hoc networks that dont need central infrastructures.
ad-hoc networks could be provided by setting up indexers that just crawl a small number of suppliers but I appreciate your point.
Many providers are currently being convinced to publish their data just because we bundle local portals with the software. They immediately have a search interface on the web which is also great for local networks especially in smaller institutions with little IT resources. This is only possible via the generic search interface we are providing with DiGIR, BioCASE or TAPIR.
This is a very good point.
From a pure technical point I would probably argue for LSIDs, RDF and OAI. But if you think about the consequences, it would mean starting from scratch. Its a huge change and I cant think of much that would stay the same. I am afraid that we simply cant cope with that and by the time we are close to having a productive system it turns out there are better ways of doing this. Maybe distributed XQuery is simple to use by then? or a SPARQL server in front of our providers is easy to set up and fast to use? Or GRID is finally taking over all of us! who knows.
This is what the TAG is for. We need to look ahead and steer. I would not advocated scrapping things that are useful.
So for sure we have to address all the issues in a wider audience. Many people outside the TAG or GUID group are not aware of the proposed changes and would be really suprised to know about the consequences. Still I hope we can make a transition to RDF at some point. Maybe the solution lies in the combinations of both technologies for some time? Integrating TAPIR, LSID resolution and OAI(?) into DiGIR2, PyWrapper and the GBIF Indexing would smooth transition.
There currently no proposed changes though most people are joining the dots and reaching the same conclusions about where we might need to head. The role of the TAG meeting is to kick off the process of making and justifying some proposals.
Roger Hyam wrote:
[...] [...]The question is why there are not more data portals?
I would say that it is because so far, mainly low-hanging fruit has been picked by the effort of the TDWG community. The number of relatively large specimen and observation record providers is pretty small. The providers are relatively large, have IT professionals on their staff and have as part of their primary focus the disemination of their data. Also, specimen records are mainly of interest to systematists, which are a relatively small(?) fraction of practicing biologists. If I had to make a (somewhat self-serving) guess, it would be that the largest number of electronic data providers about species---including static web pages--- are offering descriptive data, including images. Queries like "What is this?" and "what is its role in the ecosystem?" probably are asked by astronomically more consumers of biodiversity data than "where is the type specimen?". Observation records may be some "midlevel hanging fruit" for which the major providers share the IT sophistication of specimen record providers. To the best of my knowledge, among the big ones, only NatureServe and Cornell Lab For Ornithology are TDWG participants. Is TAG listening to them?
In summary, I am slightly concerned that the experience with collection data may have the power to cloud our minds about what biodiversity data and its needed infrastructure are.
Bob
participants (5)
-
"Döring, Markus"
-
Bob Morris
-
Donald Hobern
-
Javier de la Torre
-
Roger Hyam