Hi Markus,
I'll try and catch up on this thread...
Döring, Markus wrote:
Roger,
I really think its time to discuss the issue of indexers and searchable providers. And I have to say I mostly do agree with your analysis, although consequently this would mean to abandon our DiGIR, BioCASE and TAPIR protocols. Instead we could go for OAI or any other similar standard. Or we could create our own minimal TDWG sync protocol to retrieve lists of changed records.
I am glad you understand what I was driving at.
Historically we created search protocols to avoid central caches; mainly not to scare providers, to convince them to publish their data so that they are keeping control of it. But that argument is probably gone by now.
I think that argument still exists. The motivation to demonstrate data ownership remains and mustn't be forgotten.
Although there is the need to discuss these things in the architecture group, we should be careful that TDWG will not become a synonym for GBIF. How people are using TDWG standards should be left open to some degree.
I see the GBIF data portal as potentially ephemeral (sorry Donald). It is also a very general indexer that will never meet all needs. The question is why there are not more data portals? I would argue that we need to make it as easy as possible for people to set up 'competitors' to GBIF. This implies an easy ability to crawl providers in some way.
Someone might want to set up searchable providers, others not. >From my feeling I guess this might actually be different depending on the kind of data (objects) they are exchanging.
Yes this is my point. We should separate the notion of publishing your data from providing associated search/query services for your data. A data owner may choose to provide both but they may also not.
As it looks to me we currently cannot create a reliable and fast fully distributed system. So the initial DiGIR and Co dream has somehow failed. For a small number of providers with a good server infrastructure and relatively small data its a different thing though. But in general for many if not all applications we will have to have some kind of local cache.
I take it you mean a system base on federated searches. This will no doubt remain problematic - indexes (not complete data caches) will be needed - even if they are hidden to consuming applications.
A huge warehouse keeping all our data is quite challenging though. I doubt we could fill a single system with all our thousands of concepts and 100 millions of records. In RDF we will end up with several billions of triples! But as soon as we start selecting subsets of relevant attributes (concepts), interpret incoming data to harmonize it or to remove "obvious" errors, we render the cache useless for other applications. Updating intervalls and reliability are also issues that freak out clients. So in the worst case we might end up having a seperate cache for every different application. So the heavy burden of indexing becomes a problem for most of our clients! I am not suggesting that this is wrong, but I dont think the indexing problem will only touch a few.
Indexes are really quite different from data warehouses. Indexes contain ephemeral copies of (meta)data arranged in a convenient ways. Warehouses store data for the long term in ways that are efficient for storage/retrieval rather than searching. You warehouse data when you think you might need it in the future but it isn't currently operationally important. My bank probably has my statements from 10 years ago in a data warehouse somewhere but they are not easily accessible to me or bank employees. My statement from last month is indexed and available within a fraction of a second though.
We should try an differentiate between indexing services and warehousing services both of which might be important.
Searchable providers allow us to do tests in advance and create ad-hoc networks that dont need central infrastructures.
ad-hoc networks could be provided by setting up indexers that just crawl a small number of suppliers but I appreciate your point.
Many providers are currently being convinced to publish their data just because we bundle local portals with the software. They immediately have a search interface on the web which is also great for local networks especially in smaller institutions with little IT resources. This is only possible via the generic search interface we are providing with DiGIR, BioCASE or TAPIR.
This is a very good point.
From a pure technical point I would probably argue for LSIDs, RDF and OAI. But if you think about the consequences, it would mean starting from scratch. Its a huge change and I cant think of much that would stay the same. I am afraid that we simply cant cope with that and by the time we are close to having a productive system it turns out there are better ways of doing this. Maybe distributed XQuery is simple to use by then? or a SPARQL server in front of our providers is easy to set up and fast to use? Or GRID is finally taking over all of us! who knows.
This is what the TAG is for. We need to look ahead and steer. I would not advocated scrapping things that are useful.
So for sure we have to address all the issues in a wider audience. Many people outside the TAG or GUID group are not aware of the proposed changes and would be really suprised to know about the consequences. Still I hope we can make a transition to RDF at some point. Maybe the solution lies in the combinations of both technologies for some time? Integrating TAPIR, LSID resolution and OAI(?) into DiGIR2, PyWrapper and the GBIF Indexing would smooth transition.
There currently no proposed changes though most people are joining the dots and reaching the same conclusions about where we might need to head. The role of the TAG meeting is to kick off the process of making and justifying some proposals.