Hi Peter,
peter.hollas@thomson.com wrote:
- Is there any standard scheme for LSID discovery? i.e. Would it be a
good/bad idea to extend the LSID service to allow machine queries of LSIDs by taxon name rather than discovering them through the web interface?
Any comments and suggestions are very welcome!
Regards, Peter.
In my view, LSID is primarily a naming scheme. While it does allow for the resolution of data objects, it was not designed to support other common data access tasks such as discovery, search, query, or harvest. To support these additional data access tasks I feel we ought to look to other standard or well-established protocols. Some protocols, like the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) are well suited to harvesting. Others, like the W3C's SPARQL protocol are well suited to search and query. Unfortunately there's no single protocol for working with RDF data that does everything we need. The problem for data providers is that each additional protocol that they are asked to support adds an additional burden on them. At the same time, the easier a provider makes it to access their data, the more their data will be used. However, many data providing organizations wish to devote their resources to the creation and curation of data, rather than to the implementation of data access protocols.
We're hoping to provide an end-to-end solution for this problem with the wasabi project (formally known as DiGIR2). While the rest of this message is a bit of a plug, for wasabi, it also acts to illustrate the fact that different stakeholders in a data network have different requirements for data access.
The central idea behind wasabi is that data providing organizations generate RDF data objects, assign LSIDs to them, and drop them into a locally-installed wasabi server. The wasabi server makes data objects available through a variety of standard data access protocols. It supports LSID to metadata resolution through a simple HTTP-get protocol and through a plugin to the IBM LSID resolver. The wasabi server also allows data aggregators to use OAI-PMH to efficiently fetch data objects in bulk so that they can be indexed for search. One benefit of the OAI protocol, especially the implementation in the wasabi server, is that it allows for incremental harvesting. Because it only sends the data objects that have changed since the last harvest, OAI-PMH decreases the load on a provider's server and allows for fast indexing of their data. Finally the wasabi server provides direct SPARQL query access to data objects. Any time a wasabi server is queried (through OAI, SPARQL, or LSID resolution), the access is logged. This helps data providers keep track of who is using their data.
If a provider doesn't want to write a custom program that generates RDF/XML they can use the wasabi server's synchronizer program, along with a concept mapping configuration file, to periodically connect to their database, transform its contents into RDF data objects, and load them into the wasabi server. The wasabi server will keep track of which objects are newly added, deleted, or updated.
The RDF server is only one component of the wasabi project. It also includes a library that implements the client side of the supported data access protocols. This makes it easy for people to write custom software to grab data from wasabi servers. Researchers who want to gather large amounts of data for analysis can use the client library to simplify the task.
Another important part of the wasabi project is the indexer. The indexer supports harvesting from multiple distributed wasabi servers. Harvested data are then pushed into the indexer which can generate indices that are designed to support various types of queries. For example, the indexer can use a Google-style inverted index that is well suited to full-text queries, a database with geospatial extensions to support geographical queries, or even a triple store. The wasabi indexer could be used by large data aggregators like GBIF or by custom software developers. One example of the later are the developers of collections management software who might want to index and cache data from TCS providers so that their users can associate specimens with taxon concepts through an easy to use desktop software package.
Both the client library and the indexer allow computers to access and work with data. The wasabi project also provides an extensible web portal. Built over the client library and an index of data objects harvested from wasabi servers, the portal component allows data to be accessed by people through a customizable interface. It supports browsing, searching, and downloading of data objects. One use for the portal component might be as the public face of a thematic data networks like FishNet2, MaNIS, OrNIS, or HerpNet.
The wasabi project is free and open source. It is implemented in Java. The server, client library, and indexer portions of the project are now in beta and we plan to release them by the end of the year. The portal is still under active development.
I'll be presenting wasabi at the TDWG meeting in St. Louis next month.
-Steve