= BioGeoSDI workshop - GeoInteroperability Testbed Pilot Project = Version:** 0.1 Date:** %%mtime(%d %B %Y) [[PageOutline()]] == Documentation Editors ==openModeller Web Service * Javier de la Torre (jatorre [at] imaste-ips [dot] com) * Tim Sutton (tim [at] linfiniti [dot] com) * Bart Meganck (bart.meganck [at] africamuseum [dot] be) * Dave Vieglais (vieglais [at] ku [dot] edu ) * Aimee Stewart (astewart [at] ku [dot] edu) * Peter Brewer (p.w.brewer [at] reading [dot] ac [dot] uk) == Document History == * Initial template version: April 1st, 2007 (during the Campinas meeting) == Copyright Notice == {{{ Permission to use, copy and distribute this document in any medium for any purpose and without fee or royalty is hereby granted, provided that you include attribution to the Taxonomic Database Working Group and the authors listed above. We request that the authorship attribution be included in any software, documents or other items or products that you create related to the implementation of the contents of this document. This document is provided "as is" and the copyright holders make no representations or warranties, express or implied, including, but not limited to, warranties of merchantability, fitness for any particular purpose, non-infringement, or title; that the contents of the document are suitable for any purpose; nor that the implementation of such contents will not infringe any third party patents, copyrights, trademarks, or any other rights. The TDWG and authors will not be liable for any direct, indirect, special or consequential damages arising out of any use of the document or the performance or implementation of the contents thereof. }}} == Abstract == A week long workshop was held in Campinas, Brazil during the first week of April 2007. The focus of the workshop was to develop a testbed web application that demonstrates the interoperability of digital data and services using open standards - with particular emphasis on geospatial, taxonomic and occurrence biodiversity data. Two prototype web applications were developed in php and Flex. The wizard style application leads the user through a defined sequence of steps in order to acquire sufficient data to create a niche model. The process includes taxonomic validation using the Catalogue of Life, search and retrieval of occurrence data using services such as GBIF portal or WFS, selection of raster layers representing environmental data needed in the modeling process, and modeling these data using the openModeller Web Service in order to create a probability surface that represents areas where a species is likely to be able to survive. The workshop highlighted just how easy it is to rapidly create such a feature rich application using open access to data, free software and open standards. It also highlighted some areas where further work is needed in order to truly be able to blend these kinds of services into a cohesive computing platform. = Introduction = Biodiversity informatics is a rapidly growing field in which digital data is increasingly available. Taxonomic data, species occurrence records, ecological data, and environmental data are all available online through various services provided by scientists worldwide. This data is published in different formats and protocols, which make them difficult to be used as a single service to complete a real work. == Background == At the heart of biodiversity informatics there is the dream of a unified infrastructure where data and analysis services all around the world are seamlessly used. A common goal of TDWG participants is an environment where different TDWG initiatives interoperate to provide rich mechanisms for biodiversity knowledge exploration, analysis, and discovery. However, though there is a general awareness of relationships between TDWG standards and groups, there are few examples of practical inter-standard applications. At the same time most of biodiversity research has to deal with geographical information, biodiversity happens somewhere. In the geospatial world there is an equivalent organization to TDWG called OGC which promotes interoperability among geospatial services. They accomplish this by creating a set of open standards that different vendors and projects are using to implement their geospatial infrastructure. We examined, by way of example software applications, the degree of interoperability that can be achieved by TDWG, OGC, and related initiatives (proposed and existing) such as the GBIF REST occurrence service, openModeller web services (OMWS), TAPIR and LSIDs. To achieve these objectives, a small group of developers with expertise in the relevant domains held a workshop to develop and example web application. This application retrieves data from specimen and observation sources (GBIF Cache, TAPIR or WFS) based on scientific names, with synonym resolution. Using environmental data accessible through OGC services, the application generates ecological niche models through openModeller web services (OMWS). The result can in turn be accessible as OGC WMS and WCS layers. While this is not a novel application, previous experience indicated that there are many issues with data access, quality, processing, and standards interoperability that limit the generalized implementation of such an analysis pipeline. The outcome of the workshop is an example application that binds these core standards and data sources. More important though, is the identification of problem areas within the existing standards and the recommendations on suggested improvements to the respective TDWG or other groups. Though TDWG has developed several key interoperability standards, it has not invested sufficiently in the practical application of these standards. == Architecture Overview == The architecture was designed in order to meet the objectives of: 1. rapid application development 2. clean separation of presentation and logic layers 3. the easy addition of new services to the framework In order to achieve this a tiered architecture was created consisting of Frontend and Backend code, with the backend code arranged into DataModels and Services: {{{ +------------------------------------------------+ | Frontend code | Frontend +------------------------------------------------+ | Data Models | Services | Backend +----------------------------------------------- + }}} Two user interfaces were developed simultaneously - a 'low entry bar' PHP/HTML interface and a 'high entry bar' PHP/Flex interface. The Flex interface requires Macromedia Flash 9 to be available in the client browser. The results of the computations from the various services are stored in data models and passed back to the display layer to render them to the screen. {{{ +------------------------------------------------+ | Flex App | PHP/HTML app | Frontend +------------------------------------------------+ | Backend (PHP) | +----------------------------------------------- + }}} The PHP/HTML Frontend code was separated into a Controller and various Display classes : || Controller || | OccurrencesDisplay | NameSearchDisplay | etc. | | OccurrencesService | NameSearchService | etc. | == Implementation == === Name Search service === The name resolution service takes as input a (part of an) scientific name and returns a list of matching names. Together with every name that is returned from the service the correspondent accepted name is also returned. At the end this service force the user to choose a single accepted name, trough one of its synonyms or directly. === Occurrence Search service === The Occurrence Search service takes the scientif name returned from the name resolution service, and returns a list of species occurrences (specimen or observation) with latitude and longitude. The source of the occurrences is selected by the user in the interface. 3 different occurrences providers technology were implemented : * GBIF REST occurrence service. Not a real standard but a popular source of occurrences. * a WFS service that implements a GML application schema that the prototype could understand. * a TAPIR provider whose data has been configured to allow the use of DarwinCore 1.4 or ABCD v2.06 The biggest challange here was not the use of the different transport protocols but the semantic mediation needed to use data of the same type, occurrences, described in too many different ways: our own OGC GML Application schema, GBIF XML format, ABCD and Darwin Core. Specially problematic is the use of WFS to distribute occurrence data. There is no official or unofficial GML App schema to distribute specimen occurrence data and therefore every service could be creating its own one, like we did for this experiment, and interoperability will not be possible. Technically it was easier to consume data from GBIF and TAPIR providers as it is possible to access them directly using simple REST queries, URL + parameters. In the case of WFS the queries were also done using a REST query but the filter on it was actually an XML piece URL encoded what makes very ugly URL and difficult to debug with a simple web browser like with the others. On the back the prototype is doing several things after retrieving data from these different sources: 1/ Gathered the data 2/ Send it to a data processing service which: 1/ Inserts the data in a new table in a PostGIS database 2/ Register the newly created dataset in Geoserver as a WFS FeatureType and a WMS layer. Once the data is retrieved and cached in the PostGIS database then it becomes available locally as a WMS, WFS, WCS service, KML, PDF and other formats thanks to Geoserver. === Environmental Data Layer Selection === This service provides the user with a choice of available environmental data layers for use in the modelling calculations. Two different sources can provide these layers : 1/ The openModeller cache of environmental layers. As part of the openModeller Web Service, the getLayers call returns a simple XML collection of layers that are available on the modelling server. 2/ an OGC WCS server. The prototype managed to get a list of a available layers in a certain WCS server, but failed to retrieve them, due to lack of time. The process is more complicated in this prototype. After retrieving the layer from WCS, it has to be pushed to the modelling service to make it available for later processing. It was not possible at this moment to consider Open Modeller to use remote layers in WCS servers, but this could become easier now that the GDAL library, which OpenModeller uses, has support for WCS. The use of WCS services was only partly implemented as we have problems with the WCS server we were using for testing. It does not look difficult to complete support for WCS in the prototype. === Niche Modeling service === To provide niche modelling capability in the prototype, the openModeller Web Service (OMWS) interface was used. openModeller is a generic framework for carrying out fundamental niche modelling experiments - typically used to predict species distribution given a set of environmental raster layers and a set of known occurrence points. The openModeller Web Service (OMWS) interface provides access to this library using SOAP (Document / literal style). Besides calls directly related to modelling, the openModeller Web Service (OMWS) provides additional functions for retrieving metadata and other information. See for example the getLayers call used in the Environmental Data Layer Selection tool. Before going to modeling, the prototype shows a list of available algorithms in the OWS service, for the user to choose from. This is done with a method called getAlgorithms. The niche modelling step takes in the environmental data layer(s) and the occurrence data, runs a niche modelling experiment, and returns a model and a raster layer representing the probability surface for the model. The openModeller Web Service (OMWS) saves the resulting raster layer and creates a WCS service to return the data to the calling server. Eight algorithms are provided by OMWS, that normally are configurable by the user. For the prototype however, default parameters are pre-set by the application. The user does not need to make any decisions other than selecting an algorithm. == Conclusions == The exercise allowed us to divide existing standards in three categories, based on how easy their implementation and use was in our demonstrator application : 1/ Standards where modification or extension would be welcome. 2/ Standards that the community has supported in theory, but that have proven problematic because of lack of existing implementations. The lack of implementations probably hints to underlying problems with their ease of use and with the true commitment of the community. 3/ Standards that excel in terms of community support, existing implementations, and ease of use. This is, of course, a rude simplification. None of the standards really fits into a single one of these groups. But it was useful to focus the mind, so that an overview could be drafted of the priorities that our community should consider when supporting a standard. With this in mind, we'll describe each of the used standards in a bit more detail now : === Life Science IDs (LSID) === Though LSIDs have been adopted by TDWG as a community standard, there are few projects that have implemented them. It would greatly improve interoperability to have data objects tagged with LSIDs but their implementation is difficult for many institutions. In any case we could not find any service offering LSIDs relevant to the experiment and together with the lack of expertise within the group, finally LSIDs were not used. ---- This is to be removed --- [I don't understand this last sentence : is the implementation hard for many institutions ? Or is the fact that few implementations exist a limiting factor ? -Bart- ]. [Add some text on: institutions providing GUIDs, LSIDs, the difficulty of LSIDs vs other types of GUIDs] ------------------------------ === OCG Webservices === We achieved significant success using Open Geospatial Consortium (OGC) standards WMS, WFS, and WCS. All these OGC web services specify a GetCapabilities operation which returns a description of the data available from an OGC service provider : this is very useful to get a quick overview (e.g. to list the available layers for the user to choose). The specifications are also very clear on the syntax of the key-value pairs to be encoded on the URL GET string for accessing the data. Accessing the documents was easy, but there is a lack of simpler examples on how to use the different standards. The specifications are too long and detailed for potential hackers to get into them. WMS (Web Mapping Service) is a simple web service that returns a map view of spatial information as an image. The BioGeoSDI experiment uses the WMS standard to display occurrence data, environmental data, and completed niche models. WFS (Web Feature Service) is a service that returns vector data in XML format. WFS requires data providers to encode their data in a domain-specific 'Application Schema' which references GML (Geographic Markup Language) for primitive geometry object types. The BioGeoSDI experiment used WFS to obtain specimen occurrence data using an specific GML application schemas. WCS (Web Coverage Service) is a service which specifies a GetCoverage operation that returns raster data. The BioGeoSDI experiment used WCS providers to obtain environmental data layers. GML (Geographic Markup Language) === XML application schemas === The application schema is a standard for exchanging data within a community. We could not find any GML application schema that could be used for sharing occurrence data so we created our own GML application schema and set up a sample WFS server. We knew there is a GML application schema implementing the Darwin Core elements coming, but at this time it was not ready. We found not easy in any case to make use of any arbitrary schema due to the lack of support for complex GML schemas in Open Source WFS implementations. At the end, even if we created support for GML in the prototype it would not be very useful as there is no provider implementing the GML application schema it was created. This is definitely a point where interoperability is possible only on the interface level but on the semantic level. In any case we could not find any central registry or catalog from where to find WFS services with occurrence data. === communication/integration protocols === === TAPIR === (from : TAPIR protocol specification website ) TAPIR is an acronym for TDWG Access Protocol for Information Retrieval. The DiGIR (mainly American) and BioCASe (mainly European) protocols have many similarities but are not interchangeable, and this clearly burdens global interoperability. TAPIR was envisaged as a protocol for unifying existing biodiversity data sharing networks based on DiGIR and BioCASe. The protocol can be well compared to WFS, but it has no binding to any XML schema, like GML in WFS, and it could be used with any XML schema. The new GBIF data portal can index data provided using TAPIR with the DarwinCore schema, and its curatorial and geospatial extensions. Our experiment successfully implements a TAPIR client for accessing specimen occurrences. Of course, like with WFS, we could only make use of TAPIR servers that have their data mapped to Darwin Core or ABCD 2.06 concepts. Due to the popularity of the TAPIR protocol within the biodiversity community is easier to find services that the prototype can access, but this was because of the knowledge within the group, not because of the information available on the web. Due to the use of the protocol only within biodiversity almost all services have their data mapped to ABCD and Darwin Core concepts and therefore it is easy to solve the semantic problem. === SPICE === The SPICE protocol was one of two Catalogue of Life products used for taxonomic verification purposes within the BioGeoSDI experiment. Both systems enable the user to provide a potential ambiguous taxon string and verify it's taxonomic status and currently accepted name. The SPICE (Species 2000 Interoperability Coordination Environment) protocol was developed by researchers at the Universities of Cardiff, Reading and Southampton for use by the SPICE software to query distributed global species databases as part of the Catalogue of Life Dynamic Checklist. The SPICE protocol was therefore originally designed for communications between global species databases and the central SPICE server rather than as a publicly available web service. The BioGeoSDI experiment makes use of the SPICE wrapper to the Catalogue of Life Annual Checklist. The Annual Checklist edition of the Catalogue of Life is manually compiled from database exports provided by partner database custodians. The SPICE wrapper to the Annual Checklist was written by the Species 2000 Secretariat as a step towards unifying the Dynamic and Annual Checklists into a single system. A system for taxonomic verification was successfully implemented in the BioGeoSDI experiment using the SPICE protocol. Users can use SPICE to obtain confirmation on the currently accepted taxon name before continuing further with their modelling experiment. The SPICE protocol is capable of providing full synonym for a taxon (i.e. all known synonyms for a species) which could potentially be used to query data stored under any known name from other databases. Unfortunately time pressures during the BioGeoSDI workshop meant that this functionality could not be explored. Instead the taxonomic verification step provided by SPICE confirms the status of the name provided and returns the currently accepted name if the species is a synonym. Whilst the SPICE protocol includes all the data required for taxonomic verification, the XML schema was a little cumbersome to use. The nested nature of the infraspecific taxon information and the difference in tag of homologous data in synonyms and accepted names led to code that was more complicated seemed necessary. From a programmers viewpoint the use of capitalised tag names throughout the schema made for uncomfortable programming and the switch to CamelCase in a number of tags is frustrating. One data problem that arose through using SPICE was the lack of provenance data associated with the taxonomic records. As the SPICE protocol was designed for communications with a specific GSD and a centralised SPICE server, database metadata is provided on a per wrapper basis. This means that the correct level of citation of data as required by the Catalogue of Life end user licence cannot be provided. === Catalogue of Life Annual Checklist Web Service === The Catalog of Life (CoL) project is a joint effort between ITIS and Species 2000. CoL provides an Annual Checklist of taxonomic names distributed in CD-ROM and avaialble as a Web Service. The Catalogue of Life Annual Checklist Web Service was written in response to a call from the users to have programmatic access to the Catalogue of Life. The prototype was released concurrently with the BioGeoSDI workshop and therefore support for this web service was only completed after the workshop was completed. This web service provides all the data required by the BioGeoSDI experiment for the taxonomic verification purposed including the provenance data missing from the SPICE protocol. The schema provides easier access to individual data elements than SPICE, requiring approximately one third the lines of code to extract the same data. === Taxonomic Concept Schema === The Taxonomic Concept Schema (TCS) was not used in the BioGeoSDI experiment. Though this is a standard that was strongly supported after grueling discussion within the community, few providers are exposing data in this format. Though digital taxonomic data is available in a variety of other formats, interpreting this data into concept data is a task more appropriate for taxonomists than programmers. TCS also suffers from many of the same shortcomings as the SPICE XML schema. The extensive nesting of objects within a concept makes programming against it unwieldy for programmers and makes data interpretation onerous for data providers. The SEEK Taxonomic Object service aims to expose community data with a public API that uses TCS for both input and output formats, but has so far been limited by available digital concept data (and programming resources). === Contributing Community Projects === == GBIF REST services == The Global Biodiversity Information Facility (GBIF) REST services provides access to records of the occurrence of records from the GBIF cache of community records from DiGIR/BioCASe providers. By caching the data from lot of different biodiversity providers it is a very convenient way of abstracting the complexity of consuming data from several providers at once. When the prototype was created the service was easy to use. Accessing REST services was the easiest, but the funcionality and the quality of the service was a little bit problematic. There are a lot of occurrences exposed through the service that does not have coordinates which are useless for this experiment and there is no way to filter them when accesing the service. Considering quality we found lot of occurrences in the service with latitude and longitude equal to zero that are very probable incorrect. == openModeller Web Service (OMWS) == -->SOME EDITING NEEDED HERE!!! Is it needed all these credits? <---- The openModeller project has provided the openModeller Web Service (OMWS) for creating niche model experiments. The project is currently being developed by CRIA, Escola Politécnica da USP, and Instituto Nacional de Pesquisas Espaciais (INPE ) as an open-source initiative. It is funded by Fundo de Amparo Pesquisa do Estado de Sao Paulo (FAPESP), the Incofish project, and by individuals that have generously contributed their time. Previous collaborators include the BDWorld project, KU, and other individual participants. JAVI->We have to discuss the problems, REAL PROBLEMS, accesing the SOAP Document Style OMWS services... and maybe talk about the possibility to convert them into WPS (Web Processing Service) from OGC. Tim? -->> == Recommendations == === More clarity in OGC standards publications === OGC WMS, WFS and WCS standards are well-defined and easy to use, with very useful functionality (like the GetCapabilities call). Sadly, the standards publications are somewhat confusing and difficult to navigate. Their usability would be greatly improved if they were published : */ in easily browseable web pages */ with good search capacities */ with many real-life examples */ with many hyperlinks === A warm call for better data quality === The amount of online data is stunning, spelling a great future for integrated standards-based webtools. But data quality is often poor. This is a serious issue, as online data is not just some "nice to have" publicity front on the web. More and more, it will become the raw data for serious scientifical research (statistics, modelling), leading to publications. This cannot become reality unless the utmost care is given to filling all data fields (coordinates !), and providing an accuracy estimation where appropriate. Some of the problems we encountered were : */ many species are available in one data provider, but not in others */ GBIF data does not reference the original data providers */ Occurrence data quality is often poor (many blank data fields, such as coordinates) */ latitude/longitude values have no accuracy indication, including lack of SRS info and that is definitely a big source of inaccuracy. These issues will seriously hamper any development of integrated standards-based webtools such as our demo application. ==== DarwinCore ==== Currently, DiGIR providers frequently use DarwinCore. But this schema is being served in 24 unique variations, with five defined as only of 'historical significance' (so all the others are still in active use). Only a single version (1.3) is officially accepted by TDWG, and only a single data provider (Cornell) uses it. In its basic form, DarwinCore is quite minimalistic, only requiring a small set of (meta)data fields. For geospatial (GIS) applications, geographic extensions have been defined, but they are not official standards yet. [add text about suitability for GML application schema] ==== ABCD ==== The latest official schema, approved as a TDWG standard, is 2.06. There are several provider already using it. Points to edit: -There are lot of different ways to represent geospatial information, but most providers make use of corrdinates. -Complicate to parse due to its principle of variable atomization. The same information is provided in different ways and the clients making use of it has to deal with the merging the data from different formats and levels of atomization. == Technologies used on this experiment == === Programming languages === */ PHP */ ActionScript */ Bash shell scripting */ Python === Open Source projects === */ Geoserver: WFS, WMS server. KML and PDF creation. */ PostGIS: Temporary geospatial datasets store */ MapServer: WCS */ OpenModeller: OMWS