-----Original Message----- From: TDWG - Structure of Descriptive Data [mailto:TDWG-SDD@USOBI.ORG]On Behalf Of Robert A. (Bob) Morris Sent: Tuesday, March 20, 2001 10:07 AM To: TDWG-SDD@USOBI.ORG Subject: Re: XML gateways
Jim Croft writes:
Date: Wed, 21 Mar 2001 00:25:09 +1100 From: Jim Croft jrc@anbg.gov.au To: TDWG-SDD@usobi.org Subject: Re: XML gateways
...
Sad, but true. We should kick around other ideas to solve this problem, though a conventional port may be the easiest for data sources to implement. How about:
- A convention whereby if http://<host>/<filePath>/<cgiquery>
yields HTML then http://<host>/<filePath>/xml/<cgiquery> yields XML
-A convention whereby people set up virtual hosts---pretty easy in most web servers---so that if http://<host>/<filePath>/<cgiquery> yields HTML then something like http://xml-<host>/<filePath>/<cgiquery> yields XML
Hopefully all of this is temporary, since a well crafted GBIF should provide for discovery of URL, query syntax, and return schema. My real point is that current versions of Oracle, SQL-Server, FileMaker, and Access(?) can already emit XML without much(?) effort on the part of the data source operators, and doing so would let people proceed to build interesting distributed applications.
There are initiatives to resolve each of the issues brought up here that are being worked on in several industry sectors: 1. Service discovery is addressed by UDDI (Universal Description, Description and Integration). 2. Service description by WSDL (Web Service Description Language) which is an XML document format for describing methods and data exposed by a web services. 3. IGIR (Interface for Generic Information Retrieval) provides a generic format for information retrieval via HTTP GET or POST requests that is easy to implement. The queries are sent as a structure easily translated into a number of common query syntaxes (SQL for example). Results are contained within a consistently formatted document. IGIR is optionally stateful (it's up to the server to decide)- so a query that generates a very large result set need not be sent back to the client all in one hit. IGIR also supports a SOAP interface (that's what it is originally designed for), so programmatic access to information resources becomes pretty simple.
A search and retrieval scenario would work something like this: A client hits on the UDDI directory, looking for services that support IGIR and are "biologically relevant". A list of services supporting IGIR are returned, and the client broadcasts a query to them and waits for results to start coming back. Since the records are all formatted in a similar manner, it is a simple process for the client to merge the results and present to the user a nicely formatted set of results.
Anyway, that's how things should work. To support a broadcast query, obviously the targets must understand the contex of the query terms- so groups within a particular domain (such as biological collections for example) will need a set of "well known" access points (searchable fields). Similarly, to simplify the merging of results that come back from such a query, it would be convenient if there was an agreed record structure that was shared between data providers. The xml output by most database vendors is a hack designed to allow easy transport of database records using http. Things are getting better with xsd and so forth, but the structure of the resulting xml documents will still reflect the structure of the databases rather than any agreed upon common format.
Which brings us back to the "biological collections profile" concept that was raised at the last TDWG meeting. Would anybody like to volunteer a set of access points (searchable fields) that are relevant for biological collections? Similarly for the record structure? The DarwinCore elements start to address these issues, but are somewhat deficient in several respects. They might make a reasonable starting place though- The description with a comment section (add your gripes etc) is available at: http://tsadev.speciesanalyst.net/DarwinCore/Darwin_core.asp
Most of you know that the DarwinCore was originally developed as a Z39.50 profile, hence some of the language may seem a little odd. But the main thing to take a look at is the list of access points- which are useful from a biological collections point of view, and what's missing?
There are obviously other domains of biological databases beside just collection databases. Any suggestions for commonality between them all?
Cheers, Dave V.
========================= David A. Vieglais University of Kansas Natural History Museum & Biodiversity Research Center http://www.nhm.ukans.edu
participants (1)
-
Dave Vieglais