XML gateways

Tue Mar 20 11:49:41 CET 2001

>-----Original Message-----
>From: TDWG - Structure of Descriptive Data [mailto:TDWG-SDD at USOBI.ORG]On
>Behalf Of Robert A. (Bob) Morris
>Sent: Tuesday, March 20, 2001 10:07 AM
>To: TDWG-SDD at USOBI.ORG
>Subject: Re: XML gateways
>
>
>Jim Croft writes:
> > Date:         Wed, 21 Mar 2001 00:25:09 +1100
> > From: Jim Croft <jrc at anbg.gov.au>
> > To: TDWG-SDD at usobi.org
> > Subject:      Re: XML gateways
> >
...

>
>Sad, but true. We should kick around other ideas to solve this
>problem, though a conventional port may be the easiest for data
>sources to implement. How about:
>
>- A convention whereby if
>    http://<host>/<filePath>/<cgiquery>
>yields HTML then
>    http://<host>/<filePath>/xml/<cgiquery>
>yields XML
>
>-A convention whereby people set up virtual hosts---pretty easy in
>most web servers---so that if
>    http://<host>/<filePath>/<cgiquery>
>yields HTML then something like
>     http://xml-<host>/<filePath>/<cgiquery>
>yields XML
>
>Hopefully all of this is temporary, since a well crafted GBIF should
>provide for discovery of URL, query syntax, and return schema. My real
>point is that current versions of Oracle, SQL-Server, FileMaker, and
>Access(?) can already emit XML without much(?) effort on the part of
>the data source operators, and doing so would let people proceed to
>build interesting distributed applications.

There are initiatives to resolve each of the issues brought up here that are
being worked on in several industry sectors:
1. Service discovery is addressed by UDDI (Universal Description,
Description and Integration).
2. Service description by WSDL (Web Service Description Language) which is
an XML document format for describing methods and data exposed by a web
services.
3. IGIR (Interface for Generic Information Retrieval) provides a generic
format for information retrieval via HTTP GET or POST requests that is easy
to implement.  The queries are sent as a structure easily translated into a
number of common query syntaxes (SQL for example).  Results are contained
within a consistently formatted document.  IGIR is optionally stateful (it's
up to the server to decide)- so a query that generates a very large result
set need not be sent back to the client all in one hit.  IGIR also supports
a SOAP interface (that's what it is originally designed for), so
programmatic access to information resources becomes pretty simple.

A search and retrieval scenario would work something like this:
A client hits on the UDDI directory, looking for services that support IGIR
and are "biologically relevant".  A list of services supporting IGIR are
returned, and the client broadcasts a query to them and waits for results to
start coming back.  Since the records are all formatted in a similar manner,
it is a simple process for the client to merge the results and present to
the user a nicely formatted set of results.

Anyway, that's how things should work.  To support a broadcast query,
obviously the targets must understand the contex of the query terms- so
groups within a particular domain (such as biological collections for
example) will need a set of "well known" access points (searchable fields).
Similarly, to simplify the merging of results that come back from such a
query, it would be convenient if there was an agreed record structure that
was shared between data providers.  The xml output by most database vendors
is a hack designed to allow easy transport of database records using http.
Things are getting better with xsd and so forth, but the structure of the
resulting xml documents will still reflect the structure of the databases
rather than any agreed upon common format.

Which brings us back to the "biological collections profile" concept that
was raised at the last TDWG meeting.  Would anybody like to volunteer a set
of access points (searchable fields) that are relevant for biological
collections?  Similarly for the record structure?  The DarwinCore elements
start to address these issues, but are somewhat deficient in several
respects.  They might make a reasonable starting place though- The
description with a comment section (add your gripes etc) is available at:
http://tsadev.speciesanalyst.net/DarwinCore/Darwin_core.asp

Most of you know that the DarwinCore was originally developed as a Z39.50
profile, hence some of the language may seem a little odd.  But the main
thing to take a look at is the list of access points- which are useful from
a biological collections point of view, and what's missing?

There are obviously other domains of biological databases beside just
collection databases.  Any suggestions for commonality between them all?

Cheers,
  Dave V.

=========================
David A. Vieglais
University of Kansas
Natural History Museum &
Biodiversity Research Center
http://www.nhm.ukans.edu