[tdwg-tag] Any TCS users with experiences to report?

Tue Nov 6 05:45:44 CET 2012

Hi Rod,

Really it is up to the major players / DBs who currently provide web services to respond to this, but I will add a few comments in passing:

[Rod P.:]

Querying multiple sources on the fly ("federation") seems to me to be doomed to fail. I tried it in 2005 with the now defunct "Taxonomic Search Engine" and the performance hit of multiple HTTP requests, multiple, changeable interfaces and variable up time of the source databases made it hard work. I think at the scale we operation centralisation is the way forward. The arguments against centralisation tend to boil down to the interests of the data providers outweighing those of the users, which is a bad thing.

[Tony:]
I encountered similar problems trying to run real-time distributed data queries within the OBIS system in the early 2000's. On the other hand a number of the data contributors were small(-ish) players, often in museums, without a very well resourced or robust infrastructure for data publishing. The same should hopefully not apply to the "major players" being addressed here such as Catalogue of Life, ITIS, NCBI, GBIF, Australian NSLs and so on.

Interestingly the web mapping clients I mentioned earlier maintain an entirely distributed query model (since no one portal has the capacity to host all the data locally) which in the main, seems to work fairly well as most contributors perhaps take their data publishing obligations a bit more seriously - e.g. test that they work, and commit to maintaining "up" services as far as reasonably possible.

[Rod P.:]

We have a very large, centralised taxonomy, namely the GBIF classification (it's easily the biggest around), itself based on an aggregation of lots of taxonomies. Why not focus on making that the best documented classification we can? There are mechanisms (such as GitHub) that we could use to enable people to download it, improve it, fork it if they wish, and so on. GBIF has names connected to actual data, and data that arguably is useful outside taxonomy, so it would seem a sensible place to focus resources. If not GBIF, then who?

[Tony:]
Well, there is some territory/overlap to deal with here, since as well as GBIF, other players would presumably like to claim the high ground in this space, most notably Catalogue of Life, GNA and others (NCBI, wikispecies, The Plant List/IPNI for plants...) - at present there is no obvious"one stop shop". Also the Open Tree of Life project (OTTOL) is building its own "master taxonomy" using some of the same sources as GBIF, as I understand, possibly also with a view to being user editable (?)

Ultimately, if as you contend there is no market for unifying or even providing web services from distributed systems, why would the following exist (presumably with efforts to maintain them and continue to develop them), such as:

http://www.itis.gov/web_service.html
http://webservice.catalogueoflife.org/
https://www.anbg.gov.au/confluence/display/bdv/NSL+Services
http://www.marinespecies.org/aphia.php?p=webservice

and so on?

- Tony (waiting other relevant persons to chime in here, maybe...)

From: Roderic Page [mailto:r.page at bio.gla.ac.uk]
Sent: Monday, 5 November 2012 8:17 PM
To: Rees, Tony (CMAR, Hobart)
Cc: J.Kennedy at napier.ac.uk; mdoering at gbif.org; deepreef at bishopmuseum.org; pmurray at anbg.gov.au; eotuama at gbif.org; tdwg-tag at lists.tdwg.org; Pigot, Simon (CMAR, Hobart)
Subject: Re: [tdwg-tag] Any TCS users with experiences to report?

Hi Tony,

A few quick comments.

Querying multiple sources on the fly ("federation") seems to me to be doomed to fail. I tried it in 2005 with the now defunct "Taxonomic Search Engine" and the performance hit of multiple HTTP requests, multiple, changeable interfaces and variable up time of the source databases made it hard work. I think at the scale we operation centralisation is the way forward. The arguments against centralisation tend to boil down to the interests of the data providers outweighing those of the users, which is a bad thing.

We have a very large, centralised taxonomy, namely the GBIF classification (it's easily the biggest around), itself based on an aggregation of lots of taxonomies. Why not focus on making that the best documented classification we can? There are mechanisms (such as GitHub) that we could use to enable people to download it, improve it, fork it if they wish, and so on. GBIF has names connected to actual data, and data that arguably is useful outside taxonomy, so it would seem a sensible place to focus resources. If not GBIF, then who?

There is, however, one major problem with GBIF, and indeed most other classifications. They bear little relationship to evolutionary history, especially at deeper levels (it doesn't help that there isn't a "tree of life"). In one sense this is fine, as I think we need to keep phylogeny and classification separated otherwise we conflate two rather different things. But we do need to integrate evolutionary information. The NCBI classification will continue to grow and be central to organising genomic information, therefore we need a mapping between GBIF and NCBI. Much of this will be done via names, but a lot won't, and will rely on other links, such as specimens. We also need to integrate phylogenies themselves, which is a different challenge. Unless we deal with genomics and phylogenetics the taxonomic database community risks being even more marginalised.

My own feeling is that we've spent  a lot of time fussing with standards, etc., without working out what would be the best landscape for the people who use taxonomic information. IMHO we should be building a Google for biodiversity information. Until we do, we're basically just mucking about.

Regards

Rod

On 5 Nov 2012, at 00:33, <Tony.Rees at csiro.au<mailto:Tony.Rees at csiro.au>> <Tony.Rees at csiro.au<mailto:Tony.Rees at csiro.au>> wrote:

Hi Rod,

Questioning the value of taxonomic databases while on a TDWG list is a separate discussion...

I think we have to accept that at present there is no unified, curated, up-to-date taxonomic treatment for all life: meaning that in order to retrieve taxonomic information about "any" taxon, we (either as a human client or a remote app) may well need to query more than one taxonomic DB to locate relevant content. So I guess the essence of my question is, can we simplify / standardise things so that such resources can be queried in a standardised way (with only the destination / resource name changing) and, having done so, receive consistently structured responses (whether TCS, DwC, or other). The answer at present appears to be "no" which begs the question of what incentives there are or are not to do so, and thence whether TDWG as the "biodiversity standards" body, has a reason to exist in this space.

The reasons most obvious to me are (1) querying multiple taxonomic data sources in order to build a more complete picture than any one of them can currently supply on its own; (2) comparing different viewpoints or current treatments of a particular taxon between sources of "expertise", bearing in mind that these may differ and between them provide more insight than a single "received view"; (3) providing access to ancillary information / "taxon pages" specific to the data source in question which may for example provide attribute, distribution, literature information associated with the taxa in addition to just the names; and (4) treating the remote information as an expert source which can be queried remotely on demand trather than having to host all the same information locally - in the same way as quering any other remote data source, maintained by relevant experts, may have a place in system design as opposed to hosting everything internally - think Google Maps or whatever - and just returning the subset of information relevant to a particular query at a particular time. In other words we outsource the data collation and ongoing management to someone whose mission (and hopefully resourcing) it is to do this and concentrate on what we can do with the data once received.

I would have thought that none of the above is rocket science and has indeed already been achieved in other domains for example the OGC web mapping services already mentioned, the data standards required by OBIS and GBIF for participation in their data aggregating networks, and so on. What we have here is a parallel "taxonomic information aggregating" activity which similarly would ideally need standards for data interchange if the poor consumer is not to deal with a multiplicity of uncontrolled local data structures and query/response syntaxes. Indeed the parallel with OGC standards is not completely theoretical in that OGC WFS (web feature service) can be adapted to map to taxonomic information (just qwithout the spatial component) without difficulty if only the community could agree on a relevant schema - in other words tools exist already (GeoServer, DeeGree) which could handle the requests/responses I believe, but they have no defined standards to work with unless you roll-your-own...

Just my 2 cents of course... I amagine the "global names" folks and their associates would have more to say on this matter of standardising access to distributed taxonomic data sources.

Regards - Tony

-----Original Message-----
From: Roderic Page [mailto:r.page at bio.gla.ac.uk]
Sent: Saturday, 3 November 2012 4:58 PM
To: Rees, Tony (CMAR, Hobart)
Cc: <J.Kennedy at napier.ac.uk<mailto:J.Kennedy at napier.ac.uk>>; <mdoering at gbif.org<mailto:mdoering at gbif.org>>;
<deepreef at bishopmuseum.org<mailto:deepreef at bishopmuseum.org>>; pmurray at anbg.gov.au<mailto:pmurray at anbg.gov.au>; eotuama at gbif.org<mailto:eotuama at gbif.org>;
tdwg-tag at lists.tdwg.org<mailto:tdwg-tag at lists.tdwg.org>; Pigot, Simon (CMAR, Hobart)
Subject: Re: [tdwg-tag] Any TCS users with experiences to report?

Playing devil's advocate I think there are several issues here:

1. The example you gave of an OGC query illustrates what for me is a
major limitation of existing approaches (such as DiGiR and TAPIR), they
focus on standardising queries not identifiers. Hence we can query
databases in a consistent (if cumbersome) way, but have no easy way to
refer to the things (taxa, specimens, etc.) we retrieve. Having stable,
reusable, resolvable identifiers would be a step forward.

2. Taxonomic concepts aren't much use unless connected to data.
Arguably the most widely used taxonomic database in biodiversity is the
NCBI taxonomy database, which has stable identifiers, an API, and taxa
that are connected to data (sequences and publications). The GBIF
backbone classification is also connected to data (specimens and
observations) although its taxon identifiers (like its occurrence ids)
aren't terribly stable.

3. I think the standards-first approach tends to put the cart before
the horse. I'm not sure it's the lack of standards that is the problem,
it's the lack of usable information in taxonomic databases. Apart from
NCBI and GBIF, what science can I do with taxonomic databases? What
questions do they allow me to ask?

Regards

Rod

Sent from my iPhone

On 3 Nov 2012, at 03:41, <Tony.Rees at csiro.au<mailto:Tony.Rees at csiro.au>> wrote:

Hi Jessie, also others who have responded thus far,

You said:

I think it would be great if the major databases that describe taxa
(not
just list names) described their data as concepts and allowed people
to
link to their databases when identifying specimens and when
sequencing
etc, this would be the start of a really useful biodiversity
network.

Agreed! And also the databases that "just list names" are dealing
with concepts as we know, comprising a valid name plus all listed
synonyms in these cases...

My feeling is the reason that there is not yet any standardization in
this area - every data resource does its own thing using its own home-
grown schema in the main (that is, presuming a web service is even
offered) and the "standards group" (TDWG) has not pushed a model of any
sort of standard client which expects to be able to access distributed
taxonomic information in a standard way, so there is no incentive for
the sources to provide this. Sort of like a fax machine with no-one on
the other end wishing to communicate with it. By contrast (for example)
the OGC has defined standards for geospatial web services which, once
adhered to, allow one wants one's own data to be accessed by standards-
compliant remote client apps in a standard way, so if I publish a layer
(map) from my geoserver here (http://www.cmar.csiro.au/geoserver/ ) as
layer name = bioreg:CAAB37020002 then any remote client can access it
via standard syntax which will retrieve it in a specified format, for
example

http://www.cmar.csiro.au/geoserver/wms?service=WMS&version=1.1.0&reques
t=GetMap&layers=bioreg:CAAB37020002&styles=&bbox=109.0,-44.5,156.5,-
8.5&width=512&height=388&srs=EPSG:4326&format=image/gif

So maybe for either TCS, DwC and so on a missing part of the task is
to define the syntax for such calls (plus relevant expected responses)
for taxonomic data and then create some example both publishing and
retrieving (client) software to exercise this - provided there is an
interest in doing so of course!

More soon,

---------------------------------------------------------
Roderic Page
Professor of Taxonomy
Institute of Biodiversity, Animal Health and Comparative Medicine
College of Medical, Veterinary and Life Sciences
Graham Kerr Building
University of Glasgow
Glasgow G12 8QQ, UK

Email: r.page at bio.gla.ac.uk<mailto:r.page at bio.gla.ac.uk>
Tel: +44 141 330 4778
Fax: +44 141 330 2792
Skype: rdmpage
Facebook: http://www.facebook.com/rdmpage
Twitter: http://twitter.com/rdmpage
Blog: http://iphylo.blogspot.com
Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html
Citations: http://scholar.google.co.uk/citations?hl=en&user=4Z5WABAAAAAJ

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.tdwg.org/pipermail/tdwg-tag/attachments/20121106/7bb53c5d/attachment-0001.html