Various thoughts

Sun Nov 6 07:24:58 CET 2005

Hear are some thoughts on some recent posts (I got sidetracked on my
iSpecies toy -- http://iSpecies.org -- which manages to show the power
of searching multiple information sources, and the limitations of
taxonomic names as a means to link those sources).

2 problems
---------------

Regarding Ricardo's post (Can we break down GUID problem into 2
hierarchical sub-problems?), I think the answer is clearly yes. For
some further insight, Lincoln Stein's article in Nature Reviews
Genetics (http://dx.doi.org/10.1038/nrg1065 -- there are PDFs online in
various places:
http://scholar.google.com/scholar?
hl=en&lr=&c2coff=1&safe=off&cluster=8495832052860440576) outlines
Lincoln's notion that biological databases could be integrated using a
combination of "knuckles" and "nodes" . Nodes are databases, knuckles
are services that map items in different databases.

I think we have lots of "nodes", and part of our challenge is to
integrate these. Step one is to have GUIDs for items in our databases,
so that we can always uniquely identify and retrieve data. Step two is
to map the items we are interested in. This is where I think much of
the taxonomic concept stuff comes in.

What gets a GUID?
------------------------

I think it's clear from Richard's responses that he's thought a lot
more about names than I have. Perhaps I'm too wedded to keeping things
simple, but is it reasonable to suppose that there are basically three
levels that we might operate at:

1. Names as text strings (orthographic variants may belong here)
2. Names as elements of nomenclature (i.e., date and authorship)
3. Names as concepts (i.e., linked to explicit usage)

So, why not have GUIDs at all three levels, and have databases be
explicit about what the serve? For example, my understanding is that
uBio is essentially about name strings,  IPNI is a database of plant
nomenclature, whereas TROPICOS/VAST is about concepts (in the sense of
what is the accepted name for a plant). So

1. uBio
2. IPNI
3. TROPICOS/VAST

(Ducks incoming brickbats)

So, one could imagine that a nomenclatural database could link its
names to namestring GUIDs, and a concept database could link to
nomenclatural GUIDs. From the user's perspective, they can choose what
kind of entity they want to refer to.

Why so many levels? Well, if we want to use names in the rest of
biology we'll need to accommodate all levels of certainty about what a
name means. At a minimum, we would map a name in a paper or some
database to a name GUID. Through links to nomenclatural and concept
databases, other users would at least see the range of possible
meanings for that name. If the source was explicit about a name (e.g.,
"Aus bus" sensu "J. Doe", or a connection to another object such as a
specimen or a sequence).

I think Richard is concerned that if we just have GUIDs for name
strings then we're not gaining much. I disagree.  We need to support
all kinds of names, and in many cases name strings are all we have. One
could ask do we need GUIDs for name if we can simply search
nomenclatural and concept databases using text strings (i.e., why
should names get GUIDs?). Well, I guess the answer is because I think
we need a database of names that has information on where the name came
from, in what content it was used (e.g., in a book on flies so it's
probably a fly name, a web-based checklist of plants of some island),
and so on. Plus there are a lot of informal names ("A. sp"), names in
different naming systems (Phylocode and less formal phylogenetic
names), and so on. These won't necessarily be captured in nomenclatural
databases.

I think we get some benefits from this:

1. Everybody is explicit about what they provide

2. People can get on with doing what they want to do (and/or do best).
uBio grabs names from wherever it can find them, IPNI provides detail
on the authorship of names, TROPICOS/VAST suggests which name to use.

3. Users decide what level is appropriate for them. They may just link
to a name, they may link to a concept.

4. Interlinking becomes easier because the nature of the link could be
made explicit (e.g., this name occurs in my database as this concept,
the name also occurs in ITIS, but I'm linking solely by name).

Regarding Richard's option 2: "How to use GUIDs to dramatically change
the way biological data is exchanged (higher cost, slower
implementation, fundamental improvements)."

I feel this is only going to happen if it's simple and easy to use, and
does in fact yield demonstrable improvements. Which leads me to the
next point.

Who is going to use this stuff?
--------------------------------------

If we think the audience is other taxonomists, and other taxonomic
database providers, then we are being far too inward looking. I think
we need to consider all sorts of things, particularly major
bioinformatics databases (think GenBank), geographers (think Google
Earth), and information providers (think journal publishers).

For example, some journals (such as those hosted by BioOne) have all
strings that look like taxonomic names linked to ITIS (whether the name
exists or not!). This provides added value -- click on the name and you
learn more. Now, publishers are very keen on adding value, and simple,
resolvable identifiers would be a bonus (after all, they got behind
DOIs). Now, asking publishers to tag using concepts is going to be
asking too much. Asking authors of papers is going to be a stretch as
well, although some may well do so, many may balk at yet more stuff
getting in the way of their work (which is doing the fun stuff and
writing it up).

In my opinion we need to keep thinking of simple solutions (they might
not be simple under the hood) that work, and that scale, and that have
a low barrier to entry.

Concept bank
-------------------

Sally's (not entirely serious) suggestion of "ConceptBank" bears
thinking about, especially as one could populate it initially by
automated searches. For example, studies that use DNA sequences will
have GenBank accession numbers for sequences, which we could use to
link together different studies on the same organism (i.e., if you use
the same sequence you're talking about the same thing - probably).
Likewise, one could text mine PubMed and build lists of
(name,publication) pairs. There are limits to this of course, but it
would at least be a starting point.

For me a major issue in all of this is scale. If we're serious we need
to think big, look at automation, think about distributed, social
approaches, and be willing to accept a degree of error. These aren't
attributes of the typical taxonomist.

Regards

Rod

------------------------------------------------------------------------
----------------------------------------
Professor Roderic D. M. Page
Editor, Systematic Biology
DEEB, IBLS
Graham Kerr Building
University of Glasgow
Glasgow G12 8QP
United Kingdom

Phone:    +44 141 330 4778
Fax:      +44 141 330 2792
email:    r.page at bio.gla.ac.uk
web:      http://taxonomy.zoology.gla.ac.uk/rod/rod.html
reprints: http://taxonomy.zoology.gla.ac.uk/rod/pubs.html

Subscribe to Systematic Biology through the Society of Systematic
Biologists Website:  http://systematicbiology.org
Search for taxon names at http://darwin.zoology.gla.ac.uk/~rpage/portal/
Find out what we know about a species at http://ispecies.org