
Hear are some thoughts on some recent posts (I got sidetracked on my iSpecies toy -- http://iSpecies.org -- which manages to show the power of searching multiple information sources, and the limitations of taxonomic names as a means to link those sources). 2 problems --------------- Regarding Ricardo's post (Can we break down GUID problem into 2 hierarchical sub-problems?), I think the answer is clearly yes. For some further insight, Lincoln Stein's article in Nature Reviews Genetics (http://dx.doi.org/10.1038/nrg1065 -- there are PDFs online in various places: http://scholar.google.com/scholar? hl=en&lr=&c2coff=1&safe=off&cluster=8495832052860440576) outlines Lincoln's notion that biological databases could be integrated using a combination of "knuckles" and "nodes" . Nodes are databases, knuckles are services that map items in different databases. I think we have lots of "nodes", and part of our challenge is to integrate these. Step one is to have GUIDs for items in our databases, so that we can always uniquely identify and retrieve data. Step two is to map the items we are interested in. This is where I think much of the taxonomic concept stuff comes in. What gets a GUID? ------------------------ I think it's clear from Richard's responses that he's thought a lot more about names than I have. Perhaps I'm too wedded to keeping things simple, but is it reasonable to suppose that there are basically three levels that we might operate at: 1. Names as text strings (orthographic variants may belong here) 2. Names as elements of nomenclature (i.e., date and authorship) 3. Names as concepts (i.e., linked to explicit usage) So, why not have GUIDs at all three levels, and have databases be explicit about what the serve? For example, my understanding is that uBio is essentially about name strings, IPNI is a database of plant nomenclature, whereas TROPICOS/VAST is about concepts (in the sense of what is the accepted name for a plant). So 1. uBio 2. IPNI 3. TROPICOS/VAST (Ducks incoming brickbats) So, one could imagine that a nomenclatural database could link its names to namestring GUIDs, and a concept database could link to nomenclatural GUIDs. From the user's perspective, they can choose what kind of entity they want to refer to. Why so many levels? Well, if we want to use names in the rest of biology we'll need to accommodate all levels of certainty about what a name means. At a minimum, we would map a name in a paper or some database to a name GUID. Through links to nomenclatural and concept databases, other users would at least see the range of possible meanings for that name. If the source was explicit about a name (e.g., "Aus bus" sensu "J. Doe", or a connection to another object such as a specimen or a sequence). I think Richard is concerned that if we just have GUIDs for name strings then we're not gaining much. I disagree. We need to support all kinds of names, and in many cases name strings are all we have. One could ask do we need GUIDs for name if we can simply search nomenclatural and concept databases using text strings (i.e., why should names get GUIDs?). Well, I guess the answer is because I think we need a database of names that has information on where the name came from, in what content it was used (e.g., in a book on flies so it's probably a fly name, a web-based checklist of plants of some island), and so on. Plus there are a lot of informal names ("A. sp"), names in different naming systems (Phylocode and less formal phylogenetic names), and so on. These won't necessarily be captured in nomenclatural databases. I think we get some benefits from this: 1. Everybody is explicit about what they provide 2. People can get on with doing what they want to do (and/or do best). uBio grabs names from wherever it can find them, IPNI provides detail on the authorship of names, TROPICOS/VAST suggests which name to use. 3. Users decide what level is appropriate for them. They may just link to a name, they may link to a concept. 4. Interlinking becomes easier because the nature of the link could be made explicit (e.g., this name occurs in my database as this concept, the name also occurs in ITIS, but I'm linking solely by name). Regarding Richard's option 2: "How to use GUIDs to dramatically change the way biological data is exchanged (higher cost, slower implementation, fundamental improvements)." I feel this is only going to happen if it's simple and easy to use, and does in fact yield demonstrable improvements. Which leads me to the next point. Who is going to use this stuff? -------------------------------------- If we think the audience is other taxonomists, and other taxonomic database providers, then we are being far too inward looking. I think we need to consider all sorts of things, particularly major bioinformatics databases (think GenBank), geographers (think Google Earth), and information providers (think journal publishers). For example, some journals (such as those hosted by BioOne) have all strings that look like taxonomic names linked to ITIS (whether the name exists or not!). This provides added value -- click on the name and you learn more. Now, publishers are very keen on adding value, and simple, resolvable identifiers would be a bonus (after all, they got behind DOIs). Now, asking publishers to tag using concepts is going to be asking too much. Asking authors of papers is going to be a stretch as well, although some may well do so, many may balk at yet more stuff getting in the way of their work (which is doing the fun stuff and writing it up). In my opinion we need to keep thinking of simple solutions (they might not be simple under the hood) that work, and that scale, and that have a low barrier to entry. Concept bank ------------------- Sally's (not entirely serious) suggestion of "ConceptBank" bears thinking about, especially as one could populate it initially by automated searches. For example, studies that use DNA sequences will have GenBank accession numbers for sequences, which we could use to link together different studies on the same organism (i.e., if you use the same sequence you're talking about the same thing - probably). Likewise, one could text mine PubMed and build lists of (name,publication) pairs. There are limits to this of course, but it would at least be a starting point. For me a major issue in all of this is scale. If we're serious we need to think big, look at automation, think about distributed, social approaches, and be willing to accept a degree of error. These aren't attributes of the typical taxonomist. Regards Rod ------------------------------------------------------------------------ ---------------------------------------- Professor Roderic D. M. Page Editor, Systematic Biology DEEB, IBLS Graham Kerr Building University of Glasgow Glasgow G12 8QP United Kingdom Phone: +44 141 330 4778 Fax: +44 141 330 2792 email: r.page@bio.gla.ac.uk web: http://taxonomy.zoology.gla.ac.uk/rod/rod.html reprints: http://taxonomy.zoology.gla.ac.uk/rod/pubs.html Subscribe to Systematic Biology through the Society of Systematic Biologists Website: http://systematicbiology.org Search for taxon names at http://darwin.zoology.gla.ac.uk/~rpage/portal/ Find out what we know about a species at http://ispecies.org
participants (1)
-
Roderic Page