Good discussion Rod....
What gets a GUID?
I think it's clear from Richard's responses that he's thought a lot more about names than I have. Perhaps I'm too wedded to keeping things simple, but is it reasonable to suppose that there are basically three levels that we might operate at:
- Names as text strings (orthographic variants may belong here)
- Names as elements of nomenclature (i.e., date and authorship)
- Names as concepts (i.e., linked to explicit usage)
So, why not have GUIDs at all three levels, and have databases be explicit about what the serve? For example, my understanding is that uBio is essentially about name strings, IPNI is a database of plant nomenclature, whereas TROPICOS/VAST is about concepts (in the sense of what is the accepted name for a plant). So
- uBio
- IPNI
- TROPICOS/VAST
I would agree here with Rod's 3 levels. Richard has gone into much more detail on what a name might be but I think that Rod's is good enough for our purposes. (1)Name strings that supposedly refer to some concept but don't adhere to any of the biological codes of nomenclature, (2)names that were published correctly according to a biological code of nomenclature (and have by definition a type specimen) and concepts (a name + publication where the name may be (1) or (2) depending on the "quality "of the concept ). Many of Richards name types could be sub-types of 1,2, or 3 as he suggests I think.
So, one could imagine that a nomenclatural database could link its names to namestring GUIDs,
I couldn't honestly imagine a nomenclatural database linking to name strings GUIDs but I could maybe imagine a name search facility linking to nomenclatural or concept databases...
and a concept database could link to nomenclatural GUIDs.
Yes this is what SEEK would want to do... (for its prototype - we're not seeing ourselves being the long term concept bank (yet...) )
From the user's perspective, they can choose what kind of entity they want to refer to.
Agree here but they really need to understand the problem if they are to make good use of these services. If a user doesn't understand the fact that a scientific name will not uniquely identify what kind or organism they are referring to and thereby that their data will be unambiguous (probably now and more likely in the future thereby lowering it's value) then we're doing them a disservice.
Why so many levels? Well, if we want to use names in the rest of biology we'll need to accommodate all levels of certainty about what a name means. At a minimum, we would map a name in a paper or some database to a name GUID. Through links to nomenclatural and concept databases, other users would at least see the range of possible meanings for that name. If the source was explicit about a name (e.g., "Aus bus" sensu "J. Doe", or a connection to another object such as a specimen or a sequence).
Yes I agree we need to accommodate different levels of uncertainty I guess this is where I've been promoting nominal concepts when you know the name but don't know the meaning, rather than using a name - basically to express more explicitly we don't know the meaning of the name.
I think Richard is concerned that if we just have GUIDs for name strings then we're not gaining much. I disagree. We need to support all kinds of names, and in many cases name strings are all we have.
One
could ask do we need GUIDs for name if we can simply search nomenclatural and concept databases using text strings (i.e., why should names get GUIDs?). Well, I guess the answer is because I think we need a database of names that has information on where the name
came
from, in what content it was used (e.g., in a book on flies so it's probably a fly name, a web-based checklist of plants of some island), and so on. Plus there are a lot of informal names ("A. sp"), names in different naming systems (Phylocode and less formal phylogenetic names), and so on. These won't necessarily be captured in
nomenclatural
databases.
These are more like concepts to me than name string GUIDs - except that the concepts don't have code approved names.... but if we're really only storing name strings and giving them a GUID then that's more like what I thought the name string GUIDs you were originally proposing were.
I think we get some benefits from this:
- Everybody is explicit about what they provide
Yes
- People can get on with doing what they want to do (and/or do best).
uBio grabs names from wherever it can find them, IPNI provides detail on the authorship of names, TROPICOS/VAST suggests which name to use.
Chuck could correct me here about TROPICOS but I guess they don't just suggest what names to use they describe concepts and part of that is assigning the appropriate code compliant name to them....i.e. they have a clear meaning for each of the names they use. But some other database might have just as valid but different meaning for the same name as them, possibly because of geography.
- Users decide what level is appropriate for them. They may just link
to a name, they may link to a concept.
Yes - but as long as they know if they link to a name they're not being clear about what they mean - it is possible to link to concepts of different degrees of accuracy (genus, sp, sub sp etc)which is different from linking to a name which has a degree of ambiguity and possibly the granularity issue too(except for the implicit inclusion of type specimens in meaning of names). Hope that made sense...
- Interlinking becomes easier because the nature of the link could be
made explicit (e.g., this name occurs in my database as this concept, the name also occurs in ITIS, but I'm linking solely by name).
Yes if done correctly...
Who is going to use this stuff?
If we think the audience is other taxonomists, and other taxonomic database providers, then we are being far too inward looking. I think we need to consider all sorts of things, particularly major bioinformatics databases (think GenBank), geographers (think Google Earth), and information providers (think journal publishers).
No we in SEEK have been looking at this problem from an ecologist's point of view. I few are to integrate biodiversity data spanning spatial and temporal ranges and we know different parts of the globe use different meaning for the same taxa and that this has also changed across time then integrating data on names will give inaccurate results for analysis. Now genomic work is more recent but even it is starting to suffer I believe from this same problem. So we should try and educate people now and look to the future as well as the past data.
For example, some journals (such as those hosted by BioOne) have all strings that look like taxonomic names linked to ITIS (whether the
name
exists or not!). This provides added value -- click on the name and
you
learn more.
Yes but it might also add more confidence to the error! So a name is linked to a meaning but it's not what the author meant! The author should decide....and we should provide tools to help them do so.
Now, publishers are very keen on adding value, and simple, resolvable identifiers would be a bonus (after all, they got behind DOIs). Now, asking publishers to tag using concepts is going to be asking too much. Asking authors of papers is going to be a stretch as well, although some may well do so, many may balk at yet more stuff getting in the way of their work (which is doing the fun stuff and writing it up).
Agreed if we don't find a good way of doing it. But I still believe people need to be educated on this.
In my opinion we need to keep thinking of simple solutions (they might not be simple under the hood) that work, and that scale, and that have a low barrier to entry.
I agree but I don't think we should pull the wool over people eyes about the problem...
Concept bank
Sally's (not entirely serious) suggestion of "ConceptBank" bears thinking about, especially as one could populate it initially by automated searches. For example, studies that use DNA sequences will have GenBank accession numbers for sequences, which we could use to link together different studies on the same organism (i.e., if you use the same sequence you're talking about the same thing - probably).
But why will you use/have the same sequence? - because you've sequenced it yourself and you get the same sequence? or because you thought it was Aus bus or whatever, looked it up in Genbank and copied the sequence??? Again highlighting the importance of being clear what we mean when we use a name - or sequence....
Likewise, one could text mine PubMed and build lists of (name,publication) pairs. There are limits to this of course, but it would at least be a starting point.
Yes this would be useful for some things....but definitely limited because of the existing problems in use of names.....
For me a major issue in all of this is scale. If we're serious we need to think big, look at automation, think about distributed, social approaches, and be willing to accept a degree of error. These aren't attributes of the typical taxonomist.
I agree we need to think big, I agree we need automation - but automation of what - tool sot help people get their meaningful names right would be the best thing. Getting people etc accept error is one thing - making sure they realise the error they are incorporating into their valuable research is just as important. (if they want to share with others they haven't planned to beforehand)
Jessie This message is intended for the addressee(s) only and should not be read, copied or disclosed to anyone else outwith the University without the permission of the sender. It is your responsibility to ensure that this message and any attachments are scanned for viruses or other defects. Napier University does not accept liability for any loss or damage which may result from this email or any attachment, or for errors or omissions arising after it was sent. Email is not a secure medium. Email entering the University's system is subject to routine monitoring and filtering by the University.