Various thoughts

Sun Nov 6 20:07:33 CET 2005

Good discussion Rod....

> 
> What gets a GUID?
> ------------------------
> 
> I think it's clear from Richard's responses that he's thought a lot
> more about names than I have. Perhaps I'm too wedded to keeping things
> simple, but is it reasonable to suppose that there are basically three
> levels that we might operate at:
> 
> 1. Names as text strings (orthographic variants may belong here)
> 2. Names as elements of nomenclature (i.e., date and authorship)
> 3. Names as concepts (i.e., linked to explicit usage)
> 
> So, why not have GUIDs at all three levels, and have databases be
> explicit about what the serve? For example, my understanding is that
> uBio is essentially about name strings,  IPNI is a database of plant
> nomenclature, whereas TROPICOS/VAST is about concepts (in the sense of
> what is the accepted name for a plant). So
> 
> 1. uBio
> 2. IPNI
> 3. TROPICOS/VAST

I would agree here with Rod's 3 levels. Richard has gone into much more
detail on what a name might be but I think that Rod's is good enough for
our purposes. (1)Name strings that supposedly refer to some concept but
don't adhere to any of the biological codes of nomenclature, (2)names
that were published correctly according to a biological code of
nomenclature (and have by definition a type specimen) and concepts (a
name + publication where the name may be (1) or (2) depending on the
"quality "of the concept ). Many of Richards name types could be
sub-types of 1,2, or 3 as he suggests I think.

> 
> So, one could imagine that a nomenclatural database could link its
> names to namestring GUIDs, 
I couldn't honestly imagine a nomenclatural database linking to name
strings GUIDs but I could maybe imagine a name search facility linking
to nomenclatural or concept databases...
>and a concept database could link to
> nomenclatural GUIDs. 
Yes this is what SEEK would want to do... (for its prototype - we're not
seeing ourselves being the long term concept bank (yet...) )
> From the user's perspective, they can choose what
> kind of entity they want to refer to.

Agree here but they really need to understand the problem if they are to
make good use of these services. If a user doesn't understand the fact
that a scientific name will not uniquely identify what kind or organism
they are referring to and thereby that their data will be unambiguous
(probably now and more likely in the future thereby lowering it's value)
then we're doing them a disservice. 

> 
> Why so many levels? Well, if we want to use names in the rest of
> biology we'll need to accommodate all levels of certainty about what a
> name means. At a minimum, we would map a name in a paper or some
> database to a name GUID. Through links to nomenclatural and concept
> databases, other users would at least see the range of possible
> meanings for that name. If the source was explicit about a name (e.g.,
> "Aus bus" sensu "J. Doe", or a connection to another object such as a
> specimen or a sequence).
Yes I agree we need to accommodate different levels of uncertainty I
guess this is where I've been promoting nominal concepts when you know
the name but don't know the meaning, rather than using a name -
basically to express more explicitly we don't know the meaning of the
name.
> 
> I think Richard is concerned that if we just have GUIDs for name
> strings then we're not gaining much. I disagree.  We need to support
> all kinds of names, and in many cases name strings are all we have.
One
> could ask do we need GUIDs for name if we can simply search
> nomenclatural and concept databases using text strings (i.e., why
> should names get GUIDs?). Well, I guess the answer is because I think
> we need a database of names that has information on where the name
came
> from, in what content it was used (e.g., in a book on flies so it's
> probably a fly name, a web-based checklist of plants of some island),
> and so on. Plus there are a lot of informal names ("A. sp"), names in
> different naming systems (Phylocode and less formal phylogenetic
> names), and so on. These won't necessarily be captured in
nomenclatural
> databases.
These are more like concepts to me than name string GUIDs - except that
the concepts don't have code approved names.... but if we're really only
storing name strings and giving them a GUID then that's more like what I
thought the name string GUIDs you were originally proposing were.

> 
> I think we get some benefits from this:
> 
> 1. Everybody is explicit about what they provide

Yes

> 
> 2. People can get on with doing what they want to do (and/or do best).
> uBio grabs names from wherever it can find them, IPNI provides detail
> on the authorship of names, TROPICOS/VAST suggests which name to use.

Chuck could correct me here about TROPICOS but I guess they don't just
suggest what names to use they describe concepts and part of that is
assigning the appropriate code compliant name to them....i.e. they have
a clear meaning for each of the names they use. But some other database
might have just as valid but different meaning for the same name as
them, possibly because of geography.

> 
> 3. Users decide what level is appropriate for them. They may just link
> to a name, they may link to a concept.

Yes - but as long as they know if they link to a name they're not being
clear about what they mean - it is possible to link to concepts of
different degrees of accuracy (genus, sp, sub sp etc)which is different
from linking to a name which has a degree of ambiguity and possibly the
granularity issue too(except for the implicit inclusion of type
specimens in meaning of names). Hope that made sense...
> 
> 4. Interlinking becomes easier because the nature of the link could be
> made explicit (e.g., this name occurs in my database as this concept,
> the name also occurs in ITIS, but I'm linking solely by name).
Yes if done correctly...

> 
> Who is going to use this stuff?
> --------------------------------------
> 
> If we think the audience is other taxonomists, and other taxonomic
> database providers, then we are being far too inward looking. I think
> we need to consider all sorts of things, particularly major
> bioinformatics databases (think GenBank), geographers (think Google
> Earth), and information providers (think journal publishers).

No we in SEEK have been looking at this problem from an ecologist's
point of view. I few are to integrate biodiversity data spanning spatial
and temporal ranges and we know different parts of the globe use
different meaning for the same taxa and that this has also changed
across time then integrating data on names will give inaccurate results
for analysis. Now genomic work is more recent but even it is starting to
suffer I believe from this same problem. So we should try and educate
people now and look to the future as well as the past data.
> 
> For example, some journals (such as those hosted by BioOne) have all
> strings that look like taxonomic names linked to ITIS (whether the
name
> exists or not!). This provides added value -- click on the name and
you
> learn more. 
Yes but it might also add more confidence to the error! So a name is
linked to a meaning but it's not what the author meant!
The author should decide....and we should provide tools to help them do
so.

> Now, publishers are very keen on adding value, and simple,
> resolvable identifiers would be a bonus (after all, they got behind
> DOIs). Now, asking publishers to tag using concepts is going to be
> asking too much. Asking authors of papers is going to be a stretch as
> well, although some may well do so, many may balk at yet more stuff
> getting in the way of their work (which is doing the fun stuff and
> writing it up).

Agreed if we don't find a good way of doing it. But I still believe
people need to be educated on this.
> 
> In my opinion we need to keep thinking of simple solutions (they might
> not be simple under the hood) that work, and that scale, and that have
> a low barrier to entry.

I agree but I don't think we should pull the wool over people eyes about
the problem...

> 
> Concept bank
> -------------------
> 
> Sally's (not entirely serious) suggestion of "ConceptBank" bears
> thinking about, especially as one could populate it initially by
> automated searches. For example, studies that use DNA sequences will
> have GenBank accession numbers for sequences, which we could use to
> link together different studies on the same organism (i.e., if you use
> the same sequence you're talking about the same thing - probably).

But why will you use/have the same sequence? - because you've sequenced
it yourself and you get the same sequence? or because you thought it was
Aus bus or whatever, looked it up in Genbank and copied the sequence???
Again highlighting the importance of being clear what we mean when we
use a name - or sequence....

> Likewise, one could text mine PubMed and build lists of
> (name,publication) pairs. There are limits to this of course, but it
> would at least be a starting point.

Yes this would be useful for some things....but definitely limited
because of the existing problems in use of names.....

> 
> For me a major issue in all of this is scale. If we're serious we need
> to think big, look at automation, think about distributed, social
> approaches, and be willing to accept a degree of error. These aren't
> attributes of the typical taxonomist.

I agree we need to think big, I agree we need automation - but
automation of what - tool sot help people get their meaningful names
right would be the best thing. Getting people etc accept error is one
thing - making sure they realise the error they are incorporating into
their valuable research is just as important. (if they want to share
with others they haven't planned to beforehand)

Jessie
This message is intended for the addressee(s) only and should not be read, copied or disclosed to anyone else outwith the University without the permission of the sender.
It is your responsibility to ensure that this message and any attachments are scanned for viruses or other defects. Napier University does not accept liability for any loss
or damage which may result from this email or any attachment, or for errors or omissions arising after it was sent. Email is not a secure medium. Email entering the 
University's system is subject to routine monitoring and filtering by the University.