Various thoughts

Sun Nov 6 00:17:30 CET 2005

Thanks, Rod -- great stuff!

> What gets a GUID?
> ------------------------
>
> I think it's clear from Richard's responses that he's thought a lot
> more about names than I have. Perhaps I'm too wedded to keeping things
> simple, but is it reasonable to suppose that there are basically three
> levels that we might operate at:
>
> 1. Names as text strings (orthographic variants may belong here)
> 2. Names as elements of nomenclature (i.e., date and authorship)
> 3. Names as concepts (i.e., linked to explicit usage)
>
> So, why not have GUIDs at all three levels, and have databases be
> explicit about what the serve?

Unfortunately, there are more than three levels in current use (from least
to most taxonomically specific):

1. Names as raw text strings (including orthographic variants, homonyms not
distinguished)

2. Names as raw text strings (including orthographic variants, homonyms
distinguished)

3. Names as complete sets of one or more name-units (including orthographic
variants, homonyms distinguished)

4. Names as complete sets of one or more name-units (excluding orthographic
variants, homonyms distinguished)

5. Names as semi-complete sets of one or more name-units (including
orthographic variants, homonyms distinguished)

6. Names as semi-complete sets of one or more name-units (excluding
orthographic variants, homonyms distinguished)

7. Names as terminal units only (excluding orthographic variants, homonyms
distinguished)

8. Names in context of usage (Name SEC Usage instance)

Note: Monomials have one name-unit, binomials have two, trinomials have
three, the name "Centropyge (Xiphypops) fisheri falvicauda" has four, and so
on.

There are actually more possibilities than these (e.g., whether or not
autonyms/nominotypical names are treated as distinct), but the above list
covers the main range of what I know to already be "out there". (Let's not
even open the Pandora's box of hybrid name formulae...)

I believe that your work and that of uBio typically involves #1 & #2 (#1 is
a Google search; #2 is usually in the form of Name+authors). Taxonomic
authorities for specimen databases sometimes operate on #3 & #4. ITIS works
at #5 ("semi-complete" because infrageneric names are not strictly included
for names at rank of species and lower, and they also allow only one level
of infraspecific name-unit.  #6 is the botanical approach, and I *think* the
way that IPNI and/or IF do it (in this case, semi-complete excludes
infrageneric name-units for species and lower, but im not sure how many
simultaneous levels of infraspecific names they accomodate -- Sally??)  #7
is the zoological perspective, and #8 is concepts.

Actually, #8 is only one flavor of concept -- but concepts don't really even
belong on this list at all.  It seems that most people can agree that a
concept is represented by a Name-Usage instance [yes, Jessie -- "Usage" can
be sensu-stricto here! :-) ] -- e.g., "Name+Publication".  We haven't begun
the war over what constitutes a "Publication" (I prefer the term
"Documentation"), but that war doesn't need to be started here.  I think we
can reasonably safely say that concepts can be thought of as
Name+Publication, and so for this discussion we really only need to focus on
what a "Name" is.

So....I think we can probably reduce the list of 1-7 above down to perhaps 3
different items -- but I wouldn't go with exactly the three you listed.  We
can eliminate #1 of my list, because the name-string itself is the GUID (or
rather, the provider+namestring itself can serve as the GUID). The main
difference between #2 and #3 is that #3 parses and distinguishes the ranks
of the individual name-units. In other words, it's smart enough to know that
the name:
"Centropyge (Xiphypops) fisheri falvicauda (Fraser-Brunner 1933)"
consists of GenusName+SubgenusName+SpeciesEpithet+SubspeciesEpithet -- which
is slightly more informative than #2, which understands only a text string
consisting of 63 consecutive ASCII characters.

Also, ITIS technically could be represented as #3 (instead of #5), because
it actually tracks infrageneric name-units via hierarichal "parent_tsn"
links. Also, #3 requires less work to sort out than #5 (e.g., #3 doesn't
need to recognize that "Centropyge flavicauda" and "Centropyge flavicaudus"
derive from the same basionym), so more existing datasets could conform to
it with less work to scrutinize each name record.  #4 is the least common of
the first five (in my experience), and could also be "dumbed down" to fit
into #3 fairly painlessly, or possibly even "smartened up" to fit into #6 in
some cases

So....of the first five in the list, we could probably get by with just #3
(which would more or less correspond to your #1, only a bit more explicitly
defined). I think we should definitely keep #6 as a GUID candidate, because
it fits the botanical practice.  Same with #7 for zoological practice.
Thus, I do have hope that we could reduce the list to these three name-GUID
domains:

1. Names as complete sets of one or more name-units (including orthographic
variants, homonyms distinguished)
(for names-only records, inclusive of authors to distinguish homonyms)

2. Names as semi-complete sets of one or more name-units (excluding
orthographic variants, homonyms distinguished)
(for IPNI/IF and botanical datasets)

3. Names as terminal units only (excluding orthographic variants, homonyms
distinguished)
(for zoological datasets -- particularly ZooBank).

There are several other reasons why these three "domains" of name-GUIDs make
sense to me (this message is too long as it is, so I won't go into those
reasons now). But bear in mind that each one of them would need some
explicit rules about things like autonyms/nominotypical names and such.  And
eventually we *will* need to open the "hybrid" Pandora's box....but I think
this set of three could be a useful start.

Any one of the three could be placed in the context of a Usage instance
(=Publication), and thereby represent a concept.

The main question I have for you, Rod, is: would the "Names as complete sets
of one or more name-units" be too information-rich to be practical?  In
other words, is it reasonable to force name providers (who want to conform
to a TDWG Name-GUID standard) to minimally: 1) distinguish homonyms, and 2)
identify ranks of individual name-units in a multi-unit name (e.g,
distinguish a trinomial of Genus+Subgenus+species [Octopus (Octopus)
octopus] as different from a trinomial of Genus+species+subspecies [Octopus
octopus octopus])? If these bits of metadata are not unreasonable, then we
might just find a path to mutual agreement on this issue!

I hope the fact that I embrace your "three alternate GUIDs for names"
approach (i.e., the bulk of your last message) is self-evident from all the
text I have written above!

As for the section of your email under this heading:

> Who is going to use this stuff?
> --------------------------------------

I agree with everything you wrote.  My thoughts regarding "How to use GUIDs
to dramatically change the way biological data is exchanged" are focused not
just on helping taxonomists, but also on concealing the complexities and
subtleties of taxonomy from non-taxonomists, to allow more direct access to
the information they are most likely to be searching for.

> Concept bank
> -------------------

If I'm not mistaken, this is one of the things the SEEK folks have set out
to do (Jessie?)

As for the rest of your message that I did not comment on above, it is clear
that you and I agree on MUCH more than we disagree on!

Aloha,
Rich

Richard L. Pyle, PhD
Database Coordinator for Natural Sciences
  and Associate Zoologist in Ichthyology
Department of Natural Sciences, Bishop Museum
1525 Bernice St., Honolulu, HI 96817
Ph: (808)848-4115, Fax: (808)847-8252
email: deepreef at bishopmuseum.org
http://hbs.bishopmuseum.org/staff/pylerichard.html