Rich,
This is very useful, and your list of different levels ties in with some of the things I have been trying to think about in developing a central data index that will be flexible enough to guide users to relevant data based on a mapping between their request parameters and the variable name information that may be included in different data records.
My guess is that many of the levels you describe will be handled by dynamic processing within different systems and will therefore not normally ever receive their own GUIDs. It will be the responsibility of projects such as uBio, Species 2000, ITIS and GBIF to develop algorithms that can cluster name strings in appropriate ways.
I would say that the three levels that Rod identifies (raw name strings, nomenclatural records, taxon concepts) are the most important levels for us to refine and the ones on which we need to focus. The last of these is clearly the most contentious, but I believe that the TaxonConcept element as currently defined in TCS is an excellent compromise that should support a wide range of applications.
Donald
--------------------------------------------------------------- Donald Hobern (dhobern@gbif.org) Programme Officer for Data Access and Database Interoperability Global Biodiversity Information Facility Secretariat Universitetsparken 15, DK-2100 Copenhagen, Denmark Tel: +45-35321483 Mobile: +45-28751483 Fax: +45-35321480 ---------------------------------------------------------------
-----Original Message----- From: Taxonomic Databases Working Group GUID Project [mailto:TDWG-GUID@LISTSERV.NHM.KU.EDU] On Behalf Of Richard Pyle Sent: 06 November 2005 11:18 To: TDWG-GUID@LISTSERV.NHM.KU.EDU Subject: Re: Various thoughts
Thanks, Rod -- great stuff!
What gets a GUID?
I think it's clear from Richard's responses that he's thought a lot more about names than I have. Perhaps I'm too wedded to keeping things simple, but is it reasonable to suppose that there are basically three levels that we might operate at:
- Names as text strings (orthographic variants may belong here)
- Names as elements of nomenclature (i.e., date and authorship)
- Names as concepts (i.e., linked to explicit usage)
So, why not have GUIDs at all three levels, and have databases be explicit about what the serve?
Unfortunately, there are more than three levels in current use (from least to most taxonomically specific):
1. Names as raw text strings (including orthographic variants, homonyms not distinguished)
2. Names as raw text strings (including orthographic variants, homonyms distinguished)
3. Names as complete sets of one or more name-units (including orthographic variants, homonyms distinguished)
4. Names as complete sets of one or more name-units (excluding orthographic variants, homonyms distinguished)
5. Names as semi-complete sets of one or more name-units (including orthographic variants, homonyms distinguished)
6. Names as semi-complete sets of one or more name-units (excluding orthographic variants, homonyms distinguished)
7. Names as terminal units only (excluding orthographic variants, homonyms distinguished)
8. Names in context of usage (Name SEC Usage instance)
Note: Monomials have one name-unit, binomials have two, trinomials have three, the name "Centropyge (Xiphypops) fisheri falvicauda" has four, and so on.
There are actually more possibilities than these (e.g., whether or not autonyms/nominotypical names are treated as distinct), but the above list covers the main range of what I know to already be "out there". (Let's not even open the Pandora's box of hybrid name formulae...)
I believe that your work and that of uBio typically involves #1 & #2 (#1 is a Google search; #2 is usually in the form of Name+authors). Taxonomic authorities for specimen databases sometimes operate on #3 & #4. ITIS works at #5 ("semi-complete" because infrageneric names are not strictly included for names at rank of species and lower, and they also allow only one level of infraspecific name-unit. #6 is the botanical approach, and I *think* the way that IPNI and/or IF do it (in this case, semi-complete excludes infrageneric name-units for species and lower, but im not sure how many simultaneous levels of infraspecific names they accomodate -- Sally??) #7 is the zoological perspective, and #8 is concepts.
Actually, #8 is only one flavor of concept -- but concepts don't really even belong on this list at all. It seems that most people can agree that a concept is represented by a Name-Usage instance [yes, Jessie -- "Usage" can be sensu-stricto here! :-) ] -- e.g., "Name+Publication". We haven't begun the war over what constitutes a "Publication" (I prefer the term "Documentation"), but that war doesn't need to be started here. I think we can reasonably safely say that concepts can be thought of as Name+Publication, and so for this discussion we really only need to focus on what a "Name" is.
So....I think we can probably reduce the list of 1-7 above down to perhaps 3 different items -- but I wouldn't go with exactly the three you listed. We can eliminate #1 of my list, because the name-string itself is the GUID (or rather, the provider+namestring itself can serve as the GUID). The main difference between #2 and #3 is that #3 parses and distinguishes the ranks of the individual name-units. In other words, it's smart enough to know that the name: "Centropyge (Xiphypops) fisheri falvicauda (Fraser-Brunner 1933)" consists of GenusName+SubgenusName+SpeciesEpithet+SubspeciesEpithet -- which is slightly more informative than #2, which understands only a text string consisting of 63 consecutive ASCII characters.
Also, ITIS technically could be represented as #3 (instead of #5), because it actually tracks infrageneric name-units via hierarichal "parent_tsn" links. Also, #3 requires less work to sort out than #5 (e.g., #3 doesn't need to recognize that "Centropyge flavicauda" and "Centropyge flavicaudus" derive from the same basionym), so more existing datasets could conform to it with less work to scrutinize each name record. #4 is the least common of the first five (in my experience), and could also be "dumbed down" to fit into #3 fairly painlessly, or possibly even "smartened up" to fit into #6 in some cases
So....of the first five in the list, we could probably get by with just #3 (which would more or less correspond to your #1, only a bit more explicitly defined). I think we should definitely keep #6 as a GUID candidate, because it fits the botanical practice. Same with #7 for zoological practice. Thus, I do have hope that we could reduce the list to these three name-GUID domains:
1. Names as complete sets of one or more name-units (including orthographic variants, homonyms distinguished) (for names-only records, inclusive of authors to distinguish homonyms)
2. Names as semi-complete sets of one or more name-units (excluding orthographic variants, homonyms distinguished) (for IPNI/IF and botanical datasets)
3. Names as terminal units only (excluding orthographic variants, homonyms distinguished) (for zoological datasets -- particularly ZooBank).
There are several other reasons why these three "domains" of name-GUIDs make sense to me (this message is too long as it is, so I won't go into those reasons now). But bear in mind that each one of them would need some explicit rules about things like autonyms/nominotypical names and such. And eventually we *will* need to open the "hybrid" Pandora's box....but I think this set of three could be a useful start.
Any one of the three could be placed in the context of a Usage instance (=Publication), and thereby represent a concept.
The main question I have for you, Rod, is: would the "Names as complete sets of one or more name-units" be too information-rich to be practical? In other words, is it reasonable to force name providers (who want to conform to a TDWG Name-GUID standard) to minimally: 1) distinguish homonyms, and 2) identify ranks of individual name-units in a multi-unit name (e.g, distinguish a trinomial of Genus+Subgenus+species [Octopus (Octopus) octopus] as different from a trinomial of Genus+species+subspecies [Octopus octopus octopus])? If these bits of metadata are not unreasonable, then we might just find a path to mutual agreement on this issue!
I hope the fact that I embrace your "three alternate GUIDs for names" approach (i.e., the bulk of your last message) is self-evident from all the text I have written above!
As for the section of your email under this heading:
Who is going to use this stuff?
I agree with everything you wrote. My thoughts regarding "How to use GUIDs to dramatically change the way biological data is exchanged" are focused not just on helping taxonomists, but also on concealing the complexities and subtleties of taxonomy from non-taxonomists, to allow more direct access to the information they are most likely to be searching for.
Concept bank
If I'm not mistaken, this is one of the things the SEEK folks have set out to do (Jessie?)
As for the rest of your message that I did not comment on above, it is clear that you and I agree on MUCH more than we disagree on!
Aloha, Rich
Richard L. Pyle, PhD Database Coordinator for Natural Sciences and Associate Zoologist in Ichthyology Department of Natural Sciences, Bishop Museum 1525 Bernice St., Honolulu, HI 96817 Ph: (808)848-4115, Fax: (808)847-8252 email: deepreef@bishopmuseum.org http://hbs.bishopmuseum.org/staff/pylerichard.html