Re: [tdwg-content] Name is species concept thinking

13 Jun 2010

      Hi Dave,
...
By linking to a populated GNUB it would also have an improved 
means to provide the protonym circumscription of the concept, 
as you describe in (5).
Just to be clear, when you say "protonym circumscription of the concept",
you mean a concept circumscription whose boundaries are defined by the set
of included protonyms (as opposed to the concept circumscription established
for the Protonym-usage instance; i.e., original description).  Correct?
Although such concept/circumscription definitions (effectively represented
by the set of type specimens implied by the set of protonyms) are not as
high-resoultion as concept/circumscription definitions that are defined by a
broader suite of specimens, populations, or characters; they are, I believe,
the "best bang for the buck" in that they give us 80% of the benefit for 20%
of the work.
...
In addition,  we would like to support the inclusion of
bibliographic data,
Already included via GNUB.
...
specimens,
In my mind, a *key* value of GNUB/GNA is to serve as a taxon authority for
specimen collections (i.e., the anchorpoints for specimen/observation
taxonomic identifications).
...
geospatial information,
Inherited from the specimens/observations.
...
and general
descriptive data.
Inherited from the PLAZI treatments anchored to the publications, as well as
the published and unpublished character data anchored through specimens.
...
In (5) you describe the protonym-based circumscription to 
evaluate the relative agreement of the identified concepts (via 'meta-
authorities').    This provides the basis for expanding the potential
set of names for a subsequent data retrieval from GBIF (for 
example) to include all the related nomenclatural and lexical 
variants for those names (of course checking for homonym 
conflicts among them).
Yes, exactly!
...
In (6) it appears the output of the Taxon Concept resolution 
process is either an expanded set of name strings or an array of
protonymIDs.
Before the content is built, the name-strings can be fed back into GNI to
snoop out additional possible protonym links.  However, in a data-populatd
paradigm, it would be an array of ProtonymIDs.
...
If the latter,  I
can see how this would provide a more precise 
concept-informed but name-based retrieval method and probably 
the best we can expect from
large indices like GBIF.    But I don't see how it will support a
strict concept-based retrieval.
If you are content with a protonym-based concept circumscription definition,
it has all you need.  Each Taxon Name Usage instance in GNUB represents an
array of (minimually one) ProtonymIDs -- that is, the set of all protonyms
representing the asserted taxon concept in the usage instance.  Like I said,
it's not as high-resolution as specimen/population/character-based
concept/circumscription definitions, but I think it gets us most of the way
there, with the least amount of effort (not to say that it requires little
effort to get us that far -- just that trying to define concept boundaries
at higher resolution requires *MUCH* more effort).

So, the question is, what concept boundaries are fuzzy when you use
Protonym-based definitions?

Imagine an example where we have 7 protonyms of something in the Pacific;
three described from type specimens collected in the eastern Pacific, and
four from specimens collected throughout the western Pacific.  We also have
a bunch of specimens from the central Pacific, but no Protonyms typified
from that region.

Taxonomist "A" declares that the three protonyms from the eastern Pacific
represents one valid species (Aus bus), and the four from the west represent
a second valid species (Aus xus).  Taxonomist "B" declares the exact same
thing.  Using Protonym-based circumscriptions, we can infer that each the
taxon concepts of "Aus bus" and "Aus xus" are both congruent between the two
taxonomists.

The fuzziness comes in for the central Pacific populations:

1) Suppose that Taxonomist "A" explicitly cited the populations in the
central Pacific, and declared them to be "Aus bus"; but Taxonomist "B" never
mentioned them.  In that case, we would probably want to establish the
concept realtionship as "Aus bus sec. A <includes> Aus bus sec. B" (as
opposed to "is congruent with", as would be the case for a Protonym-based
circumscription).

2) Suppose that Taxonomist "A" explicitly cited the populations in the
central Pacific, and declared them to be "Aus bus"; but Taxonomist "B" cited
those same populations as belonging to "Aus xus".  In that case, we would
probably want to establish the concept realtionship as "Aus bus sec. A
<overlaps with> Aus bus sec. B". Again, the Protonym-based circumscription
in this case would give us an imprecise representation of the concept
mappings.

However, in my experience (working in the Pacific, where this sort of
circumsctance of eastern vs. western vs. central population differences
happens a LOT), it's actually a very rare problem.  That is, in scenario 1,
it's most likely the case that Taxonomist B would have included the central
populations the same way that Taxonomist A would have.  As for scenario 2,
I'm struggling to think of even a single example of this.  I suspect that
it's just very rare.

So the point is, I think that protonym-based circumscription definitions are
perfectly adequate for the vast majority of use cases.
...
The real world example that forms my litmus test is the 
blue-headed vireo,  Vireo solitarius (Wilson 1810) which was 
originally called Muscicapa solitaria and has also been 
combined to form Vireosylvia
solitaria and Lanivireo solitarius.   Of course there are lexical
variants as well (Google "Lanivireo solitaria" for example).   These,
properly structured, would be the sort of useful set of 
lexical/ nomenclatural content I would hope as a response 
from a  GNI/GNUB resolution service based on protonymID.
Send me a bunch of usage instances involving all the different name
variants, and involving various concept definitions, and I can create a
sample GNUB dataset that would illustrate how this would work.  The
name-mapping things is trivial, once the TNU instances have been populated.
The concept mapping stuff is a bit more complex -- but still relatively
simple compared to algorithms for, say, oxygen control systems in
rebreathers..... :-)
...
One current view of the taxon (concept C1) has this species occupying
the eastern part of the US.   Another species, Vireo plumbeus Coues,
1866, (concept C2) occupies the middle west USA, and a third 
species, Vireo cassini Xántus de Vesey, 1858 (concept C3) is 
on the western coast.
Another view lumps all three of these into a single species 
which, based on the rule of priority, has the valid name 
Vireo solitarius and results in a new concept (C4).  This 
concept includes C1, C2,  and
C3.   Both concepts have the scientific name of Vireo solitarius.
We can access and represent these in a consistent fashion 
using our CLB and probably others can too in their own index models.
So, now we have a specimen of Vireo solitarius that was captured in
Minnesota.   It might be an errant instance of C1, Vireo solitarius
sensu stricto, that strayed a bit west of normal.   It might be (C4)
Vireo solitarius, sensu lato.     The specimen would need that concept
identifier tied to the record to make this explicit.    So,  let's say
that the identifier was made using the lumped concept (C4).  
Of course, if this doesn't make it into the record, we are 
stuck with the name alone.
Right -- this sounds like the same as the hypothetical example I made above.
But like I say, I think this example is the exception, rather than the rule
(i.e., it falls in the missing 20% of the "benefit" in the 80% benefit/20%
work ratio).
...
Using the method (6) you described would allow a user to 
discover the different treatments of Vireo solitarius (C1 and 
C4) and provide some means to discriminate them via concept 
resolution.
- C4 includes C1, C2, and C3 which would include all the names above.
- C1 would only include the nomenclatural/lexical variants 
for Vireo solitarius.
Resolution will enable us to perform a significantly more 
useful and concept-informed search.  It will, however,  
include the specimen I referenced above in BOTH cases because 
"Vireo solitarius" or it's protonymID will be a search term 
in both cases.
Right -- until someone else comes along and provides a more explicit
identification for that specimen.
...
A more precise concept based system would utilise a required 
taxon concept identifier in the specimen record to 
discriminate different uses of the SAME NAME.
Sure!  That would be fantastic -- and maybe someday we'll get to the point
where all specimen/observation identification events come in the form of
"Aus bus sec. Smith 1955", rather than simply "Aus bus" (as the vast
majority are now).  This, in my mind, is the single greatest and most
consistent informatics failure within legacy taxonomic works and specimen
databases.  But I think the good news is that we can still get 80% of the
benefit by going only as far as protonyms (which we *can* derive from a name
alone -- once we get past homonymy and gross misspellings).
...
In other 
words,  if you did a search of Vireo solitarius and the 
concept resolver indicated the different concepts above and 
you chose the sensu stricto (split) version,  you would get 
the C1 labelled records but the C4 labelled records would be 
excluded or at least come with a warning (may not be what you 
are looking for).  This of course requires our specimen 
records to have a concept
identifier.   Or,  the concept definition itself will include
additional annotations to enable us to make inferences
I think the best we can do is flag those cass, and rely on caveat emptor.
...
Publication date of the concept - If the split didn't happen 
until 1980 and the specimen is from 1960 then we could infer C4.
Distribution information for the concept - if we disregard 
errant specimens then we might infer a 1985 Minnesota 
specimen is a C2 in spite of the different name.
The date one could work within the GNUB architecture, because that dates are
all there (as long as the specimen identification was also dated).  With the
right integration with GBIF, the distribution one *might* be derivable
algorithmically, but it wold depend on the nature of the data.
...
In sum,  we are on track for achieving this and I believe our 
data mobilisation strategy will support getting these sort of data
published.   When Markus returns from paternity leave I would hope we
could include his thoughts on how we might expose these as 
RDF via our indices to support all aspects of this discussion.
Keep on a keepin' on....

Rich

P.S. Congrats to Markus!  I was unaware!

Re: [tdwg-content] Name is species concept thinking

Richard Pyle