Re: [tdwg-content] Name is species concept thinking

13 Jun 2010

      Dude...this conversation has been ongoing for more than 20 years (at least
that's how long I've been participating in it -- the conversation has
actually been going on since the dawn of biodiversity informatics).  I doubt
that we're going to resolve it now.  But I do agree we need to get this kind
of information captured in a more easily accessible, better-summarized
archival form.  Whether that should be in DwC wikispace, or GNA wikispace,
however, is not clear.  

But for now, the conversation is still hot, and in my experience, nothing
throws a bucket of water on a hot topic of conversation more effectively
than porting it from a push-based email list to pull-based wiki forum. My
greatest hope is that we can:

1) Get passed the crude vocabulary and semantics (I'm using both of these
terms in the vernacular sense here, not the technical sense) so that we can
figure out if we really are all on the same page (or not); and

2) Sparking the generation of some sort of summary document that can live on
the appropriate web-based discussion forum, with associated dialog &
discussion.

Rich
...
-----Original Message-----
From: gtuco.btuco@gmail.com [mailto:gtuco.btuco@gmail.com] On 
Behalf Of John Wieczorek
Sent: Sunday, June 13, 2010 10:55 AM
To: Richard Pyle
Cc: David Remsen (GBIF); tdwg-content@lists.tdwg.org
Subject: Re: [tdwg-content] Name is species concept thinking
Silly me to think that you might actually be approaching done 
with this conversation. ;-)
...
Hi Dave,
...
By linking to a populated GNUB it would also have an 
improved means 
to provide the protonym circumscription of the concept, as you 
describe in (5).
Just to be clear, when you say "protonym circumscription of the 
concept", you mean a concept circumscription whose boundaries are 
defined by the set of included protonyms (as opposed to the concept 
circumscription established for the Protonym-usage 
instance; i.e., original description).  Correct?
Although such concept/circumscription definitions (effectively 
represented by the set of type specimens implied by the set of 
protonyms) are not as high-resoultion as concept/circumscription 
definitions that are defined by a broader suite of specimens, 
populations, or characters; they are, I believe, the "best bang for 
the buck" in that they give us 80% of the benefit for 20% 
of the work.
...
In addition,  we would like to support the inclusion of 
bibliographic 
data,
Already included via GNUB.
...
specimens,
In my mind, a *key* value of GNUB/GNA is to serve as a 
taxon authority 
for specimen collections (i.e., the anchorpoints for 
specimen/observation taxonomic identifications).
...
geospatial information,
Inherited from the specimens/observations.
...
and general
descriptive data.
Inherited from the PLAZI treatments anchored to the
...
well as the published and unpublished character data 
anchored through specimens.
...
In (5) you describe the protonym-based circumscription to evaluate 
the relative agreement of the identified concepts (via 'meta-
authorities').    This provides the basis for expanding
...
...
set of names for a subsequent data retrieval from GBIF (for
example) to include all the related nomenclatural and lexical 
variants for those names (of course checking for homonym conflicts 
among them).
Yes, exactly!
...
In (6) it appears the output of the Taxon Concept 
resolution process 
is either an expanded set of name strings or an array of
On Sun, Jun 13, 2010 at 1:50 PM, Richard Pyle 
<deepreef@bishopmuseum.org> wrote:
publications, as 
the potential
protonymIDs.
...
Before the content is built, the name-strings can be fed
...
to snoop out additional possible protonym links.  However, in a 
data-populatd paradigm, it would be an array of ProtonymIDs.
...
If the latter,  I
can see how this would provide a more precise concept-informed but 
name-based retrieval method and probably the best we can 
expect from
large indices like GBIF.    But I don't see how it will support a
strict concept-based retrieval.
If you are content with a protonym-based concept circumscription 
definition, it has all you need.  Each Taxon Name Usage instance in 
GNUB represents an array of (minimually one) ProtonymIDs --
...
the set of all protonyms representing the asserted taxon concept in 
the usage instance.  Like I said, it's not as high-resolution as 
specimen/population/character-based
concept/circumscription definitions, but I think it gets us most of 
the way there, with the least amount of effort (not to say that it 
requires little effort to get us that far -- just that trying to 
define concept boundaries at higher resolution requires 
*MUCH* more effort).
So, the question is, what concept boundaries are fuzzy when you use 
Protonym-based definitions?
Imagine an example where we have 7 protonyms of something in the 
Pacific; three described from type specimens collected in
...
Pacific, and four from specimens collected throughout the western 
Pacific.  We also have a bunch of specimens from the 
central Pacific, 
but no Protonyms typified from that region.
Taxonomist "A" declares that the three protonyms from the eastern 
Pacific represents one valid species (Aus bus), and the 
four from the 
west represent a second valid species (Aus xus).  Taxonomist "B" 
declares the exact same thing.  Using Protonym-based 
circumscriptions, 
we can infer that each the taxon concepts of "Aus bus" and 
"Aus xus" 
are both congruent between the two taxonomists.
The fuzziness comes in for the central Pacific populations:
1) Suppose that Taxonomist "A" explicitly cited the
...
central Pacific, and declared them to be "Aus bus"; but 
Taxonomist "B" 
never mentioned them.  In that case, we would probably want to 
establish the concept realtionship as "Aus bus sec. A 
<includes> Aus 
bus sec. B" (as opposed to "is congruent with", as would be
...
for a Protonym-based circumscription).
2) Suppose that Taxonomist "A" explicitly cited the
...
central Pacific, and declared them to be "Aus bus"; but 
Taxonomist "B" 
cited those same populations as belonging to "Aus xus".  In
...
we would probably want to establish the concept 
realtionship as "Aus 
bus sec. A <overlaps with> Aus bus sec. B". Again, the 
Protonym-based 
circumscription in this case would give us an imprecise 
representation 
of the concept mappings.
However, in my experience (working in the Pacific, where
...
circumsctance of eastern vs. western vs. central population 
differences happens a LOT), it's actually a very rare
...
is, in scenario 1, it's most likely the case that 
Taxonomist B would 
have included the central populations the same way that 
Taxonomist A 
would have.  As for scenario 2, I'm struggling to think of even a 
single example of this.  I suspect that it's just very rare.
So the point is, I think that protonym-based circumscription 
definitions are perfectly adequate for the vast majority of 
use cases.
...
The real world example that forms my litmus test is the 
blue-headed 
vireo,  Vireo solitarius (Wilson 1810) which was originally called 
Muscicapa solitaria and has also been combined to form Vireosylvia
solitaria and Lanivireo solitarius.   Of course there are lexical
variants as well (Google "Lanivireo solitaria" for 
example).   These,
properly structured, would be the sort of useful set of lexical/ 
nomenclatural content I would hope as a response from a  GNI/GNUB 
resolution service based on protonymID.
Send me a bunch of usage instances involving all the different name 
variants, and involving various concept definitions, and I 
can create 
a sample GNUB dataset that would illustrate how this would 
work.  The 
name-mapping things is trivial, once the TNU instances have 
been populated.
The concept mapping stuff is a bit more complex -- but still 
relatively simple compared to algorithms for, say, oxygen control 
systems in rebreathers..... :-)
...
One current view of the taxon (concept C1) has this 
species occupying
the eastern part of the US.   Another species, Vireo
...
...
1866, (concept C2) occupies the middle west USA, and a
...
...
Vireo cassini Xántus de Vesey, 1858 (concept C3) is on the western 
coast.
Another view lumps all three of these into a single species which, 
based on the rule of priority, has the valid name Vireo solitarius 
and results in a new concept (C4).  This concept includes C1, C2,  
and
C3.   Both concepts have the scientific name of Vireo solitarius.
We can access and represent these in a consistent fashion 
using our 
CLB and probably others can too in their own index models.
So, now we have a specimen of Vireo solitarius that was captured in
Minnesota.   It might be an errant instance of C1, Vireo solitarius
sensu stricto, that strayed a bit west of normal.   It 
might be (C4)
Vireo solitarius, sensu lato.     The specimen would need
...
...
identifier tied to the record to make this explicit.    
So,  let's say
that the identifier was made using the lumped concept (C4).
Of course, if this doesn't make it into the record, we are 
stuck with 
the name alone.
Right -- this sounds like the same as the hypothetical 
example I made above.
But like I say, I think this example is the exception, 
rather than the 
rule (i.e., it falls in the missing 20% of the "benefit" in the 80% 
benefit/20% work ratio).
...
Using the method (6) you described would allow a user to 
discover the 
different treatments of Vireo solitarius (C1 and
C4) and provide some means to discriminate them via concept 
resolution.
- C4 includes C1, C2, and C3 which would include all the 
names above.
- C1 would only include the nomenclatural/lexical variants 
for Vireo 
solitarius.
Resolution will enable us to perform a significantly more 
useful and 
concept-informed search.  It will, however, include the specimen I 
referenced above in BOTH cases because "Vireo solitarius" or it's 
protonymID will be a search term in both cases.
Right -- until someone else comes along and provides a more 
explicit 
identification for that specimen.
...
A more precise concept based system would utilise a required taxon 
concept identifier in the specimen record to discriminate 
different 
uses of the SAME NAME.
Sure!  That would be fantastic -- and maybe someday we'll 
get to the 
point where all specimen/observation identification events 
come in the 
form of "Aus bus sec. Smith 1955", rather than simply "Aus bus" (as 
the vast majority are now).  This, in my mind, is the 
single greatest 
and most consistent informatics failure within legacy 
taxonomic works 
and specimen databases.  But I think the good news is that we can 
still get 80% of the benefit by going only as far as
...
we *can* derive from a name alone -- once we get past 
homonymy and gross misspellings).
...
In other
words,  if you did a search of Vireo solitarius and the concept 
resolver indicated the different concepts above and you chose the 
sensu stricto (split) version,  you would get the C1 
labelled records 
but the C4 labelled records would be excluded or at least 
come with a 
warning (may not be what you are looking for).  This of course 
requires our specimen records to have a concept
identifier.   Or,  the concept definition itself will include
additional annotations to enable us to make inferences
I think the best we can do is flag those cass, and rely on 
caveat emptor.
...
Publication date of the concept - If the split didn't happen until 
1980 and the specimen is from 1960 then we could infer C4.
Distribution information for the concept - if we disregard errant 
specimens then we might infer a 1985 Minnesota specimen is a C2 in 
spite of the different name.
The date one could work within the GNUB architecture, because that 
dates are all there (as long as the specimen identification 
was also 
dated).  With the right integration with GBIF, the distribution one 
*might* be derivable algorithmically, but it wold depend on
back into GNI 
that is, 
the eastern 
populations in the 
the case 
populations in the 
that case, 
this sort of 
problem.  That 
plumbeus Coues,
third species, 
that concept
protonyms (which 
the nature of the data.
...
...
In sum,  we are on track for achieving this and I believe our data 
mobilisation strategy will support getting these sort of data
published.   When Markus returns from paternity leave I
would hope we
...
...
could include his thoughts on how we might expose these as RDF via 
our indices to support all aspects of this discussion.
Keep on a keepin' on....
Rich
P.S. Congrats to Markus!  I was unaware!
_______________________________________________
tdwg-content mailing list
tdwg-content@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-content