[tdwg-content] Name is species concept thinking

John Wieczorek tuco at berkeley.edu
Sun Jun 13 22:54:43 CEST 2010

Silly me to think that you might actually be approaching done with
this conversation. ;-)

On Sun, Jun 13, 2010 at 1:50 PM, Richard Pyle <deepreef at bishopmuseum.org> wrote:
> Hi Dave,
>> By linking to a populated GNUB it would also have an improved
>> means to provide the protonym circumscription of the concept,
>> as you describe in (5).
> Just to be clear, when you say "protonym circumscription of the concept",
> you mean a concept circumscription whose boundaries are defined by the set
> of included protonyms (as opposed to the concept circumscription established
> for the Protonym-usage instance; i.e., original description).  Correct?
> Although such concept/circumscription definitions (effectively represented
> by the set of type specimens implied by the set of protonyms) are not as
> high-resoultion as concept/circumscription definitions that are defined by a
> broader suite of specimens, populations, or characters; they are, I believe,
> the "best bang for the buck" in that they give us 80% of the benefit for 20%
> of the work.
>> In addition,  we would like to support the inclusion of
>> bibliographic data,
> Already included via GNUB.
>> specimens,
> In my mind, a *key* value of GNUB/GNA is to serve as a taxon authority for
> specimen collections (i.e., the anchorpoints for specimen/observation
> taxonomic identifications).
>> geospatial information,
> Inherited from the specimens/observations.
>> and general
>> descriptive data.
> Inherited from the PLAZI treatments anchored to the publications, as well as
> the published and unpublished character data anchored through specimens.
>> In (5) you describe the protonym-based circumscription to
>> evaluate the relative agreement of the identified concepts (via 'meta-
>> authorities').    This provides the basis for expanding the potential
>> set of names for a subsequent data retrieval from GBIF (for
>> example) to include all the related nomenclatural and lexical
>> variants for those names (of course checking for homonym
>> conflicts among them).
> Yes, exactly!
>> In (6) it appears the output of the Taxon Concept resolution
>> process is either an expanded set of name strings or an array of
>> protonymIDs.
> Before the content is built, the name-strings can be fed back into GNI to
> snoop out additional possible protonym links.  However, in a data-populatd
> paradigm, it would be an array of ProtonymIDs.
>> If the latter,  I
>> can see how this would provide a more precise
>> concept-informed but name-based retrieval method and probably
>> the best we can expect from
>> large indices like GBIF.    But I don't see how it will support a
>> strict concept-based retrieval.
> If you are content with a protonym-based concept circumscription definition,
> it has all you need.  Each Taxon Name Usage instance in GNUB represents an
> array of (minimually one) ProtonymIDs -- that is, the set of all protonyms
> representing the asserted taxon concept in the usage instance.  Like I said,
> it's not as high-resolution as specimen/population/character-based
> concept/circumscription definitions, but I think it gets us most of the way
> there, with the least amount of effort (not to say that it requires little
> effort to get us that far -- just that trying to define concept boundaries
> at higher resolution requires *MUCH* more effort).
> So, the question is, what concept boundaries are fuzzy when you use
> Protonym-based definitions?
> Imagine an example where we have 7 protonyms of something in the Pacific;
> three described from type specimens collected in the eastern Pacific, and
> four from specimens collected throughout the western Pacific.  We also have
> a bunch of specimens from the central Pacific, but no Protonyms typified
> from that region.
> Taxonomist "A" declares that the three protonyms from the eastern Pacific
> represents one valid species (Aus bus), and the four from the west represent
> a second valid species (Aus xus).  Taxonomist "B" declares the exact same
> thing.  Using Protonym-based circumscriptions, we can infer that each the
> taxon concepts of "Aus bus" and "Aus xus" are both congruent between the two
> taxonomists.
> The fuzziness comes in for the central Pacific populations:
> 1) Suppose that Taxonomist "A" explicitly cited the populations in the
> central Pacific, and declared them to be "Aus bus"; but Taxonomist "B" never
> mentioned them.  In that case, we would probably want to establish the
> concept realtionship as "Aus bus sec. A <includes> Aus bus sec. B" (as
> opposed to "is congruent with", as would be the case for a Protonym-based
> circumscription).
> 2) Suppose that Taxonomist "A" explicitly cited the populations in the
> central Pacific, and declared them to be "Aus bus"; but Taxonomist "B" cited
> those same populations as belonging to "Aus xus".  In that case, we would
> probably want to establish the concept realtionship as "Aus bus sec. A
> <overlaps with> Aus bus sec. B". Again, the Protonym-based circumscription
> in this case would give us an imprecise representation of the concept
> mappings.
> However, in my experience (working in the Pacific, where this sort of
> circumsctance of eastern vs. western vs. central population differences
> happens a LOT), it's actually a very rare problem.  That is, in scenario 1,
> it's most likely the case that Taxonomist B would have included the central
> populations the same way that Taxonomist A would have.  As for scenario 2,
> I'm struggling to think of even a single example of this.  I suspect that
> it's just very rare.
> So the point is, I think that protonym-based circumscription definitions are
> perfectly adequate for the vast majority of use cases.
>> The real world example that forms my litmus test is the
>> blue-headed vireo,  Vireo solitarius (Wilson 1810) which was
>> originally called Muscicapa solitaria and has also been
>> combined to form Vireosylvia
>> solitaria and Lanivireo solitarius.   Of course there are lexical
>> variants as well (Google "Lanivireo solitaria" for example).   These,
>> properly structured, would be the sort of useful set of
>> lexical/ nomenclatural content I would hope as a response
>> from a  GNI/GNUB resolution service based on protonymID.
> Send me a bunch of usage instances involving all the different name
> variants, and involving various concept definitions, and I can create a
> sample GNUB dataset that would illustrate how this would work.  The
> name-mapping things is trivial, once the TNU instances have been populated.
> The concept mapping stuff is a bit more complex -- but still relatively
> simple compared to algorithms for, say, oxygen control systems in
> rebreathers..... :-)
>> One current view of the taxon (concept C1) has this species occupying
>> the eastern part of the US.   Another species, Vireo plumbeus Coues,
>> 1866, (concept C2) occupies the middle west USA, and a third
>> species, Vireo cassini Xántus de Vesey, 1858 (concept C3) is
>> on the western coast.
>> Another view lumps all three of these into a single species
>> which, based on the rule of priority, has the valid name
>> Vireo solitarius and results in a new concept (C4).  This
>> concept includes C1, C2,  and
>> C3.   Both concepts have the scientific name of Vireo solitarius.
>> We can access and represent these in a consistent fashion
>> using our CLB and probably others can too in their own index models.
>> So, now we have a specimen of Vireo solitarius that was captured in
>> Minnesota.   It might be an errant instance of C1, Vireo solitarius
>> sensu stricto, that strayed a bit west of normal.   It might be (C4)
>> Vireo solitarius, sensu lato.     The specimen would need that concept
>> identifier tied to the record to make this explicit.    So,  let's say
>> that the identifier was made using the lumped concept (C4).
>> Of course, if this doesn't make it into the record, we are
>> stuck with the name alone.
> Right -- this sounds like the same as the hypothetical example I made above.
> But like I say, I think this example is the exception, rather than the rule
> (i.e., it falls in the missing 20% of the "benefit" in the 80% benefit/20%
> work ratio).
>> Using the method (6) you described would allow a user to
>> discover the different treatments of Vireo solitarius (C1 and
>> C4) and provide some means to discriminate them via concept
>> resolution.
>> - C4 includes C1, C2, and C3 which would include all the names above.
>> - C1 would only include the nomenclatural/lexical variants
>> for Vireo solitarius.
>> Resolution will enable us to perform a significantly more
>> useful and concept-informed search.  It will, however,
>> include the specimen I referenced above in BOTH cases because
>> "Vireo solitarius" or it's protonymID will be a search term
>> in both cases.
> Right -- until someone else comes along and provides a more explicit
> identification for that specimen.
>> A more precise concept based system would utilise a required
>> taxon concept identifier in the specimen record to
>> discriminate different uses of the SAME NAME.
> Sure!  That would be fantastic -- and maybe someday we'll get to the point
> where all specimen/observation identification events come in the form of
> "Aus bus sec. Smith 1955", rather than simply "Aus bus" (as the vast
> majority are now).  This, in my mind, is the single greatest and most
> consistent informatics failure within legacy taxonomic works and specimen
> databases.  But I think the good news is that we can still get 80% of the
> benefit by going only as far as protonyms (which we *can* derive from a name
> alone -- once we get past homonymy and gross misspellings).
>> In other
>> words,  if you did a search of Vireo solitarius and the
>> concept resolver indicated the different concepts above and
>> you chose the sensu stricto (split) version,  you would get
>> the C1 labelled records but the C4 labelled records would be
>> excluded or at least come with a warning (may not be what you
>> are looking for).  This of course requires our specimen
>> records to have a concept
>> identifier.   Or,  the concept definition itself will include
>> additional annotations to enable us to make inferences
> I think the best we can do is flag those cass, and rely on caveat emptor.
>> Publication date of the concept - If the split didn't happen
>> until 1980 and the specimen is from 1960 then we could infer C4.
>> Distribution information for the concept - if we disregard
>> errant specimens then we might infer a 1985 Minnesota
>> specimen is a C2 in spite of the different name.
> The date one could work within the GNUB architecture, because that dates are
> all there (as long as the specimen identification was also dated).  With the
> right integration with GBIF, the distribution one *might* be derivable
> algorithmically, but it wold depend on the nature of the data.
>> In sum,  we are on track for achieving this and I believe our
>> data mobilisation strategy will support getting these sort of data
>> published.   When Markus returns from paternity leave I would hope we
>> could include his thoughts on how we might expose these as
>> RDF via our indices to support all aspects of this discussion.
> Keep on a keepin' on....
> Rich
> P.S. Congrats to Markus!  I was unaware!
> _______________________________________________
> tdwg-content mailing list
> tdwg-content at lists.tdwg.org
> http://lists.tdwg.org/mailman/listinfo/tdwg-content

More information about the tdwg-content mailing list