Re: [tdwg-content] DwC taxonomic terms

4 Sep 2009

      Hi Peter,

Thank you for your email, but it seems evident that either I failed to
adequately explain how my proposed solution would work, or I fail to
understand what you would expect to extract from resolving a TaxonConceptID.
Perhaps you could describe for me what a TaxonConceptID would resolve to (or
point me to an example), and how it would be more effective at addressing
the needs of researchers? Also, can you be more explicit about what you mean
by "things" in the question, "Are these the same or different things?"  By
"things", do you mean that both refer to the same original description of
the species epithet "triseriatus", or do you do you mean that they represent
the same taxon concept circumscription? I suspect you mean the latter, in
which case it would also apply for two different references to "Aedes
triseriatus". 

Addressing the former, perhaps I didn't explain that a fundamental component
of an object resolved through a TaxonNameUsageID is a link to the Protonym
(~Basionym), and as such a link to a TaxonNameUsageID for Aedes triseriatus
would itself cross-link to all TaxonNameUsage instances of Ochlerotatus
triseriatus, thereby revealing the congruency of the "triseriatus" epithet.
As for the latter, services built on top of TaxonNameUsage instances would
go the next step and resolve whether or not two different references to the
name "Aedes triseriatus" (or one to A. triseriatus and one to O.
triseriatus) apply to the same taxon concept circumscription.

To be sure, these services do not exist yet, but they are unambiguously in
development right now, and such an infrastructure for taxon name resolution
were identified as a high priority at the eBiosphere conference, so there is
now at least a reasonably clear roadmap to developing and implementing this
infrastructure.  By contrast, I'm not sure I've ever found two people who
have exactly the same notion of what kind of object a TaxonConceptID would
resolve to, and how its associated metadata would answer the question you
pose below about Aedes triseriatus vs. Ochlerotatus triseriatus, (Which is
why it would be helpful for me to see a specific example of a
TaxonConceptID, and what it resolves to).

Aloha,
Rich

Richard L. Pyle, PhD
Database Coordinator for Natural Sciences
  and Associate Zoologist in Ichthyology
Department of Natural Sciences, Bishop Museum
1525 Bernice St., Honolulu, HI 96817
Ph: (808)848-4115, Fax: (808)847-8252
email: deepreef@bishopmuseum.org
http://hbs.bishopmuseum.org/staff/pylerichard.html

________________________________

	From: Peter DeVries [mailto:pete.devries@gmail.com] 
	Sent: Thursday, September 03, 2009 4:32 PM
	To: Richard Pyle
	Cc: tdwg-content@lists.tdwg.org
	Subject: Re: [tdwg-content] DwC taxonomic terms

	Richard, 

	Your proposed plan does not actually give researchers what they need
for large scale analysis.

	They need to know what is "meant" by a particular identifier.

	In the mosquito community we have a split between those who have
adopted Ochlerotatus as a genus.
	For some this changed Aedes triseriatus to Ochlerotatus triseriatus,
others refuse to adopt the new name.

	Are these the same or different things?

	Under your scheme they are different things because the idea that an
entity is a species is merged
	with the particular taxonomic placement of that entity.

	How does your proposal solve this?

	What is needed is a linked data identifier that resolves to data
that help determine those instances of

	Aedes triseriatus and Ochlerotatus triseriatus that are the same,
and those instances that are different.

	In reference to the earlier discussion on separating identifiers
from resolution, how will a user determine
	if occurrences tagged with the Aedes triseriatus UUID or LSID and
those tagged with the Ochlerotatus
	triseriatus LSID are referring to the same species?

	The proposed solution leaves users with just a name and no clear way
of determining what the person identifying
	the specimen actually meant. The original species description is
amazingly non-informative.

	Most non-taxonomist's don't care that much about what particular
genus something is in. They care that
	the specimens they collected with malaria parasites are linked to
other specimens of the same species.
	At those times they do care, they want quick way to lookup the
current name i.e. phylogenetic hypothesis
	that can remain linked to their data.

	If you leave in the TaxonConceptID, then users have a choice of
filling it in or ignoring it. For those that would
	like to use something like this, it will dramatically improve data
integration and move disagreements about
	name changes in the background. A change, that I think, would
improve the relationship between taxonomists
	and other biological scientists.

	There were a number of other issues in previous emails that
suggested that the taxonomic community
	has chosen to rehash informatics issues that have already been
thoroughly discussed by the scientific
	informatics community. What is somewhat alarming is that they seem
to have come to completely
	opposite conclusions.

	Also the thread on "trust" seemed particularly misinformed. If the
writer intended to imply that by going to
	the current GBIF site they can "trust" the data, they are wrong. I
see no mechanism on the GBIF home
	page that allows me to determine that this is the "real" GBIF site.

	This is not meant to disparage GBIF, but to clarify the discussion.
In fact the person who seems to be
	the most concerned with "trust" does not have any way to
authenticate that his highly touted resolution
	service is the "real" one. 

	I suspect that the "trust" issue was either particularly uninformed
or a smoke screen for a different issue
	which may be about data and services from cronies vs. data and
services from non-cronies.

	If you don't trust a particular provider, you can just remove those
URI's from your data store by filtering by
	"context" or reification.

	Respectfully,

	- Pete

---------------------------------------------------------------
	Pete DeVries <http://spiders.entomology.wisc.edu/pjd/index.html> 
	Department of Entomology
	University of Wisconsin - Madison
	445 Russell Laboratories
	1630 Linden Drive
	Madison, WI 53706
	Email: pdevries@wisc.edu
	GeoSpecies Knowledge Base <http://species.geospecies.org/> 
	About the GeoSpecies Knowledge Base <http://about.geospecies.org/> 
	------------------------------------------------------------

	On Wed, Sep 2, 2009 at 2:53 PM, Richard Pyle
<deepreef@bishopmuseum.org> wrote:

		Greetings (again)...

		With a slightly more rested brain, I'll provide some more
specific feedback
		on the DwC Taxonomy terms.  I'll use John's Aug 25 proposed
list of terms &
		definitions as a starting point.

		(Tim -- go get a cup of coffee before continuing....)

		> taxonID: An identifier for a specific taxon-related name
usage (a
		> Taxon record). May be a global unique identifier or an
identifier
		> specific to the data set.

		As I said in my previous post, I worry that "taxon" is too
familiar, and has
		too many meanings such that, without reviewing the
definition, people may
		jump to the wrong conclusion about what sort of data object
should be
		resolved through this ID.  As klunky as it is, I feel it
better to be
		unambiguous and use something like "taxonNameUsageID"  This
is the term GNUB
		has adopted; and while GNUB is still in early draft form, it
took literally
		decades of deliberation to finally arrive at that term.  If
GNA & GNUB gain
		the traction that many of us are hoping it will, I believe
that the term
		"TaxonNameUsage" will become much more familiar to managers
of taxonomic
		data in the future.  Thus, I would propose:

		taxonNameUsageID: An identifier for a specific taxon-related
name usage
		instance (a particular name as it is used within the context
of a particular
		publication or other documentation source). May be a global
unique
		identifier or an identifier specific to the data set.

		> acceptedTaxonID: A unique identifier for the
acceptedTaxon.

		I'm not exactly sure what this is supposed to represent, but
I gather that
		it is used in cases where the taxon name for this record is
not regarded as
		the accepted taxon name. Stan wrote:

		> In the context of an identification, yes, a taxon is
asserted
		> to be valid/accepted by the identifier (at the time), but
not
		> all identifications are accepted by the data manager, so
that
		> last statement isn't always true.  Also not all taxa are
		> accepted/valid within a classification (if it includes
		> synonymous taxa).

		If this is the purpose for the "acceptedTaxonID" (and I
agree it's important
		to represent this), then I think we need to be more explicit
about what is
		meant by accepted.  For example, consider these three
different meanings
		(I'll use the terms provided by John, rather than my
recommended terms):

		1. Accepted in the sense of name orthograpgy
		A specimen was identified as "Centropyge loricula", so the
TaxonID resolves
		to this name.  The data manager knows that the correct
orthography is
		"Centropyge loriculus", so acceptedTaxonID resolves to that
name.

		2. Accepted in the sense of subjective synonymy
		A specimen was identified as "Centropyge flammeus", so the
TaxonID resolves
		to this name.  The data manager follows modern literature in
treating this
		name as a junior synonym of C. loriculus, so acceptedTaxonID
resolves to
		"Centropyge loriculus".

		3. Accepted in the sense of Concept Circumscription
		A specimen was identified as "Centropyge loriculus" and the
TaxonID resolves
		to the usage instance of "Centropyge loriculus Günther 1874
sec Woods &
		Schultz 1953", but the data manager feels this is not the
most appropriate
		circumscription for the taxon represented by the specimens,
so
		acceptedTaxonID resolves to the usage instance of
"Centropyge loriculus
		Günther 1874 sec Allen 1975".

		In my mind, all three of these would be appropriate use
cases for
		acceptedTaxonID; but I suspect some people would not regard
#3 as
		appropriate.  As long as taxonID and acceptedTaxonID both
point to Usage
		instances, it doesn't really matter, because a resolved
Usage Instance
		record will provide the full set of metadata to do whatever
comparison
		(orthography/synonymy/circumscription) the consumer of the
record wishes to
		do.  However, I do think the definition of the term should
address these
		different possible resolutions of meaning.

		The draft GNUB structure (which I can send to anyone who is
interested) has
		a field called "ValidUsageID", which is a recursive foreign
key to the same
		or a different Usage Instance, and is used explicitly for
synonym treatments
		(#2 in the above list).  Best to explain by example:

		Each row below represents a Taxon Name Usage Instance, and
"VUID" refers to
		ValidUsageID.

		TNUID   Reference               VUID    FullName
		====================================================
		 1     Günther 1874      1     Centropyge loriculus
		 2     Woods&Schultz 1953  2   Centropyge flammeus
		 3     Allen 1975                3     Centropyge loriculus
		 4     Allen 1975                3     Centropyge flammeus
		====================================================

		For the first three records, TNUID=VUID.  This means that
each of those
		publications treated each of those names as a valid species.
By contrast,
		TNUID 4 has VUID 3 (i.e., TNUID<>VUID), which means that
Allen 1975 treated
		the name "Centropyge flammeus" as a junior synonym of
"Centropyge
		loriculus".  Note that in the GNUB data model, the TNUID
link must point to
		TNUID within the Reference.  For example, in row #4,
TNUID=3; not 1. In
		simplest terms, row #4 translates to "Allen 1975 regarded
Centropyge
		flammeus as a junior synonym of Centropyge loriculus."  In
other words, this
		relationship applies specifically to use-case #2 in the list
above.

		As for the term itself, my recommendation would depend on
which of the three
		use-case examples listed above the term "acceptedTaxonID" is
intended to
		represent.  If it is really only meant for Use-case #2
(synonymy), then I
		would recommend following GNUB with "validUsageID".
However, I think it's
		probably best to leave the scope of meaning of the term open
to any of these
		use-cases, in which case I would recommend the term
"acceptedUsageID".  But
		in either case, I think the definition needs to be more
explicit.

		> higherTaxonID: A unique identifier for the taxon that is
the parent of
		> the scientificName.

		Again, why not be explicit?  Following the "taxon" root-stem
approach, this
		should probably be "parentTaxonID".  In the GNUB data model,
the field used
		for this exact same purpose is "ParentUsageID".  So,
accordingly, my
		recommendation for the DwC term wothld be "parentUsageID".

		> originalTaxonID: A unique identifier for the basionym
(botany),
		> basonym (bacteriology), or replacement of the
scientificName.

		I wrestled with this term a lot when developing the
Taxonomer data model,
		and launched several threads on Taxacom about it, and
discussed it
		extensively with many database nerds and taxononmy nerds of
all Code
		flavors.  "Protologue" was the closes existing term to what
this term is
		intended for, but the problem with "Protologue" (a term
familiar to
		botanical taxonomists) is that it may be spread across more
than one
		publication.  As I understand it, it's the set of Usage
Instances that
		collectively fulfill the criteria for a name being validly
published.  I
		finally decided on the term "Protonym". Although I later
discovered that
		this word had been defined in a different way in the context
of fungi
		taxonomy, I was assured by Paul Kirk (curator of Index
Fungorum) that my use
		of the term should take precedence.  Consequently, the term
we use in GNUB
		(Paul is one of the original architects of GNUB) is
"ProtonymID".

		I'm not necessarily pushing for DwC to adopt this term;
however, I am
		reasonably confident that GNUB will retin it, and depending
on the future
		success of GNUB, it may end up becoming solidified in our
community.  As
		such, I think "protonymID" is the best term to use for DwC.
However, if
		this is not acceptable, then I would suggest
"originalUsageID" as a more
		explicit alternative.

		> scientificName: The taxon name (with date and authorship
information
		> if applicable). When forming part of an Identification,
this should be
		> the name in the lowest level taxonomic rank that can be
determined.
		> This term should not contain Identification
qualifications, which
		> should instead be supplied in the IdentificationQualifier
term.

		This is probably fine, but it sort of depends on where DwC
settles on the
		definition of "acceptedTaxon(ID)/acceptedUsage(ID)".  If the
scope includes
		orthographic variants, then the definition of scientificName
should be
		expanded to explicitly refer to "exact orthography" (which
may or may not
		match the orthography represented by acceptedXXX).  In GNUB,
each usage has
		a field called "VerbatimNameString", which is intended to
capture the exact
		string of characters (as best as can be represented via
UTF-8) that appeared
		in the publcation/reference.  However, I don't think this is
necessary for
		DwC.  But I do think the definition of scientificName should
make comment on
		orthography.

		> acceptedTaxon: The currently valid (zoological) or
accepted
		> (botanical) name for the scientificName.

		This definition suggests that this term applies only to my
use-case #2
		(synonymies).  As described earlier, in GNUB (which was
initially developed
		by two botanists and one zoologist), the term "valid" was
used instead of
		"accepted".  Either one will do, but I think it makes sense
to follow GNUB.
		In any case, I would propose the following:

		If the intent is only for taxonomic synonymies (use-case 2),
then go with
		"validUsage" to be consistent with GNUB, and recommend that
a full
		usage-instance string ("Centropyge loriculus Günther 1874
sec Allen 1975")
		be provided, if available.

		If the intent is less specific, and is open to
		orthographic/synonym/circumscription relationships, then go
with
		"acceptedUsage" (with the same full usage-instance string)

		> higherTaxon: The taxon that is the parent of the
scientificName.

		Again, I would go with "parentUsage", and recommend the full
usage-instance
		string.

		> originalTaxon: The basionym (botany), basonym
(bacteriology), or
		> replacement of the scientificName..

		As per above, I would go with "protonym" (which need only be
a name-string,
		such as "Centropyge loriculus Günther 1874"); but if not
protonym, then
		"originalUsage".

		> higherClassification: A list (concatenated and separated)
of the names
		> for the taxonomic ranks less specific than that given in
the
		> scientificName.

		I'm fine with this.

		> kingdom, phylum, class, order, family, genus, subgenus,
		> specificEpithet, infraspecificEpithet - all unchanged.

		Fine by me.

		> taxonRank: The taxonomic rank of the scientificName.
Recommended best
		> practice is to use a controlled vocabulary.

		Fine by me.

		> verbatimTaxonRank: The verbatim original taxonomic rank of
the
		scientificName.

		I think this is OK, but I'm not entirely sure how strictly
the term
		"verbatim" is applied.  For example, should this be verbatim
as it appears
		on the specimen label or original database record (e.g.,
"f." if it says
		"f."; "forma" if it says "forma", etc.)  Or, does it just
mean the
		"interpreted" rank (i.e., convert "f." to "forma").  My
inclination is the
		former; but for most names (i.e., those without explicit
rank qualifiers
		embedded within the name-string), this would be blank.  For
example, all
		species and higher ranks would be blank, because nobody
explicitly writes
		"species" when listing a species name.  To a zoologist, a
subspecies name
		looks like "Centropyge loriculus flammeus", but to a
botanist it looks like
		"Centropyge loriculus subsp. flammeus".  Sensu stricto, the
use of the word
		"verbatim" would imply that the zoologist would leave this
item empty, but
		the botanist would enter "subsp."  Do I interpret this
correctly?  Or (as I
		suspect), do I misunderstand the purpose of this item.

		> scientificNameAuthorship, nomenclaturalCode - unchanged

		Fine by me.

		> taxonPublicationID: A unique identifier for the
publication of the Taxon.

		Presumably this would be the publication to which the
specific usage
		instance for taxonID/taxonNameUsageID is anchored.  If so,
then I think the
		definition needs to be expanded.  As written, some people
might interpret
		the publication as always being the original publication
(i.e., the "Günther
		1874" of "Centropyge loriculus Günther 1874 sec Allen
1975").  Others might
		(more correctly, in my view) interpret it as the concept
definition
		publication (i.e., the "Allen 1975" of "Centropyge loriculus
Günther 1874
		sec Allen 1975").

		> taxonPublication: A reference for the publication of the
Taxon.

		Same comment as above.

		> taxonomicStatus, nomenclaturalStatus, taxonAccordingTo,
taxonRemarks,
		> vernacularName - unchanged.

		I'm fine with all of these except possibly taxonAccordingTo,
which I need to
		think about some more.

		Sorry for the long post -- I'm just making up for having not
been part of
		this discussion earlier.  I am more than happy to help draft
revised
		definitions for all of these terms, but only after we
resolve their intended
		scope & meaning.

		By the way, where do I find the current draft definitions
for all these
		terms? When I go to
http://code.google.com/p/darwincore/wiki/Taxon, I only
		see definitions for three of the terms.

		Aloha,
		Rich

		Richard L. Pyle, PhD
		Database Coordinator for Natural Sciences
		 and Associate Zoologist in Ichthyology
		Department of Natural Sciences, Bishop Museum
		1525 Bernice St., Honolulu, HI 96817
		Ph: (808)848-4115, Fax: (808)847-8252
		email: deepreef@bishopmuseum.org
		http://hbs.bishopmuseum.org/staff/pylerichard.html

		_______________________________________________
		tdwg-content mailing list
		tdwg-content@lists.tdwg.org
		http://lists.tdwg.org/mailman/listinfo/tdwg-content

	-- 
	---------------------------------------------------------------
	Pete DeVries
	Department of Entomology
	University of Wisconsin - Madison
	445 Russell Laboratories
	1630 Linden Drive
	Madison, WI 53706
	------------------------------------------------------------