Re: [tdwg-tag] [tdwg-rdf: 105] Re: Any TCS users with experiences to report?

27 Nov 2012

      I read Rich's email as quoted in Nico's reply - I think maybe Rich's 
post didn't actually go out on the tdwg-tag or RDF group lists.  Rich 
mentions that he is swamped and will reply later.  For the moment it may 
be helpful to cite an earlier email of Rich's which it took me some time 
to dig out of the tdwg-content email list:

http://lists.tdwg.org/pipermail/tdwg-content/2010-October/001703.html

In that post, Rich was responding to a thread that started when I asked 
how one would handle a real-life situation (the specimen pictured in 
http://images.cyberfloralouisiana.com/images/specimensheets/lsu/0/0/4/28/LSU...).  
The relevant part begins about half way down the page with "In the web 
example given by Steve, we have... ".  In that section, Rich notes that

"Eventually, a third party may be able to deduce (perhaps through a suite of
other, external information) a RelationshipAssertion that maps the TNU
"[Juncus] diffusissimus Buckl. sec L. Urbatsch 2009" to some other, perhaps
published and well-defined taxon concept (of the same or different name).
Also, if there are 100 specimens in the collection that L. Urbatsch
identified as "Juncus diffusissimus Buckl." in 2009, then anchoring all 100
Identification instances to the one TNU, allows all of those specimens to
inherit the mapping of the one "[Juncus] diffusissimus Buckl. sec L.
Urbatsch 2009" TNU instance to some other better-defined taxon concept."

 From that post, I understood that a TNU (a.k.a. "assertion" in Pyle 
2004 http://systbio.org/files/phyloinformatics/1.pdf) can be as vague as 
an idea that some determiner had in his/her head about how 
organism/specimen instances should be mapped to a name.  I think from 
what Rich said there that there is the potential that we as metadata 
aggregators may at some later point be able to map how that idea in the 
determiner's head fits in with a more well-defined (e.g. published) 
taxon description which one may choose to call a taxon concept rather 
than a TNU. 

As so often is the case, I think the problem here boils down to 
identifiers and the metadata that we associate with them.  Let's say in 
the real-life example above, somebody (we can say GNUB) assigns a 
persistent identifier (perhaps a URI constructed from an opaque UUID) to 
"Juncus diffusissimus Buckl. sec L. Urbatsch 2009".  We could say with 
an rdf:type statement that the resource identified by the URI is a TNU.  
We can give that resource a tc:hasName property linking it to the name 
which is represented by the string "Juncus diffusissimus Buckl.".  (I'm 
not sure what property we use to say that L. Urbatch made the 
assertion).  Now let's say that L. Urbatsch publishes a paper describing 
in detail her concept of Juncus diffusissimus Buckl.  We can now assign 
the resource identified by the URI a tc:accordingTo property whose value 
is the DOI of the paper she wrote.  If we want, we can replace the 
previous rdf:type statement with different one stating that the resource 
is a taxon concept rather than a TNU, or if we believe that all taxon 
concepts are also TNUs we can leave the rdf:type statement that we had 
before and just add a second one saying that the resource is also of 
type taxon concept. 

The point I'm trying to make is that as long as this "thing" that we are 
variously calling "taxon name usage", "taxon concept", "shallow 
taxonomic concept", or "deep taxonomic concept" can be assigned an 
identifier, what really matters is the metadata we associate with it, 
not really what we call it.  The more metadata that we can connect with 
it, either through datatype properties like name strings or object 
properties that describe how the "thing" is related to other resources, 
the "deeper" the concept.  On the other extreme, we may know nothing 
more than the name string.  In that case we could call it a "nominal 
concept", but we could still assign it an identifier and maybe with luck 
we could associate more metadata with it (make it "deeper") at some 
point in the future. 

Returning to the original question of the thread (which was about the 
utility of TCS), TCS tries to deal with this problem using a thing 
called "signatures" (section 17.2, see 
http://bioimages.vanderbilt.edu/pages/TCS-Schema-UserGuide-v1.3.pdf) 
which are a somewhat crude attempt to make identifying strings unique by 
standardizing their format.  However, TCS was written in 2005-2006.  
Since then, the development of DOIs, the TDWG GUID Applicability 
Statement standard, and best practices in the Linked Data world have 
provided well-established and standardized ways to create persistent and 
dereferenciable identifiers.  So there isn't any reason I can see why we 
can't use them. 

I am going to be bold and say that we already have the minimum tools 
required to get started implementing TNUs/TaxonConcepts:
- URI GUIDs (which if one preferred could be UUIDs or  LSIDs -- HTTP 
proxied to make Linked Data people happy; see the TDWG GUID 
Applicability Statement standard if you don't know how to do this) to 
identify the TNU/concepts,
- the two terms tc:hasName and tc:accordingTo (from the TDWG Taxon 
Concept ontology) to relate the TNUs/TaxonConcepts to names and sec. 
references, and
- some sources for name and publication URI GUIDs. 
There are deficiencies all over the place for that last item, but they 
can be addressed over time by improving the scope of the relevant 
databases and the quality of the metadata provided.  uBio has URIs for 
almost every name I've ever looked for.  BHL has a growing collection of 
old literature which has been assigned identifiers by  Rod Page's 
BioStor, new literature usually has an assigned, dereferenceable proxied 
DOI, and one can even make valid URIs from ISBNs of books (although they 
aren't resolvable).  I'm not sure how one should address the situation 
where the "sec." reference of a TNU is a person and date since there 
isn't a standard database of people (as far as I know).  But that could 
be remedied.  Ultimately, one could create the kinds of mapping tools 
that Nico and Rich are talking about which relate different taxon 
concepts/TNUs which have set theory relationships.  Whether that would 
be done with RDF, OWL, or something completely different I don't know, 
but the basic anchoring of persistent identifiers for the TNU/concepts 
to the names and sec. references wouldn't have to wait on that.  We 
could also get hung up about what terms to use to express the metadata 
describing the basic TNU/name/sec. resources, but there is nothing that 
says that metadata can't change or be improved over time.  It's the 
identifier that shouldn't change. 

Am I wrong about this???

Steve

Nico Franz wrote:
...
Thank you, Rich.
So we seem to agree on something like this:
Rich                                    Nico
taxon name usage   <===>   "shallow" taxonomic concept
taxon concept         <===>   "deep" taxonomic concept
Both: labeling is via name sec. author
Both: authoring concepts/usages vs. identifying to those => slippery 
issue; ideally requires proper speaker awareness.
Why the latter? - well, because (again) the desirable effect of 
using concepts - the desirable situation where these would have a 
justification that goes beyond just really meticulous data management 
and advances to the level of facilitating better science qua more 
precise taxonomic semantics - only obtains if a great number of name 
occurrences in a wide range of shallow-ish sources is linked via 
identification to a presumably smaller number of occurrences where 
those names are well defined and where successive definitions of names 
are semantically linked. So there needs to be an emerging culture of 
minimizing concept inflation. Otherwise we obtain what we have now 
(mostly just names) and on top of that add new baggage (lots of really 
shallow concepts) that nobody can do good semantics with.
Here is where I think we disagree, perhaps just in terms of sales 
strategy:
You seem to suggest that making an a priori distinction between 
TNUs and concepts is (1) possible in a good number of cases, (2) is 
desirable perhaps in the form of registry, and (3) even necessary for 
building and populating databases, etc.
Here I disagree, for a number of reasons. First off I do believe 
that not defining certain things too soon or too narrowly is sometimes 
actually really good science and on the other hand, doing so can be a 
show stopper if other people don't share this narrowness and find it 
limiting. Second, while we can perhaps readily agree that a lengthy 
monograph published in American Museum Novitates rises to the level of 
authoring next concepts whereas a label saying "Family Carabidae" on a 
specimen submitted as part of an insect student collection does not, 
there are enough in-between cases where only time will tell.
Example: USDA Plants promotes a particular perspective of 
groundcherry taxonomy, genus-level concept Physalis - 
http://plants.usda.gov/java/profile?symbol=physa - with some 29 
species-level concepts recognized. ASU's herbarium curator Les Landrum 
is a bit of a groundcherry nerd (I say this with admiration). If you 
go here: http://swbiodiversity.org/seinet/index.php, then Search 
Collections => Select All => Next => Scientific Name = Physalis => 
Search, you get some 3700 pertinent specimen records. If you then 
switch to the Species List tab, you see 115 concept listed overall. 
Switching to the USDA Plants Thesaurus will give you only 46 concepts 
that these 3700 specimens are mapped to. Using instead the ASU 
Taxonomic Thesaurus will yield 89 concepts linking variously to those 
specimens. This is based on Les' classification of groundcherries 
which is not further documented in the SEINet environment at this moment.
Now, saying a a priori whether Les' list represents a set of TNUs 
versus concepts would presumably require you to assert that there is 
nobody who is Les or very much like him that can provide a 
semantically very accurate mapping of the 89 name usages in the 
SEINet-ASU Physalis list to the much more thoroughly circumscribed 
USDA Plants concepts. That could seem like a daring prediction given 
how little Les might think of the USDA perspective. At the very moment 
that Les or someone very much like him DOES provide the mapping, what 
looked like a list of TNUs then all of a sudden acquires - via the 
mapping - a much deeper semantic status where others can readily go 
from one classification to the next, even though each come with very 
different amounts of information in their original appearances. Some 
people may prefer Les' concepts at least for Arizonan groundcherries, 
and in either case, the mapping put both on an even playing field.
So this exemplifies IMO why so far the concept approach has been 
too abstract, the TCN has been too depauperate on the 
relationships/mapping side (worrying instead almost needlessness about 
what constitutes a concept per se), and definitions between 
identifications, name usages, shallow, deep concepts have been too 
abstract as well. I believe we should focus less discussion on those 
issues and more emphasis on building mapping tools that can carry a 
wide range of input and logically infer additional implied mappings 
from the initial expert-given set. The actual semantic properties of 
that input will emerge a posteriori and will be hard to predict in 
some cases. Some descriptions are lengthy but nobody understands them. 
Some names lists are profoundly informative if the context of their 
origin is well known to an expert.
There will be some obvious overreaches in both directions (too many 
unconnected items, some items that are connected more precisely than 
their inherent information would seem to justify). I think these 
overreaches would be tolerable. What's less productive to me is a 
restrictive set of definitions that provide an early blockage in they 
way towards an environment where mapping is supposed to occur very 
frequently. We're not at the registry stage yet. More at the "can this 
work in principle" stage. As I mentioned before, the mappings ARE the 
concepts under a certain viewpoint. We don't want to pre-determine 
their fate by separating TNUs from concepts in a great number of cases.
I hope this was not a misrepresentation of your view and also a 
clarification of my view. In the end, we both advocate some sort of 
balance for the same concerns, but perhaps disagree only strategically 
about the moment where/when that balance will materialize - upfront 
via precise definitions and registration or later on via the 
presence/lack of actual mappings.
Best,
Nico
On Mon, Nov 26, 2012 at 5:18 PM, Richard Pyle 
<deepreef@bishopmuseum.org <mailto:deepreef@bishopmuseum.org>> wrote:
I want to get into this topic in more detail (going back to
    Steve’s original post), but this week is hell-week for me, so only
    a quick comment now.
I generally agree with everything Nico says, but I think we need
    to be a little more clear of what we mean by “name sec. author”
The core unit of the data model we’ve been building towards (GNUB,
    which underlies ZooBank) uses as its fundamental unit something
    we’ve been calling a “Taxon Name Usage Instance” (TNU).  The scope
    of what can be a TNU is intentionally very broad – anything from
    an original taxon name description, to a mention in a newspaper
    article, and potentially even a scribbled hand-written label or
    letter.  The only requirement is that it be static – that is, a
    snapshot in time.  I mention this because database records can be
    represented as TNUs, but only as a static snapshot of the record. 
    If the essence of the database record changes over time (e.g., due
    to changing taxonomic opinion), then a new TNU is generated for a
    different snapshot in time.
A very small subset of the universe of TNUs represent
    Code-governed Nomenclatural Acts (original descriptions of new
    names and other code-governed nomenclatural actions). In the case
    of such TNUs involving the ICZN Code (for example), the TNUs are
    registered in ZooBank.  But the point is, one subset of all TNUs
    are those that involve actions governed by a Code of nomenclature.
The reason I mention this is that, if I read Nico’s email
    correctly, I think he’s saying that not all TNUs de-facto
    represent taxon concepts.  Rather, analogous to the nomenclatural
    subset of TNUs, there is a subset of TNUs that rise to the level
    of representing Taxon Concept definitions.  In the case of
    nomenclatural acts, someone must make some sort of declaration
    (assertion) that a specific TNU constitutes a Code-governed
    nomenclatural act, along with relevant metadata relating to that
    assertion and the nature of the Act.  In the case of zoological
    names, ZooBank is intended to facilitate this role (i.e., when a
    person registers a TNU in ZooBank, there is an implied assertion
    that the TNU represents a nomenclatural act under the ICZN Code).
What would be nice to have (and what TDWG could play a helpful
    role in facilitating), is a registry of sorts (analogous to
    ZooBank) for those TNUs that represent taxon concepts.  In other
    words, a mechanism for people to “register” the subset of all TNUs
    that represent taxon concepts. Secondarily, there would also be a
    mechanism to make assertions about how registered taxon concepts
    map to each other (via some sort of set theory relationship[s]).
In summary, my points are
1)      We should be clear when we say “name sec. author” whether
    we mean it sensu lato (i.e., all TNUs); or sensu stricto (i.e.,
    only those TNUs that rise to the level of representing taxon
    concepts).
2)      There ought to be a registry (perhaps administered by
    CoL?) for identifying the subset of TNUs that represent concept
    definitions, and it should include a mechanism for making
    set-theory relationship assertions among registered concept-TNUs.
3)      The two things mentioned in #2 should be separate; that
    is, one can assert that a particular TNU represents a taxon
    concept separately from (potentially multiple) assertions about
    how that taxon concept relates to other taxon concepts.
More later.
Aloha,
Rich
P.S By my standards that WAS quick!
-- 
Steven J. Baskauf, Ph.D., Senior Lecturer
Vanderbilt University Dept. of Biological Sciences

postal mail address:
VU Station B 351634
Nashville, TN  37235-1634,  U.S.A.

delivery address:
2125 Stevenson Center
1161 21st Ave., S.
Nashville, TN 37235

office: 2128 Stevenson Center
phone: (615) 343-4582,  fax: (615) 343-6707
http://bioimages.vanderbilt.edu

Re: [tdwg-tag] [tdwg-rdf: 105] Re: Any TCS users with experiences to report?

Steve Baskauf