[tdwg-content] ITIS TSNID to uBio NamebankIDs mapping

Fri Jun 3 22:17:00 CEST 2011

Rich (et al.)... Just a quick comment re ITIS TSNs, since Rich posited:
=======================
My understanding (David N.: correct me if I'm wrong), is that all TSNs that correspond to "valid/accepted" names (where [taxonomic_units].[usage]='valid'|'accepted') essentially represent a taxon concept.  The rest of the TSNs (where [taxonomic_units].[usage]='invalid'|'not accepted') represent a variety of things, ranging from different combinations to alternate spellings to subjective synonyms, each of which is referable to one of the "valid/accepted" names.
=======================

I would say "sort of".... The TSNs do not themselves correspond with much of anything other than a unique, persistent, non-intelligent identifier for a "scientific name" (I realize that begs your next question/point of what that term "means") record in the context of the ITIS data system. See this linked from the "About ITIS" page:
http://www.itis.gov/pdf/faq_itis_tsn.pdf

Re "Scientific Name", as you hopefully see in the above document, the term in ITIS generally corresponds to what I see from the ICBN use (Art. 16-24) and the ICZN use (Art. 4-5 in particular, as the "combination" formation, rather than the more atomized uses like "specific name" which is like "epithet"). There are of course other thing in ITIS with TSNs, like database artifacts, that are labeled as such and retained but hidden from most users to avoid confusion and not strand any user that might already have the TSN.

By way of an example of the use of "name" fields.... Just recently I was given a nice "finished" world dataset for a modest animal-family-that-shall-remain-nameless, and the "name" fields were in some cases just as ITIS uses them, and in others there were additional things like authorship and so on lumped in with the name parts in those fields, though there were no years provided even then. So, usable, but the amount of work to essentially re-parse the data was surprising for just a couple hundred names, and even then they were inconsistent and incomplete, so someone now has to go collect all the missing details and go over it all again, and it clearly needs some smoothing around the edges as well. That was for just 200+ names from a single source. Ugh, thanks....

As to the relationship to taxon concept, if you squinted your eyes "just so" you could qualify as Rich did above and suggest that those TSNs that happen to represent names with usage=valid/accepted (and preferably those with some level of verification indicated, vs. the legacy data we're still dealing with!)  "essentially represent a taxon concept", but I don't really think that is appropriate at this point.... actually the closest thing in ITIS to a "taxon concept" would be certain entries in the reference_links table (the intersection between the scientific names entries and the reference entries), but even that is too abstract in my view. Since any number of references may be linked to a single TSN, that TSN won't necessarily yield something that maps to "a taxon concept" unless you're thinking "sensu ITIS v2011-05-31" or something of that ilk, which is I guess another way to think about it, with its own pros/cons.

And I agree with Rich's warnings of many pitfalls below (dragons and such).

I'll leave it there. Oops. So much for the "quick" comment....

Best,
Dave

David Nicolson
Data Development Coordinator, Integrated Taxonomic Information System
Biologist, USGS Core Science Systems, Biological Informatics Program
nicolsod at si.edu    Office 202-633-2149    Fax 202-786-2934
http://www.itis.gov/
http://www.cbif.gc.ca/itis/
"Nihil sumas necesse est..."

-----Original Message-----
From: Richard Pyle [mailto:deepreef at bishopmuseum.org] 
Sent: Friday, June 03, 2011 3:16 PM
To: 'Steven J. Baskauf'; 'Kevin Richards'
Cc: tdwg-content at lists.tdwg.org; Orrell, Thomas; 'Alan J Hampson'; Nicolson, David; 'Gerald Guala'
Subject: RE: [tdwg-content] ITIS TSNID to uBio NamebankIDs mapping

Hi All,

I'm just catching up on email now, after a series of other work-related
obligations and virtual attendance at a cybertaxonomy/e-literature meeting
in Chicago this week.  I do not now have time to review the entire thread,
so I'll jump into the stream with Steve's recent post.

> I think that one reason why this question has been on my mind is that I've
been waiting for 
> GNUB (Global Name Use Bank) to come out.  

Just a quick update, due to budgetary woes in the U.S. Federal Government,
NSF funding for awarded proposals has been pushed every further back.  If
I'm not mistaken, something like 18 months passed between proposal
submission and availability of funds for the BiSciCol grant, which our
institution was only able to (finally!) start processing within the past few
months.  Why is this relevant to GNUB?  Because the BiSciCol grant includes
the most substantial funding yet for implementation of GNUB (indeed, the
only funding for GNUB by name). The good news is that, now that funding is
in hand and money (finally) flowing, development & implementation of GNUB is
ramping up quickly.  And the promise of more (and more substantial) funding
is just around the corner (watch this space).

> I'm not really up on how it is going to work, but my impression is that it
was going 
> to be based on the Global Name Index (GNI) which was mentioned in that
earlier 
> January thread.  

Not exactly.  GNI and GNUB represent two ends of a spectrum.  GNI is at the
"minimal metadata/maximal content" end of the spectrum -- basically a
repository of any text-string purported to represent a taxon name that can
be linked via a resolvable identifier.  GNUB is at the "richly
metadata'd/carefully curated" end of the spectrum, representing a highly
normalized structure with permanent resolvable GUIDs and the potential for
robust information/data services.  In the vernacular, GNI is the "dirty
bucket", and GNUB is the "clean bucket". At the moment, the connection
between GNUB and GNI is unidirectional, in that the content of the
progenitor of GNUB has been indexed in GNI, but there is no mechanism (yet)
for GNI content to feed into GNUB.  The reason for this is fairly
straightforward: it's very easy to flatten out normalized content into
simple text strings (GNUB-->GNI), but it's much more difficult (impossible?)
to migrate metadata-poor, moderately parsed content into a highly structured
system.

> At that point, the GNI names didn't have any identifiers that were exposed
to 
> the public as permanent GUIDs.  I'm assuming that if GNUB refers to GNI
names, 
> they will have some kind of identifiers.  So if that happens how is the
GUID 
> recommendation 8 going to be followed?  As Kevin said in 
> http://lists.tdwg.org/pipermail/tdwg-content/2011-June/002499.html "What I
take 
> from recommendation 8 of the GUID applicability guide ... is that if you
DON'T 
> already have a record in your own database for a taxon name/concept, then 
> reuse an existing one.  "  What we have here with GNI is a situation where
none 
> of the records have identifiers.  In my mind, the "best practice"
according to 
> recommendation 8 would be for the GNI to reuse existing identifiers where 
> they exist and NOT make up new ones.  This is a bit more complicated
because the 
> ITIS identifiers (which are in common use) don't have an http URI version
that is 
> resolvable, and while the uBio identifiers have a resolvable http URI,
it's in the 
> form of a proxied LSID, which I've already complained is very ugly.  So
I'd like to 
> hear some ideas about how to have "reused" identifiers in the GNI.

In terms of GUIDs, the objects in GNUB and the objects in GNI are not the
same, and therefore cannot share identifiers.  The core object in GNI is a
text-string.  Indeed, the text string itself can be the actual identifier,
because it *is* the thing being identified.  In other words, because the
essential uniqueness of an instance (record) in GNI by definition *is* the
text string (i.e., the series of UTF-8-encoded characters), then that text
string represents a perfectly suitable unique identifier.  There is no need
to generate a surrogate identifier like an integer number or UUID or LSID or
whatever (except, perhaps, for internal use as a primary key for joining
tables; but those identifiers need not/should not be exposed to the outside
world).

By contrast, the core object in GNUB is a taxon name usage instance -- which
is a purely abstract notion of the usage of a taxon name within some
documentation source (like a publication).  In this case, the text-string
name is merely a property of the GUID-identified object, and would be an
extremely BAD choice to use as a unique identifier.  This is why GNUB needs
to generate a unique identifier to represent this core data object.  The
form that identifier takes (UUID, LSID, integer, DOI, whatever) from the
perspective of the end user should be completely irrelevant, because the
user should rarely (if ever) see it, and should certainly *never* be in a
position to type it on a keyboard (we can discuss the appearance of ZooBank
LSIDs on printed pages separately). All that matters is that it is
persistent, globally unique identifier that can be used to cross-link
information and can be conveniently resolved to the metadata of the object
it represents.

But the point is, recommendation 8 of the GUID applicability guide is not
being violated in the context of GNI and GNUB.

The real problem in all of this is the inconsistent meaning people apply to
the notion of a "taxon name".  In GNI-space, the name is simply a text
string.  In GNUB-space, the "name-object" is a code-compliant Protonym that
serves to cross-link Name-usages to each other.  ITIS is different still.
My understanding (David N.: correct me if I'm wrong), is that all TSNs that
correspond to "valid/accepted" names (where
[taxonomic_units].[usage]='valid'|'accepted') essentially represent a taxon
concept.  The rest of the TSNs (where
[taxonomic_units].[usage]='invalid'|'not accepted') represent a variety of
things, ranging from different combinations to alternate spellings to
subjective synonyms, each of which is referable to one of the
"valid/accepted" names.  CoL uses names as proxies to taxon concepts (not
sure how they handle synonyms vs. misspellings, etc.)  And there are other
variations as well -- to most botanists, "Aus bus L." and "Xus bus (L.)
Smith" represent "different names", whereas to most zoologists (who would
not bother to include the "Smith"), regard them as the "different
combinations of the same name" (zoologists are less consistent than
botanists in this regard).

The point is, this inconsistency and heterogeneity of what is meant by a
"name" in taxonomy is, in my opinion, the single GREATEST obstacle in
achieving informatics harmony among biodiversity datsets.

> One thing that comes to my mind would be to have a "domain name" like 
> "http://purl.org/gni/" or "http://purl.org/tn/" ("tn" for "taxon name")
and 
> to follow it with a namespace/id combination similar to what is done with
lsids.  
> So for example "itis/19408" and "ubio/448439" could be appended, 
> creating http://purl.org/gni/itis/19408 and
http://purl.org/gni/ubio/448439 for "
> Quercus rubra  L."  Both URIs could point to the same RDF and that RDF
could 
> indicate that the two identifiers are owl:sameAs .  

This syntax is basically what ZooBank does (and GNUB will do), within their
own domain name.  But I like the idea of a common URL domain that allows
these qualified identifiers to be appended.  

The real problem is what you describe next:

> I realize from what Bob Morris has cautioned in the past that there are
problems
> with owl:sameAs when the two things aren't actually the same thing 
> (e.g. if the uBio ID refers to a name string only but the ITIS TSN refers
to the name 
> plus an "accepted" status and a relationship to parent taxa).  

Do NOT underestimate the significance of this point.

> However, if there were an understanding that the GNI only refers to name
strings, 
> then one could still refer to http://purl.org/gni/itis/19408 as an
identifier for the 
> name string of the thing (whatever it is) that is referred to by an ITIS
TSN of 19408. 

Here be dragons -- for lots of reasons.  At this point, you might as well
just do a text-string match on the name.  The problem is, you'll miss the
match if authorship is not identical, but you risk homonymy mis-match  if
authorship is not included.

> I have no idea whether this would be a good idea or not, but I was really
cringing 
> to think about 19 million newly minted UUIDs appended to
"http://gni.globalnames.org/" 
> and figuring out how to connect those horrid things to the names and ITIS
TSNs 
> that I'm already using.  I think that I said this before, but using the
purl.org domain 
> rather than one like http://gni.globalnames.org/ would in the future allow

> somebody else to take over management of providing the metadata when the 
> GUIDs are resolved without having to deal with issues of who "owns" the
domain name.  

As I said before, I think it's perfectly fine to generate UUIDs for internal
purposes within GNI for varius performance reasons (or whatever), but I
don't think it's wise to expose those UUIDs externally.  Because the
uniqueness of a GNI record *is* the text string, then it makes more sense to
me to simply use the text string. However, that only works for
GNI/uBio/NameBank, where the essence of the record *is* the text string.
It's a non-starter for other datasets like GNUB, ITIS, CoL, and most others,
where the essence of the record is something altogether different.

Aloha,
Rich

Richard L. Pyle, PhD
Database Coordinator for Natural Sciences
Associate Zoologist in Ichthyology
Dive Safety Officer
Department of Natural Sciences, Bishop Museum
1525 Bernice St., Honolulu, HI 96817
Ph: (808)848-4115, Fax: (808)847-8252
email: deepreef at bishopmuseum.org
http://hbs.bishopmuseum.org/staff/pylerichard.html