[tdwg-content] ITIS TSNID to uBio NamebankIDs mapping

Richard Pyle deepreef at bishopmuseum.org
Tue Jun 7 09:09:45 CEST 2011


>> By contrast, the core object in GNUB is a taxon name usage instance --
which
>> is a purely abstract notion of the usage of a taxon name within some
>> documentation source (like a publication).  In this case, the text-string
>> name is merely a property of the GUID-identified object, and would be an
>> extremely BAD choice to use as a unique identifier.  
>
> It is possible that I'm not understanding what you are saying here, but if
you 
> are saying that the only name-related property of your GNUB taxon 
> instances will be one which has a name string literal as its object, 

Goodness, no!!  The point I was making was that for GNI, the name-string
*is* the object.  For GNUB, the name-string is merely one (of MANY)
properties of the object.

> That will require any client using your taxon instance metadata to
re-process 
> the literal name string to cross reference it with lexical variants, parse
it into 
> its pieces, etc.  

No -- that's definitely NOT the case.  GNUB is highly
normalized/atomized/parsed.

> That should only need to be done once and then referenced via a 
> GUID for the name (i.e. in the sense of tn:TaxonName).  

Yes, but the name-string is only one of the properties.  Other properties
include most of the other elements in dwc:Taxon (and more).

>> This is why GNUB needs
>> to generate a unique identifier to represent this core data object.  The
>> form that identifier takes (UUID, LSID, integer, DOI, whatever) from the
>> perspective of the end user should be completely irrelevant, because the
>> user should rarely (if ever) see it, and should certainly *never* be in a
>> position to type it on a keyboard (we can discuss the appearance of
ZooBank
>> LSIDs on printed pages separately). 
>
> OK, again maybe I'm not understanding what you are saying here, but 
> if you are saying that you don't intend to expose your unique GNUB 
> identifiers to the public, then as far as I'm concerned you are setting up

> GNUB to be irrelevant from the start.  

Let me clarify: Obviously, GNUB identifiers will be fully exposed to the
"public", in the sense that anyone who WANTS to see them (developers, IT
specialists, hard-core name nerds, etc.), will be able to see them.  In
fact, anyone who wants a replicate copy of the ENTIRE dataset, including all
Identifiers, raw tables, etc., will be able to do so.  The idea is that you
can download a snapshot of the entire database (all tables in their native
structure; not dumbed down or flattened), and then set up a simple
replication service that allows your local copy to automatically stay in
synch with the "master" copy/ies. So yes, anyone who wants access to the
identifiers has full access to them.

The point I was making was that most end-users won't care what the
identifier is, or what kind it is, or how beautiful or ugly it is, or
whatever.  A good analogy is DNS:  All users ever see is "google.com".  They
never see "74.125.224.176" (which google.com maps to from my machine at this
moment). But the "ugly" "74.125.224.176" is what actually identifies the
server to which google.com takes you.  Analogously, users should only ever
see "Danaus plexippus (Linnaeus 1758)"; they should never need to see
"A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523".

> You mention a number of cool taxonomist-geek type things that you hope 
> to accomplish with GNUB.  But from my perspective as a
non-taxonomist-geek, 
> the main purpose I have for GNUB is as a place to anchor
dwc:Identification instances 
> so that I can indicate whether my identified resource is a representative
of the same 
> taxon that is being referred to by somebody else (or at least to make it
possible for 
> somebody to figure that out via computery cleverness, Semantic Web or
otherwise).  

Yes, exactly.  But remember, GNUB is just an index to information.  You will
anchor your dwc:Identification instances to GNUB identifers, which will give
you a precise indication of the concept that was used for the
Identification.  For example, the Field guide or taxonomic key that the
taxonomist used to make the identification in the first place. No
information on the field guide or key that was used when applying the
identification to the occurrence?  No problem -- generate a new TNU,
"authored" by the person/entity making the identification, and voila! You're
now plugged into the GNA matrix.  What does that give you? Well...a few
immediate options include:

- Access to the full literature citation and other nomenclatural details for
the name;
- Access to all other usages of that name, including variant spellings,
combinations, synonymy treatments, etc.
- Access to all other resources of relevance that are also plugged into the
GNA "matrix".

But more to your interests: 

- Cross-linking to other usage instances in a way that allows you to figure
that out via computery cleverness whether you and someone else are referring
to the same taxon concept.  This little piece of magic can happen and two
ways:

1) Implicitly. By comparing other usages in the contexts of their collective
synonymies.  For example, suppose RefA and RefB both treated "Aus bus" and
"Aus xus" as two distinct species.  RefC treated "Aus xus" as a
heterotypic/junior synonym of "Aus bus".  If your identification of a
specimen as "Aus bus" links into the TNU associated with RefA, then
implicitly we can say that its (likely to be) congruent with the concept
represented by the TNU for that name of RefB; but may or may not be
comparable to the concept represented by RefC. This is an example of
addressing the "many concepts for one name" problem. Conversely, suppose
your specimen identification is linked to the TNU for RefC.  In that case,
we can infer that your concept of "Aus bus" could apply to representatives
of either "Aus bus" *or* "Aus xus" as cited in RefA and RefB. This is
addressing the "many names for one concept" problem.  These are just two
very simple examples (of many possible examples) of the sort of computery
cleverness that can be used to infer implicit concept-mapping among TNUs.
Obviously, there are assumptions and caveats and such -- but it's still
better (a LOT better) than trying to make inferences based on the
text-string name only.

2) Explicitly. In the same way that TNUs can serve as the "molecules" behind
nomenclatural services (like ZooBank, Index Fungorum, and possibly
IPNI/APNI/Tropicos, if/when they embrace GNA), these TNU molecules can also
underpin taxon concept services, such as those represented in TCS
RelationshipAsserions.  In other words, there can be a structure/service
that sits on top of GNUB that allows explicit declarations of the sort: TNU1
represents a concept circumscription that is congruent with TNU2; etc.
These third-party assertions about concept-concept mappings could provide a
very valuable service for making inferences involving both
many-names-for-same-concept issues and many-concepts-with-one-name issues,
presumably with greater precision and reliability than the implicit
mappings.

> How am I going to do that if you don't provide me with a good 
> (i.e. meeting the 8 criteria of my last email) GUID to use as the 
> object of my dwc:Identification properties?  

Have we cleared up that misunderstanding?

> For over a year, I've heard you lament that the whole problem 
> is that people make identifications and don't indicate the sensu/sec. 
> reference for the names they use.  

Yes, exactly!  And that's the real problem with our information domain: one
of the key pieces of information needed to apply computery cleverness to
identifications of Occurrence instances is missing from the vast majority of
datasets.  That, unfortunately, means we're limited in our ability to make
inferences about concept mappings -- not because an informatic structure
doesn't exist to accommodate it, but because one of the key pieces of
information is lost (i.e., what *concept* of this taxon were you thinking
when you assigned this name to this occurrence instance?)

> You are now creating a system that would allow people to unambiguously 
> make it clear what taxon they mean but you aren't giving them any 
> way to say what it is?  Again, I may just be misunderstanding what you
wrote here.

Indeed, it seems that you are.  Please let me know if I have not cleared up
the misunderstanding.

> Yes.  This "record based ID" can be anything you want.  I don't really
don't and 
> shouldn't have to care about that.  The "human friendly ID that allows
people to 
> discuss the same semantic thing" is precisely what the TDWG GUID
Applicability 
> Statement (a ratified TDWG standard, thanks to Kevin) is talking about.  

Hmmm...my turn to worry that I'm misunderstanding something.  I'm fairly
certain that the TDWG GUID applicability statement applies primarily to what
you are referring to as the "record based ID".  I think (not sure) that what
Kevin meant by the other thing ("human friendly ID that allows people to
discuss the same semantic thing") was more of a human-friendly service that
accepts the human-friendly form of an "identifier" (e.g., the text-string
taxon name), and then converts that into the real GUID (our "record based
ID") for actual embedded linking purposes.  Sort of like how DNS converts
"google.com" (human-friendly representation of a domain name) to
"74.125.224.176" (actual "GUID" used to route to a specific server).

> As I read that standard, I don't see any requirement that a GUID be "human
friendly", 
> but I would consider "human friendliness" to be a desirable "best
practice" 
> (influenced somewhat by http://www.w3.org/Provider/Style/URI and 
> http://www.w3.org/TR/cooluris/) - if we have a choice of creating
externally exposed 
> GUIDs that are either human-friendly or not human-friendly, and if either
works 
> equally well, why not choose ones that are human-friendly?

Here is where I completely disagree.  I've said it before, and I'll keep
saying it:  GUIDs are (should be) intended and necessary for
computer-computer communication; *NOT*for human-computer or human-human
communication.  Their beauty or ugliness should be determined by what's
beautiful or ugly to a computer, not to a human.  A consistent 128 bits is
"beautiful" to a computer, but a UUID is ugly to a human; whereas " Danaus
plexippus (Linnaeus 1758)" is beautiful to a human, but ugly to a computer
(for reasons Dima already outlined).

More fundamentally, one lesson of history that seems to be perpetually
repeated is the mistake of encoding human-interpretable information into
what is intended to be a stable, permanent identifier.  INEVITABLY, a system
that uses human-interpretable information as identifiers will include some
fraction of instances where the human-interpretable part is somehow "wrong"
(e.g., the user entered a Cyrillic "а" was accitdentally entered instead of
a latin "a", or a typographic error in a scientific name, or worst of all,
the assignement of a text-string name to a homonym due to a mix-up in
authorship).  The temptation to "fix" those "wrong" values is enormous. And,
of course, by "fixing" them, permanence is broken. 

Almost by definition, then, a "beautiful" identifier for computer-computer
communication should be "ugly" to a pair of human eyeballs.

> It is interesting all this discussion of identifiers when in the end it
doesn't 
> matter too much what the identifier is, just that you have an identifier
at all.  

Yes and know.  I guess it depends on what the word "is" means in your "what
the identifier is" phrase (Channeling Bill Clinton here).  If by "is" you
include "is permanent", "is unique", or "is actionable", then it does matter
what the identifier "is".  If you mean "is a DOI" vs. "is an lsid", then it
may matter (see Rod's post), or it may not -- depending on what you want the
Identifier to be able to do. 

> The important thing is the semantics, the "are we talking about the same
thing" 
> question - so this is where I believe RDF/semantic web comes in - I might
see if 
> I can come up with some RDF/sem web example for TDWG that could
demonstrate this, hmmm...
 
This is where the real problem in our community is.  We are *WAY* too fast
and loose with the definitions of what our "things" are.  We think that by
simply distinguishing "Taxon Names" from "Taxon Concepts" that we've removed
ambiguity.  Not even close. There are multiple flavors within each of those
two "domains", and far too few people in our community (both on the IT side
*and* the taxonomy side) have thought through the implications of defining
the different flavors, let alone trying to establish a "sameAs" between two
different flavors.
 
> Better yet, read the TDWG GUID Applicability Statement 
> http://www.tdwg.org/standards/150/ and 

I think I helped write that one, so I'm pretty sure I've got a lot of that
covered already (except the parts I vehemently disagree with...  :-)  )

> http://www.w3.org/TR/cooluris/
That one I didn't know about, so thanks for the link. Of course, GUIDs
(sensu lato) and "uris" are not necessarily the same thing.  But that's
another argument for another day.
  
> When I say "GUID" I am not throwing around a colloquial term.  
> I intend for it to have the exact technical meaning that it is given 
> in the TDWG standard.  

Fair enough -- I must have missed when you defined your use of "GUID"
specifically in the context of the TDWG standard.

> At this point in time (i.e. after we finally have a ratified standard on
GUIDs), 

Maybe I'm mistaken, but I don't think we do.  I don't think that an
"Applicability Statement" rises to the level of "ratified standard", in the
sense that TCS 1.0 and DwC are "ratified standards".  Someone with better
knowledge of the TDWG process can clarify this.

> nobody in our community has any business designing and 
> exposing GUIDs without having read this document and 
> completely understanding its requirements and recommendations.  

I certainly would agree with that statement.

> I should not have to be "explaining" any of this to anybody on the list.  

*Sigh*  I often feel the same way.  Too often, in fact. I hope you realize
that when I complimented you on your 8 points I was complimenting you on the
way you "paraphrase out of [your] head".

> It is explained clearly and concisely in the standard.  

...err "applicability statement".

> There has been a bit of a
> debate over the importance of embedding "actionability" into identifiers
> inherently (the Tim Berners-Lee perspective)

> Wrong.  "GUIDs should be resolvable" (direct quote of recommendation 
> 7 from the GUID applicability statement).

No, *NOT* wrong! I will say it again to be perfectly clear: There has been
(and continues to be) a bit of a debate over the importance of embedding
"actionability" into identifiers inherently. This is and continues to be a
true statement. The only extent to which that statement is "wrong" is that I
understated it with the words "bit of a".  I should have either eliminated
those words, or replaced them with "robust".

> Don't add more of them to the list.  Recommendation 3: 
> "Providers must assign at most one GUID to any particular object."  
> Recommendation 4: "Only one globally unique identifier should be assigned
to each object".  

*Exactly*.  That's why I think it's foolish to regard all of these different
resolution mechanisms as distinct "identifiers".  There is *ONE* GUID.  It
is: A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523.  There are ten different ways to
make it actionable. It therefore meets the recommendations of the
applicability statement.

I draw your attention to p.7 of the "TDWG GUID Applicability Statement",
under the heading "Uniqueness and Resolution", where it states the
following:
============================
The global uniqueness of an identifier is often confused with the issue of
resolution of the identifier.  These two attributes of GUIDs can be
distinguished and discussed separately.
For example a Universally Unique Identifier (UUID) is a globally unique
identifier, but there are no widely known and used protocols for resolving a
UUID over the Internet (unlike HTTP URIs). This form of GUID is perfectly
acceptable for uniquely identifying data objects within a dataset.
Some identifiers therefore provide uniqueness, but not resolvability.
============================

The part that's not written there, but I think should have been written
there (and that I argued strongly in favor of writing there when the
document was drafted), is that GUIDs that are not self-resolving (i.e., not
inherently actionable), can be *made* actionable when represented in the
context of resolution metadata.

> I would assert that what you "want" and what you have 
> in your mind is at odds with the TDWG standard for GUIDs.

I would assert otherwise.

> This may be your opinion, but it is at odds with the ratified 
> standard which says.

Again, I don't agree with you on this assertion.

> (recommendation 2) that "HTTP GET 
> resolution must be provided for non-self-resolving GUIDs".  

Yes, exactly -- and I trust you realize that this is exactly what ZooBank
does.  Note that the applicability statement does not say there must be
*only one* HTTP proxy for the non-self-resolving GUID.

> The problem here is caused by you when you create and 
> expose so many different HTTP URI forms of your UUID.  
> Stop doing that (recommendation 4).

And I disagree. ZooBank follows recommendation 4 *precisely*.  There is only
*ONE* globally unique *identifier* assigned to each object.  In this case,
that identifier is: A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523. Full stop. End of
story. 'Nuff said.

The problem is not that I create and expose so many different HTTP URI forms
of my UUID.  The problem is when people conflate the function of
*identification* with *resolution*.  This is where I part company with (what
I've been told is) the TB-L school of thought.  And no, I don't think I'm
smarter than TB-L.  If I were the only one who disagreed on this point of
conflating resolution metadata with unique and global identification, then I
would assume that I'm an idiot and would stop complaining about this.  But
the more I think about it, and read about it, and understand about it, the
more confident I am in standing my ground on this.

> There is no need for this.  Make a single HTTP URI version of 
> your UUID and stick with it.  Preferably one without the query 
> string and use Mod rewrite (or whatever it's called) to transform 
> the simple, clear, and permanent version of the URI into whatever 
> flavor of temporary URL you are liking at the moment.  Every 
> application today understands HTTP GET.  No need for a registry.

Of course every application understands HTTP GET.  That's not the point (at
all).  

> Go with the TCS standard and the TDWG ontology as it exists currently.

If you think that TCS has "the" answer to the "name" problem, then I don't
think you fully appreciate the magnitude of the problem.

> > While it's nice to see the explicit representation of a "name" as an
object,
> > rather than a string; unfortunately that doesn't address the elephant in
the
> > room; that is, that different people have different notions of what "a
> > single scientific biological name" is.  I'm not talking subtly different
> > shades of fundamentally the same thing; I'm talking about fundamentally
> > different things with different implied sets of properties. This is one
of
> > the issues I continued to hammer on during the development of TCS, and
the
> > one that gave me the biggest qualms about TCS 1.0.  My hope was that it
> > would be resolved in TCS 2.0. 
>
> There ain't no TCS 2.0 .  There is only TCS 1.2 .  I'm sorry about it, 
> but that's the ratified standard.

Please understand, I'm trying to illustrate where the existing standards
fall short of what this community *needs* in order to move forward. Of
course we have the standards, and if we allowed our hands to be tied to
those standards, there wouldn't be any progress.  TCS 1.2 DOES NOT MEET THE
NEED.  I want to move in the direction of something that DOES meet the need.

> There have been any number of things that I would "like" to be the 
> way I want.  However, the point of standards is that they get 
> hammered out in a form that satisfies the community in a general way.  

Are you saying that the standards are written in stone, and we should be
happy with them, and simply live with their limitations?  If so, then you're
operating in a world that I don't want any part of. I don't think you are,
but frankly, the tone of this particular email exchange (by either of us)
has not been especially helpful.  OBVIOUSLY we should use the standards, as
they exist, as much as possible WHEN THEY MEET THE NEEDS.  What I was
talking about (perhaps in an overly friendly, informal and loose way) is
where we need to go to next.  We clearly disagree on a few specific
interpretations of the TDWG GUID applicability statement, but that's fine --
that's what we should be spending our time focused on.

> But GNUB is not an "old system".  It is being build from scratch and I 
> would assert that where it comes to interfacing it with the outside 
> world, it should follow standards such as they exist at the moment.  

*Exactly*, and obviously it will, as much as feasible, practical and
desirable, accommodate the existing standards -- and even the applicability
statements -- within their inherent limitations.  But speaking as someone
who was a very active participant in the development of both the GUID
applicability statements and TCS back when those were new and on the cutting
edge, I have absolutely no interest in *limiting* what GNUB can do to what
those standards articulate.  We've moved along now that it's time to start
pushing to the next level -- time to start overcoming the limitations those
existing standards imposed. Many of those limitations were recognized at the
time those documents were drafted, and the drafters acknowledged that some
of the improvements would need to wait for the next version.  With the
development of GNA/GNUB, it's time to move on to the next version.  We
obviously want the next level to be backward compatible with existing
standards, and obviously every effort will be made to maintain backward
compatibility. 

> At the moment, people are allowed to think about and describe 
> names without reducing them solely to usage instances as you would like.  

Yes -- which is why I keep emphasizing why GNI will remain an important
component.

> I spend about an hour yesterday composing a rant about how
counterproductive 
> it is for taxonomy and computer geeks to create tools and systems that
won't 
> ever actually be used by the people who need them.  I decided that it
wasn't 
> helpful to actually post it, but now I'm thinking that maybe I should
have...

Perhaps you should -- but keep in mind that statements like "won't ever
actually be used by the people who need them" is an awfully broad and bold
assertion.  Backing up such an assertion begs for an articulation of the
full scope of all possible users, and a deep understanding of the function
of the systems you are making such assertions about.

> dwc:Taxon doesn't really have much of any useful definition, so I'm with
you there.  
> tn:TaxonName is actually rather precisely defined, at least if you look at
the RDF 
>
(http://code.google.com/p/tdwg-ontology/source/browse/trunk/ontology/voc/Tax
onName.rdf) 

Is this the definition you refer to:

"A scientific biological name. An object that represents a single scientific
biological name that either is governed by or appears to be governed by one
of the biological codes of nomenclature. These are not taxa. Taxa, whether
accepted or not, are represented by TaxonConcept objects."

If so, by that definition, how many TaxonName instances are included in the
following list?

Aus bus
Aus dus
Xus bus

Three? Four? Five?  I can defend all three of those answers within the scope
of the definition above.  Assuming no homonyms or misspellings are involved,
GNUB would establish four separate Protonyms, each of which can be thought
of as a "name object", each with Code-specific properties.  Additionally, if
these fell under the botanical Code, there would be at least one, and as
many as three additional nomenclaturally-relevant TNUs that would establish
combination(s) other than the original as distinct "name objects" under the
botanical Code.

> In my opinion, TCS (and by extension, the TDWG ontology) 
> puts a rather restrictive collar and leash on taxon names.  

Enthusiastically Agreed!  :-)

> I quote from the user guide page 9: "<TaxonName> elements 
> do not represent taxa.  They serve only as abstract nomenclatural 
> data structures that encapsulate the core rules of the different 
> nomenclatural codes.  Their purpose is to prevent nomenclatural 
> statements becoming confused with statements about the circumscription 
> of, and relationships between, different taxon concepts.  No 
> taxonomic opinion can be expressed using <TaxonName> elements in 
> TCS.  As a rule of thumb if you are dealing with anything beyond a 
> type specimen and references to it, you are talking about a 
> TaxonConcept of some form."  This does not seem like a broad 
> and imprecise definition to me.  One is allowed to describe the 
> pieces of the name and that's about it.

Yes, I know -- I helped write that.  Unfortunately, it's still not precise
enough (as is documented on some wiki somewhere, as we were defining what
was originally called "LinneanCore", which later was subsumed into what is
now TCS).

> When I look carefully at how the TDWG ontology deals with 
> taxon names and taxon concepts, it seems very simple and 
> "usable" to me.  

I'll definitely concede that point to you -- is strikes a good balance
between ideal and practical.  One of the over-arching goals within GNA
development is to nudge a bit further towards the "ideal" without
compromising on the "simple" and "usable"  Whether or not this is possible
remains to be seen.

> If one defines a Taxon to be composed of a name component 
> and a sensu/sec. component as several people (including you, I think) 
> on this list have done and as TSC has done (I think), then representing 
> it in RDF becomes tractable.  

OK, good -- now I'm getting my head back into this conversation.  Yes, *my*
intent was to keep TCS open-ended such that any "[Name] sec. [Reference]"
(=TNU) could be represented through TCS.  That is the intention of GNUB.
This is where Jessie Kennedy and I had many long debates.  From her
perspective, only the subset of "[Name] sec. [Reference]" (=TNU) instances
that rise to the level of a "taxon definition" should be represented in TCS.
This comes down to the fuzzy distinction between an "Identification" and a
"Concept Definition".  In the latter, presumably one provides a suite of
information to help define the boundaries of a taxon-concept circumscription
(specimens, characters, synonymy, etc.). In the former, presumably one
simply assigns a name-string to an occurrence (or similar) instance of an
organism.  The problem is that every imaginable version between these two
endpoints exists in biodiversity-land, so there is no clear distinction
between which instances rise to the level of a "Taxon" and thus are
legitimately represented via TCS, and which do not.  In my mind, the
approach of GNUB should be to not try to establish a distinction, and just
accommodate any "[Name] sec. [Reference]" (=TNU) instance.

> One anchors the name part to a tn:TaxonName instance 
> (properly collared and chained and wearing a GUID as a dog tag).  
> How one anchors the sensu/sec. part is still a subject for discussion.  

This is the essence of a TNU.  Except in GNUB-speak, a "TaxonName" is
represented by another TNU -- specifically, the TNU that established the
name in the first place.  So, for example, Linnaeus (1758) established the
name "Aus bus".  Smith (1990) defines a taxon concept for "Aus bus L.".

TNU1: Aus bus Linnaeus 1758 sec. Linnaeus 1758
TNU2: Aus bus Linnaeus 1758 sec. Smith 1990

The Protonym is TNU1.  TNU2 links to TNU1 as the Protonym, and basically
translates to "Smith's taxon concept definition labeled with the name 'Aus
bus L.'"; or more simply: "Aus bus L. sec. Smith 1990".

> I have been thinking about the following approach.  It is based on a Venn 
> diagram that I have in my head which I created from your descriptions of 
> TNUs on this list.  The Venn diagram has a big rectangle labeled 
> "nominal taxon".  

If I correctly understand what you mean by the "Nominal Taxon", I think this
equates in GNUB-speak to a Protonym.

> Inside that is a smaller rectangle named "taxon name usage (TNU)".  
> Inside that is an even smaller rectangle named "taxon concept".  

Hmmmm...maybe.  I need to digest this a bit.

> In this view, Taxon concepts are well-described/circumscribed by a 
> publication.  

Yes.

> TNUs (which include taxon concepts) are associated with a particular 
> person's idea of what the taxon is, but which may or may not be 
> described in a publication.  

Yes, I think.  I would state it this way: a subset of all TNUs are the TNUs
that represent well-defined, published definitions of taxon concepts.  That
is, all taxon concepts are anchored to (born as?) a TNU, but not all TNUs
rise to the level of Taxon Concepts.

Depending on how you distinguish "Publication" from non-publication, this
may be somewhat of a distracting parameter.  Generally, good taxon concept
definitions exist within documentation sources that are what most of us
would call "published"; but there's nothing inherent to "publication" that
is necessary for "good taxon concept definition".  Good taxon concept
definitions can certainly exist in what many of us would described as
"unpublished" form; just as many published TNU's don't rise to the level of
good taxon concept definition.

> Nominal taxa are all instances of a scientific name use including those 
> where we have no idea who applied the name or what set of 
> organisms they intended to be included in the taxon.  

Yes!  In GNUB, this is represented by the fact that all the relevant TNUs
are anchored to the same Protonym (e.g., Aus bus L. sec. Linnaeus 1758).

> In terms of RDF metadata:
> 1. Go ahead and let the rdf:type of the thing be tc:Taxon

Ok.  But how does that map to dwc:Taxon?

> 2. Make the object of tc:hasName be a GUID (i.e. as described 
> by the TDWG GUID Applicability Statement, not some other 
> kind of GUID)-identified resource, preferably from a 
> well-known source like uBio.

Not sure.  I don't see uBio as a source of "name objects" so much as
"name-strings".  I think a better GUID link would be to a GNUB TNU that is a
Protonym. This is what is currently registered in ZooBank: Protonyms (the
most common kind of Nomenclatural Act; that is, the TNU that represents the
establishment of a new scientific name).

> 3. If the sensu/sec. is described in a publication (in my mind 
> a true taxon concept), then the object of tc:accordingTo 
> is an HTTP proxied DOI, HTTP URI of a BHL-scanned publication, 
> or if both of those fail, something non-resolvable but globally-
> unique like an ISBN or URL of a stable web page.

OK, yes, I think so.  Translated into GNUB-speak, I would say that if the
TNU (treatment of a taxon name within a documentation source, like a
publication) includes a robust definition of a Taxon Concept, then the
linked ReferenceID (GNUB-generated GUID) would ideally be cross-mapped to a
content-rich rendering of the identified reference, such as a DOI
(presumably resolving to a PDF), an HTTP URI to a set of BHL page-images, or
a PLAZI Handle for an XML-marked-up taxon treatment (or any or all of the
above).

> 4. If the sensu/sec. is not described in a publication, but is 
> associated with a particular person (in my mind a TNU that 
> isn't a true taxon concept), then the object of tc:accordingTo 
> could be the URI of a foaf:Person or foaf:Group.

Well, that's not exactly how GNUB would handle it -- but close.  Basically,
a "Reference" in GNUB represents some form of documentation of information
that has been authored (e.g., foaf:Person), and is static as of some moment
in time (e.g., publication date).  Again, I don't think "publication" is the
right parameter to distinguish "taxon concept" from non-taxon-concept. There
are many, many TNUs appearing in published works that do not really rise to
the level of taxon concept definition.  In any case, whether it's published
or not, and whether it represents a good taxon definition or not, are two
different things that may be correlated, but not hard-linked.  Also,
regardless of whether it's published, any kind of documentation has the
potential of authorship (attribution) and some point in time....in other
words, a gnub:Reference instance. There's no reason to use the class of
"thing" to which a TNU is linked (e.g., publication object vs. Agent object,
as you seem to be suggesting) as the delimiter of what should be treated as
a "Taxon Concept" and what should not.

> 5. If the sensu/sec. is completely unknown, then the taxon 
> is a nominal taxon that is not a TNU.  I don't know whether 
> it is better for the taxon to simply lack a tc:accordingTo property 
> or to have a tc:accordingTo property that somehow says 
> "we don't know anything about the sensu/sec.".  

Agreed!  GNUB-speak, the ReferenceID would be null or (my preference from an
implementation perspective "0" (which translates to "we don't have any
information about the specific implied usage, so treat it as a nominal
taxon").

> I realize that you probably aren't going to like this because 
> it isn't as sophisticated and nuanced as you would like for 
> your GNUB TNUs to be.  

No, actually I think it's perfectly fine.  The reason I like normalized
back-end data structures is that they give you much greater flexibility in
offering any range of services, from extremely simple to as complex as the
back-end data model allows.  Moreover, as you said:

> However, there would be nothing that would prohibit you 
> from creating and adding a myriad of clever properties 
> to the tc:Taxon instance RDF to make it do all of the 
> things you want.  

Exactly.

> The practice I have described would break down the act 
> of defining a taxon into  well-known, standardized pieces 
> and it is a practice that could be fairly easily be followed 
> by people without sophisticated IT resources.  It would 
> allow for the transfer and comparison of taxa information 
> and make the possibility of reconciling at some central 
> location (like GNUB) the taxa that are described in a 
> distributed network of users.  Doing something like this is, 
> I believe, the entire reason for the existence of TCS, the 
> TDWG ontology, old TDWG TAG roadmaps, etc.  

We are in full agreement!

> Please apply some self-discipline to follow the ratified 
> standards or risk blowing us all back to 2005 where we 
> would have to discuss all of the settled things again.  

I guess this is where we differ.  Besides the semantic issue of "ratified
standard" vs. "applicability statement", and the fact that we seem to have
somewhat different interpretations of what the GUID applicability statement
is actually recommending, I have a somewhat opposite perspective from you on
this. In my view, constraining ourselves to TCS 1.2 is forcing us to STAY
back in 2005, which had a somewhat different biodiversity informatics
landscape from today, and even more different from what (I *hope*) we see
emerge over the next 2-3 years. As I said, we want to maintain backward
compatibility with TCS 1.2, and we certainly want to adhere to the
recommendations of the GUID applicability statement (which I believe I do,
except for the specific known issues that are on the "to do" list), but also
push forward to overcome the limitations those technologies as a way to
prototype the next generation of these equivalent standards &
recommendations.

> In some ways what I'm talking about here is really 
> (as I understand it) the principle that underlies REST.  

Yes!  Ever since I had REST explained to me, I've been anxious to implement
those kinds of services.  Rob Whitton is already at work on ZooBank 2.0,
which will be a complete ground-up re-write, and will be services-based.

> Within your big GNUB kingdom and my little Bioimages 
> kingdom, we are free to do whatever clever things we 
> want, structure databases as we wish, do clever 
> programming stuff or whatever.  But when you and I talk, 
> we follow commonly established rules, namely we talk 
> using the HTTP protocol 

Total agreement!

> and identify the things that we want to talk about 
> using HTTP URIs.  

Errr..sort of.  I say we identify things using GUIDs, and provide services
that resolve those GUIDs via actionable HTTP URIs (or, if you prefer,
embedding those GUIDs within a resolution metadata "wrapper").  Yes, I know
it's all the rage to collapse the functions of actionability and globally
unique identification into the same text-string URI (what I've been
referring to as the TB-L perspective).  But to be perfectly blunt, I see
this as a mistake that will, in the long run, sow down our progress.

> Since we are talking specifically about biodiversity informatics, 
> we should choose to follow more restrictive rules about the 
> identifiers themselves (following the TDWG GUID applicability 
> statement) and the nature of the RDF (following the GUID 
> applicability statement, well-known vocabularies such as the 
> TDWG ontology, FOAF, DCMI, Darwin Core, geo, etc.).  
> If we fail to do that, then every interaction that I have with 
> another entity requires me to establish in advance the rules 
> of that interaction.  The Web works well because people follow 
> a defined set of rules about URLs and HTML.  I would assert 
> that we now (at last) have a similar model available to us 
> in the biodiversity informatics community if organizations 
> would just have the self-discipline to use it.  

Agreed!  I think when we distill this entire exchange, we'll find that we
have slightly different interpretations about what the GUID applicability
statement actually says & means, and a non-trivial amount of
miscommunication, but otherwise (as was the case the last time we had such a
voluminous exchange), we're actually more on the same page than not.

> So I'm actually pretty optimistic about the whole venture 
> assuming that we can get people and organizations to 
> actually read and try to follow the standards that we 
> have already agreed upon.  

I think it's nice to end this email on a point of strong agreement!

Aloha,
Rich


Richard L. Pyle, PhD
Database Coordinator for Natural Sciences
Associate Zoologist in Ichthyology
Dive Safety Officer
Department of Natural Sciences, Bishop Museum
1525 Bernice St., Honolulu, HI 96817
Ph: (808)848-4115, Fax: (808)847-8252
email: deepreef at bishopmuseum.org
http://hbs.bishopmuseum.org/staff/pylerichard.html






More information about the tdwg-content mailing list