[tdwg-content] ITIS TSNID to uBio NamebankIDs mapping

Sun Jun 5 20:56:40 CEST 2011

Hi Steve,

Excellent post!

I like your list of what we want "GUIDs" (see below) to do, and I think it's
an excellent starting point for a bar we should all strive for.  I'm
particularly grateful to learn that the existing ZooBank service fails so
many of them.  I've forwarded your post to Rob Whitton, who will be working
on Gen-2 of ZooBank in the coming weeks, and asked him if we can use your 8
tests as a metric to adhere to.  Watch this space.

Meanwhile...

> "But really, from the perspective of the end-user, does it matter 
> if it's an identifier or a service?  Ultimately, they ask the questions, 
> and the answers appear on their computer screens."
>
> I would answer this question by saying "yes, it does matter!" - 
> it is important that a well-designed GUID do more than just throw 
> something up onto a human user's web browser.  

I absolutely agree with you, but that's not the distinction I was making in
my quoted text.  I was only talking about whether we call something an
"identifier" (not GUID, which has more specific implications), or a
"service", in the context of human-machine conversations.  I think your
enumeration of things we want GUIDs to do is a very good framework for
discussion.  I would only caution that "GUID" means different things to
different people (some people use it synonymously with UUID, for example),
and also that GUID does not imply "actionable".  There has been a bit of a
debate over the importance of embedding "actionability" into identifiers
inherently (the Tim Berners-Lee perspective), vs thinking about
"identification" separately from how we perform some action on it.  For
example, UUIDs and Social Security numbers are extremely useful identifiers,
even though they are not inherently actionable.  It's amazingly easy to
perform action on a non-actionable identifier by simply appending it to a
actionable prefix.  For example, going back to the list of "identifiers":

A. A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523
B. urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523
C.
http://zoobank.org/urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E5BF4
1523 
D. http://zoobank.org/A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523 
E.
http://lsid.tdwg.org/urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E5B
F41523 
F. http://www.google.com/search?q=Danaus+plexippus+(Linnaeus+1758) 
G.
http://lsid.tdwg.org/summary/urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB
4-EA8E5BF41523 
H.
http://darwin.zoology.gla.ac.uk/~rpage/lsid/tester/?q=urn:lsid:zoobank.org:a
ct:A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523&submit=Go 

There are two different ways of looking at this:

1) There are 8 different identifiers
2) There is one identifier (A), and 6 ways to perform action on it (B-E,
G-H).

If you treat them all as distinct identifiers, then let me add a few more to
the list:

I.
http://zoobank.org/?lsid=urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA
8E5BF41523
J.
http://zoobank.org/?id=urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E
5BF41523
K. http://zoobank.org/?id=A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523
L. http://zoobank.org/?uuid=A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523

Note that all four of the above, plus B-D in the original list, are all
resolved through zoobank.org.  Why are there so many different ways to
perform action on the "same" identifier? Because I wanted the ZooBank
resolution service to be flexible. And, because in my mind, there is only
one identifier (A); and lots of different ways to retrieve the metadata of
the object it represents.

Now consider this from the TB-L perspective. Eleven different identifiers
for the same object (excluding F).  Does that mean we need to generate
owl:sameAs statements for all pair-wise relationships?  That's a lot of
owl:sameAs statements! Even if I'm the bad guy in foolishly allowing so many
different ways to resolve ZooBank identifiers, and needlessly fabricated so
many "different" identifiers for the same thing unnecessarily.  Fair enough.
But I still think we're a lot better off by disentangling identifiers from
the services we use to perform action on them.

One of the arguments on the TB-L side is that a non-actionable identifier by
itself is useless if you cannot inherently perform action on it.  For
example, if you were walking through the park and stumbled upon a slip of
paper with "A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523" written on it, you
probably wouldn't be able to do much with it.  But in reality, that's not
what happens.  We never expose identifiers as a simple context-free
identifiers in their non-resolvable form.  These identifiers are *always*
exposed in some context.  The problem is that if you treat the "resolution
metadata" (as I call it -- e.g., "urn:lsid:zoobank.org:act:" or
"http://zoobank.org/") as *part* of the identifier (as you have to do if you
make things like "urn:lsid:ubio.org:namebank:11815"), then it becomes
difficult for an application to distinguish between
"http://zoobank.org/?id=A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523", and
"http://zoobank.org/?uuid=A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523"; which, to a
human, obviously refers to the same thing.  In other words, absent all those
owl:sameAs statements, an application could break if it harvests content
from different sources that use different resolution metadata for the "same"
(sensu Pyle) identifier.

Maybe what we need to think about is a registry of "persistent resolution
services", which our community relies on.  That way, we can apply the
owl:sameAs statements to the resolution services, rather than to every
single individual identifier.

> An important question that I think has been underlying much of this
discussion 
> is whether GUIDs are actually needed for names.  

I think the answer is clearly  "yes". The problem is defining what is meant
by the word "name".

> If one takes the position 
> that a "name" can never be more than a string without 
> crossing the line into being something more complicated 
> like a TNU or TaxonConcept, then I think one could make 
> the case that the answer to this is "no".  

Perhaps, but I don't know of anyone who takes that position.
GNI/uBio/NameBank exist for a very specific purpose, and in that very narrow
context, the "name" is equivalent to the UTF-8-encouded string of
characters.  The architects of these systems would be the first to say that
this is a very limited context for what a "name" is, and *none* of them
would assert that a "name" can never be more than this.  Everyone I know
understands that all other flavors of "name" imply something much, much more
than the string of text characters.

> There isn't a whole lot that one would want to know about the 
> string that couldn't just be imparted by letting it be a string literal.  
> If one takes this position, then "Quercus alba L." is a different "thing" 
> (i.e. resource) from "Quercus alba" or "Quercus alba Linnaeus".  
> It seems that something like this is the position that Rich and the 
> GNI are taking.  Under this scenario, there is little point in creating 
> URI GUIDs for the name strings.

I only took that position in the *very narrow* context of GNI, which is
unusual among the millions of taxonomic datasets in treating a "name" as a
distinct text string.  And I backed off from that position after reading
Dima's post.

> On the other hand, if one takes the position that a name can be a 
> conceptual entity that has properties which include its name string(s) 

...as, I think, everyone does...

> and parts thereof, then it does make sense to apply GUIDs to that 
> kind of entity.  I am thinking about a tn:TaxonName as defined in the 
> TDWG ontology (see 
>
http://code.google.com/p/tdwg-ontology/source/browse/trunk/ontology/voc/Taxo
nName.rdf), 
> which comes out of the TCS schema (see 
> http://code.google.com/p/darwin-sw/wiki/ClassTaxon for info and links
regarding TCS).  
> A tn:TaxonName is "An object that represents a single scientific
biological name..." i.e. an "object" 
> NOT defined as a string.  

While it's nice to see the explicit representation of a "name" as an object,
rather than a string; unfortunately that doesn't address the elephant in the
room; that is, that different people have different notions of what "a
single scientific biological name" is.  I'm not talking subtly different
shades of fundamentally the same thing; I'm talking about fundamentally
different things with different implied sets of properties. This is one of
the issues I continued to hammer on during the development of TCS, and the
one that gave me the biggest qualms about TCS 1.0.  My hope was that it
would be resolved in TCS 2.0. I wanted to reduce both names and concepts to
the same core entity: usage instances.  That's exactly what we're doing with
GNUB.

> But if the GNI is only a "dirty bucket" that accumulates every name string
that anybody 
> has ever used in history but with little or no metadata, then I can't see
that I have any 
> use for a URI point to it, at least as something to which I would refer in
RDF.  

I think it's helpful to see GNI and GNUB as a yin-and-yang sort of thing.
There *needs* to be a service at the dirty end of the spectrum, because for
the vast majority of existing biodiversity data (digitized or not), the only
link we have to at taxon concept is a text-string name. There needs to be a
service that manages names-as-text strings.  GNUB, at the other end of the
spectrum, has the rich full-context metadata that I think you are interested
in, allowing for unambiguous reconciliation of different text strings as
applied to type specimens, or enumerating all spelling variants of the
"same" name, etc., etc.  What's missing (but DEFINITELY planned and already
sketched out), are the services that connect GNUB and GNI together.  As soon
as we hear definitively from NSF (should be soon now), we'll have the
resources to start building those services.

> I'm not saying that there isn't a use for the GNI.  I think what I'm
saying is that there 
> doesn't seem to be any point in worrying about how to create URIs for the
GNI 
> when those URIs don't "do" anything different from what a string literal
does.  
> I think this is essentially what Rich was saying: "that text string
represents a perfectly 
> suitable unique identifier.  There is no need to generate a surrogate
identifier like 
> an integer number or UUID or LSID or whatever".

Yes, I think that's exactly what I was saying.  Dima's post has forced me to
reconsider this somewhat, but even still, more broadly, I never saw GNI as a
service in need of "GUIDs" (in the sense that you outlined at the beginning
of your post). Certainly there is value in having internal data structures
to perform certain functions, but as far as I can tell, the interface
between GNI and the outside world should probably be limited to
human-readable name-strings. 

> Although Rich has been very cautionary about maintaining the distinction
between 
> ITIS TSNs, which he believes to represent some kind of minimal TNU 

I would defer to Dave N.'s post concerning what a TSN is, and represents.

> and uBio IDs which he believes to represent a name string, 
> I haven't been able to find any evidence that it would be "naughty" 
> to assert that either one is a tn:TaxonName.  

That's only true to the extent that tn:TaxonName may be too broadly
(imprecisely) defined (just like dwc:Taxon).

Aloha,
Rich