[tdwg-content] ITIS TSNID to uBio NamebankIDs mapping

Mon Jun 6 16:37:22 CEST 2011

Comments regarding several emails inline:

Richard Pyle wrote:
> By contrast, the core object in GNUB is a taxon name usage instance -- which
> is a purely abstract notion of the usage of a taxon name within some
> documentation source (like a publication).  In this case, the text-string
> name is merely a property of the GUID-identified object, and would be an
> extremely BAD choice to use as a unique identifier.  
It is possible that I'm not understanding what you are saying here, but 
if you are saying that the only name-related property of your GNUB taxon 
instances will be one which has a name string literal as its object, 
then I think that is a big mistake.  That will require any client using 
your taxon instance metadata to re-process the literal name string to 
cross reference it with lexical variants, parse it into its pieces, 
etc.  That should only need to be done once and then referenced via a 
GUID for the name (i.e. in the sense of tn:TaxonName). 
> This is why GNUB needs
> to generate a unique identifier to represent this core data object.  The
> form that identifier takes (UUID, LSID, integer, DOI, whatever) from the
> perspective of the end user should be completely irrelevant, because the
> user should rarely (if ever) see it, and should certainly *never* be in a
> position to type it on a keyboard (we can discuss the appearance of ZooBank
> LSIDs on printed pages separately). 
OK, again maybe I'm not understanding what you are saying here, but if 
you are saying that you don't intend to expose your unique GNUB 
identifiers to the public, then as far as I'm concerned you are setting 
up GNUB to be irrelevant from the start.  You mention a number of cool 
taxonomist-geek type things that you hope to accomplish with GNUB.  But 
from my perspective as a non-taxonomist-geek, the main purpose I have 
for GNUB is as a place to anchor dwc:Identification instances so that I 
can indicate whether my identified resource is a representative of the 
same taxon that is being referred to by somebody else (or at least to 
make it possible for somebody to figure that out via computery 
cleverness, Semantic Web or otherwise).  How am I going to do that if 
you don't provide me with a good (i.e. meeting the 8 criteria of my last 
email) GUID to use as the object of my dwc:Identification properties?  
For over a year, I've heard you lament that the whole problem is that 
people make identifications and don't indicate the sensu/sec. reference 
for the names they use.  You are now creating a system that would allow 
people to unambiguously make it clear what taxon they mean but you 
aren't giving them any way to say what it is?  Again, I may just be 
misunderstanding what you wrote here.

Kevin Richards wrote:
> Oh, now that I have read Rich's email here, it seems we are in agreement, of sorts.  I think there is obviously a need for both of these "identifier" approaches - ie a record based ID that no one should really ever need to interact with directly, and a human friendly "ID" that allows people to discuss the same semantic "thing".
>   
Yes.  This "record based ID" can be anything you want.  I don't really 
don't and shouldn't have to care about that.  The "human friendly ID 
that allows people to discuss the same semantic thing" is precisely what 
the TDWG GUID Applicability Statement (a ratified TDWG standard, thanks 
to Kevin) is talking about.  As I read that standard, I don't see any 
requirement that a GUID be "human friendly", but I would consider "human 
friendliness" to be a desirable "best practice" (influenced somewhat by 
http://www.w3.org/Provider/Style/URI and http://www.w3.org/TR/cooluris/) 
- if we have a choice of creating externally exposed GUIDs that are 
either human-friendly or not human-friendly, and if either works equally 
well, why not choose ones that are human-friendly?
> It is interesting all this discussion of identifiers when in the end it doesn't matter too much what the identifier is, just that you have an identifier at all.  The important thing is the semantics, the "are we talking about the same thing" question - so this is where I believe RDF/semantic web comes in - I might see if I can come up with some RDF/sem web example for TDWG that could demonstrate this, hmmm...
>   
Already done in the context of tc:Taxon and tn:TaxonName and posted on 
this list in January: 
http://lists.tdwg.org/pipermail/tdwg-content/2011-January/002204.html .
http://biodiversity.org.au/apni.taxon/118883
an identifier that is both friendly to humans and computers.  Through 
content negotiation a computer gets
http://biodiversity.org.au/apni.taxon/118883.rdf
and the human gets
http://biodiversity.org.au/apni.taxon/118883.html
The resource itself has rdf:type tc:TaxonConcept (defined in the 
ontology to be equivalent to tc:Taxon), well-known because it is part of 
the TDWG ontology.  In these examples, the approach for referring to 
name strings through tc:hasName, the subsequent reference to a name 
record (http://biodiversity.org.au/apni.name/36530), and the structure 
of that name record in RDF 
(http://biodiversity.org.au/apni.name/36530.rdf) follow the approach of 
the TSC standard (as incarnated in the TDWG ontology) very precisely.  I 
can't see anything in these examples that doesn't follow TDWG standards 
and what I know of as "best practices".  Thank you, Paul...  Also we 
have many examples of appropriate HTTP URI GUID use from Pete, although 
not involving tc:Taxon and tn:TaxonName specifically. 

Richard Pyle wrote:
> I like your list of what we want "GUIDs" (see below) to do, and I think it's
> an excellent starting point for a bar we should all strive for.  I'm
> particularly grateful to learn that the existing ZooBank service fails so
> many of them.  I've forwarded your post to Rob Whitton, who will be working
> on Gen-2 of ZooBank in the coming weeks, and asked him if we can use your 8
> tests as a metric to adhere to.  
Better yet, read the TDWG GUID Applicability Statement 
http://www.tdwg.org/standards/150/ and http://www.w3.org/TR/cooluris/ .  
My 8 points are just a paraphrase out of my head.  Striving is not good 
enough.  Follow the standard.
>   
>> "But really, from the perspective of the end-user, does it matter
>> if it's an identifier or a service?  Ultimately, they ask the questions,
>> and the answers appear on their computer screens."
>>
>> I would answer this question by saying "yes, it does matter!" -
>> it is important that a well-designed GUID do more than just throw
>> something up onto a human user's web browser.
>>     
>
> I absolutely agree with you, but that's not the distinction I was making in
> my quoted text.  I was only talking about whether we call something an
> "identifier" (not GUID, which has more specific implications), or a
> "service", in the context of human-machine conversations.  I think your
> enumeration of things we want GUIDs to do is a very good framework for
> discussion.  I would only caution that "GUID" means different things to
> different people (some people use it synonymously with UUID, for example),
> and also that GUID does not imply "actionable".  
Again I would say read http://www.tdwg.org/standards/150/ .  When I say 
"GUID" I am not throwing around a colloquial term.  I intend for it to 
have the exact technical meaning that it is given in the TDWG standard.  
At this point in time (i.e. after we finally have a ratified standard on 
GUIDs), nobody in our community has any business designing and exposing 
GUIDs without having read this document and completely understanding its 
requirements and recommendations.  I should not have to be "explaining" 
any of this to anybody on the list.  It is explained clearly and 
concisely in the standard.  I really am somewhat flabbergasted about how 
participants in TDWG, which I think is supposed to be a biodiversity 
standards organization, generally don't seem to read and follow the 
ratified standards.  I think the process could be helped somewhat if the 
TDWG website were cleaned up a bit to make the obsolete stuff less easy 
to find and the important, current stuff easier to find.  Also, I don't 
understand why all important documents aren't linked to the permanent 
URI page (e.g. http://www.tdwg.org/standards/150/) in pdf format.  That 
would allow users to view the page directly in a web browser rather than 
having to open a zip file and then open a Word document. 

> There has been a bit of a
> debate over the importance of embedding "actionability" into identifiers
> inherently (the Tim Berners-Lee perspective)
Wrong.  "GUIDs should be resolvable" (direct quote of recommendation 7 
from the GUID applicability statement).
> , vs thinking about
> "identification" separately from how we perform some action on it.  For
> example, UUIDs and Social Security numbers are extremely useful identifiers,
> even though they are not inherently actionable.  It's amazingly easy to
> perform action on a non-actionable identifier by simply appending it to a
> actionable prefix.  For example, going back to the list of "identifiers":
>
> A. A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523
> B. urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523
> C.
> http://zoobank.org/urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E5BF4
> 1523
> D. http://zoobank.org/A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523
> E.
> http://lsid.tdwg.org/urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E5B
> F41523
> F. http://www.google.com/search?q=Danaus+plexippus+(Linnaeus+1758)
> G.
> http://lsid.tdwg.org/summary/urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB
> 4-EA8E5BF41523
> H.
> http://darwin.zoology.gla.ac.uk/~rpage/lsid/tester/?q=urn:lsid:zoobank.org:a
> ct:A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523&submit=Go
>
> There are two different ways of looking at this:
>
> 1) There are 8 different identifiers
> 2) There is one identifier (A)
A is an identifier but A does not meet the requirement of the GUID 
Applicability statement.  Quote recommendation 2: "HTTP GET resolution 
*must* be provided for non-self resolving GUIDs".  Pick one of your 
proxied HTTP URIs, call it your GUID and stop there.  (Note: the 
emphasis on "must" is present in the standards document, not added by me.)
> , and 6 ways to perform action on it (B-E,
> G-H).
>
> If you treat them all as distinct identifiers, then let me add a few more to
> the list:
>
> I.
> http://zoobank.org/?lsid=urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA
> 8E5BF41523
> J.
> http://zoobank.org/?id=urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E
> 5BF41523
> K. http://zoobank.org/?id=A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523
> L. http://zoobank.org/?uuid=A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523
>   
Don't add more of them to the list.  Recommendation 3: "Providers *must* 
assign at most one GUID to any particular object."  Recommendation 4: 
"Only one globally unique identifier should be assigned to each object". 
> Note that all four of the above, plus B-D in the original list, are all
> resolved through zoobank.org.  Why are there so many different ways to
> perform action on the "same" identifier? Because I wanted the ZooBank
> resolution service to be flexible. And, because in my mind, there is only
> one identifier (A); and lots of different ways to retrieve the metadata of
> the object it represents.
>   
I would assert that what you "want" and what you have in your mind is at 
odds with the TDWG standard for GUIDs.
> Now consider this from the TB-L perspective. Eleven different identifiers
> for the same object (excluding F).  Does that mean we need to generate
> owl:sameAs statements for all pair-wise relationships?  That's a lot of
> owl:sameAs statements! Even if I'm the bad guy in foolishly allowing so many
> different ways to resolve ZooBank identifiers, and needlessly fabricated so
> many "different" identifiers for the same thing unnecessarily.  Fair enough.
> But I still think we're a lot better off by disentangling identifiers from
> the services we use to perform action on them.
>   
This may be your opinion, but it is at odds with the ratified standard 
which says (recommendation 2) that "HTTP GET resolution *must *be 
provided for non-self-resolving GUIDs". 
> One of the arguments on the TB-L side is that a non-actionable identifier by
> itself is useless if you cannot inherently perform action on it.  For
> example, if you were walking through the park and stumbled upon a slip of
> paper with "A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523" written on it, you
> probably wouldn't be able to do much with it.  But in reality, that's not
> what happens.  We never expose identifiers as a simple context-free
> identifiers in their non-resolvable form.  These identifiers are *always*
> exposed in some context.  The problem is that if you treat the "resolution
> metadata" (as I call it -- e.g., "urn:lsid:zoobank.org:act:" or
> "http://zoobank.org/") as *part* of the identifier (as you have to do if you
> make things like "urn:lsid:ubio.org:namebank:11815"), then it becomes
> difficult for an application to distinguish between
> "http://zoobank.org/?id=A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523", and
> "http://zoobank.org/?uuid=A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523"; which, to a
> human, obviously refers to the same thing.  In other words, absent all those
> owl:sameAs statements, an application could break if it harvests content
> from different sources that use different resolution metadata for the "same"
> (sensu Pyle) identifier.
>   
The problem here is caused by you when you create and expose so many 
different HTTP URI forms of your UUID.  Stop doing that (recommendation 4).
> Maybe what we need to think about is a registry of "persistent resolution
> services", which our community relies on.  That way, we can apply the
> owl:sameAs statements to the resolution services, rather than to every
> single individual identifier.
>   
There is no need for this.  Make a single HTTP URI version of your UUID 
and stick with it.  Preferably one without the query string and use Mod 
rewrite (or whatever it's called) to transform the simple, clear, and 
permanent version of the URI into whatever flavor of temporary URL you 
are liking at the moment.  Every application today understands HTTP 
GET.  No need for a registry.
>   
>> An important question that I think has been underlying much of this
>>     
> discussion
>   
>> is whether GUIDs are actually needed for names.
>>     
>
> I think the answer is clearly  "yes". The problem is defining what is meant
> by the word "name".
>   
Go with the TCS standard and the TDWG ontology as it exists currently.
>> and parts thereof, then it does make sense to apply GUIDs to that
>> kind of entity.  I am thinking about a tn:TaxonName as defined in the
>> TDWG ontology (see
>>
>>     
> http://code.google.com/p/tdwg-ontology/source/browse/trunk/ontology/voc/Taxo
> nName.rdf),
>   
>> which comes out of the TCS schema (see
>> http://code.google.com/p/darwin-sw/wiki/ClassTaxon for info and links
>>     
> regarding TCS).
>   
>> A tn:TaxonName is "An object that represents a single scientific
>>     
> biological name..." i.e. an "object"
>   
>> NOT defined as a string.
>>     
>
> While it's nice to see the explicit representation of a "name" as an object,
> rather than a string; unfortunately that doesn't address the elephant in the
> room; that is, that different people have different notions of what "a
> single scientific biological name" is.  I'm not talking subtly different
> shades of fundamentally the same thing; I'm talking about fundamentally
> different things with different implied sets of properties. This is one of
> the issues I continued to hammer on during the development of TCS, and the
> one that gave me the biggest qualms about TCS 1.0.  My hope was that it
> would be resolved in TCS 2.0. 
There ain't no TCS 2.0 .  There is only TCS 1.2 .  I'm sorry about it, 
but that's the ratified standard.
> I wanted to reduce both names and concepts to
> the same core entity: usage instances.  That's exactly what we're doing with
> GNUB.
>
>   
There have been any number of things that I would "like" to be the way I 
want.  However, the point of standards is that they get hammered out in 
a form that satisfies the community in a general way.  Individual people 
often are left without everything that they wanted.  From within our own 
personal projects, we can do anything we darn well please.  But when it 
comes to communicating with others, we should discipline ourselves to 
follow the standards.  I understand that for existing systems, there is 
considerable time and money required to retrofit old systems to a new 
standard.  But GNUB is not an "old system".  It is being build from 
scratch and I would assert that where it comes to interfacing it with 
the outside world, it should follow standards such as they exist at the 
moment.  At the moment, people are allowed to think about and describe 
names without reducing them solely to usage instances as you would 
like.  I spend about an hour yesterday composing a rant about how 
counterproductive it is for taxonomy and computer geeks to create tools 
and systems that won't ever actually be used by the people who need 
them.  I decided that it wasn't helpful to actually post it, but now I'm 
thinking that maybe I should have...
>
> That's only true to the extent that tn:TaxonName may be too broadly
> (imprecisely) defined (just like dwc:Taxon).
>
>
>   
dwc:Taxon doesn't really have much of any useful definition, so I'm with 
you there.  tn:TaxonName is actually rather precisely defined, at least 
if you look at the RDF 
(http://code.google.com/p/tdwg-ontology/source/browse/trunk/ontology/voc/TaxonName.rdf) 
and relate it to the TCS documents on which it is based 
(http://www.tdwg.org/standards/117/ , again it would be extremely useful 
to have a pdf version of the User Guide directly linked to that page so 
that people could look at it in their browsers rather than having to 
download a zip archive.  Note also Kennedy et al. 2005 
http://www.springerlink.com/content/7bv5pa3falxwrrvx/ which I found 
helpful for understanding the rationale for TCS).  In my opinion, TCS 
(and by extension, the TDWG ontology) puts a rather restrictive collar 
and leash on taxon names.  I quote from the user guide page 9: 
"<TaxonName> elements do not represent taxa.  They serve only as 
abstract nomenclatural data structures that encapsulate the core rules 
of the different nomenclatural codes.  Their purpose is to prevent 
nomenclatural statements becoming confused with statements about the 
circumscription of, and relationships between, different taxon 
concepts.  No taxonomic opinion can be expressed using <TaxonName> 
elements in TCS.  As a rule of thumb if you are dealing with anything 
beyond a type specimen and references to it, you are talking about a 
TaxonConcept of some form."  This does not seem like a broad and 
imprecise definition to me.  One is allowed to describe the pieces of 
the name and that's about it.

When I look carefully at how the TDWG ontology deals with taxon names 
and taxon concepts, it seems very simple and "usable" to me.  If one 
defines a Taxon to be composed of a name component and a sensu/sec. 
component as several people (including you, I think) on this list have 
done and as TSC has done (I think), then representing it in RDF becomes 
tractable.  One anchors the name part to a tn:TaxonName instance 
(properly collared and chained and wearing a GUID as a dog tag).  How 
one anchors the sensu/sec. part is still a subject for discussion.  I 
have been thinking about the following approach.  It is based on a Venn 
diagram that I have in my head which I created from your descriptions of 
TNUs on this list.  The Venn diagram has a big rectangle labeled 
"nominal taxon".  Inside that is a smaller rectangle named "taxon name 
usage (TNU)".  Inside that is an even smaller rectangle named "taxon 
concept".  In this view, Taxon concepts are well-described/circumscribed 
by a publication.  TNUs (which include taxon concepts) are associated 
with a particular person's idea of what the taxon is, but which may or 
may not be described in a publication.  Nominal taxa are all instances 
of a scientific name use including those where we have no idea who 
applied the name or what set of organisms they intended to be included 
in the taxon.  In terms of RDF metadata:
1. Go ahead and let the rdf:type of the thing be tc:Taxon
2. Make the object of tc:hasName be a GUID (i.e. as described by the 
TDWG GUID Applicability Statement, not some other kind of 
GUID)-identified resource, preferably from a well-known source like uBio.
3. If the sensu/sec. is described in a publication (in my mind a true 
taxon concept), then the object of tc:accordingTo is an HTTP proxied 
DOI, HTTP URI of a BHL-scanned publication, or if both of those fail, 
something non-resolvable but globally-unique like an ISBN or URL of a 
stable web page.
4. If the sensu/sec. is not described in a publication, but is 
associated with a particular person (in my mind a TNU that isn't a true 
taxon concept), then the object of tc:accordingTo could be the URI of a 
foaf:Person or foaf:Group.
5. If the sensu/sec. is completely unknown, then the taxon is a nominal 
taxon that is not a TNU.  I don't know whether it is better for the 
taxon to simply lack a tc:accordingTo property or to have a 
tc:accordingTo property that somehow says "we don't know anything about 
the sensu/sec.". 

I realize that you probably aren't going to like this because it isn't 
as sophisticated and nuanced as you would like for your GNUB TNUs to 
be.  However, there would be nothing that would prohibit you from 
creating and adding a myriad of clever properties to the tc:Taxon 
instance RDF to make it do all of the things you want.  The practice I 
have described would break down the act of defining a taxon into  
well-known, standardized pieces and it is a practice that could be 
fairly easily be followed by people without sophisticated IT resources.  
It would allow for the transfer and comparison of taxa information and 
make the possibility of reconciling at some central location (like GNUB) 
the taxa that are described in a distributed network of users.  Doing 
something like this is, I believe, the entire reason for the existence 
of TCS, the TDWG ontology, old TDWG TAG roadmaps, etc.  Please apply 
some self-discipline to follow the ratified standards or risk blowing us 
all back to 2005 where we would have to discuss all of the settled 
things again.  If that is going to happen, I will give up on TDWG 
because I'll be retired before it is done over again.

In some ways what I'm talking about here is really (as I understand it) 
the principle that underlies REST.  Within your big GNUB kingdom and my 
little Bioimages kingdom, we are free to do whatever clever things we 
want, structure databases as we wish, do clever programming stuff or 
whatever.  But when you and I talk, we follow commonly established 
rules, namely we talk using the HTTP protocol and identify the things 
that we want to talk about using HTTP URIs.  Since we are talking 
specifically about biodiversity informatics, we should choose to follow 
more restrictive rules about the identifiers themselves (following the 
TDWG GUID applicability statement) and the nature of the RDF (following 
the GUID applicability statement, well-known vocabularies such as the 
TDWG ontology, FOAF, DCMI, Darwin Core, geo, etc.).  If we fail to do 
that, then every interaction that I have with another entity requires me 
to establish in advance the rules of that interaction.  The Web works 
well because people follow a defined set of rules about URLs and HTML.  
I would assert that we now (at last) have a similar model available to 
us in the biodiversity informatics community if organizations would just 
have the self-discipline to use it. 

Roderic Page wrote:
> Reading this thread makes me despair. It's as if we are determined not to make progress, forever debating identifiers and what they identify, with seemingly little hope of resolution, and no clear vision of what the goals are. We wallow in acronym soup, and enjoy the technical challenges, but don't actually get anywhere
I have to say that I'm not as pessimistic as Rod is.  Maybe that's just 
because I haven't been involved in the process as long as he has and 
haven't had sufficient time to develop appropriate cynicism.  But I 
think there has been real progress, even in the couple years I've been 
tracking TDWG.  We DO have a GUID Applicability Statement Standard now.  
We DO have a Darwin Core standard that defines terms which could be used 
to describe properties of biodiversity resources.  We DO have doi's that 
are HTTP proxied and which return real metadata.  We DO have people in 
our community who know how to write RDF and set up content negotiation 
for GUIDs as described in standards and best practices.  I would also 
say that we do have a relatively clear vision of what the goals are.  
When I look at the old TAG roadmaps from 2006-2008
http://www.tdwg.org/uploads/media/TAG_Roadmap_01.doc (2006)
http://www.tdwg.org/fileadmin/subgroups/tag/TAG_Roadmap_2007_final.pdf 
(2007)
http://www.tdwg.org/fileadmin/subgroups/tag/TAG_Roadmap_2008.pdf (2008)
the goals laid out there are the same ones I hear people talking about 
now.  The difference is that we now have the tools and standards to do 
what was desired in 2006-8.  We also have a funded project (BiSciCol) 
that is making progress toward developing a system that will track when 
changes occur in metadata for resources that are described by GUIDs.  So 
I'm actually pretty optimistic about the whole venture assuming that we 
can get people and organizations to actually read and try to follow the 
standards that we have already agreed upon. 

Steve

-- 
Steven J. Baskauf, Ph.D., Senior Lecturer
Vanderbilt University Dept. of Biological Sciences

postal mail address:
VU Station B 351634
Nashville, TN  37235-1634,  U.S.A.

delivery address:
2125 Stevenson Center
1161 21st Ave., S.
Nashville, TN 37235

office: 2128 Stevenson Center
phone: (615) 343-4582,  fax: (615) 343-6707
http://bioimages.vanderbilt.edu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.tdwg.org/pipermail/tdwg-content/attachments/20110606/bd49805d/attachment.html