Comments regarding several emails inline:

Richard Pyle wrote:

By contrast, the core object in GNUB is a taxon name usage instance -- which
is a purely abstract notion of the usage of a taxon name within some
documentation source (like a publication).  In this case, the text-string
name is merely a property of the GUID-identified object, and would be an
extremely BAD choice to use as a unique identifier.

It is possible that I'm not understanding what you are saying here, but if you are saying that the only name-related property of your GNUB taxon instances will be one which has a name string literal as its object, then I think that is a big mistake. That will require any client using your taxon instance metadata to re-process the literal name string to cross reference it with lexical variants, parse it into its pieces, etc. That should only need to be done once and then referenced via a GUID for the name (i.e. in the sense of tn:TaxonName).

This is why GNUB needs
to generate a unique identifier to represent this core data object.  The
form that identifier takes (UUID, LSID, integer, DOI, whatever) from the
perspective of the end user should be completely irrelevant, because the
user should rarely (if ever) see it, and should certainly *never* be in a
position to type it on a keyboard (we can discuss the appearance of ZooBank
LSIDs on printed pages separately).

OK, again maybe I'm not understanding what you are saying here, but if you are saying that you don't intend to expose your unique GNUB identifiers to the public, then as far as I'm concerned you are setting up GNUB to be irrelevant from the start. You mention a number of cool taxonomist-geek type things that you hope to accomplish with GNUB. But from my perspective as a non-taxonomist-geek, the main purpose I have for GNUB is as a place to anchor dwc:Identification instances so that I can indicate whether my identified resource is a representative of the same taxon that is being referred to by somebody else (or at least to make it possible for somebody to figure that out via computery cleverness, Semantic Web or otherwise). How am I going to do that if you don't provide me with a good (i.e. meeting the 8 criteria of my last email) GUID to use as the object of my dwc:Identification properties? For over a year, I've heard you lament that the whole problem is that people make identifications and don't indicate the sensu/sec. reference for the names they use. You are now creating a system that would allow people to unambiguously make it clear what taxon they mean but you aren't giving them any way to say what it is? Again, I may just be misunderstanding what you wrote here.

Kevin Richards wrote:

Oh, now that I have read Rich's email here, it seems we are in agreement, of sorts.  I think there is obviously a need for both of these "identifier" approaches - ie a record based ID that no one should really ever need to interact with directly, and a human friendly "ID" that allows people to discuss the same semantic "thing".

Yes. This "record based ID" can be anything you want. I don't really don't and shouldn't have to care about that. The "human friendly ID that allows people to discuss the same semantic thing" is precisely what the TDWG GUID Applicability Statement (a ratified TDWG standard, thanks to Kevin) is talking about. As I read that standard, I don't see any requirement that a GUID be "human friendly", but I would consider "human friendliness" to be a desirable "best practice" (influenced somewhat by http://www.w3.org/Provider/Style/URI and http://www.w3.org/TR/cooluris/) - if we have a choice of creating externally exposed GUIDs that are either human-friendly or not human-friendly, and if either works equally well, why not choose ones that are human-friendly?

It is interesting all this discussion of identifiers when in the end it doesn’t matter too much what the identifier is, just that you have an identifier at all.  The important thing is the semantics, the "are we talking about the same thing" question - so this is where I believe RDF/semantic web comes in - I might see if I can come up with some RDF/sem web example for TDWG that could demonstrate this, hmmm...

Already done in the context of tc:Taxon and tn:TaxonName and posted on this list in January: http://lists.tdwg.org/pipermail/tdwg-content/2011-January/002204.html .
http://biodiversity.org.au/apni.taxon/118883
an identifier that is both friendly to humans and computers. Through content negotiation a computer gets
http://biodiversity.org.au/apni.taxon/118883.rdf
and the human gets
http://biodiversity.org.au/apni.taxon/118883.html
The resource itself has rdf:type tc:TaxonConcept (defined in the ontology to be equivalent to tc:Taxon), well-known because it is part of the TDWG ontology. In these examples, the approach for referring to name strings through tc:hasName, the subsequent reference to a name record (http://biodiversity.org.au/apni.name/36530), and the structure of that name record in RDF (http://biodiversity.org.au/apni.name/36530.rdf) follow the approach of the TSC standard (as incarnated in the TDWG ontology) very precisely. I can't see anything in these examples that doesn't follow TDWG standards and what I know of as "best practices". Thank you, Paul... Also we have many examples of appropriate HTTP URI GUID use from Pete, although not involving tc:Taxon and tn:TaxonName specifically.

Richard Pyle wrote:

I like your list of what we want "GUIDs" (see below) to do, and I think it's
an excellent starting point for a bar we should all strive for.  I'm
particularly grateful to learn that the existing ZooBank service fails so
many of them.  I've forwarded your post to Rob Whitton, who will be working
on Gen-2 of ZooBank in the coming weeks, and asked him if we can use your 8
tests as a metric to adhere to.

Better yet, read the TDWG GUID Applicability Statement http://www.tdwg.org/standards/150/ and http://www.w3.org/TR/cooluris/ . My 8 points are just a paraphrase out of my head. Striving is not good enough. Follow the standard.

"But really, from the perspective of the end-user, does it matter
if it's an identifier or a service?  Ultimately, they ask the questions,
and the answers appear on their computer screens."

I would answer this question by saying "yes, it does matter!" -
it is important that a well-designed GUID do more than just throw
something up onto a human user's web browser.


I absolutely agree with you, but that's not the distinction I was making in
my quoted text.  I was only talking about whether we call something an
"identifier" (not GUID, which has more specific implications), or a
"service", in the context of human-machine conversations.  I think your
enumeration of things we want GUIDs to do is a very good framework for
discussion.  I would only caution that "GUID" means different things to
different people (some people use it synonymously with UUID, for example),
and also that GUID does not imply "actionable".

Again I would say read http://www.tdwg.org/standards/150/ . When I say "GUID" I am not throwing around a colloquial term. I intend for it to have the exact technical meaning that it is given in the TDWG standard. At this point in time (i.e. after we finally have a ratified standard on GUIDs), nobody in our community has any business designing and exposing GUIDs without having read this document and completely understanding its requirements and recommendations. I should not have to be "explaining" any of this to anybody on the list. It is explained clearly and concisely in the standard. I really am somewhat flabbergasted about how participants in TDWG, which I think is supposed to be a biodiversity standards organization, generally don't seem to read and follow the ratified standards. I think the process could be helped somewhat if the TDWG website were cleaned up a bit to make the obsolete stuff less easy to find and the important, current stuff easier to find. Also, I don't understand why all important documents aren't linked to the permanent URI page (e.g. http://www.tdwg.org/standards/150/) in pdf format. That would allow users to view the page directly in a web browser rather than having to open a zip file and then open a Word document.

There has been a bit of a
debate over the importance of embedding "actionability" into identifiers
inherently (the Tim Berners-Lee perspective)

Wrong. "GUIDs should be resolvable" (direct quote of recommendation 7 from the GUID applicability statement).

, vs thinking about
"identification" separately from how we perform some action on it.  For
example, UUIDs and Social Security numbers are extremely useful identifiers,
even though they are not inherently actionable.  It's amazingly easy to
perform action on a non-actionable identifier by simply appending it to a
actionable prefix.  For example, going back to the list of "identifiers":

A. A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523
B. urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523
C.
http://zoobank.org/urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E5BF4
1523
D. http://zoobank.org/A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523
E.
http://lsid.tdwg.org/urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E5B
F41523
F. http://www.google.com/search?q=Danaus+plexippus+(Linnaeus+1758)
G.
http://lsid.tdwg.org/summary/urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB
4-EA8E5BF41523
H.
http://darwin.zoology.gla.ac.uk/~rpage/lsid/tester/?q=urn:lsid:zoobank.org:a
ct:A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523&submit=Go

There are two different ways of looking at this:

1) There are 8 different identifiers
2) There is one identifier (A)

A is an identifier but A does not meet the requirement of the GUID Applicability statement. Quote recommendation 2: "HTTP GET resolution must be provided for non-self resolving GUIDs". Pick one of your proxied HTTP URIs, call it your GUID and stop there. (Note: the emphasis on "must" is present in the standards document, not added by me.)

, and 6 ways to perform action on it (B-E,
G-H).

If you treat them all as distinct identifiers, then let me add a few more to
the list:

I.
http://zoobank.org/?lsid=urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA
8E5BF41523
J.
http://zoobank.org/?id=urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E
5BF41523
K. http://zoobank.org/?id=A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523
L. http://zoobank.org/?uuid=A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523

Don't add more of them to the list. Recommendation 3: "Providers must assign at most one GUID to any particular object." Recommendation 4: "Only one globally unique identifier should be assigned to each object".

Note that all four of the above, plus B-D in the original list, are all
resolved through zoobank.org.  Why are there so many different ways to
perform action on the "same" identifier? Because I wanted the ZooBank
resolution service to be flexible. And, because in my mind, there is only
one identifier (A); and lots of different ways to retrieve the metadata of
the object it represents.

I would assert that what you "want" and what you have in your mind is at odds with the TDWG standard for GUIDs.

Now consider this from the TB-L perspective. Eleven different identifiers
for the same object (excluding F).  Does that mean we need to generate
owl:sameAs statements for all pair-wise relationships?  That's a lot of
owl:sameAs statements! Even if I'm the bad guy in foolishly allowing so many
different ways to resolve ZooBank identifiers, and needlessly fabricated so
many "different" identifiers for the same thing unnecessarily.  Fair enough.
But I still think we're a lot better off by disentangling identifiers from
the services we use to perform action on them.

This may be your opinion, but it is at odds with the ratified standard which says (recommendation 2) that "HTTP GET resolution must be provided for non-self-resolving GUIDs".

One of the arguments on the TB-L side is that a non-actionable identifier by
itself is useless if you cannot inherently perform action on it.  For
example, if you were walking through the park and stumbled upon a slip of
paper with "A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523" written on it, you
probably wouldn't be able to do much with it.  But in reality, that's not
what happens.  We never expose identifiers as a simple context-free
identifiers in their non-resolvable form.  These identifiers are *always*
exposed in some context.  The problem is that if you treat the "resolution
metadata" (as I call it -- e.g., "urn:lsid:zoobank.org:act:" or
"http://zoobank.org/") as *part* of the identifier (as you have to do if you
make things like "urn:lsid:ubio.org:namebank:11815"), then it becomes
difficult for an application to distinguish between
"http://zoobank.org/?id=A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523", and
"http://zoobank.org/?uuid=A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523"; which, to a
human, obviously refers to the same thing.  In other words, absent all those
owl:sameAs statements, an application could break if it harvests content
from different sources that use different resolution metadata for the "same"
(sensu Pyle) identifier.

The problem here is caused by you when you create and expose so many different HTTP URI forms of your UUID. Stop doing that (recommendation 4).

Maybe what we need to think about is a registry of "persistent resolution
services", which our community relies on.  That way, we can apply the
owl:sameAs statements to the resolution services, rather than to every
single individual identifier.

There is no need for this. Make a single HTTP URI version of your UUID and stick with it. Preferably one without the query string and use Mod rewrite (or whatever it's called) to transform the simple, clear, and permanent version of the URI into whatever flavor of temporary URL you are liking at the moment. Every application today understands HTTP GET. No need for a registry.

An important question that I think has been underlying much of this

discussion

is whether GUIDs are actually needed for names.


I think the answer is clearly  "yes". The problem is defining what is meant
by the word "name".

Go with the TCS standard and the TDWG ontology as it exists currently.

and parts thereof, then it does make sense to apply GUIDs to that
kind of entity.  I am thinking about a tn:TaxonName as defined in the
TDWG ontology (see

http://code.google.com/p/tdwg-ontology/source/browse/trunk/ontology/voc/Taxo
nName.rdf),

which comes out of the TCS schema (see
http://code.google.com/p/darwin-sw/wiki/ClassTaxon for info and links

regarding TCS).

A tn:TaxonName is "An object that represents a single scientific

biological name..." i.e. an "object"

NOT defined as a string.


While it's nice to see the explicit representation of a "name" as an object,
rather than a string; unfortunately that doesn't address the elephant in the
room; that is, that different people have different notions of what "a
single scientific biological name" is.  I'm not talking subtly different
shades of fundamentally the same thing; I'm talking about fundamentally
different things with different implied sets of properties. This is one of
the issues I continued to hammer on during the development of TCS, and the
one that gave me the biggest qualms about TCS 1.0.  My hope was that it
would be resolved in TCS 2.0.

There ain't no TCS 2.0 . There is only TCS 1.2 . I'm sorry about it, but that's the ratified standard.

I wanted to reduce both names and concepts to
the same core entity: usage instances.  That's exactly what we're doing with
GNUB.

There have been any number of things that I would "like" to be the way I want. However, the point of standards is that they get hammered out in a form that satisfies the community in a general way. Individual people often are left without everything that they wanted. From within our own personal projects, we can do anything we darn well please. But when it comes to communicating with others, we should discipline ourselves to follow the standards. I understand that for existing systems, there is considerable time and money required to retrofit old systems to a new standard. But GNUB is not an "old system". It is being build from scratch and I would assert that where it comes to interfacing it with the outside world, it should follow standards such as they exist at the moment. At the moment, people are allowed to think about and describe names without reducing them solely to usage instances as you would like. I spend about an hour yesterday composing a rant about how counterproductive it is for taxonomy and computer geeks to create tools and systems that won't ever actually be used by the people who need them. I decided that it wasn't helpful to actually post it, but now I'm thinking that maybe I should have...


That's only true to the extent that tn:TaxonName may be too broadly
(imprecisely) defined (just like dwc:Taxon).

dwc:Taxon doesn't really have much of any useful definition, so I'm with you there. tn:TaxonName is actually rather precisely defined, at least if you look at the RDF (http://code.google.com/p/tdwg-ontology/source/browse/trunk/ontology/voc/TaxonName.rdf) and relate it to the TCS documents on which it is based (http://www.tdwg.org/standards/117/ , again it would be extremely useful to have a pdf version of the User Guide directly linked to that page so that people could look at it in their browsers rather than having to download a zip archive. Note also Kennedy et al. 2005 http://www.springerlink.com/content/7bv5pa3falxwrrvx/ which I found helpful for understanding the rationale for TCS). In my opinion, TCS (and by extension, the TDWG ontology) puts a rather restrictive collar and leash on taxon names. I quote from the user guide page 9: "<TaxonName> elements do not represent taxa. They serve only as abstract nomenclatural data structures that encapsulate the core rules of the different nomenclatural codes. Their purpose is to prevent nomenclatural statements becoming confused with statements about the circumscription of, and relationships between, different taxon concepts. No taxonomic opinion can be expressed using <TaxonName> elements in TCS. As a rule of thumb if you are dealing with anything beyond a type specimen and references to it, you are talking about a TaxonConcept of some form." This does not seem like a broad and imprecise definition to me. One is allowed to describe the pieces of the name and that's about it.

When I look carefully at how the TDWG ontology deals with taxon names and taxon concepts, it seems very simple and "usable" to me. If one defines a Taxon to be composed of a name component and a sensu/sec. component as several people (including you, I think) on this list have done and as TSC has done (I think), then representing it in RDF becomes tractable. One anchors the name part to a tn:TaxonName instance (properly collared and chained and wearing a GUID as a dog tag). How one anchors the sensu/sec. part is still a subject for discussion. I have been thinking about the following approach. It is based on a Venn diagram that I have in my head which I created from your descriptions of TNUs on this list. The Venn diagram has a big rectangle labeled "nominal taxon". Inside that is a smaller rectangle named "taxon name usage (TNU)". Inside that is an even smaller rectangle named "taxon concept". In this view, Taxon concepts are well-described/circumscribed by a publication. TNUs (which include taxon concepts) are associated with a particular person's idea of what the taxon is, but which may or may not be described in a publication. Nominal taxa are all instances of a scientific name use including those where we have no idea who applied the name or what set of organisms they intended to be included in the taxon. In terms of RDF metadata:
1. Go ahead and let the rdf:type of the thing be tc:Taxon
2. Make the object of tc:hasName be a GUID (i.e. as described by the TDWG GUID Applicability Statement, not some other kind of GUID)-identified resource, preferably from a well-known source like uBio.
3. If the sensu/sec. is described in a publication (in my mind a true taxon concept), then the object of tc:accordingTo is an HTTP proxied DOI, HTTP URI of a BHL-scanned publication, or if both of those fail, something non-resolvable but globally-unique like an ISBN or URL of a stable web page.
4. If the sensu/sec. is not described in a publication, but is associated with a particular person (in my mind a TNU that isn't a true taxon concept), then the object of tc:accordingTo could be the URI of a foaf:Person or foaf:Group.
5. If the sensu/sec. is completely unknown, then the taxon is a nominal taxon that is not a TNU. I don't know whether it is better for the taxon to simply lack a tc:accordingTo property or to have a tc:accordingTo property that somehow says "we don't know anything about the sensu/sec.".

I realize that you probably aren't going to like this because it isn't as sophisticated and nuanced as you would like for your GNUB TNUs to be. However, there would be nothing that would prohibit you from creating and adding a myriad of clever properties to the tc:Taxon instance RDF to make it do all of the things you want. The practice I have described would break down the act of defining a taxon into well-known, standardized pieces and it is a practice that could be fairly easily be followed by people without sophisticated IT resources. It would allow for the transfer and comparison of taxa information and make the possibility of reconciling at some central location (like GNUB) the taxa that are described in a distributed network of users. Doing something like this is, I believe, the entire reason for the existence of TCS, the TDWG ontology, old TDWG TAG roadmaps, etc. Please apply some self-discipline to follow the ratified standards or risk blowing us all back to 2005 where we would have to discuss all of the settled things again. If that is going to happen, I will give up on TDWG because I'll be retired before it is done over again.

In some ways what I'm talking about here is really (as I understand it) the principle that underlies REST. Within your big GNUB kingdom and my little Bioimages kingdom, we are free to do whatever clever things we want, structure databases as we wish, do clever programming stuff or whatever. But when you and I talk, we follow commonly established rules, namely we talk using the HTTP protocol and identify the things that we want to talk about using HTTP URIs. Since we are talking specifically about biodiversity informatics, we should choose to follow more restrictive rules about the identifiers themselves (following the TDWG GUID applicability statement) and the nature of the RDF (following the GUID applicability statement, well-known vocabularies such as the TDWG ontology, FOAF, DCMI, Darwin Core, geo, etc.). If we fail to do that, then every interaction that I have with another entity requires me to establish in advance the rules of that interaction. The Web works well because people follow a defined set of rules about URLs and HTML. I would assert that we now (at last) have a similar model available to us in the biodiversity informatics community if organizations would just have the self-discipline to use it.

Roderic Page wrote:

Reading this thread makes me despair. It's as if we are determined not to make progress, forever debating identifiers and what they identify, with seemingly little hope of resolution, and no clear vision of what the goals are. We wallow in acronym soup, and enjoy the technical challenges, but don't actually get anywhere

I have to say that I'm not as pessimistic as Rod is. Maybe that's just because I haven't been involved in the process as long as he has and haven't had sufficient time to develop appropriate cynicism. But I think there has been real progress, even in the couple years I've been tracking TDWG. We DO have a GUID Applicability Statement Standard now. We DO have a Darwin Core standard that defines terms which could be used to describe properties of biodiversity resources. We DO have doi's that are HTTP proxied and which return real metadata. We DO have people in our community who know how to write RDF and set up content negotiation for GUIDs as described in standards and best practices. I would also say that we do have a relatively clear vision of what the goals are. When I look at the old TAG roadmaps from 2006-2008
http://www.tdwg.org/uploads/media/TAG_Roadmap_01.doc (2006)
http://www.tdwg.org/fileadmin/subgroups/tag/TAG_Roadmap_2007_final.pdf (2007)
http://www.tdwg.org/fileadmin/subgroups/tag/TAG_Roadmap_2008.pdf (2008)
the goals laid out there are the same ones I hear people talking about now. The difference is that we now have the tools and standards to do what was desired in 2006-8. We also have a funded project (BiSciCol) that is making progress toward developing a system that will track when changes occur in metadata for resources that are described by GUIDs. So I'm actually pretty optimistic about the whole venture assuming that we can get people and organizations to actually read and try to follow the standards that we have already agreed upon.

Steve

-- 
Steven J. Baskauf, Ph.D., Senior Lecturer
Vanderbilt University Dept. of Biological Sciences

postal mail address:
VU Station B 351634
Nashville, TN  37235-1634,  U.S.A.

delivery address:
2125 Stevenson Center
1161 21st Ave., S.
Nashville, TN 37235

office: 2128 Stevenson Center
phone: (615) 343-4582,  fax: (615) 343-6707
http://bioimages.vanderbilt.edu