Hi Steve,
Excellent post!
I like your list of what we want "GUIDs" (see below) to do, and I think it's an excellent starting point for a bar we should all strive for. I'm particularly grateful to learn that the existing ZooBank service fails so many of them. I've forwarded your post to Rob Whitton, who will be working on Gen-2 of ZooBank in the coming weeks, and asked him if we can use your 8 tests as a metric to adhere to. Watch this space.
Meanwhile...
"But really, from the perspective of the end-user, does it matter if it's an identifier or a service? Ultimately, they ask the questions, and the answers appear on their computer screens."
I would answer this question by saying "yes, it does matter!" - it is important that a well-designed GUID do more than just throw something up onto a human user's web browser.
I absolutely agree with you, but that's not the distinction I was making in my quoted text. I was only talking about whether we call something an "identifier" (not GUID, which has more specific implications), or a "service", in the context of human-machine conversations. I think your enumeration of things we want GUIDs to do is a very good framework for discussion. I would only caution that "GUID" means different things to different people (some people use it synonymously with UUID, for example), and also that GUID does not imply "actionable". There has been a bit of a debate over the importance of embedding "actionability" into identifiers inherently (the Tim Berners-Lee perspective), vs thinking about "identification" separately from how we perform some action on it. For example, UUIDs and Social Security numbers are extremely useful identifiers, even though they are not inherently actionable. It's amazingly easy to perform action on a non-actionable identifier by simply appending it to a actionable prefix. For example, going back to the list of "identifiers":
A. A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523 B. urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523 C. http://zoobank.org/urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E5BF4 1523 D. http://zoobank.org/A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523 E. http://lsid.tdwg.org/urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E5B F41523 F. http://www.google.com/search?q=Danaus+plexippus+(Linnaeus+1758) G. http://lsid.tdwg.org/summary/urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB 4-EA8E5BF41523 H. http://darwin.zoology.gla.ac.uk/~rpage/lsid/tester/?q=urn:lsid:zoobank.org:a ct:A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523&submit=Go
There are two different ways of looking at this:
1) There are 8 different identifiers 2) There is one identifier (A), and 6 ways to perform action on it (B-E, G-H).
If you treat them all as distinct identifiers, then let me add a few more to the list:
I. http://zoobank.org/?lsid=urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA 8E5BF41523 J. http://zoobank.org/?id=urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E 5BF41523 K. http://zoobank.org/?id=A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523 L. http://zoobank.org/?uuid=A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523
Note that all four of the above, plus B-D in the original list, are all resolved through zoobank.org. Why are there so many different ways to perform action on the "same" identifier? Because I wanted the ZooBank resolution service to be flexible. And, because in my mind, there is only one identifier (A); and lots of different ways to retrieve the metadata of the object it represents.
Now consider this from the TB-L perspective. Eleven different identifiers for the same object (excluding F). Does that mean we need to generate owl:sameAs statements for all pair-wise relationships? That's a lot of owl:sameAs statements! Even if I'm the bad guy in foolishly allowing so many different ways to resolve ZooBank identifiers, and needlessly fabricated so many "different" identifiers for the same thing unnecessarily. Fair enough. But I still think we're a lot better off by disentangling identifiers from the services we use to perform action on them.
One of the arguments on the TB-L side is that a non-actionable identifier by itself is useless if you cannot inherently perform action on it. For example, if you were walking through the park and stumbled upon a slip of paper with "A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523" written on it, you probably wouldn't be able to do much with it. But in reality, that's not what happens. We never expose identifiers as a simple context-free identifiers in their non-resolvable form. These identifiers are *always* exposed in some context. The problem is that if you treat the "resolution metadata" (as I call it -- e.g., "urn:lsid:zoobank.org:act:" or "http://zoobank.org/") as *part* of the identifier (as you have to do if you make things like "urn:lsid:ubio.org:namebank:11815"), then it becomes difficult for an application to distinguish between "http://zoobank.org/?id=A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523", and "http://zoobank.org/?uuid=A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523"; which, to a human, obviously refers to the same thing. In other words, absent all those owl:sameAs statements, an application could break if it harvests content from different sources that use different resolution metadata for the "same" (sensu Pyle) identifier.
Maybe what we need to think about is a registry of "persistent resolution services", which our community relies on. That way, we can apply the owl:sameAs statements to the resolution services, rather than to every single individual identifier.
An important question that I think has been underlying much of this
discussion
is whether GUIDs are actually needed for names.
I think the answer is clearly "yes". The problem is defining what is meant by the word "name".
If one takes the position that a "name" can never be more than a string without crossing the line into being something more complicated like a TNU or TaxonConcept, then I think one could make the case that the answer to this is "no".
Perhaps, but I don't know of anyone who takes that position. GNI/uBio/NameBank exist for a very specific purpose, and in that very narrow context, the "name" is equivalent to the UTF-8-encouded string of characters. The architects of these systems would be the first to say that this is a very limited context for what a "name" is, and *none* of them would assert that a "name" can never be more than this. Everyone I know understands that all other flavors of "name" imply something much, much more than the string of text characters.
There isn't a whole lot that one would want to know about the string that couldn't just be imparted by letting it be a string literal. If one takes this position, then "Quercus alba L." is a different "thing" (i.e. resource) from "Quercus alba" or "Quercus alba Linnaeus". It seems that something like this is the position that Rich and the GNI are taking. Under this scenario, there is little point in creating URI GUIDs for the name strings.
I only took that position in the *very narrow* context of GNI, which is unusual among the millions of taxonomic datasets in treating a "name" as a distinct text string. And I backed off from that position after reading Dima's post.
On the other hand, if one takes the position that a name can be a conceptual entity that has properties which include its name string(s)
...as, I think, everyone does...
and parts thereof, then it does make sense to apply GUIDs to that kind of entity. I am thinking about a tn:TaxonName as defined in the TDWG ontology (see
http://code.google.com/p/tdwg-ontology/source/browse/trunk/ontology/voc/Taxo nName.rdf),
which comes out of the TCS schema (see http://code.google.com/p/darwin-sw/wiki/ClassTaxon for info and links
regarding TCS).
A tn:TaxonName is "An object that represents a single scientific
biological name..." i.e. an "object"
NOT defined as a string.
While it's nice to see the explicit representation of a "name" as an object, rather than a string; unfortunately that doesn't address the elephant in the room; that is, that different people have different notions of what "a single scientific biological name" is. I'm not talking subtly different shades of fundamentally the same thing; I'm talking about fundamentally different things with different implied sets of properties. This is one of the issues I continued to hammer on during the development of TCS, and the one that gave me the biggest qualms about TCS 1.0. My hope was that it would be resolved in TCS 2.0. I wanted to reduce both names and concepts to the same core entity: usage instances. That's exactly what we're doing with GNUB.
But if the GNI is only a "dirty bucket" that accumulates every name string
that anybody
has ever used in history but with little or no metadata, then I can't see
that I have any
use for a URI point to it, at least as something to which I would refer in
RDF.
I think it's helpful to see GNI and GNUB as a yin-and-yang sort of thing. There *needs* to be a service at the dirty end of the spectrum, because for the vast majority of existing biodiversity data (digitized or not), the only link we have to at taxon concept is a text-string name. There needs to be a service that manages names-as-text strings. GNUB, at the other end of the spectrum, has the rich full-context metadata that I think you are interested in, allowing for unambiguous reconciliation of different text strings as applied to type specimens, or enumerating all spelling variants of the "same" name, etc., etc. What's missing (but DEFINITELY planned and already sketched out), are the services that connect GNUB and GNI together. As soon as we hear definitively from NSF (should be soon now), we'll have the resources to start building those services.
I'm not saying that there isn't a use for the GNI. I think what I'm
saying is that there
doesn't seem to be any point in worrying about how to create URIs for the
GNI
when those URIs don't "do" anything different from what a string literal
does.
I think this is essentially what Rich was saying: "that text string
represents a perfectly
suitable unique identifier. There is no need to generate a surrogate
identifier like
an integer number or UUID or LSID or whatever".
Yes, I think that's exactly what I was saying. Dima's post has forced me to reconsider this somewhat, but even still, more broadly, I never saw GNI as a service in need of "GUIDs" (in the sense that you outlined at the beginning of your post). Certainly there is value in having internal data structures to perform certain functions, but as far as I can tell, the interface between GNI and the outside world should probably be limited to human-readable name-strings.
Although Rich has been very cautionary about maintaining the distinction
between
ITIS TSNs, which he believes to represent some kind of minimal TNU
I would defer to Dave N.'s post concerning what a TSN is, and represents.
and uBio IDs which he believes to represent a name string, I haven't been able to find any evidence that it would be "naughty" to assert that either one is a tn:TaxonName.
That's only true to the extent that tn:TaxonName may be too broadly (imprecisely) defined (just like dwc:Taxon).
Aloha, Rich