Rich, I should say that my inclusion of references, acronym definitions etc. is not to insinuate that you are unaware of those things, but is a recognition that this is a discussion on a public list and that some of the readers may have never heard of these things and may not be aware of the references. Also, the message to which you responded was a response to chunks of several emails - I guess a bad practice intended to cut down on the number of postings and to group related thoughts. I thought I had included enough of the the "Roderic Page wrote:" and " Kevin Richards wrote:" headings to make it clear to which message I was referring. A couple statements to which you responded to were written by Kevin and not me.
For the purposes of clarity, any time I say "GUID" here, I intend it in the sense of the TDWG GUID Applicability Statement. In the GBIF "Adoption of Persistent Identifiers for Biodiversity Informatics" document (http://www2.gbif.org/Persistent-Identifiers.pdf), the term "persistent actionable identifiers" is used instead of GUID, but in the interest of brevity I'll use GUID.
Thanks for taking the time to explain more about how GNUB will work. I am anxious to see it come to fruition and to use it. I have additional comments and questions relative to your description of it, but they will have to wait for another email. I think it would be best to focus this post on the subject of GUIDs because I think that this is the crux of our disagreement here.
First a word about the TDWG GUID Applicability Statement. You were expressing some reservations about calling it a "standard". If you go to http://www.tdwg.org/standards/, you will find it listed under "Current Standards". My understanding is (and I may be corrected by those who know better) that a TDWG Standard can be either an Applicability Statement or a Technical Definition (like Darwin Core). In either case, the standard has gone through the review process, been subjected to public comment, and approved by the TDWG Executive. So I consider either an Applicability Statement or a Technical Definition to have considerably more "weight" than something like a blog post or ad hoc usage guide. One problem with the GUID A.S. (Applicability Statement) is found on the title page. It says "there is, or will be, a separate document for the applicability of each specific GUID technology". Unfortunately, the "there is" part currently only applies to LSIDs - no other GUID technology has its own document. So an understanding of the "appropriate" way to apply something like a UUID must be inferred from the general statements and examples about UUIDs, by "reading between the lines" by considering how general recommendations about GUIDs would impact the handling of UUIDs, and by analogy to how LSIDs (another non-HTTP URI-based GUID) are handled.
You quoted p.7 of the guide:
============================ The global uniqueness of an identifier is often confused with the issue of resolution of the identifier. These two attributes of GUIDs can be distinguished and discussed separately. For example a Universally Unique Identifier (UUID) is a globally unique identifier, but there are no widely known and used protocols for resolving a UUID over the Internet (unlike HTTP URIs). This form of GUID is perfectly acceptable for uniquely identifying data objects within a dataset. Some identifiers therefore provide uniqueness, but not resolvability. ============================
So based on this, you are correct to call a UUID a GUID. However, the part that I disagree with is:
... I think it's foolish to regard all of these different resolution mechanisms as distinct "identifiers". There is *ONE* GUID. It is: A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523. There are ten different ways to make it actionable. It therefore meets the recommendations of the applicability statement.
The problem is that when you create an HTTP URI out of a UUID, you are creating an identifier whether you think you are or not. I suppose as a matter of semantics, you could say "I don't intend for the ten ways I showed of making my UUID actionable to be GUIDs", but if I encounter one of them, how am I supposed to know that? You may not think that an HTTP proxied non-HTTP URI GUID (e.g. an HTTP proxied UUID) is a GUID, but anyone who is interested in describing the properties of the identified resource in RDF (which should be everyone, GUID A.S. recommendation 10) will think so. The GUID A.S. does not contain any RDF examples (unfortunately) but the LSID Applicability Statement talks in detail about how LSIDs should be used in RDF. Recommendation 29 of the LSID A.S. states that "objects must be identified by an LSID in its standard form using the rdf:about attribute". You can do this with an LSID because it is a urn (subset of the more generic URI) and therefore a describable thing in RDF. However, a UUID cannot be used similarly in an rdf:about attribute because it is not any kind of URI. It is just a globally unique string. Recommendation 31 says "All references to objects identified by LSIDs using the rdf:resource attribute must use a proxy version of the LSID." This is because an LSID (nor a UUID) cannot be used by a client to retrieve information about the object of the property (the value of the rdf:resource attribute). That can only be done if the GUID is an HTTP URI. Recommendation 30 says that the description of all objects identified by an LSID must contain an owl:sameAs, owl:equivalentProperty or owl:equivalentClass statement expressing the equivalence beteen the object identifier in its standard form and its proxy version. The RDF example given on page 18 show how this is to be accomplished (fragment shown here):
<rdf:Description rdf:about="urn:lsid:ubio.org:namebank:11815"> dc:identifierurn:lsid:ubio.org:namebank:11815</dc:identifier> <owl:sameAs rdf:resource="http://lsid.tdwg.org/urn:lsid:ubio.org:namebank:11815%22/%3E ... </rdf:Description>
In this example, the HTTP URI "http://lsid.tdwg.org/urn:lsid:ubio.org:namebank:11815" is not just a "resolution mechanism". It IS an identifier whether you want it to be or not. I suppose you could try to define it out of a role as a "GUID" but that would be playing with semantics (no pun intended). Semantic clients would consider it to be just as much an identifier as the unproxied LSID Now consider how the example you were giving would need to be handled in RDF. I am extrapolating here because as I said, there is no "UUID Applicability guide". To handle all of the identifiers you listed:
A. A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523 B. urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523 C.http://zoobank.org/urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E5BF4... D. http://zoobank.org/A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523 E.http://lsid.tdwg.org/urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E5B... F. http://www.google.com/search?q=Danaus+plexippus+(Linnaeus+1758) G.http://lsid.tdwg.org/summary/urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB... H. http://darwin.zoology.gla.ac.uk/~rpage/lsid/tester/?q=urn:lsid:zoobank.org:a... I. http://zoobank.org/?lsid=urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA... J. http://zoobank.org/?id=urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E... K. http://zoobank.org/?id=A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523 L. http://zoobank.org/?uuid=A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523
one would write this:
<rdf:Description rdf:about="urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523"> dc:identifierA9F435E0-8ED7-46DD-BAB4-EA8E5BF41523</dc:identifier> <owl:sameAs rdf:resource="http://zoobank.org/urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E5BF4... <owl:sameAs rdf:resource="http://zoobank.org/A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523%22/%3E <owl:sameAs rdf:resource="http://lsid.tdwg.org/urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E5B... <owl:sameAs rdf:resource="http://www.google.com/search?q=Danaus+plexippus+(Linnaeus+1758)%22/%3E <owl:sameAs rdf:resource="http://lsid.tdwg.org/summary/urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB... <owl:sameAs rdf:resource="http://darwin.zoology.gla.ac.uk/~rpage/lsid/tester/?q=urn:lsid:zoobank.org:a... <owl:sameAs rdf:resource="http://zoobank.org/?lsid=urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA... <owl:sameAs rdf:resource="http://zoobank.org/?id=urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E... <owl:sameAs rdf:resource="http://zoobank.org/?id=A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523%22/%3E <owl:sameAs rdf:resource="http://zoobank.org/?uuid=A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523%22/%3E ... </rdf:Description>
Note that it would not be necessary (nor in my opinion a good idea) to use the LSID in the rdf:about attribute. Any of the 10 HTTP URIs could have been switched with it. (Well, the google.com one really shouldn't be there because it represents a web page, not a name.) However the UUID can NOT be used in the rdf:about attribute, nor can it be used in an rdf:resource attribute. From the standpoint of the RDF, it has no use as an identifier that the client can "understand" (i.e. use as a subject or object of any object property).
I don't think you were seriously suggesting that all 12 of the identifiers on the list would actually be used in "real life". You were making a point about how a UUID could be made actionable. But my point is that you simply cannot meet the requirements of the GUID A.S. with ONLY a UUID. You MUST have an HTTP proxied version of it in order to "do the right thing" (i.e. GUID A.S. rec 10) and provide metadata in the form of RDF serialized as XML. That HTTP proxied version isn't *just *going to be seen as a "resolution mechanism". It is going to be the ONLY identifier of any relevance in terms of the operation of the RDF which will see the UUID in the dc:identifier property as nothing more than a string literal. If you and GNUB are going to participate in BiSciCol as I understand it to be developing (and I believe that you are), you will HAVE to have an HTTP URI version of your UUIDs and in that context the raw UUID will be relatively irrelevant.
My point is that you should decide on just one of these HTTP URIs and use that as your identifier when you communicate with the outside world. My preference would be "http://zoobank.org/A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523" as the shortest and least complex one that would do everything that needs to get done. I guess that there isn't problem with the other nine existing, but from my point of view there is nothing but harm to be done by exposing them to the outside world. If you do, there is a chance that people will think that you intend for them to be an HTTP URI GUID for the object and you will be stuck forever having to put owl:sameAs statements about them in your RDF. You noted that the GUID A.S. says about UUIDs: "This form of GUID is perfectly acceptable for uniquely identifying data objects within a dataset." I would put emphasis on the word "within". Outside of that dataset, the UUID is not as useful as its HTTP proxied version. You could (from the standpoint of the outside world) refer to your object by both "http://zoobank.org/A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523" and "A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523", or you could ONLY refer to your object in the outside world as "http://zoobank.org/A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523". You can't ONLY refer to your object to the outside world as "A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523" and describe it in RDF. From this point of view, why would you want to expose two identifiers when you only need to expose one? This is what I meant when I said you should just pick one and stick with it.
The other point which I was trying to make is: why would you choose to expose to the outside world an identifier that only does part of the desirable things that we want (i.e. my list of 8 desirable attributes of a GUID), when you could use a modification of that identifier that would do everything you want? You mention how GUIDs for names are primarily of interest to machines. That is undoubtedly true. But with virtually no additional cost (15 minutes of time from somebody who knows how to create a single 3 kB XSLT file) an HTTP URI GUID could resolve to something readable by humans in additional to the more useful machine-readable RDF/XML.
I would assert the same thing about LSIDs. Why would you create in identifier that is part of (what seems to me to be universally recognized as) a dead technology when you could create a simpler HTTP URI that would do the same thing and potentially more? In the case of uBio and Biodiversity Collections Index, they were set up when LSIDs were believed to be the "Next Big Thing". That did not turn out to be the case, so those organizations are stuck with painful HTTP URIs like "http://biocol.org/urn:lsid:biocol.org:col:35115" and "http://www.ubio.org/authority/metadata.php?lsid=urn:lsid:ubio.org:namebank:9..." when they could have had "http://biocol.org/35115" and "http://www.ubio.org/9479554". I would say "lesson learned" - we know how to construct good HTTP URI GUIDs that will do everything people want so why not just do that? If it turns out that Linked Data and the Semantic Web are also "The Next Big Thing" that turns out to be a flop, we still have globally unique strings that are not actionable. But I think that the demonstrations of multiple members of our community show that at least to some degree LOD/Semantic Web technologies "work" and can be implemented by almost anybody.
You said:
Here is where I completely disagree. I've said it before, and I'll keep saying it: GUIDs are (should be) intended and necessary for computer-computer communication; *NOT*for human-computer or human-human communication. Their beauty or ugliness should be determined by what's beautiful or ugly to a computer, not to a human. A consistent 128 bits is "beautiful" to a computer, but a UUID is ugly to a human; whereas " Danaus plexippus (Linnaeus 1758)" is beautiful to a human, but ugly to a computer (for reasons Dima already outlined).
More fundamentally, one lesson of history that seems to be perpetually repeated is the mistake of encoding human-interpretable information into what is intended to be a stable, permanent identifier. INEVITABLY, a system that uses human-interpretable information as identifiers will include some fraction of instances where the human-interpretable part is somehow "wrong" (e.g., the user entered a Cyrillic "а" was accitdentally entered instead of a latin "a", or a typographic error in a scientific name, or worst of all, the assignement of a text-string name to a homonym due to a mix-up in authorship). The temptation to "fix" those "wrong" values is enormous. And, of course, by "fixing" them, permanence is broken.
Almost by definition, then, a "beautiful" identifier for computer-computer communication should be "ugly" to a pair of human eyeballs.
I disagree with you completely here. If you haven't read the "Cool URIs" piece, you should before we talk about this more. It is full of examples that are easy to read and type and are intended to be "understood" by both humans and computers. The piece at http://www.w3.org/Provider/Style/URI is an even easier read. GUIDs CAN be easy to "read" and type, although they don't have to be. The degree to which it "matters" whether a GUID is human readable or not depends primarily on the likelihood that humans will see it in print or type it in the URL box of a web browser. In the examples of GUIDs for names that you provided, I will agree that it's not very likely that humans will be seeing them. But if the GUID is of a specimen, an image, or a tree (which could easily appear in print or be written down by somebody to look at its web page), I would argue that readability is desirable, e.g. http://bioimages.vanderbilt.edu/uncg/966 . I realize that everyone does not agree with me on this, particularly the fans of UUIDs. As far as I know, there isn't any rule about what characters should be in an HTTP URI. But there is a general understanding that it is a best practice that an HTTP URI that is intended as an identifier should do content negotiation and produce both HTML for humans and RDF for machines.
[lots of stuff cut out here that will have to wait for another email]
Errr..sort of. I say we identify things using GUIDs, and provide services that resolve those GUIDs via actionable HTTP URIs (or, if you prefer, embedding those GUIDs within a resolution metadata "wrapper"). Yes, I know it's all the rage to collapse the functions of actionability and globally unique identification into the same text-string URI (what I've been referring to as the TB-L perspective). But to be perfectly blunt, I see this as a mistake that will, in the long run, sow down our progress.
Why does this slow down our progress? I don't get that at all. I see your viewpoint as the one impeding progress because non-HTTP GUIDs make it difficult or impossible to describe things in RDF.
... Agreed! I think when we distill this entire exchange, we'll find that we have slightly different interpretations about what the GUID applicability statement actually says & means, and a non-trivial amount of miscommunication, but otherwise (as was the case the last time we had such a voluminous exchange), we're actually more on the same page than not.
I'm sure this is probably the case! I hope that I am not coming across as rude or disrespectful in this kind of discussion. When I question your statements and those of others, I expect to often be shown to be wrong and learn from the experience. I also expect that my statements will be subjected to the same scrutiny and criticism that I may dish out! :-)
So I'm actually pretty optimistic about the whole venture assuming that we can get people and organizations to actually read and try to follow the standards that we have already agreed upon.
I think it's nice to end this email on a point of strong agreement!
Likewise! Steve
Aloha, Rich
Richard L. Pyle, PhD Database Coordinator for Natural Sciences Associate Zoologist in Ichthyology Dive Safety Officer Department of Natural Sciences, Bishop Museum 1525 Bernice St., Honolulu, HI 96817 Ph: (808)848-4115, Fax: (808)847-8252 email: deepreef@bishopmuseum.org http://hbs.bishopmuseum.org/staff/pylerichard.html
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content .