[tdwg-content] Why UUIDs alone are not adequate as GUIDs, was Re: ITIS TSNID to uBio NamebankIDs mapping

Steve Baskauf steve.baskauf at vanderbilt.edu
Wed Jun 8 00:05:26 CEST 2011


Rich,
I should say that my inclusion of references, acronym definitions etc. 
is not to insinuate that you are unaware of those things, but is a 
recognition that this is a discussion on a public list and that some of 
the readers may have never heard of these things and may not be aware of 
the references.  Also, the message to which you responded was a response 
to chunks of several emails - I guess a bad practice intended to cut 
down on the number of postings and to group related thoughts.  I thought 
I had included enough of the the "Roderic Page wrote:" and " Kevin 
Richards wrote:" headings to make it clear to which message I was 
referring.  A couple statements to which you responded to were written 
by Kevin and not me. 

For the purposes of clarity, any time I say "GUID" here, I intend it in 
the sense of the TDWG GUID Applicability Statement.  In the GBIF 
"Adoption of Persistent Identifiers for Biodiversity Informatics" 
document (http://www2.gbif.org/Persistent-Identifiers.pdf), the term 
"persistent actionable identifiers" is used instead of GUID, but in the 
interest of brevity I'll use GUID. 

Thanks for taking the time to explain more about how GNUB will work.  I 
am anxious to see it come to fruition and to use it.  I have additional 
comments and questions relative to your description of it, but they will 
have to wait for another email.  I think it would be best to focus this 
post on the subject of GUIDs because I think that this is the crux of 
our disagreement here. 

First a word about the TDWG GUID Applicability Statement.  You were 
expressing some reservations about calling it a "standard". If you go to 
http://www.tdwg.org/standards/, you will find it listed under "Current 
Standards".  My understanding is (and I may be corrected by those who 
know better) that a TDWG Standard can be either an Applicability 
Statement or a Technical Definition (like Darwin Core).  In either case, 
the standard has gone through the review process, been subjected to 
public comment, and approved by the TDWG Executive.  So I consider 
either an Applicability Statement or a Technical Definition to have 
considerably more "weight" than something like a blog post or ad hoc 
usage guide.  One problem with the GUID A.S. (Applicability Statement) 
is found on the title page.  It says "there is, or will be, a separate 
document for the applicability of each specific GUID technology".  
Unfortunately, the "there is" part currently only applies to LSIDs - no 
other GUID technology has its own document.  So an understanding of the 
"appropriate" way to apply something like a UUID must be inferred from 
the general statements and examples about UUIDs, by "reading between the 
lines" by considering how general recommendations about GUIDs would 
impact the handling of UUIDs, and by analogy to how LSIDs (another 
non-HTTP URI-based GUID) are handled. 

You quoted p.7 of the guide:

============================
The global uniqueness of an identifier is often confused with the issue of
resolution of the identifier.  These two attributes of GUIDs can be
distinguished and discussed separately.
For example a Universally Unique Identifier (UUID) is a globally unique
identifier, but there are no widely known and used protocols for resolving a
UUID over the Internet (unlike HTTP URIs). This form of GUID is perfectly
acceptable for uniquely identifying data objects within a dataset.
Some identifiers therefore provide uniqueness, but not resolvability.
============================

So based on this, you are correct to call a UUID a GUID.  However, the 
part that I disagree with is:

... I think it's foolish to regard all of these different
resolution mechanisms as distinct "identifiers".  There is *ONE* GUID.  It
is: A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523.  There are ten different ways to
make it actionable. It therefore meets the recommendations of the
applicability statement.

The problem is that when you create an HTTP URI out of a UUID, you are 
creating an identifier whether you think you are or not.  I suppose as a 
matter of semantics, you could say "I don't intend for the ten ways I 
showed of making my UUID actionable to be GUIDs", but if I encounter one 
of them, how am I supposed to know that?  You may not think that an HTTP 
proxied non-HTTP URI GUID (e.g. an HTTP proxied UUID) is a GUID, but 
anyone who is interested in describing the properties of the identified 
resource in RDF (which should be everyone, GUID A.S. recommendation 10) 
will think so.  The GUID A.S. does not contain any RDF examples 
(unfortunately) but the LSID Applicability Statement talks in detail 
about how LSIDs should be used in RDF.  Recommendation 29 of the LSID 
A.S. states that "objects must be identified by an LSID in its standard 
form using the rdf:about attribute".  You can do this with an LSID 
because it is a urn (subset of the more generic URI) and therefore a 
describable thing in RDF.  However, a UUID cannot be used similarly in 
an rdf:about attribute because it is not any kind of URI.  It is just a 
globally unique string.  Recommendation 31 says "All references to 
objects identified by LSIDs using the rdf:resource attribute must use a 
proxy version of the LSID."  This is because an LSID (nor a UUID) cannot 
be used by a client to retrieve information about the object of the 
property (the value of the rdf:resource attribute).  That can only be 
done if the GUID is an HTTP URI.  Recommendation 30 says that the 
description of all objects identified by an LSID must contain an 
owl:sameAs, owl:equivalentProperty or owl:equivalentClass statement 
expressing the equivalence beteen the object identifier in its standard 
form and its proxy version.  The RDF example given on page 18 show how 
this is to be accomplished (fragment shown here):

  <rdf:Description rdf:about="urn:lsid:ubio.org:namebank:11815">
    <dc:identifier>urn:lsid:ubio.org:namebank:11815</dc:identifier>
    <owl:sameAs 
rdf:resource="http://lsid.tdwg.org/urn:lsid:ubio.org:namebank:11815"/>
    ...
  </rdf:Description>

In this example, the HTTP URI 
"http://lsid.tdwg.org/urn:lsid:ubio.org:namebank:11815" is not just a 
"resolution mechanism".  It IS an identifier whether you want it to be 
or not.  I suppose you could try to define it out of a role as a "GUID" 
but that would be playing with semantics (no pun intended).  Semantic 
clients would consider it to be just as much an identifier as the 
unproxied LSID Now consider how the example you were giving would need 
to be handled in RDF.  I am extrapolating here because as I said, there 
is no "UUID Applicability guide".  To handle all of the identifiers you 
listed:

A. A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523
B. urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523
C.http://zoobank.org/urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523
D. http://zoobank.org/A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523
E.http://lsid.tdwg.org/urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523
F. http://www.google.com/search?q=Danaus+plexippus+(Linnaeus+1758)
G.http://lsid.tdwg.org/summary/urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523
H. 
http://darwin.zoology.gla.ac.uk/~rpage/lsid/tester/?q=urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523&submit=Go
I. 
http://zoobank.org/?lsid=urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523
J. 
http://zoobank.org/?id=urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523
K. http://zoobank.org/?id=A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523
L. http://zoobank.org/?uuid=A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523

one would write this:

  <rdf:Description 
rdf:about="urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523">
    <dc:identifier>A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523</dc:identifier>
    <owl:sameAs 
rdf:resource="http://zoobank.org/urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523"/>
    <owl:sameAs 
rdf:resource="http://zoobank.org/A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523"/>
    <owl:sameAs 
rdf:resource="http://lsid.tdwg.org/urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523"/>
    <owl:sameAs 
rdf:resource="http://www.google.com/search?q=Danaus+plexippus+(Linnaeus+1758)"/>
    <owl:sameAs 
rdf:resource="http://lsid.tdwg.org/summary/urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523"/>
    <owl:sameAs 
rdf:resource="http://darwin.zoology.gla.ac.uk/~rpage/lsid/tester/?q=urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523&submit=Go"/>
    <owl:sameAs 
rdf:resource="http://zoobank.org/?lsid=urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523"/>
    <owl:sameAs 
rdf:resource="http://zoobank.org/?id=urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523"/>
    <owl:sameAs 
rdf:resource="http://zoobank.org/?id=A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523"/>
    <owl:sameAs 
rdf:resource="http://zoobank.org/?uuid=A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523"/>
    ...
  </rdf:Description>

Note that it would not be necessary (nor in my opinion a good idea) to 
use the LSID in the rdf:about attribute.  Any of the 10 HTTP URIs could 
have been switched with it.  (Well, the google.com one really shouldn't 
be there because it represents a web page, not a name.)  However the 
UUID can NOT be used in the rdf:about attribute, nor can it be used in 
an rdf:resource attribute.  From the standpoint of the RDF, it has no 
use as an identifier that the client can "understand" (i.e. use as a 
subject or object of any object property). 

I don't think you were seriously suggesting that all 12 of the 
identifiers on the list would actually be used in "real life".  You were 
making a point about how a UUID could be made actionable.  But my point 
is that you simply cannot meet the requirements of the GUID A.S. with 
ONLY a UUID.  You MUST have an HTTP proxied version of it in order to 
"do the right thing" (i.e. GUID A.S. rec 10) and provide metadata in the 
form of RDF serialized as XML.  That HTTP proxied version isn't *just 
*going to be seen as a "resolution mechanism".  It is going to be the 
ONLY identifier of any relevance in terms of the operation of the RDF 
which will see the UUID in the dc:identifier property as nothing more 
than a string literal.  If you and GNUB are going to participate in 
BiSciCol as I understand it to be developing (and I believe that you 
are), you will HAVE to have an HTTP URI version of your UUIDs and in 
that context the raw UUID will be relatively irrelevant. 

My point is that you should decide on just one of these HTTP URIs and 
use that as your identifier when you communicate with the outside 
world.  My preference would be 
"http://zoobank.org/A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523" as the 
shortest and least complex one that would do everything that needs to 
get done.  I guess that there isn't problem with the other nine 
existing, but from my point of view there is nothing but harm to be done 
by exposing them to the outside world.  If you do, there is a chance 
that people will think that you intend for them to be an HTTP URI GUID  
for the object and you will be stuck forever having to put owl:sameAs 
statements about them in your RDF.  You noted that the GUID A.S. says 
about UUIDs: "This form of GUID is perfectly acceptable for uniquely 
identifying data objects within a dataset."  I would put emphasis on the 
word "within".  Outside of that dataset, the UUID is not as useful as 
its HTTP proxied version.   You could (from the standpoint of the 
outside world) refer to your object by both 
"http://zoobank.org/A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523" and 
"A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523", or you could ONLY refer to your 
object in the outside world as 
"http://zoobank.org/A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523".  You can't 
ONLY refer to your object to the outside world as 
"A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523" and describe it in RDF.  From 
this point of view, why would you want to expose two identifiers when 
you only need to expose one?  This is what I meant when I said you 
should just pick one and stick with it. 

The other point which I was trying to make is: why would you choose to 
expose to the outside world an identifier that only does part of the 
desirable things that we want (i.e. my list of 8 desirable attributes of 
a GUID), when you could use a modification of that identifier that would 
do everything you want?  You mention how GUIDs for names are primarily 
of interest to machines.  That is undoubtedly true.  But with virtually 
no additional cost (15 minutes of time from somebody who knows how to 
create a single 3 kB XSLT file) an HTTP URI GUID could resolve to 
something readable by humans in additional to the more useful 
machine-readable RDF/XML. 

I would assert the same thing about LSIDs.  Why would you create in 
identifier that is part of (what seems to me to be universally 
recognized as) a dead technology when you could create a simpler HTTP 
URI that would do the same thing and potentially more?  In the case of 
uBio and Biodiversity Collections Index, they were set up when LSIDs 
were believed to be the "Next Big Thing".  That did not turn out to be 
the case, so those organizations are stuck with painful HTTP URIs like 
"http://biocol.org/urn:lsid:biocol.org:col:35115" and 
"http://www.ubio.org/authority/metadata.php?lsid=urn:lsid:ubio.org:namebank:9479554" 
when they could have had "http://biocol.org/35115" and 
"http://www.ubio.org/9479554".  I would say "lesson learned" - we know 
how to construct good HTTP URI GUIDs that will do everything people want 
so why not just do that?  If it turns out that Linked Data and the 
Semantic Web are also "The Next Big Thing" that turns out to be a flop, 
we still have globally unique strings that are not actionable.  But I 
think that the demonstrations of multiple members of our community show 
that at least to some degree LOD/Semantic Web technologies "work" and 
can be implemented by almost anybody. 

You said:
> Here is where I completely disagree.  I've said it before, and I'll keep
> saying it:  GUIDs are (should be) intended and necessary for
> computer-computer communication; *NOT*for human-computer or human-human
> communication.  Their beauty or ugliness should be determined by what's
> beautiful or ugly to a computer, not to a human.  A consistent 128 bits is
> "beautiful" to a computer, but a UUID is ugly to a human; whereas " Danaus
> plexippus (Linnaeus 1758)" is beautiful to a human, but ugly to a computer
> (for reasons Dima already outlined).
>
> More fundamentally, one lesson of history that seems to be perpetually
> repeated is the mistake of encoding human-interpretable information into
> what is intended to be a stable, permanent identifier.  INEVITABLY, a system
> that uses human-interpretable information as identifiers will include some
> fraction of instances where the human-interpretable part is somehow "wrong"
> (e.g., the user entered a Cyrillic "а" was accitdentally entered instead of
> a latin "a", or a typographic error in a scientific name, or worst of all,
> the assignement of a text-string name to a homonym due to a mix-up in
> authorship).  The temptation to "fix" those "wrong" values is enormous. And,
> of course, by "fixing" them, permanence is broken.
>
> Almost by definition, then, a "beautiful" identifier for computer-computer
> communication should be "ugly" to a pair of human eyeballs.
>   
I disagree with you completely here.  If you haven't read the "Cool 
URIs" piece, you should before we talk about this more.  It is full of 
examples that are easy to read and type and are intended to be 
"understood" by both humans and computers.  The piece at 
http://www.w3.org/Provider/Style/URI is an even easier read.  GUIDs CAN 
be easy to "read" and type, although they don't have to be.  The degree 
to which it "matters" whether a GUID is human readable or not depends 
primarily on the likelihood that humans will see it in print or type it 
in the URL box of a web browser.  In the examples of GUIDs for names 
that you provided, I will agree that it's not very likely that humans 
will be seeing them.  But if the GUID is of a specimen, an image, or a 
tree (which could easily appear in print or be written down by somebody 
to look at its web page), I would argue that readability is desirable, 
e.g. http://bioimages.vanderbilt.edu/uncg/966 .  I realize that everyone 
does not agree with me on this, particularly the fans of UUIDs.  As far 
as I know, there isn't any rule about what characters should be in an 
HTTP URI.  But there is a general understanding that it is a best 
practice that an HTTP URI that is intended as an identifier should do 
content negotiation and produce both HTML for humans and RDF for machines. 

[lots of stuff cut out here that will have to wait for another email]
> Errr..sort of.  I say we identify things using GUIDs, and provide services
> that resolve those GUIDs via actionable HTTP URIs (or, if you prefer,
> embedding those GUIDs within a resolution metadata "wrapper").  Yes, I know
> it's all the rage to collapse the functions of actionability and globally
> unique identification into the same text-string URI (what I've been
> referring to as the TB-L perspective).  But to be perfectly blunt, I see
> this as a mistake that will, in the long run, sow down our progress.
>   
Why does this slow down our progress?  I don't get that at all.  I see 
your viewpoint as the one impeding progress because non-HTTP GUIDs make 
it difficult or impossible to describe things in RDF.
> ...
> Agreed!  I think when we distill this entire exchange, we'll find that we
> have slightly different interpretations about what the GUID applicability
> statement actually says & means, and a non-trivial amount of
> miscommunication, but otherwise (as was the case the last time we had such a
> voluminous exchange), we're actually more on the same page than not.
>   
I'm sure this is probably the case!  I hope that I am not coming across 
as rude or disrespectful in this kind of discussion.  When I question 
your statements and those of others, I expect to often be shown to be 
wrong and learn from the experience.  I also expect that my statements 
will be subjected to the same scrutiny and criticism that I may dish 
out! :-)
>   
>> So I'm actually pretty optimistic about the whole venture
>> assuming that we can get people and organizations to
>> actually read and try to follow the standards that we
>> have already agreed upon.
>>     
>
> I think it's nice to end this email on a point of strong agreement!
>   
Likewise!
Steve
> Aloha,
> Rich
>
>
> Richard L. Pyle, PhD
> Database Coordinator for Natural Sciences
> Associate Zoologist in Ichthyology
> Dive Safety Officer
> Department of Natural Sciences, Bishop Museum
> 1525 Bernice St., Honolulu, HI 96817
> Ph: (808)848-4115, Fax: (808)847-8252
> email: deepreef at bishopmuseum.org
> http://hbs.bishopmuseum.org/staff/pylerichard.html
>
>
>
>
> _______________________________________________
> tdwg-content mailing list
> tdwg-content at lists.tdwg.org
> http://lists.tdwg.org/mailman/listinfo/tdwg-content
> .
>
>   

-- 
Steven J. Baskauf, Ph.D., Senior Lecturer
Vanderbilt University Dept. of Biological Sciences

postal mail address:
VU Station B 351634
Nashville, TN  37235-1634,  U.S.A.

delivery address:
2125 Stevenson Center
1161 21st Ave., S.
Nashville, TN 37235

office: 2128 Stevenson Center
phone: (615) 343-4582,  fax: (615) 343-6707
http://bioimages.vanderbilt.edu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.tdwg.org/pipermail/tdwg-content/attachments/20110607/6ec3f844/attachment.html 


More information about the tdwg-content mailing list