[tdwg-content] Why UUIDs alone are not adequate as GUIDs, was Re: ITIS TSNID to uBio NamebankIDs mapping

Thu Jun 9 01:45:42 CEST 2011

Yes, ratified following the official TDWG process.  Ratified is the final status of a proposed standard that has followed the process all the way through and has become "official".  This process replaced the deprecated "member voting" formerly conducted at the annual meeting. 

Chuck

-----Original Message-----
From: tdwg-content-bounces at lists.tdwg.org [mailto:tdwg-content-bounces at lists.tdwg.org] On Behalf Of Kevin Richards
Sent: Wednesday, June 08, 2011 4:16 PM
To: Richard Pyle; 'Steve Baskauf'
Cc: tdwg-content at lists.tdwg.org
Subject: Re: [tdwg-content] Why UUIDs alone are not adequate as GUIDs, was Re: ITIS TSNID to uBio NamebankIDs mapping

Answering your question about whether the GUID applicability statements are "ratified" standards...

To be honest, I am not sure of the difference between an un-ratified standard and a ratified standard.  My impression was that we have done all we need to with the applicability statements in the standards process, so perhaps they are ratified??

An email from 22 February (ironic date - the date of our devastating earthquake) about the applicability statements:

"The TDWG Executive Committee has approved the Life Sciences Identifiers Applicability Statement (LSID_AS) and the Globally Unique Identifiers (GUID_AS) Applicability Statement as new TDWG standards.

The Executive committee acknowledges Kevin Richards of Landcare New Zealand as author of the GUID Applicability Statement. Likewise, Kevin Richards, Ricardo Pereira (TDWG Infrastructure Project), Donald Hobern (Atlas of Living Australia), Roger Hyam (TDWG Infrastructure Project), Lee Belbin (TDWG Infrastructure Project) and Stan Blum (California Academy of Sciences) as co-authors of the LSID Applicability Statement.

The committee also greatly appreciated the patience and perseverance of Ben Richardson of the Department of Environment and Conservation of Western Australia who was the Review Manager for these standards. The process, as can be seen from the institutional associations, was in this case longer than all would have liked, but we hope that the standards will prove useful to the Biodiversity Informatics community.

We would also thank all those who were involved as formal or public reviewers of these standards. Your input was greatly appreciated and was in various ways, incorporated into the final standards.

These standards can be downloaded from http://www.tdwg.org/standards/150/download/.

Chuck Miller
TDWG Chair
On behalf of the TDWG Executive Committee"

Kevin

-----Original Message-----
From: tdwg-content-bounces at lists.tdwg.org [mailto:tdwg-content-bounces at lists.tdwg.org] On Behalf Of Richard Pyle
Sent: Thursday, 9 June 2011 8:46 a.m.
To: 'Steve Baskauf'
Cc: tdwg-content at lists.tdwg.org
Subject: Re: [tdwg-content] Why UUIDs alone are not adequate as GUIDs, was Re: ITIS TSNID to uBio NamebankIDs mapping

Hi Steve,

First of all, I owe you (and the list at large) a sincere apology for my excessively long and largely discombobulated email.  I was distracted by many things, and I ended up writing it in chunks over the course of a full day.  You were very clear on who you were responding to in each section, but I probably lost track of that because of the discontinuous mode of my response.  Another problem is that I converted my reply to plain text, and that caused me to lose track in a few places whom I was responding to.  Again, my sincere apologies.

> For the purposes of clarity, any time I say "GUID" here, I intend it 
> in the sense of the TDWG GUID Applicability Statement.

OK thanks.  That became clear as I responded, but somehow I didn't pick up on that when I first started responding.  But even the TDWG GUID Applicability Statement (TGAS) is not perfectly clear or consistent in its use of the term GUID.  In some cases, the term implies self-actionability; in other cases, it says what to do when GUIDs are not self-actionable.

> In the GBIF "Adoption of Persistent Identifiers for Biodiversity 
> Informatics" document 
> (http://www2.gbif.org/Persistent-Identifiers.pdf),
> the term "persistent actionable identifiers" is used instead of GUID, 
> but in the interest of brevity I'll use GUID.

OK, fair enough.  The GBIF document was the most recent one I contributed to, so I was thinking in those terms for using the qualified "persistent actionable identifiers" language in contrast to "GUID"; but I'm perfectly happy using the term "GUID" now that we have it (reasonably) well-defined.

> Thanks for taking the time to explain more about how GNUB will work.
> I am anxious to see it come to fruition and to use it.

I'm hoping that by late summer we'll have it functioning with several core services, and perhaps you and others on this list can help test those services and provide suggestions for new services.  Before that can be a productive use of everyone's time, though, we need to hammer out some technical documentation. As I am writing this from my hotel room at Disney's Caribbean Beach Resort in Orlando (while my family naps after a long flight in preparation for some serious Magic Kingdom action tonight), I'm not really in a position to  delve into this in too much detail right now.  But I'll take a stab at it.

> First a word about the TDWG GUID Applicability Statement.
> You were expressing some reservations about calling it a "standard".
> If you go to http://www.tdwg.org/standards/, you will find it listed 
> under "Current Standards".

My reservations were mostly about calling it a "ratified standard".  I honestly don't know if it is or isn't, but I don't rememeber a vote on it (like there was for TCS and for the "ratified" DwC).  Perhaps Kevin Richards or someone else at TDWG can clarify (for both of us).

> So an understanding of the "appropriate" way to apply something like a 
> UUID must be inferred from the general statements and examples about 
> UUIDs, by "reading between the lines" by considering how general 
> recommendations about GUIDs would impact the handling of UUIDs, and by 
> analogy to how LSIDs (another non-HTTP URI-based GUID) are handled.

Perhaps instead of reading between the lines, the discussion surrounding the drafting of the "TGAS" is available online somewhere.  That would include details about the thinking behind the final wording.

> So based on this, you are correct to call a UUID a GUID.  However, the 
> part that I disagree with is:
>
> ... I think it's foolish to regard all of these different resolution 
> mechanisms as distinct "identifiers".  There is *ONE* GUID.  It
> is: A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523.  There are ten different 
> ways to make it actionable. It therefore meets the recommendations of 
> the applicability statement.

You are not alone in disagreeing with me on this.

> The problem is that when you create an HTTP URI out of a UUID, you are 
> creating an identifier whether you think you are or not.

Fair enough; but by that definition & logic, *every* HTTP URI (sensu the "Contemporary View" explained at http://www.w3.org/TR/uri-clarification/; i.e., inclusive of things we sometimes call URN or URL) is an identifier.  But I think that goes well beyond the scope of the discussion we're having here about GUIDs.

> I suppose as a matter of semantics, you could say "I don't intend for 
> the ten ways I showed of making my UUID actionable to be GUIDs", but 
> if I encounter one of them, how am I supposed to know that?

That is *exactly* the point I was trying to get at in my earlier message.  Right now, everything that resolves via HTTP GET must be treated as a GUID.  But it's not guaranteed to be persistent (thinking again in terms of the more explicit "persistent actionable identifiers"). I think our community can do better than that.  The problem is not the resolution -- I can (and intend to) persist all ten service syntax forms, so they will all fit the TGAS recommendation as GUIDs.  But that doesn't do you any good if you're trying to compare cited objects in two different datasets that each happened to use different syntax for the resolution mechanism.

A little more context might be helpful here.  Those ten different mechanisms to resolve ZooBank identifiers existed before the drafting of the TGAS document.  I assumed, at the time I established them, that everyone would see as clearly as I do that the need for identification is different from the need for "resolution" (=actionability). So strong was the opposition to what seemed obvious to me, that I followed my normal pattern in such cases, which is to assume that I was wrong.  But the unsettling part is that the more carefully I thought about it, the more obvious it became that I was right, and the opposing viewpoint was wrong (despite the inherent assumption by various big-name web luminaries, who I otherwise hold enormous respect for).  So, through the early TDWG/GBIF discussions, and both TDWG/GBIF GUID workshops, and the drafting of the various TDWG and GBIF documents, I stubbornly maintained this perspective (that identification and resolution should not be conflated).  I believe that it was my stubbornness that accounts for the acknowledgement of the distinction between identification and resolution in TGAS and other documents.

Now, the easy way out would be to throw in the towel and terminate 9 of those resolution services, and make everyone happy with a single ZooBank URI that can be actioned via HTTP GET. But to do so instills in me the same sort of lack of conviction that I would feel if I confessed to a crime I did not commit just because it was the easy way out.  On this issue, I'm not ready to do that, because it is so glaringly obvious to me that we *must* maintain a distinction between identification and resolution.

> You may not think that an HTTP proxied non-HTTP URI GUID (e.g. an HTTP 
> proxied UUID) is a GUID, but anyone who is interested in describing 
> the properties of the identified resource in RDF (which should be 
> everyone, GUID A.S.
> recommendation 10) will think so.

Not everyone.  But I concede that most would.  And this is what I want to fix.

Another part of the TGAS that I quoted was this part (p 11):

"For non-self-resolving GUIDs, such as UUIDs, resolution of that GUID via the HTTP protocol’s GET method (the standard method by which a resource is retrieved on the web) must be implemented. This ensures that the data for the object being identified can be obtained from the provider of that GUID with tools that a majority of Internet users and developers already understand and use."

This, I believe, is one of the paragraphs inserted because of my insistence that the roles of identification and actionability be distinguished.  Nothing in that statement -- or anywhere else in the TGAS that I am aware of -- suggests that HTTP-proxied "non-self-resolving GUIDs" themselves represent distinct GUIDs.  Nor does it say that multiple mechanisms for establishing that HTTP-proxied actionability function represent a violation of Recommendation 4.

> The GUID A.S. does not contain any RDF examples (unfortunately) but 
> the LSID Applicability Statement talks in detail about how LSIDs should be used in RDF.
> Recommendation 29 of the LSID A.S. states that "objects must be 
> identified by an LSID in its standard form using the rdf:about 
> attribute".  You can do this with an LSID because it is a urn (subset 
> of the more generic URI) and therefore a describable thing in RDF.  
> However, a UUID cannot be used similarly in an rdf:about attribute 
> because it is not any kind of URI.  It is just a globally unique string.

Right -- which is exactly why ZooBank identifiers are presented publicly as LSIDs (with proper resolution mechanisms), rather than simply as UUIDs.  But that doesn't change the fact that the UUID is the "real" identifier, and is simply "wrapped" in LSID-compliant resolution metadata.  But I will say that I also regard the LSID as a bona-fide "identifier" in and of itself, because that's how the LSID spec is written.  So I (grudgingly) admit that our minting of LSIDs commits us to treating the full-context LSID as though it is a distinct identifier from the UUID that it encapsulates.  However, I don't think this applies to all the flavors of HTTP proxying, because there is no spec (that I am aware of) that says "all HTTP URIs should be treated as though they are GUIDs" -- even though, by some definitions, they technically are.

> Recommendation 31 says "All references to objects identified by LSIDs 
> using the rdf:resource attribute must use a proxy version of the LSID."

Right, and this is where I think I dropped the ball on ZooBank LSID resolution.  At the moment, resolving a ZooBank LSID directly (e.g., via Rod Page's LSID tester, or TDWG's LSID resolver service) retruns the proper RDF (thanks to Kevin Richards, who set that service up).  However, the HTTP proxy version returns HTTP by default.  I needed to do this because I didn't (and still don't) know enough about applying style sheets to RDF to render them in a human-friendly form.  I spoke with Rob Whitton about this last week, and he will have this fixed soon.

> Recommendation 30 says that the description of all objects identified 
> by an LSID must contain an owl:sameAs, owl:equivalentProperty or 
> owl:equivalentClass statement expressing the equivalence beteen the object identifier in its standard form and its proxy version.

Ahh!! OK, this may be the fatal bullet to my argument.  But let me explain a bit further:

The "true" GUID for a ZooBank record is the UUID.  The standard form of presenting this UUID to the public is as an LSID.  I'm happy with saying that the LSID *is* the TDWG-context GUID for the record (calling the UUID the "true" GUID is just a semantic technicality that has no real bearing in the context of TDWG standards).  The standard http proxy for ZooBank LSIDs is "http://zoobank.org/[LSID]" -- that is, the LSID appended to a "http://zoobank.org" prefix.

I have no argument with the Recommendation 30 that says there should be an owl:sameAs, owl:equivalentProperty or owl:equivalentClass statement expressing the  equivalence between the LSID and its proxy version.

But I do have an argument against the notion that *any* web service that can resolve the LSID into its constituent metadata (whether HTTP, RDF, or whatever) must be treated as a distinct GUID, with a similar need for the owl:sameAs [etc.] statement.
Perhaps this, ultimately, is the crux of our argument.

> I don't think you were seriously suggesting that all 12 of the 
> identifiers on the list would actually be used in "real life".  You 
> were making a point about how a UUID could be made actionable.

In part yes.  But what I was really saying is that it's silly to think of all of those different metadata resolution services as distinct GUIDs (even though in the broad sense, all HTTP URIs are technically GUIDs).  Also, it depends on what you mean by "used in real life".  They should certainly not be used in "real life" as identifiers of the sort you gave examples for. But they may well be "used" in other real-life contexts.

> But my point is that you simply cannot meet the requirements of the 
> GUID A.S. with ONLY a UUID.

We may quibbling about semantics here.  I never said that the TGAS was met with ONLY a UUID.  My point was, the UUID *is* the identifier, and it can meet the TGAS requirements and recommendations *provided* that there is an appropriate HTTP GET resolution service for it, and provided that the UUID is exposed externally only in the context of the relevant resolution metadata.  In other words, I *COMPLETELY* agree with you (and have tried to make this clear all along) that one would never see something like "<dc:identifier>A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523</dc:identifier>" in an RDF (or other similar) document.  But I do believe that something like "<dc:identifier>http://zoobank.org/A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523</dc:identifier>" *would* be compliant.

> You MUST have an HTTP proxied version of it in order to "do the right thing"
> (i.e. GUID A.S. rec 10) and provide metadata in the form of RDF serialized as XML.

Yes, exactly.

> That HTTP proxied version isn't just going to be seen as a "resolution mechanism".

But my point is that it *should* be.  In other words, our community should rise to that level of sophistication, because it would, I am quite certain, benefit us in the long run.

>  If you and GNUB are going to participate in BiSciCol as I understand 
> it to be developing (and I believe that you are), you will HAVE to 
> have an HTTP URI version of your UUIDs and in that context the raw 
> UUID will be relatively irrelevant.

Of course!  And if you ever thought otherwise, then obviously I am not expressing myself well.  Maybe part of our argument is that you are focused on implementation, and I am speaking more on principle.  I thought I made it clear in my first post on this thread that a UUID by itself is not actionable (recall my example of walking through the park and discovering a UUID written on a slip of paper), and therefore not, by itself, functional as a persistent actionable identifier (sensu TDWG/GBIF). My only point in all of this is that identification and resolution are two separate functions, and we should be sophisticated enough to recognize the distinction.  I don't know if it's feasible, but I think one way that it could be made feasible comes back to my suggestion of a registry of resolution services.  This is not going backward; it's going forward.  However, our community may have its hands full with just implementing the things we most need to implement, and may not have the luxury of time and resources to implement a standard acknowledgement of the distinction between resolution services and object identification -- by my contention is that we ignore that distinction at our peril.

> My point is that you should decide on just one of these HTTP URIs and 
> use that as your identifier when you communicate with the outside 
> world.

That is already the case (has been the case ever since July 2007, when Kevin Richards set up our LSID resolution service).

> My preference would be "http://zoobank.org/A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523"
> as the shortest and least complex one that would do everything that 
> needs to get done.

Well, for various reasons we went with the LSID version:
"http://zoobank.org/urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523"

Or, as RDF in accordance with the LSID spec:

http://zoobank.org/authority/metadata/?lsid=urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523

> I guess that there isn't problem with the other nine existing, but 
> from my point of view there is nothing but harm to be done by exposing 
> them to the outside world.

I guess that depends on what you mean by "exposing" them.  In my mind, they are already "exposed" because they work.  However, I don't think anyone would (or should) embed them in semantic documents as though they were TDWG-style GUIDs.  HOWEVER, the point I was originally making is that if we could (rightly) recognize the different roles of identification and resolution, then we wouldn't have a problem.  You could very easily use your preferred "short" version of "http://zoobank.org/A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523", and reasoning service would have no difficulty recognizing it as identifying the same object as urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523, or http://zoobank.org/urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523.  I realize there is no elegant way to do this using existing RDF syntax, which is why this is *really* a much more fundamental argument than just TDWG-space.  But in my extremely naïve way of representing it, it might look something like:

<rdf:Description rdf:about="http://zoobank.org/A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523">
    <dc:identifier>A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523</dc:identifier>
    <xxx:resolutionService>http://zoobank.org/</xxx:resolutionService>

...which would have no trouble combining with a document that had something like this:

<rdf:Description rdf:about="http://zoobank.org/?uuid=A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523">
    <dc:identifier>A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523</dc:identifier>
    <xxx:resolutionService>http://zoobank.org/?uuid=</xxx:resolutionService>

> The other point which I was trying to make is: why would you choose to 
> expose to the outside world an identifier that only does part of the 
> desirable things that we want (i.e. my list of 8 desirable attributes 
> of a GUID), when you could use a modification of that identifier that would do everything you want?

I would *never* "choose" to do that.  However, I may very well be stuck with that due to insufficient resources and expertise.  *That* is what I intend to fix now that I (finally) have both resources and expertise.

> But with virtually no additional cost (15 minutes of time from 
> somebody who knows how to create a single 3 kB XSLT file)

Ah....if only I had 15 minutes of such a person's time before now!  :-)

> I would assert the same thing about LSIDs.  Why would you create in 
> identifier that is part of (what seems to me to be universally 
> recognized as) a dead technology when you could create a simpler HTTP 
> URI that would do the same thing and potentially more?

The answer to that is much easier, and should be self-evident when you consider what I already mentioned previously: that the service was established in the summer of 2007.  At that time, LSID was absolutely NOT dead, and indeed was actively being promoted by both TDWG and GBIF.  This was the outcome of the two GUID workshops those organizations sponsored.  There certainly were detractors to LSIDs back then, making the same arguments they are making now.  To the extent that LSIDs are currently perceived as "dead" by some, is due largely to the self-fulfilling prophecy of those detractors.

But in any case, regardless of whether LSIDs really are dead or not, and regardless of why that may be so (if it is so), there were very good reasons why ZooBank went with LSIDs.  And while I realize that the four years since then are a veritable EON in IT contexts, keep in mind that ZooBank has to think in terms of centuries.  In that context, the HTTP protocol is not guaranteed to be persistent, and things like DOI are pretty-much downright ephemeral. In fact, this is exactly why I went with UUIDs in the first place.  As long as electronic data are stored in binary form, 128 bits will have mathematical stability. *That's* why I realized that UUIDs were the only defensible choice for the "real" identifier, and is the identifier that ZooBank will persist.  The choice of LSID as a resolution protocol was, as already stated, influenced by the thinking of our community at the time. *My* thinking at the time was that the only thing with any real plausibility of ICZN-scale longevity was binary data encoding (even that may not withstand more than a few decades), so I embraced UUIDs (which is to say, I embraced 128-bit identity). Everything else (LSID protocol, HTTP protocol, etc.) could be regarded as no more than the "resolution mechanism du joir".  Perhaps this starts to explain why I keep emphasizing the distinction between identity and metadata resolution.  The ZooBank registry has to think in terms of long-term identity, and assume that resolution mechanisms will continue to change as the technological wind blows.

> In the case of uBio and Biodiversity Collections Index, they were set 
> up when LSIDs were believed to be the "Next Big Thing".

Actually, all of us were implementing them at the same time.  I think IPNI was one of the first; BCI came later.  This all emerged from the two TDWG/GBIF GUID workshops.

> That did not turn out to be the case, so those organizations are stuck 
> with painful HTTP URIs like 
> "http://biocol.org/urn:lsid:biocol.org:col:35115" and 
> "http://www.ubio.org/authority/metadata.php?lsid=urn:lsid:ubio.org:namebank:9479554"
> when they could have had "http://biocol.org/35115"
> and "http://www.ubio.org/9479554".  I would say "lesson learned" -

Ha!  Hardly!  We are only just now beginning to start learning lessons.  Let's revisit this conversation again in a couple of decades and see how many more lessons are yet in store for us.

In any case, my family just woke up from their nap, so I'll have to look at the rest of your message later, after some time with Mickey and the gang.

Aloha,
Rich

Richard L. Pyle, PhD
Database Coordinator for Natural Sciences Associate Zoologist in Ichthyology Dive Safety Officer Department of Natural Sciences, Bishop Museum
1525 Bernice St., Honolulu, HI 96817
Ph: (808)848-4115, Fax: (808)847-8252
email: deepreef at bishopmuseum.org
http://hbs.bishopmuseum.org/staff/pylerichard.html

_______________________________________________
tdwg-content mailing list
tdwg-content at lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-content

Please consider the environment before printing this email
Warning:  This electronic message together with any attachments is confidential. If you receive it in error: (i) you must not read, use, disclose, copy or retain it; (ii) please contact the sender immediately by reply email and then delete the emails.
The views expressed in this email may not be those of Landcare Research New Zealand Limited. http://www.landcareresearch.co.nz _______________________________________________
tdwg-content mailing list
tdwg-content at lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-content