RFC 4122 as motivation, was Re: Why UUIDs alone are not adequate as GUIDs
For better or worse, the TDWG Applicability Statement for GUID's, and most of the TDWG community, uses "GUID" in a generic sense, not conformant to RFC 4122 which declares "GUID" and "UUID" to be equivalent terms.
Also, according to http://www.rfc-editor.org, ---which by STD 1 always contains the current status of any RFC---it seems that RFC 4122 has never advanced past "Proposed Standard" in the 6 years since it was proposed. So, however one reads RFC 4122 on the question of "*ONE* GUID" (meaning in the 4122 context "*ONE* UUID"), at best "*ONE* GUID" is a proposal, not a standard in the sense of IETF. Not even a Draft Standard.
Finally, FWIW, the author of http://en.wikipedia.org/wiki/Universally_unique_identifier seems to not take "Unique" to mean at most one per resource:
"The intent of UUIDs is to enable distributed systems to uniquely identify information without significant central coordination. Thus, anyone can create a UUID and use it to identify something with reasonable confidence that the identifier will never be unintentionally used by anyone for anything else. Information labeled with UUIDs can therefore be later combined into a single database without needing to resolve name conflicts."
I find nothing in the above, nor in RFC 4122 that prohibits multiple UUIDs for the same resource, counterproductive as that might be. The TDWG GUID Applicability standard, however, needs some cleaning on related points, since it has some internally confusing narrative on this issue. On balance, it comes out implicitly in favor of a single UUID (or any other kind of "GUID") for each resource, issued only by the resource provider; but explicitly permits multiple schemes for identifying a resource.
Bob Morris
On Wed, Jun 8, 2011 at 10:56 PM, Paul Murray pmurray@anbg.gov.au wrote:
On 08/06/2011, at 8:05 AM, Steve Baskauf wrote:
... I think it's foolish to regard all of these different resolution mechanisms as distinct "identifiers". There is *ONE* GUID. It is: A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523. There are ten different ways to make it actionable. It therefore meets the recommendations of the applicability statement.
The problem is that when you create an HTTP URI out of a UUID, you are creating an identifier whether you think you are or not.
Jumping in again, but perhaps RFC 4122 might help a little here. A GUID (or UUID) is a set of 128 bits, 16 octets, 32 hex digits, 5 inches of punched paper tape. However you choose to write or express it, there is indeed "*ONE* GUID". A URI is not a GUID. This: http://example.org/A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523 is a different URI to this http://my.organisation.org/A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523 is a different URI to this: http://example.org/A9F435E08ED746DDBAB4EA8E5BF41523 Furthermore, these uris have nothing whatever to do with the guid - apart from the fact that it's obvious to we humans that they do. Fortunately, there is a standard for expressing a guid/uuid as a URI, and it is the "uuid" urn namespace, defined in RFC-4122. Thus: urn:uuid:A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523 is a URI that - according to a w3c standard - corresponds to the 128-bit guid. This: urn:uuid:A9F435E08ED746DDBAB4EA8E5BF41523 is *not valid* - it doesn't conform to the schema. There is one unique (case insensitive) uuid urn for any guid, and a defined equivalence between them. These are not "cool uris", but guids are inherently uncool so that's to be expected. If you want to use GUIDs for identifiers and need equivalent URIs (for use in RDF and the semweb), then urn:uuid:<the guid> might be a good way to go.
If you have received this transmission in error please notify us immediately by return e-mail and delete all copies. If this e-mail or any attachments have been sent to you in error, that error does not constitute waiver of any confidentiality, privilege or copyright in respect of information in the e-mail or attachments. Please consider the environment before printing this email.
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
While I generally accept Bob's careful research, and while I think it is imperative that multiple IDs are allowed in principle, as to avoid monopolies, my feeling is:
* The number of IDs should be minimized. * The http:-URI is defacto the most relevant ID in the semantic web. * Avoid multiple alternative ways of embedding a UUID-string in a resolvable URI. All forms have to be added as sameAs information. They become ballast for future generations.
That is: take the http-ID serious as a resource requiring long-term management and persistence.
Finally: Success is measured by the adoption by people not visiting tdwg meetings. This is a social issue.
The last point is my main reservation about the TDWG Applicability Statement for GUID's.
Gregor
Thank you Gregor for very succinctly expressing what I think is the important take-home message in this discussion! I think that one of the things that makes me want to scream and run away from TDWG is the excessive "point making" that goes on on this list. I hesitate to make the following statement because somebody is going to find an earlier email of mine and find one that I wrote "just to make a point". But I would like to believe the following is true: I am not participating in TDWG for intellectual stimulation, social networking, career advancement, or entertainment. I am participating because I believe that it will help me achieve some useful result in a reasonable amount of time.
Given that axiom, I personally don't care very much how many "correct" ways there are of creating identifiers that the will "officially" work in RDF or what clever possible future technology or URI schemes Android might or might not be creating. I will happily allow Rich to call the UUID version of his identifiers a "GUID" and the HTTP proxided version "Rumpelstiltskin" while I call the HTTP proxied version the "GUID" and the UUID "a string". That simply does not matter. What does matter is that after the years of time that TDWG has spent spinning its wheels on the issue of GUIDs, we finally have have a system (HTTP URIs, HTTP as a universally understood information transfer method, and RDF as a lingua franca for marking up metadata) that is implementable for creating a distributed system for unambiguously identifying and transferring information about biodiversity resources (the "dream" laid out in the old TAG roadmaps). Not only is that system implementable, but it HAS been very successfully implemented by people within our community and is increasingly being more broadly implemented outside our community. We have people on this list (mostly silent, but I know they are there because I sometimes get off-list emails from them) who don't know how to do the things that the "experts" know how to do and they are coming here for information and advice on how to actually make things work according to the standards TDWG has established. Given that, I consider it extremely helpful to have examples of implementations that actually "work" right now, and not particularly helpful to have examples of things that could be done but would be a bad idea, or which can't be implemented in a finite amount of time, or that might be done at some point in the future but that don't actually work in the present. I would suggest that we keep that at the forefront in our minds when we post.
I do apologize for my part in the long email exchanges with Rich which some might consider tedious, but in the end I think they have produced some useful results. From Rich's last post, I now know that he intends for http://zoobank.org/urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E5BF4... to be the widely circulated http proxied form of his zoobank UUIDs. Given that zoobank issued LSIDs, it is doing the right thing to maintain them even if nobody uses them. So from the perspective of offering a useful example, I would suggest that
<rdf:Description rdf:about="urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523">
dcterms:identifierA9F435E0-8ED7-46DD-BAB4-EA8E5BF41523</dcterms:identifier> <owl:sameAs rdf:resource="http://zoobank.org/urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E5BF4... ... </rdf:Description>
would be a way to mark up the information about his various identifiers that would be both "correct" (in the sense of not breaking any rules about RDF) and also useful as a template for people who want to give LOD a chance. I looked up dcterms:identifier to see what the definition was. It says "An unambiguous reference to the resource within a given context. ... Recommended best practice is to identify the resource by means of a string conforming to a formal identification system." The bugaboo here is "formal identification system" and whether UUID fits that definition. But I would venture to say that one could "get away with it" and it would achieve Rich's goal of letting the universe in hundreds of years know "this is the identifier that I intend for that object". Because XML is just plain text, the RDF file would not have to be considered to be part of a magically actionable system. It could also just be read as a marked up plain text file and dcterms are about the most well-known and stable thing we have at the moment for imparting information about what we intend things to mean. However, rdf:about and owl:sameAs statements would also make either the HTTP URI or the LSID work in the "here and now" of Linked Data. Including the HTTP URI version would allow a semantic client to "look up" information through the existing network system (i.e. using HTTP protocol) assuming that Rich gets the zoobank system to return content-type=rdf+xml when that is asked for by a semantic client rather than always HTML.
Further, in the interest of achieving what Gregor so clearly stated, I would recommend (beg on my knees?) that when GNUB is set up (assuming that it uses UUIDs) that it creates a single, simple HTTP URI proxied form of the UUID (another GUID or a Rumpelstiltskin if you prefer) that can be used by those who want to give LOD a chance. The domain name should be something that is intended to persist for a very long time (purl.org would allow maintenance to be transferred but I personally don't care). The RDF for the TNUs could then look something like this:
<rdf:Description rdf:about="http://purl.org/tnu/A9F435E0-8ED7-46DD-BAB4-EA8E5BF41524%22%3E
dcterms:identifierA9F435E0-8ED7-46DD-BAB4-EA8E5BF41524</dcterms:identifier> ... </rdf:Description>
(replace purl.org/tnu with domain name of your choice). Assuming that interest in LSIDs is as low as it seems to be, I would just skip the hassle of messing with them. All of your RDF served would then be one line shorter and do what Gregor suggested (minimize the number of IDs) as well.
Also in the interest of providing examples, I mentioned the XSLT option for simple human-friendly content negotiation. Here is how I did it. I made a single 3 kb XSL stylesheet (XSLT) file: http://bioimages.vanderbilt.edu/taxon/taxonconcepts.xsl which is sitting in the same directory as the rdf files. Then each RDF file contains
<?xml-stylesheet type="text/xsl" href="taxonconcepts.xsl"?>
right after the
<?xml version="1.0" encoding="UTF-8"?>
line. The server is set up so that when a URI like http://bioimages.vanderbilt.edu/taxon/19422-weakley2010 is dereferenced, the client is sent the file http://bioimages.vanderbilt.edu/taxon/19422-weakley2010.rdf regardless of the content-type requested. So a semantic client gets the RDF/XML and a web browser formats the XML for humans according to the XSL stylesheet. I call this "poor man's content negotiation" because it requires virtually no maintenance or sophisticated server resources. One does have to maintain a consistent RDF structure because the XSLT is a "dumb" static file, but if your RDF is being generated systematically, it will probably have a consistent format anyway. It also means that a human has to use "view page source" to look at the underlying RDF, but 99% of human clients won't care about that anyway, and the 1% that does care will probably know how to view the page source anyway. I want to be clear here that what I am trying to show in this example is NOT anything about taxon concepts, proper RDF format, the correctness of Darwin-SW, etc. or to say that this is the only, best, or most proper way to achieve content negotiation. What I AM trying to show is that if you are going to provide RDF for computers, there is virtually no additional cost to also providing a human readable version. I am essentially a computer dummy. I went to a bookstore and bought a book on XSLT and wrote the file myself. If I can do that, then any organization who has a "real" computer person on their staff could accomplish this as well and I don't see any reason NOT to do it, even if the information provided is intended primarily for computer to computer communication.
Steve
Gregor Hagedorn wrote:
While I generally accept Bob's careful research, and while I think it is imperative that multiple IDs are allowed in principle, as to avoid monopolies, my feeling is:
- The number of IDs should be minimized.
- The http:-URI is defacto the most relevant ID in the semantic web.
- Avoid multiple alternative ways of embedding a UUID-string in a
resolvable URI. All forms have to be added as sameAs information. They become ballast for future generations.
That is: take the http-ID serious as a resource requiring long-term management and persistence.
Finally: Success is measured by the adoption by people not visiting tdwg meetings. This is a social issue.
The last point is my main reservation about the TDWG Applicability Statement for GUID's.
Gregor _______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
I should correct what I said below to draw the distinction between RDF and RDF/XML:
One does have to maintain a consistent RDF structure because the XSLT is a "dumb" static file, but if your RDF is being generated systematically, it will probably have a consistent format anyway. It also means that a human has to use "view page source" to look at the underlying RDF, but
This should have said, "maintain a consistent XML structure ... if your RDF/XML is being ..." since XML is just one way of serializing RDF. However, it is *the *recommended serialization of RDF for GUID resolution, i.e. the GUID A.S. rec 10 says "The default metadata response format *should *be RDF serialized as XML." (emphasis in the document), so people following the rec will be generating it. What I was trying to get across is that there are easy ways to create human-friendly representations.
Steve
participants (3)
-
Bob Morris
-
Gregor Hagedorn
-
Steve Baskauf