[tdwg-content] RFC 4122 as motivation, was Re: Why UUIDs alone are not adequate as GUIDs

Thu Jun 9 14:55:26 CEST 2011

Thank you Gregor for very succinctly expressing what I think is the 
important take-home message in this discussion!  I think that one of the 
things that makes me want to scream and run away from TDWG is the 
excessive "point making" that goes on on this list.  I hesitate to make 
the following statement because somebody is going to find an earlier 
email of mine and find one that I wrote "just to make a point".  But I 
would like to believe the following is true: I am not participating in 
TDWG for intellectual stimulation, social networking, career 
advancement, or entertainment.  I am participating because I believe 
that it will help me achieve some useful result in a reasonable amount 
of time. 

Given that axiom, I personally don't care very much how many "correct" 
ways there are of creating identifiers that the will "officially" work 
in RDF or what clever possible future technology or URI schemes Android 
might or might not be creating.  I will happily allow Rich to call the 
UUID version of his identifiers a "GUID" and the HTTP proxided version 
"Rumpelstiltskin" while I call the HTTP proxied version the "GUID" and 
the UUID "a string".  That simply does not matter.  What does matter is 
that after the years of time that TDWG has spent spinning its wheels on 
the issue of GUIDs, we finally have have a system (HTTP URIs, HTTP as a 
universally understood information transfer method, and RDF as a lingua 
franca for marking up metadata) that is implementable for creating a 
distributed system for unambiguously identifying and transferring 
information about biodiversity resources (the "dream" laid out in the 
old TAG roadmaps).  Not only is that system implementable, but it HAS 
been very successfully implemented by people within our community and is 
increasingly being more broadly implemented outside our community.  We 
have people on this list (mostly silent, but I know they are there 
because I sometimes get off-list emails from them) who don't know how to 
do the things that the "experts" know how to do and they are coming here 
for information and advice on how to actually make things work according 
to the standards TDWG has established.  Given that, I consider it 
extremely helpful to have examples of implementations that actually 
"work" right now, and not particularly helpful to have examples of 
things that could be done but would be a bad idea, or which can't be 
implemented in a finite amount of time, or that might be done at some 
point in the future but that don't actually work in the present. I would 
suggest that we keep that at the forefront in our minds when we post.

I do apologize for my part in the long email exchanges with Rich which 
some might consider tedious, but in the end I think they have produced 
some useful results.  From Rich's last post, I now know that he intends 
for 
http://zoobank.org/urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523 
to be the widely circulated http proxied form of his zoobank UUIDs.  
Given that zoobank issued LSIDs, it is doing the right thing to maintain 
them even if nobody uses them.  So from the perspective of offering a 
useful example, I would suggest that

  <rdf:Description 
rdf:about="urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523">

<dcterms:identifier>A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523</dcterms:identifier>
    <owl:sameAs 
rdf:resource="http://zoobank.org/urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523"/>
    ...
  </rdf:Description>

would be a way to mark up the information about his various identifiers 
that would be both "correct" (in the sense of not breaking any rules 
about RDF) and also useful as a template for people who want to give LOD 
a chance.  I looked up dcterms:identifier to see what the definition 
was.  It says "An unambiguous reference to the resource within a given 
context. ... Recommended best practice is to identify the resource by 
means of a string conforming to a formal identification system."  The 
bugaboo here is "formal identification system" and whether UUID fits 
that definition.  But I would venture to say that one could "get away 
with it" and it would achieve Rich's goal of letting the universe in 
hundreds of years know "this is the identifier that I intend for that 
object".  Because XML is just plain text, the RDF file would not have to 
be considered to be part of a magically actionable system.  It could 
also just be read as a marked up plain text file and dcterms are about 
the most well-known and stable thing we have at the moment for imparting 
information about what we intend things to mean.  However, rdf:about and 
owl:sameAs statements would also make either the HTTP URI or the LSID 
work in the "here and now" of Linked Data.  Including the HTTP URI 
version would allow a semantic client to "look up" information through 
the existing network system (i.e. using HTTP protocol) assuming that 
Rich gets the zoobank system to return content-type=rdf+xml when that is 
asked for by a semantic client rather than always HTML. 

Further, in the interest of achieving what Gregor so clearly stated, I 
would recommend (beg on my knees?) that when GNUB is set up (assuming 
that it uses UUIDs) that it creates a single, simple HTTP URI proxied 
form of the UUID (another GUID or a Rumpelstiltskin if you prefer) that 
can be used by those who want to give LOD a chance.  The domain name 
should be something that is intended to persist for a very long time 
(purl.org would allow maintenance to be transferred but I personally 
don't care).  The RDF for the TNUs could then look something like this:

  <rdf:Description 
rdf:about="http://purl.org/tnu/A9F435E0-8ED7-46DD-BAB4-EA8E5BF41524">

<dcterms:identifier>A9F435E0-8ED7-46DD-BAB4-EA8E5BF41524</dcterms:identifier>
    ...
  </rdf:Description>

(replace purl.org/tnu with domain name of your choice).  Assuming that 
interest in LSIDs is as low as it seems to be, I would just skip the 
hassle of messing with them.  All of your RDF served would then be one 
line shorter and do what Gregor suggested (minimize the number of IDs) 
as well.

Also in the interest of providing examples, I mentioned the XSLT option 
for simple human-friendly content negotiation.  Here is how I did it.  I 
made a single 3 kb XSL stylesheet (XSLT) file:
http://bioimages.vanderbilt.edu/taxon/taxonconcepts.xsl
which is sitting in the same directory as the rdf files.  Then each RDF 
file contains

<?xml-stylesheet type="text/xsl" href="taxonconcepts.xsl"?>

right after the

<?xml version="1.0" encoding="UTF-8"?>

line.  The server is set up so that when a URI like 
http://bioimages.vanderbilt.edu/taxon/19422-weakley2010 is dereferenced, 
the client is sent the file 
http://bioimages.vanderbilt.edu/taxon/19422-weakley2010.rdf regardless 
of the content-type requested.  So a semantic client gets the RDF/XML 
and a web browser formats the XML for humans according to the XSL 
stylesheet.  I call this "poor man's content negotiation" because it 
requires virtually no maintenance or sophisticated server resources.  
One does have to maintain a consistent RDF structure because the XSLT is 
a "dumb" static file, but if your RDF is being generated systematically, 
it will probably have a consistent format anyway.  It also means that a 
human has to use "view page source" to look at the underlying RDF, but 
99% of human clients won't care about that anyway, and the 1% that does 
care will probably know how to view the page source anyway.  I want to 
be clear here that what I am trying to show in this example is NOT 
anything about taxon concepts, proper RDF format, the correctness of 
Darwin-SW, etc. or to say that this is the only, best, or most proper 
way to achieve content negotiation.  What I AM trying to show is that if 
you are going to provide RDF for computers, there is virtually no 
additional cost to also providing a human readable version.  I am 
essentially a computer dummy.  I went to a bookstore and bought a book 
on XSLT and wrote the file myself.  If I can do that, then any 
organization who has a "real" computer person on their staff could 
accomplish this as well and I don't see any reason NOT to do it, even if 
the information provided is intended primarily for computer to computer 
communication. 

Steve

Gregor Hagedorn wrote:
> While I generally accept Bob's careful research, and while I think it
> is imperative that multiple IDs are allowed in principle, as to avoid
> monopolies, my feeling is:
>
> * The number of IDs should be minimized.
> * The http:-URI is defacto the most relevant ID in the semantic web.
> * Avoid multiple alternative ways of embedding a UUID-string in a
> resolvable URI. All forms have to be added as sameAs information. They
> become ballast for future generations.
>
> That is: take the http-ID serious as a resource requiring long-term
> management and persistence.
>
> Finally: Success is measured by the adoption by people not visiting
> tdwg meetings. This is a social issue.
>
> The last point is my main reservation about the TDWG Applicability
> Statement for GUID's.
>
> Gregor
> _______________________________________________
> tdwg-content mailing list
> tdwg-content at lists.tdwg.org
> http://lists.tdwg.org/mailman/listinfo/tdwg-content
>
>
>   

-- 
Steven J. Baskauf, Ph.D., Senior Lecturer
Vanderbilt University Dept. of Biological Sciences

postal mail address:
VU Station B 351634
Nashville, TN  37235-1634,  U.S.A.

delivery address:
2125 Stevenson Center
1161 21st Ave., S.
Nashville, TN 37235

office: 2128 Stevenson Center
phone: (615) 343-4582,  fax: (615) 343-6707
http://bioimages.vanderbilt.edu