[tdwg-content] RFC 4122 as motivation, was Re: Why UUIDs alone are not adequate as GUIDs
Steve Baskauf
steve.baskauf at vanderbilt.edu
Thu Jun 9 14:55:26 CEST 2011
Thank you Gregor for very succinctly expressing what I think is the
important take-home message in this discussion! I think that one of the
things that makes me want to scream and run away from TDWG is the
excessive "point making" that goes on on this list. I hesitate to make
the following statement because somebody is going to find an earlier
email of mine and find one that I wrote "just to make a point". But I
would like to believe the following is true: I am not participating in
TDWG for intellectual stimulation, social networking, career
advancement, or entertainment. I am participating because I believe
that it will help me achieve some useful result in a reasonable amount
of time.
Given that axiom, I personally don't care very much how many "correct"
ways there are of creating identifiers that the will "officially" work
in RDF or what clever possible future technology or URI schemes Android
might or might not be creating. I will happily allow Rich to call the
UUID version of his identifiers a "GUID" and the HTTP proxided version
"Rumpelstiltskin" while I call the HTTP proxied version the "GUID" and
the UUID "a string". That simply does not matter. What does matter is
that after the years of time that TDWG has spent spinning its wheels on
the issue of GUIDs, we finally have have a system (HTTP URIs, HTTP as a
universally understood information transfer method, and RDF as a lingua
franca for marking up metadata) that is implementable for creating a
distributed system for unambiguously identifying and transferring
information about biodiversity resources (the "dream" laid out in the
old TAG roadmaps). Not only is that system implementable, but it HAS
been very successfully implemented by people within our community and is
increasingly being more broadly implemented outside our community. We
have people on this list (mostly silent, but I know they are there
because I sometimes get off-list emails from them) who don't know how to
do the things that the "experts" know how to do and they are coming here
for information and advice on how to actually make things work according
to the standards TDWG has established. Given that, I consider it
extremely helpful to have examples of implementations that actually
"work" right now, and not particularly helpful to have examples of
things that could be done but would be a bad idea, or which can't be
implemented in a finite amount of time, or that might be done at some
point in the future but that don't actually work in the present. I would
suggest that we keep that at the forefront in our minds when we post.
I do apologize for my part in the long email exchanges with Rich which
some might consider tedious, but in the end I think they have produced
some useful results. From Rich's last post, I now know that he intends
for
http://zoobank.org/urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523
to be the widely circulated http proxied form of his zoobank UUIDs.
Given that zoobank issued LSIDs, it is doing the right thing to maintain
them even if nobody uses them. So from the perspective of offering a
useful example, I would suggest that
<rdf:Description
rdf:about="urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523">
<dcterms:identifier>A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523</dcterms:identifier>
<owl:sameAs
rdf:resource="http://zoobank.org/urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523"/>
...
</rdf:Description>
would be a way to mark up the information about his various identifiers
that would be both "correct" (in the sense of not breaking any rules
about RDF) and also useful as a template for people who want to give LOD
a chance. I looked up dcterms:identifier to see what the definition
was. It says "An unambiguous reference to the resource within a given
context. ... Recommended best practice is to identify the resource by
means of a string conforming to a formal identification system." The
bugaboo here is "formal identification system" and whether UUID fits
that definition. But I would venture to say that one could "get away
with it" and it would achieve Rich's goal of letting the universe in
hundreds of years know "this is the identifier that I intend for that
object". Because XML is just plain text, the RDF file would not have to
be considered to be part of a magically actionable system. It could
also just be read as a marked up plain text file and dcterms are about
the most well-known and stable thing we have at the moment for imparting
information about what we intend things to mean. However, rdf:about and
owl:sameAs statements would also make either the HTTP URI or the LSID
work in the "here and now" of Linked Data. Including the HTTP URI
version would allow a semantic client to "look up" information through
the existing network system (i.e. using HTTP protocol) assuming that
Rich gets the zoobank system to return content-type=rdf+xml when that is
asked for by a semantic client rather than always HTML.
Further, in the interest of achieving what Gregor so clearly stated, I
would recommend (beg on my knees?) that when GNUB is set up (assuming
that it uses UUIDs) that it creates a single, simple HTTP URI proxied
form of the UUID (another GUID or a Rumpelstiltskin if you prefer) that
can be used by those who want to give LOD a chance. The domain name
should be something that is intended to persist for a very long time
(purl.org would allow maintenance to be transferred but I personally
don't care). The RDF for the TNUs could then look something like this:
<rdf:Description
rdf:about="http://purl.org/tnu/A9F435E0-8ED7-46DD-BAB4-EA8E5BF41524">
<dcterms:identifier>A9F435E0-8ED7-46DD-BAB4-EA8E5BF41524</dcterms:identifier>
...
</rdf:Description>
(replace purl.org/tnu with domain name of your choice). Assuming that
interest in LSIDs is as low as it seems to be, I would just skip the
hassle of messing with them. All of your RDF served would then be one
line shorter and do what Gregor suggested (minimize the number of IDs)
as well.
Also in the interest of providing examples, I mentioned the XSLT option
for simple human-friendly content negotiation. Here is how I did it. I
made a single 3 kb XSL stylesheet (XSLT) file:
http://bioimages.vanderbilt.edu/taxon/taxonconcepts.xsl
which is sitting in the same directory as the rdf files. Then each RDF
file contains
<?xml-stylesheet type="text/xsl" href="taxonconcepts.xsl"?>
right after the
<?xml version="1.0" encoding="UTF-8"?>
line. The server is set up so that when a URI like
http://bioimages.vanderbilt.edu/taxon/19422-weakley2010 is dereferenced,
the client is sent the file
http://bioimages.vanderbilt.edu/taxon/19422-weakley2010.rdf regardless
of the content-type requested. So a semantic client gets the RDF/XML
and a web browser formats the XML for humans according to the XSL
stylesheet. I call this "poor man's content negotiation" because it
requires virtually no maintenance or sophisticated server resources.
One does have to maintain a consistent RDF structure because the XSLT is
a "dumb" static file, but if your RDF is being generated systematically,
it will probably have a consistent format anyway. It also means that a
human has to use "view page source" to look at the underlying RDF, but
99% of human clients won't care about that anyway, and the 1% that does
care will probably know how to view the page source anyway. I want to
be clear here that what I am trying to show in this example is NOT
anything about taxon concepts, proper RDF format, the correctness of
Darwin-SW, etc. or to say that this is the only, best, or most proper
way to achieve content negotiation. What I AM trying to show is that if
you are going to provide RDF for computers, there is virtually no
additional cost to also providing a human readable version. I am
essentially a computer dummy. I went to a bookstore and bought a book
on XSLT and wrote the file myself. If I can do that, then any
organization who has a "real" computer person on their staff could
accomplish this as well and I don't see any reason NOT to do it, even if
the information provided is intended primarily for computer to computer
communication.
Steve
Gregor Hagedorn wrote:
> While I generally accept Bob's careful research, and while I think it
> is imperative that multiple IDs are allowed in principle, as to avoid
> monopolies, my feeling is:
>
> * The number of IDs should be minimized.
> * The http:-URI is defacto the most relevant ID in the semantic web.
> * Avoid multiple alternative ways of embedding a UUID-string in a
> resolvable URI. All forms have to be added as sameAs information. They
> become ballast for future generations.
>
> That is: take the http-ID serious as a resource requiring long-term
> management and persistence.
>
> Finally: Success is measured by the adoption by people not visiting
> tdwg meetings. This is a social issue.
>
> The last point is my main reservation about the TDWG Applicability
> Statement for GUID's.
>
> Gregor
> _______________________________________________
> tdwg-content mailing list
> tdwg-content at lists.tdwg.org
> http://lists.tdwg.org/mailman/listinfo/tdwg-content
>
>
>
--
Steven J. Baskauf, Ph.D., Senior Lecturer
Vanderbilt University Dept. of Biological Sciences
postal mail address:
VU Station B 351634
Nashville, TN 37235-1634, U.S.A.
delivery address:
2125 Stevenson Center
1161 21st Ave., S.
Nashville, TN 37235
office: 2128 Stevenson Center
phone: (615) 343-4582, fax: (615) 343-6707
http://bioimages.vanderbilt.edu
More information about the tdwg-content
mailing list