I apologize for not having more time to keep up with the discussions on this forum, but they are very interesting, relevant, and timely. This thread in particular has caught my attention, because it touches on things I've been thinking about a lot since around the turn of the century.
Tim already pointed out that TDWG has been offering the sort of service people are suggesting here. In fact, it was suggested and discussed during at least one of the GBIF/TDWG GUID workshops, and most people then thought it was a good idea, but it doesn't seem to have gained much traction. At least not until now (maybe). Anyway, consider me a strong (and long) supporter of a LSID hosting service, and TDWG seems like the perfect organization for this task, provided that the resources & technical expertise are available to support it.
There is so much I want to say about this issue, but neither do I want to bore everyone, nor do I want to stay awake for yet another hour tonight writing a long diatribe.
So I'll cut to the chase.
The most frustrating thing to me about all of these GUID discussions is the perpetual conflation of the needs for identification, and the needs for metadata and data resolution. We data nerds think mostly about identification, while the app developers are primarily focused on resolution. I think if we recognize these two (very different) needs, and are careful to sharpen the focus of our discussions accordingly, we'll make a lot more progress with much better efficiency.
At the pure identity end of the spectrum, most databases go with integers as local identifiers. Unfortunately, integers (especially the ones that are sequential and start with "1") are useless without some sort of context. We also have UUIDs. Wonderful, ubiquitous, locally-generated, adequately unique on a global scale, supported by most major DBMS apps, etc. And, again: by themselves, they are utterly unresolvable.
At the resolution end of the spectrum, we see a lot of appeal for PURLs. No need for special programming to parse and resolve, no special SRV records on DNS, etc. Unfortunately -- whether deserved or not -- PURLs are tarnished by the historical impermanence of their unqualified URL brethren. Even if we assume a strong commitment to the social contract of permanency, whose to say that 50 years from now that any domain name will still be intact (indeed, whether http is even relevant anymore). Even if we can commit to the "P" of PURLs for the short term, they're a bit of a gamble for the long term.
In between these two, we have LSIDs, DOIs and Handles (of course, DOIs are Handles). These all incorporate aspects of both identity and resolution, but does neither perfectly. If I type any of the following into a web browser:
doi:10.1594/PANGAEA.339110 urn:lsid:zoobank.org:act:8BDC0735-FEA4-4298-83FA-D04F67C3FBEC 10199/15417
...I get bupkis.
So, in my mind, they are *almost* self-resolving, and *not quite* identifiers. The reason I think of them as "not quite" identifiers is because the first two embed some syntax-dependant resolving information within them (leading to problems with opacity and potentially with permanence); and the third one, though lacking any resolution baggage, is also approximately 0.66154245313614840760199779464228.
I have said this before, and I will say it again: I think it would be WONDERFUL if the biodiversity informatics community all agreed to incorporate UUIDs as the standard GUID for all data objects that are shared or exposed outside of a local database. The fact that they look ugly, are difficult to type, and are impossible to memorize is a red herring. If you've ever used a PC, Mac, or Linux-based computer, you have used UUIDs. Probably hundreds, or even thousands of them. You just never knew it. And that, in my mind, is the hallmark of an effectively used GUID -- i.e., the one that the end-user doesn't even knows exists.
Following up on Peter DeVries' post:
If you want to keep the branding on the identifier you could also do
something like this.
http://lod.ipni.org/ipni-org_names_783030-1 <- the entity or
concept, 303 redirect to either
http://lod.ipni.org/ipni-org_names_783030-1.html <- human readable
page
http://lod.ipni.org/ipni-org_names_783030-1.rdf <- rdf data
Why not take it a step further and go with UUIDs in place of "ipni-org_names_783030-1"?
Couldn't the free and ubiquitous Google cache provide some caching of
these normal uri's
Well...sort of. The problem is, you don't always know what you'll get back from Google. For example, I get 43 links to a Google search on "8BDC0735-FEA4-4298-83FA-D04F67C3FBEC". Which one do I go to get my metadata. One of the links Google provided was particularly interesting:
http://lists.tdwg.org/pipermail/tdwg-tag/2009-March/000393.html
If I'd only had time to be following this forum since March, I would have seen that Roger has already made some of the points I was about to make in this post. Now I see they've already been made, so I will just reiterate:
In my personal biodiversity informatics utopia, this would be the identifier: 8BDC0735-FEA4-4298-83FA-D04F67C3FBEC
...and these would all be legitimate ways of resolving the exact same information, formatted according to some standard set of indicators along the lines of what Peter was suggesting:
urn:lsid:zoobank.org:act:8BDC0735-FEA4-4298-83FA-D04F67C3FBEC (just because) http://zoobank.org/urn:lsid:zoobank.org:act:8BDC0735-FEA4-4298-83FA-D04F67C3 FBEC (Human readable) http://purl.zoobank.org/urn:lsid:zoobank.org:act:8BDC0735-FEA4-4298-83FA-D04 F67C3FBEC (to make Roger happy) http://zoobank.org/authority/?lsid=urn:lsid:zoobank.org:act:8BDC0735-FEA4-42 98-83FA-D04F67C3FBEC (to make supporters of LSIDs happy) http://zoobank.org/8BDC0735-FEA4-4298-83FA-D04F67C3FBEC (to make me happy) http://uuid.tdwg.org/8BDC0735-FEA4-4298-83FA-D04F67C3FBEC (to make Dave Vieglias happy) http://lsid.tdwg.org/urn:lsid:zoobank.org:act:8BDC0735-FEA4-4298-83FA-D04F67 C3FBEC (to make Lee Belbin happy) http://cache.gbif.org/?uuid=8BDC0735-FEA4-4298-83FA-D04F67C3FBEC (to make Tim happy) ...and so on, and so on, and so on.
No, we shouldn't do all of them. Neither should we do only one of them. Let's figure out a few alternatives that give LSIDs a fair shake (before we abandon them altogether), and most importantly, dissociate the identifier from the resolution protocol.
On the topic of centralized GUID resolution.... Over a decade ago, before I knew what a UUID was, I tried to lobby TDWG to create a GUID issuing service that our community could adopt. Back then I was thinking in terms of simply issuing integers that individual organizations could reserve in blocks of a million at a time (or whatever large number). Now that UUIDs have attained such ubiquity, I no longer think such a service is necessary. However, I have been a strong supporter of distributed content -- not in the sense of DiGIR, where there are (theoretically) non-overlapping blocks of content that get assembled and stacked at query time; but rather in the sense of mirroring or replication of ALL content, on ALL servers.
I first started pushing for this during the days of the All-Species Foundation. I didn't (and still do not) think that this concept will go over very well for proprietary content, such a specimen data, images, and other similar sorts of content (perhaps someday, but I don't think our community is ready for that just yet). But certainly for things we all share -- taxonomy, literature, agent data, geography, etc. -- there was (and still is) a great deal of potential value in the notion of "once digitized, always available". Redundancy of effort is useful to an extent, but I think we have wasted a lot of time populating different databases belonging to different organizations with records representing the exact same objects (e.g., the citation record for Linnaeus, 1758), and even more time (specifically, *my* time) trying to cross-link different databases with overlapping content.
So....I'm going to "see" Tim's crazy idea, and "raise" him an even crazier one: rather than "a" centralized cache of LSID response content (with expiration), why don't we have dozens or hundreds of mirror copies of *all* the content? We don't even have to confine it to LSIDs -- make it compatible with several of the most commonly used GUID protocols. I'm assuming that technology allows for maintaining this via replication, etc. The only issues that need to be worked out are a security/authorization mechanism somewhere between the Wikipedia model and the ITIS model (first one that came to my head), and/or a robust audit system for tracking (and rolling back) content edits.
The GNA is already heading this way for both GNI and GNUB; and there is talk of something similar for literature citations (which would include agents), as well as for shape files for species distributions. What I'd like to see is a "Global Shared Biodiversity Data Repository" as more than just a centralized cache of metadata, but a common infrastructure that supports broad global replication (and associated automated synchronization) of anything and everything that our community is willing to share. I would propose that each object be identified by a UUID, and that any one of the dozens/hundreds of replicates could establish whatever services they want on top of that content. Different hosts might create different services for content resolution catered to different community needs. Some would wrap the UUIDs in LSID syntax, some might convert them to Handles, some would represent them as PURLs, etc., etc. The important thing is that we have a common standard for identifiers (UUIDs), built within an architecture that can support multiple and evolving resolution protocols, mechanisms, and services.
Well, dang! Not only did I just blow an hour, but I suspect I bored most of you to tears.
Sorry about that...
Aloha, Rich
Richard L. Pyle, PhD Database Coordinator for Natural Sciences and Associate Zoologist in Ichthyology Department of Natural Sciences, Bishop Museum 1525 Bernice St., Honolulu, HI 96817 Ph: (808)848-4115, Fax: (808)847-8252 email: deepreef@bishopmuseum.org http://hbs.bishopmuseum.org/staff/pylerichard.html