[tdwg-tag] LSID business model?

Fri May 1 12:58:49 CEST 2009

I apologize for not having more time to keep up with the discussions on this
forum, but they are very interesting, relevant, and timely.  This thread in
particular has caught my attention, because it touches on things I've been
thinking about a lot since around the turn of the century.

Tim already pointed out that TDWG has been offering the sort of service
people are suggesting here.  In fact, it was suggested and discussed during
at least one of the GBIF/TDWG GUID workshops, and most people then thought
it was a good idea, but it doesn't seem to have gained much traction.  At
least not until now (maybe).  Anyway, consider me a strong (and long)
supporter of a LSID hosting service, and TDWG seems like the perfect
organization for this task, provided that the resources & technical
expertise are available to support it.

There is so much I want to say about this issue, but neither do I want to
bore everyone, nor do I want to stay awake for yet another hour tonight
writing a long diatribe.

So I'll cut to the chase.

The most frustrating thing to me about all of these GUID discussions is the
perpetual conflation of the needs for identification, and the needs for
metadata and data resolution. We data nerds think mostly about
identification, while the app developers are primarily focused on
resolution.  I think if we recognize these two (very different) needs, and
are careful to sharpen the focus of our discussions accordingly, we'll make
a lot more progress with much better efficiency.

At the pure identity end of the spectrum, most databases go with integers as
local identifiers.  Unfortunately, integers (especially the ones that are
sequential and start with "1") are useless without some sort of context.  We
also have UUIDs.  Wonderful, ubiquitous, locally-generated, adequately
unique on a global scale, supported by most major DBMS apps, etc.  And,
again: by themselves, they are utterly unresolvable.

At the resolution end of the spectrum, we see a lot of appeal for PURLs.  No
need for special programming to parse and resolve, no special SRV records on
DNS, etc.  Unfortunately -- whether deserved or not -- PURLs are tarnished
by the historical impermanence of their unqualified URL brethren.  Even if
we assume a strong commitment to the social contract of permanency, whose to
say that 50 years from now that any domain name will still be intact
(indeed, whether http is even relevant anymore).  Even if we can commit to
the "P" of PURLs for the short term, they're a bit of a gamble for the long
term.

In between these two, we have LSIDs, DOIs and Handles (of course, DOIs are
Handles). These all incorporate aspects of both identity and resolution, but
does neither perfectly.  If I type any of the following into a web browser:

doi:10.1594/PANGAEA.339110
urn:lsid:zoobank.org:act:8BDC0735-FEA4-4298-83FA-D04F67C3FBEC
10199/15417

...I get bupkis.

So, in my mind, they are *almost* self-resolving, and *not quite*
identifiers.  The reason I think of them as "not quite" identifiers is
because the first two embed some syntax-dependant resolving information
within them (leading to problems with opacity and potentially with
permanence); and the third one, though lacking any resolution baggage, is
also approximately 0.66154245313614840760199779464228.

I have said this before, and I will say it again:  I think it would be
WONDERFUL if the biodiversity informatics community all agreed to
incorporate UUIDs as the standard GUID for all data objects that are shared
or exposed outside of a local database.  The fact that they look ugly, are
difficult to type, and are impossible to memorize is a red herring.  If
you've ever used a PC, Mac, or Linux-based computer, you have used UUIDs.
Probably hundreds, or even thousands of them.  You just never knew it.  And
that, in my mind, is the hallmark of an effectively used GUID -- i.e., the
one that the end-user doesn't even knows exists.

Following up on Peter DeVries' post:

> If you want to keep the branding on the identifier you could also do
something like this.
>
>	http://lod.ipni.org/ipni-org_names_783030-1         <- the entity or
concept, 303 redirect to either
>	http://lod.ipni.org/ipni-org_names_783030-1.html  <- human readable
page
>	http://lod.ipni.org/ipni-org_names_783030-1.rdf    <- rdf data 

Why not take it a step further and go with UUIDs in place of
"ipni-org_names_783030-1"?

> Couldn't the free and ubiquitous Google cache provide some caching of
these normal uri's

Well...sort of.  The problem is, you don't always know what you'll get back
from Google.  For example, I get 43 links to a Google search on
"8BDC0735-FEA4-4298-83FA-D04F67C3FBEC".  Which one do I go to get my
metadata.  One of the links Google provided was particularly interesting:

http://lists.tdwg.org/pipermail/tdwg-tag/2009-March/000393.html

If I'd only had time to be following this forum since March, I would have
seen that Roger has already made some of the points I was about to make in
this post.  Now I see they've already been made, so I will just reiterate:

In my personal biodiversity informatics utopia, this would be the
identifier:
8BDC0735-FEA4-4298-83FA-D04F67C3FBEC

...and these would all be legitimate ways of resolving the exact same
information, formatted according to some standard set of indicators along
the lines of what Peter was suggesting:

urn:lsid:zoobank.org:act:8BDC0735-FEA4-4298-83FA-D04F67C3FBEC (just because)
http://zoobank.org/urn:lsid:zoobank.org:act:8BDC0735-FEA4-4298-83FA-D04F67C3
FBEC (Human readable)
http://purl.zoobank.org/urn:lsid:zoobank.org:act:8BDC0735-FEA4-4298-83FA-D04
F67C3FBEC (to make Roger happy)
http://zoobank.org/authority/?lsid=urn:lsid:zoobank.org:act:8BDC0735-FEA4-42
98-83FA-D04F67C3FBEC (to make supporters of LSIDs happy)
http://zoobank.org/8BDC0735-FEA4-4298-83FA-D04F67C3FBEC (to make me happy)
http://uuid.tdwg.org/8BDC0735-FEA4-4298-83FA-D04F67C3FBEC (to make Dave
Vieglias happy)
http://lsid.tdwg.org/urn:lsid:zoobank.org:act:8BDC0735-FEA4-4298-83FA-D04F67
C3FBEC (to make Lee Belbin happy)
http://cache.gbif.org/?uuid=8BDC0735-FEA4-4298-83FA-D04F67C3FBEC (to make
Tim happy)
...and so on, and so on, and so on.

No, we shouldn't do all of them.  Neither should we do only one of them.
Let's figure out a few alternatives that give LSIDs a fair shake (before we
abandon them altogether), and most importantly, dissociate the identifier
from the resolution protocol.

On the topic of centralized GUID resolution....  Over a decade ago, before I
knew what a UUID was, I tried to lobby TDWG to create a GUID issuing service
that our community could adopt.  Back then I was thinking in terms of simply
issuing integers that individual organizations could reserve in blocks of a
million at a time (or whatever large number).  Now that UUIDs have attained
such ubiquity, I no longer think such a service is necessary.  However, I
have been a strong supporter of distributed content -- not in the sense of
DiGIR, where there are (theoretically) non-overlapping blocks of content
that get assembled and stacked at query time; but rather in the sense of
mirroring or replication of ALL content, on ALL servers.

I first started pushing for this during the days of the All-Species
Foundation.  I didn't (and still do not) think that this concept will go
over very well for proprietary content, such a specimen data, images, and
other similar sorts of content (perhaps someday, but I don't think our
community is ready for that just yet).  But certainly for things we all
share -- taxonomy, literature, agent data, geography, etc. -- there was (and
still is) a great deal of potential value in the notion of "once digitized,
always available".  Redundancy of effort is useful to an extent, but I think
we have wasted a lot of time populating different databases belonging to
different organizations with records representing the exact same objects
(e.g., the citation record for Linnaeus, 1758), and even more time
(specifically, *my* time) trying to cross-link different databases with
overlapping content.

So....I'm going to "see" Tim's crazy idea, and "raise" him an even crazier
one:  rather than "a" centralized cache of LSID response content (with
expiration), why don't we have dozens or hundreds of mirror copies of *all*
the content? We don't even have to confine it to LSIDs -- make it compatible
with several of the most commonly used GUID protocols.  I'm assuming that
technology allows for maintaining this via replication, etc.  The only
issues that need to be worked out are a security/authorization mechanism
somewhere between the Wikipedia model and the ITIS model (first one that
came to my head), and/or a robust audit system for tracking (and rolling
back) content edits.

The GNA is already heading this way for both GNI and GNUB; and there is talk
of something similar for literature citations (which would include agents),
as well as for shape files for species distributions.  What I'd like to see
is a "Global Shared Biodiversity Data Repository" as more than just a
centralized cache of metadata, but a common infrastructure that supports
broad global replication (and associated automated synchronization) of
anything and everything that our community is willing to share.  I would
propose that each object be identified by a UUID, and that any one of the
dozens/hundreds of replicates could establish whatever services they want on
top of that content. Different hosts might create different services for
content resolution catered to different community needs.  Some would wrap
the UUIDs in LSID syntax, some might convert them to Handles, some would
represent them as PURLs, etc., etc.  The important thing is that we have a
common standard for identifiers (UUIDs), built within an architecture that
can support multiple and evolving resolution protocols, mechanisms, and
services.

Well, dang!  Not only did I just blow an hour, but I suspect I bored most of
you to tears.

Sorry about that...

Aloha,
Rich

Richard L. Pyle, PhD
Database Coordinator for Natural Sciences
  and Associate Zoologist in Ichthyology
Department of Natural Sciences, Bishop Museum
1525 Bernice St., Honolulu, HI 96817
Ph: (808)848-4115, Fax: (808)847-8252
email: deepreef at bishopmuseum.org
http://hbs.bishopmuseum.org/staff/pylerichard.html