BioGUIDs and the Internet Analogy

Tue Sep 28 17:39:35 CEST 2004

I've been watching these messages but have not yet had time to respond to
all of the valuable discussion.

One thing I would note in relation to Richard's comments below (and would
suggest makes a lot of sense) is that we could adopt a model by which each
provider is responsible for resolving the identifiers for their own data,
but that we have a central fall-back server (e.g. one based on a central
data index) to which all requests get forwarded whenever a provider finds it
cannot resolve an id that it originally issued.  This could solve the
problem in a reasonably standard and simple way for cases in which specimens
and associated data do get moved.

One other thing that was probably not clear in my originally sending out a
PowerPoint presentation without a covering commentary was that I think we
need a community discussion not only about the merits of centralized or
decentralized approaches but also about the most appropriate owner of any
core central infrastructure.  I put GBIF in this position in my slides, but
I could equally see this as something which could be done in the name of
TDWG.

Donald

---------------------------------------------------------------
Donald Hobern (dhobern at gbif.org)
Programme Officer for Data Access and Database Interoperability
Global Biodiversity Information Facility Secretariat
Universitetsparken 15, DK-2100 Copenhagen, Denmark
Tel: +45-35321483   Mobile: +45-28751483   Fax: +45-35321480
---------------------------------------------------------------

-----Original Message-----
From: TDWG - Structure of Descriptive Data
[mailto:TDWG-SDD at LISTSERV.NHM.KU.EDU] On Behalf Of Richard Pyle
Sent: 28 September 2004 17:32
To: TDWG-SDD at LISTSERV.NHM.KU.EDU
Subject: Re: BioGUIDs and the Internet Analogy

> [For me, the bottom line---which however I nowhere state below---is:
> There is /so much/ existing free infrastructure source code---e.g.
> http://www-124.ibm.com/developerworks/oss/lsid/--- and (apparently)
> funding, and (manifestly) professionally designed specifications for
> LSID concerns that I am horrified at the prospect of adopting anything
> else if LSID comes even close to being what the community needs.

I can certainly understand that perspective, and that's one of the main
reasons I am still semi-supportive of the LSID approach (i.e., existing
code). My major concern has to do with the "Authority"/"Resolver" domain
portion of the LSID, and the need (or non-need) for it to be an active,
accessible domain in order to resolve the LSID. I'm also VERY concerned
about there ever being a temptation to change an ID for an object (e.g., a
specimen given from one Museum to another) -- unless it is understood that
the non-"ObjectID" portion is really thought of as metadata of sorts, and
ObjectID itself is globally unique by itself. I'll need to read the LSID
spec in more detail, and give it some more thought, before I comment further
on this.

> Or,
> let's forget about LSID and instead of deploying what satisfies 98% of
> the needs in six months, we could roll our own and deploy what satisfies
> 80% of the needs in a few years...

If that were really the balance, then it would be a no-brainer.  My concern
would be adopting (effectively committing to) a scheme that satisfies 80% of
the needs in six months, instead of being patient and picking a system that
accomodates 98% of the need a few years from now.  If I had confidence that
we could implement LSIDs in a "test-drive" mode for a couple of years,
without being fully committed to them, I'd be much more comfortable.  But as
someone who spends a considerable amount of time trying to undo the damage
of "legacy" solutions to data problems that were hastily conceived, I'm
trying to be cautious.

> The design goals of TCP/IP and DNS, and their implementation, intersect
> the requirements of Bio UUIDs only in a very small set, in fact, deep
> down perhaps not at all.
>
> These protocols and the associated address syntax were designed
> primarily for /routing/, not in any way designed to guarantee that a
> datum twice received has any connection between the two occurrences.
>
> IP addresses are in no way persistent.
>
> IP addresses are not globally unique, albeit in several small and varied
> ways:

I don't think anyone (in this thread) was suggesting actually *USING* TCP/IP
and DNS for BioGUIDs (at least I wasn't).  Rather, I was looking to it as a
source of ground-truthed schemes for reliably managing globally distributed
information.  For instance, would DNS synchronization/propagation serve as a
useful model for gobally distributed, synchronized taxonomic registries? Or
would the taxonomic registry work more effectively with one or a few
centralized "masters" with which a larger set of replicates kept
synchronized?  I also think, as I explained earlier, that the hierarchical
approach with centralized block ID issuance and local application might be
instructive to a bioscheme.

> In fact, if the UUIDs are meant to be semantically opaque it matters not
> one whit who or how these matters are settled.

Can you elaborate on what you mean by "semantically opaque"? I think I
understand -- but the last thing this thread needs is ambiguity about the
meaning of terms (i.e., "opaque semantics"....  :-)  )

> Exceptions to that are
> social, not technical. ("If you don't let me decide X, I am not going to
> use your scheme". "OK, then you won't participate in its benefits.
> That's fine with me")

...and as I said before, the real challenge in establishing universally
adopted BioGUIDs is not going to be technical; it's going to be
social/political.

> >>So, there is a heirarchy of how the "unique identifiers" are managed.
> >
> > There is
> >
> >>in fact a central authority, but it delegates to decentralized
> >
> > authorities.
>
> But this is mainly to distribute costs and speed issuance. It has
> nothing to do with the naming scheme. The number of organizations to be
> issued Bio GUIDs surely is several orders of magnitude less than those
> to be issued IBv6 addresses. So I doubt any IPv6 issuance mechanisms are
> instructive, at least in their purpose (and hence, if well implemented,
> in their implementation).

So are you saying that, because BioGUID traffic will be orders of magnitude
smaller than internet domain traffic, there does not need to be delegation
to decentralized authorities?  If so, then we are in full agreement.

> > GBIF seems to me to be the principle contender.
>
> I enthusiastically agree. Also the /principal/ contender. [Sorry,
> couldn't resist. My fingers slip on that one sometimes too.]

Ouch.... :-)

> Not exactly. There is one scheme in case your application can't resolve
> it in a more nearly "local" facility. There are /lots/ of ways to find
> an IP address from a domain name. All those which comply fully with the
> DNS protocol, however, can make available two pieces of metadata: the
> TTL of the record it is offering, and the IP address of a machine at
> which you can find an authoritative record of the assignment of the dns
> name to the IP address. This protocol /might/, but you hope on
> performance grounds usually /doesn't/, lead you up as far as the root
> servers, and the "one scheme to bind them all". If there is any lesson
> here at all, it is that name resolution protocols matter, but resolution
> implementations don't. Yet another attribute on which, DNS/IP and LSID
> are not distinguishable.

This seems to be a fundamental point of confusion (for me anyway).  Are the
domain names embedded within LSIDs information-bearing in the sense that
they are necessarily the internet domain at which the LSID is resolved?  I
guess I should read and understand Section 13.3 of the LSID spec before
commenting further.

> More often, only when the TTLs expire, there being no motivation to do
> otherwise.

O.K., there's a great analogy that may be useful if implementing a
distributed system of synchronized/mirrored biological data servers:  should
they remain in synch at fixed time intervals? In real time with each data
transaction? Or, should some sort of TTL feature be incorporated in data?

Much to think about.  But time for me to get some work done....

Aloha,
Rich