I've been watching these messages but have not yet had time to respond to all of the valuable discussion.
One thing I would note in relation to Richard's comments below (and would suggest makes a lot of sense) is that we could adopt a model by which each provider is responsible for resolving the identifiers for their own data, but that we have a central fall-back server (e.g. one based on a central data index) to which all requests get forwarded whenever a provider finds it cannot resolve an id that it originally issued. This could solve the problem in a reasonably standard and simple way for cases in which specimens and associated data do get moved.
One other thing that was probably not clear in my originally sending out a PowerPoint presentation without a covering commentary was that I think we need a community discussion not only about the merits of centralized or decentralized approaches but also about the most appropriate owner of any core central infrastructure. I put GBIF in this position in my slides, but I could equally see this as something which could be done in the name of TDWG.
Donald
--------------------------------------------------------------- Donald Hobern (dhobern@gbif.org) Programme Officer for Data Access and Database Interoperability Global Biodiversity Information Facility Secretariat Universitetsparken 15, DK-2100 Copenhagen, Denmark Tel: +45-35321483 Mobile: +45-28751483 Fax: +45-35321480 ---------------------------------------------------------------
-----Original Message----- From: TDWG - Structure of Descriptive Data [mailto:TDWG-SDD@LISTSERV.NHM.KU.EDU] On Behalf Of Richard Pyle Sent: 28 September 2004 17:32 To: TDWG-SDD@LISTSERV.NHM.KU.EDU Subject: Re: BioGUIDs and the Internet Analogy
[For me, the bottom line---which however I nowhere state below---is: There is /so much/ existing free infrastructure source code---e.g. http://www-124.ibm.com/developerworks/oss/lsid/--- and (apparently) funding, and (manifestly) professionally designed specifications for LSID concerns that I am horrified at the prospect of adopting anything else if LSID comes even close to being what the community needs.
I can certainly understand that perspective, and that's one of the main reasons I am still semi-supportive of the LSID approach (i.e., existing code). My major concern has to do with the "Authority"/"Resolver" domain portion of the LSID, and the need (or non-need) for it to be an active, accessible domain in order to resolve the LSID. I'm also VERY concerned about there ever being a temptation to change an ID for an object (e.g., a specimen given from one Museum to another) -- unless it is understood that the non-"ObjectID" portion is really thought of as metadata of sorts, and ObjectID itself is globally unique by itself. I'll need to read the LSID spec in more detail, and give it some more thought, before I comment further on this.
Or, let's forget about LSID and instead of deploying what satisfies 98% of the needs in six months, we could roll our own and deploy what satisfies 80% of the needs in a few years...
If that were really the balance, then it would be a no-brainer. My concern would be adopting (effectively committing to) a scheme that satisfies 80% of the needs in six months, instead of being patient and picking a system that accomodates 98% of the need a few years from now. If I had confidence that we could implement LSIDs in a "test-drive" mode for a couple of years, without being fully committed to them, I'd be much more comfortable. But as someone who spends a considerable amount of time trying to undo the damage of "legacy" solutions to data problems that were hastily conceived, I'm trying to be cautious.
The design goals of TCP/IP and DNS, and their implementation, intersect the requirements of Bio UUIDs only in a very small set, in fact, deep down perhaps not at all.
These protocols and the associated address syntax were designed primarily for /routing/, not in any way designed to guarantee that a datum twice received has any connection between the two occurrences.
IP addresses are in no way persistent.
IP addresses are not globally unique, albeit in several small and varied ways:
I don't think anyone (in this thread) was suggesting actually *USING* TCP/IP and DNS for BioGUIDs (at least I wasn't). Rather, I was looking to it as a source of ground-truthed schemes for reliably managing globally distributed information. For instance, would DNS synchronization/propagation serve as a useful model for gobally distributed, synchronized taxonomic registries? Or would the taxonomic registry work more effectively with one or a few centralized "masters" with which a larger set of replicates kept synchronized? I also think, as I explained earlier, that the hierarchical approach with centralized block ID issuance and local application might be instructive to a bioscheme.
In fact, if the UUIDs are meant to be semantically opaque it matters not one whit who or how these matters are settled.
Can you elaborate on what you mean by "semantically opaque"? I think I understand -- but the last thing this thread needs is ambiguity about the meaning of terms (i.e., "opaque semantics".... :-) )
Exceptions to that are social, not technical. ("If you don't let me decide X, I am not going to use your scheme". "OK, then you won't participate in its benefits. That's fine with me")
...and as I said before, the real challenge in establishing universally adopted BioGUIDs is not going to be technical; it's going to be social/political.
So, there is a heirarchy of how the "unique identifiers" are managed.
There is
in fact a central authority, but it delegates to decentralized
authorities.
But this is mainly to distribute costs and speed issuance. It has nothing to do with the naming scheme. The number of organizations to be issued Bio GUIDs surely is several orders of magnitude less than those to be issued IBv6 addresses. So I doubt any IPv6 issuance mechanisms are instructive, at least in their purpose (and hence, if well implemented, in their implementation).
So are you saying that, because BioGUID traffic will be orders of magnitude smaller than internet domain traffic, there does not need to be delegation to decentralized authorities? If so, then we are in full agreement.
GBIF seems to me to be the principle contender.
I enthusiastically agree. Also the /principal/ contender. [Sorry, couldn't resist. My fingers slip on that one sometimes too.]
Ouch.... :-)
Not exactly. There is one scheme in case your application can't resolve it in a more nearly "local" facility. There are /lots/ of ways to find an IP address from a domain name. All those which comply fully with the DNS protocol, however, can make available two pieces of metadata: the TTL of the record it is offering, and the IP address of a machine at which you can find an authoritative record of the assignment of the dns name to the IP address. This protocol /might/, but you hope on performance grounds usually /doesn't/, lead you up as far as the root servers, and the "one scheme to bind them all". If there is any lesson here at all, it is that name resolution protocols matter, but resolution implementations don't. Yet another attribute on which, DNS/IP and LSID are not distinguishable.
This seems to be a fundamental point of confusion (for me anyway). Are the domain names embedded within LSIDs information-bearing in the sense that they are necessarily the internet domain at which the LSID is resolved? I guess I should read and understand Section 13.3 of the LSID spec before commenting further.
More often, only when the TTLs expire, there being no motivation to do otherwise.
O.K., there's a great analogy that may be useful if implementing a distributed system of synchronized/mirrored biological data servers: should they remain in synch at fixed time intervals? In real time with each data transaction? Or, should some sort of TTL feature be incorporated in data?
Much to think about. But time for me to get some work done....
Aloha, Rich