Re: BioGUIDs and the Internet Analogy
[For me, the bottom line---which however I nowhere state below---is: There is /so much/ existing free infrastructure source code---e.g. http://www-124.ibm.com/developerworks/oss/lsid/--- and (apparently) funding, and (manifestly) professionally designed specifications for LSID concerns that I am horrified at the prospect of adopting anything else if LSID comes even close to being what the community needs. Or, let's forget about LSID and instead of deploying what satisfies 98% of the needs in six months, we could roll our own and deploy what satisfies 80% of the needs in a few years...
To my mind, the only question is this: is LSID good enough? So: first make the requirements. Then examine existing solutions. Arguing by analogy can lead to the "hammer" solution, usually attributed to Mark Twain: When all you own is a hammer, every problem begins to look like a nail. (Can someone give me an actual /reference/ to that quote????]
-------------------------
The idea of modeling taxonomic uuids on the internet has been around for about 10 years, and has been written about explicitly for at least 3-4 by Hannu Saarenma and probably others. See his article in http://reports.eea.eu.int/technical_report_2001_70/en/Technical%20Report%207...
The Tree of Life http://www.tolweb.org is a defacto such model, though without name authorities or persistence.
The design goals of TCP/IP and DNS, and their implementation, intersect the requirements of Bio UUIDs only in a very small set, in fact, deep down perhaps not at all.
These protocols and the associated address syntax were designed primarily for /routing/, not in any way designed to guarantee that a datum twice received has any connection between the two occurrences.
IP addresses are in no way persistent.
IP addresses are not globally unique, albeit in several small and varied ways:
- The "private address blocks" 10.X.Y.Z, 172.16.0-0-172.31.255.255, and 192.168.X.Y may be assigned to any machine your and my network administrator care to, as long as neither is served on "the Internet", nor have an duplication on "an internet".
-every machine implementing IP, besides whatever other addresses it may have, is always known to itself as 127.0.0.1;
- the addresses in some IP address ranges are reserved to designate /many/ machines (multicast IP)).
In general the design of the IP "nomenclature"---i.e. the IP addresses---is designed to solve routing problems, not identification problems. IP address syntax is intentionally not "semantically opaque", contrary to a requirement (well, a "should be") of LSIDs and well it should be. See Section 8 of http://www.omg.org/cgi-bin/doc?dtc/04-05-01
[If you don't see why IP addressing is not semantically opaque, try this exercise: With a single machine instruction (on most machines) how can you determine whether or not an IP address is in the Class B private address space(the 172 stuff above)? Sysadmin's, C programmers, and their families and employees are not eligible for this competition]
In turn, DNS protocols are designed only to aid discovery of IP addresses. DNS addresses are also far from persistent (in fact, DNS records held anywhere have a "Time To Live" field which must be counted down until expiration, at which time the holder must acquire a new instance of the assignment of an IP address to a domain name.).
[More below, interspersed]
Richard Pyle wrote:
Perhaps it would be useful to look at the issues being discussed about a bio identifier/locator/GUID in comparison to the same things that are needed for Internet communications.
I've long thought that parts of the DNS system would be extremely useful to emulate in some aspects of bioinformatics data management (particularly taxonomic names; see below).
IP addresses have to be unique world-wide to make the Internet work. The Internet Corporation for Assigned Names and Numbers (ICANN-
www.icann.org)
provides that uniqueness by assigning all the IP numbers in unique blocks
or
ranges of numbers to "Internet Registries".
...exactly the way that I envision an organization like GBIF would be charged with the task of issuing UIDs for certain biological objects.
There are Regional, National and Local Internet Registries that subdivide
and
"license" IP addresses to ISPs, who in turn license IP addresses to
organizations.
Ah, this is accurate mainly for IPv6, which is much less chaotic than IPv4 (most of the current Internet) and in turn from the nearly formless void that was IPv2. But again, ultimate "licensees" do not get persistent IP addresses. In the US, virtually all dialup users get a different IP every time they connect, and most home broadband users only accidently keep their IP addresses, and only if they don't disconnect very long from the network.
It's well worth comparing the design goals of IPv6 as articulated in http://www.apnic.net/docs/policy/ipv6-address-policy.html with those of LSID as articulated in Section 8 of the Draft Final Specification http://www.omg.org/cgi-bin/doc?dtc/04-05-01
There could be a useful analog for this in bioinformatics (particularly in terms of individual institutions serving as regional registries for specimen UIDs, or IC_N Commissions serving as "regional" registries for taxon name UID assignment) -- but there doesn't necessarily have to be.
In fact, if the UUIDs are meant to be semantically opaque it matters not one whit who or how these matters are settled. Exceptions to that are social, not technical. ("If you don't let me decide X, I am not going to use your scheme". "OK, then you won't participate in its benefits. That's fine with me")
If you want to see another example of lack of semantic opacity, read the ISBN standard ISO/TC 46/SC 9 N 326. Part of certain ISBNs can help you determine an allegedly common publishing-germaine attribute of the US, Zimbabwe, Puerto Rico, Ireland, Swaziland, part of Canada, and a few other "regions".
So, there is a heirarchy of how the "unique identifiers" are managed.
There is
in fact a central authority, but it delegates to decentralized
authorities.
But this is mainly to distribute costs and speed issuance. It has nothing to do with the naming scheme. The number of organizations to be issued Bio GUIDs surely is several orders of magnitude less than those to be issued IBv6 addresses. So I doubt any IPv6 issuance mechanisms are instructive, at least in their purpose (and hence, if well implemented, in their implementation).
To emulate this in bioinformatics, the "hierarchy" would be achieved simply by allowing block-assignment of UIDs to various players -- but the important point here is that only *one* organization ensures uniqueness (in the case of Internet, of ISPs). The data to which those UIDs apply would be, for the most part, the responsibility of the UID recipient, not the UID issuer (in my world view). Thus: centralized issuance; delegated application.
Is there an analogy for BioGUIDs to have a central body who divvies out
the
unique numbers (like IP addresses) to decentralized bodies or large
organizations?
The International ISBN Organization http://www.isbn-international.org/ is roughly the IPv6 model.
GBIF seems to me to be the principle contender.
I enthusiastically agree. Also the /principal/ contender. [Sorry, couldn't resist. My fingers slip on that one sometimes too.]
Since IP addresses are hard to memorize (and so too would be a BioGUID),
"domain names"
are used. Starting with a domain name, you can first find the name and/or
IP address
of a device, called the Domain Name Server, that can locate the IP address
of other
computers. This is a form of indirect addressing. ICANN also manages the
top-level
namespace for the Internet. They decide what the valid domain "extensions"
are (like
.com, .uk) so that everybody, everywhere knows where to look them up.
Then, the domain
name extensions are separated among the Regional, National, and Local
Interent Registries
around the world. There is a scheme for where to find the IP addresses
for every domain
extension (e.g. .com is on the ARIN registry, .com.uk is on the ).
Not exactly. There is one scheme in case your application can't resolve it in a more nearly "local" facility. There are /lots/ of ways to find an IP address from a domain name. All those which comply fully with the DNS protocol, however, can make available two pieces of metadata: the TTL of the record it is offering, and the IP address of a machine at which you can find an authoritative record of the assignment of the dns name to the IP address. This protocol /might/, but you hope on performance grounds usually /doesn't/, lead you up as far as the root servers, and the "one scheme to bind them all". If there is any lesson here at all, it is that name resolution protocols matter, but resolution implementations don't. Yet another attribute on which, DNS/IP and LSID are not distinguishable.
Then there is a layer of Domain Registrars who have been accredited by
ICANN to assign
domain names for the domain extensions - e.g. tdwg.org. The domain name registrars are told by the owner of the domain where to
find their particular
Domain Name Servers which may be many to enable redundancy - Primary,
Secondary, Tertiary,
etc.
Not quite. Normally, the registrant tells the registrar who has agreed to be the servers. The other case sometimes happens with "retailers" who are selling individuals domain names and ISP services at the same time.
These redundant Domain Name Servers synchronize with each other at
particular times
of day and may be located all around the world.
More often, only when the TTLs expire, there being no motivation to do otherwise.
They are the main
"switchboard" for a
particular organizations computer names and associated IP addresses. Then the individual organization can create multiple computers for the
domain name - e.g.
www.tdwg.org - and add them to the Domain Name Server listing. There can
be many computers
for a domain, for instance: info.tdwg.org, www2.tdwg.org, myname.tdwg.org.
Each of these
can be a different computer with a different IP address. The redundant
Domain Name Servers
all contain the list of all these names and what IP addresses they are.
Not usually. The primary and secondary name servers would normally only cache tdwg.org permanently. They might /acquire/ a record for www.tdwg.org in response to a request, but they would not in general renew it after it expired and maybe not even keep it that long. To do so would be hideously unscalable. If I put 10,000 machines in my domain, my primary and secondary would be mighty unhappy if they had to keep them all cached.
This is analogous in many ways to how I would envision a global taxonomic name service. UIDs are assigned by a centralized body (e.g., GBIF; or by the IC_N Commissions) to individual names. Analogous to multiple redundant Domain Name Servers (DNS) would be Taxon Name Servers (TNS). Rather than administered by one organization (e.g., GBIF, ITIS, Species 2000, uBio, etc.) these TSNs would be replicated on dozens or hundreds of servers all over the world, and maintained as synchronized within some reasonable time unit. Changes to any one replicate would be automatically propagated to all replicates (either chaotically, or more strictly through one or a few defined "hubs"). Instead of Domain names as surrogates for IP addresses, there would be fully qualified "Basionyms" (e.g., "OriginalGenusName.OriginalSpeciesName.OriginalAuthor.DescriptionYear.Page.O therOriginalCitationDetailsAsNeeded") representations of the less-human-friendly GUIDs (analogues to IP addresses). Ideally, this system wouldn't be limited to just taxonomic names, but extended to all taxonomic concepts, so that the "Domain Name" analogue would be extended to something like:
"OriginalGenusName.OriginalSpeciesName.OriginalAuthor.DescriptionYear.Page.O therOriginalCitationDetailsAsNeeded_AppliedGenusName.AppliedSpeciesSpelling. ConceptAuthor.ConceptYear.Page.OtherConceptCitationDetailsAsNeeded"
The players in the Internet networking fabric all now play by these
layered rules.
They all know them and follow them in order to keep the Internet running.
This
stuff happens out of sight to everyone but the networking people and we
all take it
for granted and assume it is simple. But, it's invisible not because it's
simple,
but rather because it's disciplined.
Agreed. Yet one more attribute where IP/DNS and LSID are not distinguishable.
Excellent synopsis, and (in my opinion), and excellent model to follow for at least taxonomic names/concepts data. Perhaps also for specimen data (but seems less intuitive for that.) This comes back to my earlier question about whether it is vital that all bioinformatics GUIDs be of the same scheme; or whether different schemes might be optimal for different classes of objects.
Aloha, Rich
Richard L. Pyle, PhD Natural Sciences Database Coordinator, Bishop Museum 1525 Bernice St., Honolulu, HI 96817 Ph: (808)848-4115, Fax: (808)847-8252 email: deepreef@bishopmuseum.org http://www.bishopmuseum.org/bishop/HBS/pylerichard.html
participants (1)
-
Bob Morris