BioGUIDs and the Internet Analogy

Mon Sep 27 23:33:48 CEST 2004

[For me, the bottom line---which however I nowhere state below---is:
There is /so much/ existing free infrastructure source code---e.g.
http://www-124.ibm.com/developerworks/oss/lsid/--- and (apparently)
funding, and (manifestly) professionally designed specifications for
LSID concerns that I am horrified at the prospect of adopting anything
else if LSID comes even close to being what the community needs. Or,
let's forget about LSID and instead of deploying what satisfies 98% of
the needs in six months, we could roll our own and deploy what satisfies
80% of the needs in a few years...

To my mind, the only question is this: is LSID good enough? So: first
make the requirements. Then examine existing solutions. Arguing by
analogy can lead to the "hammer" solution, usually attributed to Mark
Twain: When all you own is a hammer, every problem begins to look like a
nail. (Can someone give me an actual /reference/ to that quote????]

-------------------------

The idea of modeling taxonomic uuids on the internet has been around for
about 10 years, and has been written about explicitly for at least 3-4
by Hannu Saarenma and probably others. See his article in
http://reports.eea.eu.int/technical_report_2001_70/en/Technical%20Report%2070%20web.pdf

The Tree of Life http://www.tolweb.org is a defacto such model, though
without name authorities or persistence.

The design goals of TCP/IP and DNS, and their implementation, intersect
the requirements of Bio UUIDs only in a very small set, in fact, deep
down perhaps not at all.

These protocols and the associated address syntax were designed
primarily for /routing/, not in any way designed to guarantee that a
datum twice received has any connection between the two occurrences.

IP addresses are in no way persistent.

IP addresses are not globally unique, albeit in several small and varied
ways:

- The "private address blocks" 10.X.Y.Z, 172.16.0-0-172.31.255.255, and
  192.168.X.Y may be assigned to any machine your and my network
administrator care to, as long as neither is served on "the Internet",
nor have an duplication on "an internet".

-every machine implementing IP, besides whatever other addresses it may
have, is always known to itself as 127.0.0.1;

- the addresses in some IP address ranges are reserved to designate
/many/ machines (multicast IP)).

In general the design of the IP "nomenclature"---i.e. the IP
addresses---is designed to solve routing problems, not identification
problems.  IP address syntax is intentionally not "semantically opaque",
contrary to a requirement (well, a "should be")  of LSIDs and well it
should be. See Section 8 of http://www.omg.org/cgi-bin/doc?dtc/04-05-01

[If you don't see why IP addressing is not semantically opaque, try this
exercise: With a single machine instruction (on most machines) how can
you determine whether or not an IP address is in the Class B  private
address space(the 172 stuff above)? Sysadmin's, C programmers, and their
families and employees are not eligible for this competition]

In turn, DNS protocols are designed only to aid discovery of IP
addresses. DNS addresses are also far from persistent (in fact, DNS
records held anywhere have a "Time To Live" field which must be counted
down until expiration, at which time the holder must acquire a new
instance of the assignment of an IP address to a domain name.).

[More below, interspersed]

Richard Pyle wrote:

>>Perhaps it would be useful to look at the issues being discussed about
>>a bio identifier/locator/GUID in comparison to the same things that are
>>needed for Internet communications.
>
>
> I've long thought that parts of the DNS system would be extremely useful to
> emulate in some aspects of bioinformatics data management (particularly
> taxonomic names; see below).
>
>
>>IP addresses have to be unique world-wide to make the Internet work.
>>The Internet Corporation for Assigned Names and Numbers (ICANN-
>
> www.icann.org)
>
>>provides that uniqueness by assigning all the IP numbers in unique blocks
>
> or
>
>>ranges of numbers to "Internet Registries".
>
>
> ...exactly the way that I envision an organization like GBIF would be
> charged with the task of issuing UIDs for certain biological objects.
>
>
>>There are Regional, National and Local Internet Registries that subdivide
>
> and
>
>>"license" IP addresses to ISPs, who in turn license IP addresses to
>
> organizations.

Ah, this is accurate mainly for IPv6, which is much less chaotic than
IPv4 (most of the current Internet) and in turn from the nearly formless
void that was IPv2. But again, ultimate "licensees" do not get
persistent IP addresses. In the US, virtually all dialup users get a
different IP every time they connect, and most home broadband users only
accidently keep their IP addresses, and only if they don't disconnect
very long from the network.

It's well worth comparing the design goals of IPv6 as articulated in
http://www.apnic.net/docs/policy/ipv6-address-policy.html
with those of LSID as articulated in Section 8 of the Draft Final
Specification http://www.omg.org/cgi-bin/doc?dtc/04-05-01

>
> There could be a useful analog for this in bioinformatics (particularly in
> terms of individual institutions serving as regional registries for specimen
> UIDs, or IC_N Commissions serving as "regional" registries for taxon name
> UID assignment) -- but there doesn't necessarily have to be.
>

In fact, if the UUIDs are meant to be semantically opaque it matters not
one whit who or how these matters are settled. Exceptions to that are
social, not technical. ("If you don't let me decide X, I am not going to
use your scheme". "OK, then you won't participate in its benefits.
That's fine with me")

If you want to see another example of lack of semantic opacity, read the
ISBN standard ISO/TC 46/SC 9 N 326. Part of certain ISBNs can help you
determine an allegedly common publishing-germaine attribute of the US,
Zimbabwe, Puerto Rico, Ireland, Swaziland, part of Canada, and a few
other "regions".
>
>>So, there is a heirarchy of how the "unique identifiers" are managed.
>
> There is
>
>>in fact a central authority, but it delegates to decentralized
>
> authorities.

But this is mainly to distribute costs and speed issuance. It has
nothing to do with the naming scheme. The number of organizations to be
issued Bio GUIDs surely is several orders of magnitude less than those
to be issued IBv6 addresses. So I doubt any IPv6 issuance mechanisms are
instructive, at least in their purpose (and hence, if well implemented,
in their implementation).

>
> To emulate this in bioinformatics, the "hierarchy" would be achieved simply
> by allowing block-assignment of UIDs to various players -- but the important
> point here is that only *one* organization ensures uniqueness (in the case
> of Internet, of ISPs).  The data to which those UIDs apply would be, for the
> most part, the responsibility of the UID recipient, not the UID issuer (in
> my world view). Thus: centralized issuance; delegated application.
>
>
>>Is there an analogy for BioGUIDs to have a central body who divvies out
>
> the
>
>>unique numbers (like IP addresses) to decentralized bodies or large
>
> organizations?

The International ISBN Organization http://www.isbn-international.org/
is roughly the IPv6 model.

>
> GBIF seems to me to be the principle contender.

I enthusiastically agree. Also the /principal/ contender. [Sorry,
couldn't resist. My fingers slip on that one sometimes too.]
>
>
>>Since IP addresses are hard to memorize (and so too would be a BioGUID),
>
> "domain names"
>
>>are used. Starting with a domain name, you can first find the name and/or
>
> IP address
>
>>of a device, called the Domain Name Server, that can locate the IP address
>
> of other
>
>>computers.  This is a form of indirect addressing.  ICANN also manages the
>
> top-level
>
>>namespace for the Internet. They decide what the valid domain "extensions"
>
> are (like
>
>>.com, .uk) so that everybody, everywhere knows where to look them up.
>
> Then, the domain
>
>>name extensions are separated among the Regional, National, and Local
>
> Interent Registries
>
>>around the world.  There is a scheme for where to find the IP addresses
>
> for every domain
>
>>extension (e.g. .com is on the ARIN registry, .com.uk is on the ).

Not exactly. There is one scheme in case your application can't resolve
it in a more nearly "local" facility. There are /lots/ of ways to find
an IP address from a domain name. All those which comply fully with the
DNS protocol, however, can make available two pieces of metadata: the
TTL of the record it is offering, and the IP address of a machine at
which you can find an authoritative record of the assignment of the dns
name to the IP address. This protocol /might/, but you hope on
performance grounds usually /doesn't/, lead you up as far as the root
servers, and the "one scheme to bind them all". If there is any lesson
here at all, it is that name resolution protocols matter, but resolution
implementations don't. Yet another attribute on which, DNS/IP and LSID
are not distinguishable.

>>Then there is a layer of Domain Registrars who have been accredited by
>
> ICANN to assign
>
>>domain names for the domain extensions - e.g. tdwg.org.
>>The domain name registrars are told by the owner of the domain where to
>
> find their particular
>
>>Domain Name Servers which may be many to enable redundancy - Primary,
>
> Secondary, Tertiary,
>
>>etc.

Not quite. Normally, the registrant tells the registrar who has agreed
to be the servers. The other case sometimes happens with "retailers" who
are selling individuals domain names and ISP services at the same time.

These redundant Domain Name Servers synchronize with each other at
>
> particular times
>
>>of day and may be located all around the world.

More often, only when the TTLs expire, there being no motivation to do
otherwise.

>They are the main
>
> "switchboard" for a
>
>>particular organizations computer names and associated IP addresses.
>>Then the individual organization can create multiple computers for the
>
> domain name - e.g.
>
>>www.tdwg.org - and add them to the Domain Name Server listing.  There can
>
> be many computers
>
>>for a domain, for instance: info.tdwg.org, www2.tdwg.org, myname.tdwg.org.
>
> Each of these
>
>>can be a different computer with a different IP address.  The redundant
>
> Domain Name Servers
>
>>all contain the list of all these names and what IP addresses they are.

Not usually. The primary and secondary name servers would normally only
cache tdwg.org permanently. They might /acquire/ a record for
www.tdwg.org in response to a request, but they would not in general
renew it after it expired and maybe not even keep it that long. To do so
would be hideously unscalable. If I put 10,000 machines in my domain, my
primary and secondary would be mighty unhappy if they had to keep them
all cached.

>
>
> This is analogous in many ways to how I would envision a global taxonomic
> name service.  UIDs are assigned by a centralized body (e.g., GBIF; or by
> the IC_N Commissions) to individual names.  Analogous to multiple redundant
> Domain Name Servers (DNS) would be Taxon Name Servers (TNS).  Rather than
> administered by one organization (e.g., GBIF, ITIS, Species 2000, uBio,
> etc.) these TSNs would be replicated on dozens or hundreds of servers all
> over the world, and maintained as synchronized within some reasonable time
> unit.  Changes to any one replicate would be automatically propagated to all
> replicates (either chaotically, or more strictly through one or a few
> defined "hubs").  Instead of Domain names as surrogates for IP addresses,
> there would be fully qualified "Basionyms" (e.g.,
> "OriginalGenusName.OriginalSpeciesName.OriginalAuthor.DescriptionYear.Page.O
> therOriginalCitationDetailsAsNeeded") representations of the
> less-human-friendly GUIDs (analogues to IP addresses).  Ideally, this system
> wouldn't be limited to just taxonomic names, but extended to all taxonomic
> concepts, so that the "Domain Name" analogue would be extended to something
> like:
>
> "OriginalGenusName.OriginalSpeciesName.OriginalAuthor.DescriptionYear.Page.O
> therOriginalCitationDetailsAsNeeded_AppliedGenusName.AppliedSpeciesSpelling.
> ConceptAuthor.ConceptYear.Page.OtherConceptCitationDetailsAsNeeded"
>
>
>>The players in the Internet networking fabric all now play by these
>
> layered rules.
>
>>They all know them and follow them in order to keep the Internet running.
>
> This
>
>>stuff happens out of sight to everyone but the networking people and we
>
> all take it
>
>>for granted and assume it is simple.  But, it's invisible not because it's
>
> simple,
>
>>but rather because it's disciplined.

Agreed. Yet one more attribute where IP/DNS and LSID are not
distinguishable.

>
>
> Excellent synopsis, and (in my opinion), and excellent model to follow for
> at least taxonomic names/concepts data.  Perhaps also for specimen data (but
> seems less intuitive for that.) This comes back to my earlier question about
> whether it is vital that all bioinformatics GUIDs be of the same scheme; or
> whether different schemes might be optimal for different classes of objects.
>
> Aloha,
> Rich
>
> Richard L. Pyle, PhD
> Natural Sciences Database Coordinator, Bishop Museum
> 1525 Bernice St., Honolulu, HI 96817
> Ph: (808)848-4115, Fax: (808)847-8252
> email: deepreef at bishopmuseum.org
> http://www.bishopmuseum.org/bishop/HBS/pylerichard.html