RE: BioGUIDs and the Internet Analogy

I think we just need to be clear that there are on the one hand the BioGUIDs themselves and how they are created/assigned and on the other hand is the method to find them later by someone else. On the Internet there are IP addresses and how they are assigned (albeit potentially dynamically) and the method for someone else to find them. Rarely do you get Internet references given as an IP address (http://157.140.2.10) (certainly one reason is that they can be dynamic) but rather as URLs (http://www.tdwg.org) that are then resolved through the domain resolution process.

This approach enables the two things (number and name) to be managed separately, yet linked together.

Don't we need the same two mechanisms for BioGUIDs? Does the LSID spec address both parts of this? I guess I had better get over there and read the whole thing.

Regarding:
>>IP addresses are in no way persistent.
>>IP addresses are not globally unique. [But they are unique on the Internet]

Making analogies is tough because the details of the example can get in the way, as in this case. IP addresses have several attributes. The point is not whether IP addresses are analogous to BioGUIDs on the attributes of static/changeable or universality. The point is that the IP is analagous because it is a cryptic non-intelligible and unique number used to locate another network device by a machine that starts out with no idea where it is. That's the analogy - that IP and BioGUIDs are unique, nonintelligible (ie. semantically opaque) strings of characters/numbers.

The issue is how in the world to you find the BioGUID you are trying to get to? Do you just use the nonintelligible string? Or do we want somewhat intelligible names as pointers to the nonintelligible?

Regarding:
>>The number of organizations to be issued Bio GUIDs surely is several orders of magnitude less than those to be issued IPv6 addresses.

True. But, there are potentially billions of BioGUIDs to be created, which could exceed the number of IPv6 addresses. Some institutions would be the equivalent of a large Internet Registry managing millions of BioGUIDs.

This seems like a big issue to me. Would GBIF/TDWG or whomever issue millions/billions of BioGUIDs directly to institution records and not decentralize by assigning blocks of BioGUIDs to institutions?

Also, when there are billions of something, efficiency becomes an issue. IP addresses follow a binary number approach that leads to efficient processing. Is there an efficiency issue lurking ahead for the BioGUID?

Regarding:
>>...name resolution protocols matter, but resolution implementations don't...
This is true when you have working resolution implementations as the Internet does. But does LSID have a working resolution implementation now? If not, creating a new one that works is not that simple I think. That would be a good reason for piggy-backing on the existing and working Internet DNS/domain name resolution system as I've heard mentioned.

Chuck Miller
CIO
Missouri Botanical Garden

-----Original Message-----
From: Bob Morris [mailto:ram@CS.UMB.EDU]
Sent: Monday, September 27, 2004 10:34 PM
To: TDWG-SDD@LISTSERV.NHM.KU.EDU
Subject: Re: BioGUIDs and the Internet Analogy

[For me, the bottom line---which however I nowhere state below---is: There is /so much/ existing free infrastructure source code---e.g.

http://www-124.ibm.com/developerworks/oss/lsid/--- and (apparently) funding, and (manifestly) professionally designed specifications for LSID concerns that I am horrified at the prospect of adopting anything else if LSID comes even close to being what the community needs. Or, let's forget about LSID and instead of deploying what satisfies 98% of the needs in six months, we could roll our own and deploy what satisfies 80% of the needs in a few years...

To my mind, the only question is this: is LSID good enough? So: first make the requirements. Then examine existing solutions. Arguing by analogy can lead to the "hammer" solution, usually attributed to Mark

Twain: When all you own is a hammer, every problem begins to look like a nail. (Can someone give me an actual /reference/ to that quote????]

-------------------------

The idea of modeling taxonomic uuids on the internet has been around for about 10 years, and has been written about explicitly for at least 3-4 by Hannu Saarenma and probably others. See his article in http://reports.eea.eu.int/technical_report_2001_70/en/Technical%20Report%2070%20web.pdf

The Tree of Life http://www.tolweb.org is a defacto such model, though without name authorities or persistence.

The design goals of TCP/IP and DNS, and their implementation, intersect the requirements of Bio UUIDs only in a very small set, in fact, deep down perhaps not at all.

These protocols and the associated address syntax were designed primarily for /routing/, not in any way designed to guarantee that a datum twice received has any connection between the two occurrences.

IP addresses are in no way persistent.

IP addresses are not globally unique, albeit in several small and varied
ways:

- The "private address blocks" 10.X.Y.Z, 172.16.0-0-172.31.255.255, and
192.168.X.Y may be assigned to any machine your and my network administrator care to, as long as neither is served on "the Internet", nor have an duplication on "an internet".

-every machine implementing IP, besides whatever other addresses it may have, is always known to itself as 127.0.0.1;

- the addresses in some IP address ranges are reserved to designate /many/ machines (multicast IP)).

In general the design of the IP "nomenclature"---i.e. the IP addresses---is designed to solve routing problems, not identification problems. IP address syntax is intentionally not "semantically opaque", contrary to a requirement (well, a "should be") of LSIDs and well it should be. See Section 8 of http://www.omg.org/cgi-bin/doc?dtc/04-05-01

[If you don't see why IP addressing is not semantically opaque, try this
exercise: With a single machine instruction (on most machines) how can you determine whether or not an IP address is in the Class B private address space(the 172 stuff above)? Sysadmin's, C programmers, and their families and employees are not eligible for this competition]

In turn, DNS protocols are designed only to aid discovery of IP addresses. DNS addresses are also far from persistent (in fact, DNS records held anywhere have a "Time To Live" field which must be counted down until expiration, at which time the holder must acquire a new instance of the assignment of an IP address to a domain name.).

[More below, interspersed]

Richard Pyle wrote:

>>Perhaps it would be useful to look at the issues being discussed about
>>a bio identifier/locator/GUID in comparison to the same things that
>>are needed for Internet communications.
>
>
> I've long thought that parts of the DNS system would be extremely
> useful to emulate in some aspects of bioinformatics data management
> (particularly taxonomic names; see below).
>
>
>>IP addresses have to be unique world-wide to make the Internet work.
>>The Internet Corporation for Assigned Names and Numbers (ICANN-
>
> www.icann.org)
>
>>provides that uniqueness by assigning all the IP numbers in unique
>>blocks
>
> or
>
>>ranges of numbers to "Internet Registries".
>
>
> ...exactly the way that I envision an organization like GBIF would be
> charged with the task of issuing UIDs for certain biological objects.
>
>
>>There are Regional, National and Local Internet Registries that
>>subdivide
>
> and
>
>>"license" IP addresses to ISPs, who in turn license IP addresses to
>
> organizations.

Ah, this is accurate mainly for IPv6, which is much less chaotic than IPv4 (most of the current Internet) and in turn from the nearly formless void that was IPv2. But again, ultimate "licensees" do not get persistent IP addresses. In the US, virtually all dialup users get a different IP every time they connect, and most home broadband users only accidently keep their IP addresses, and only if they don't disconnect very long from the network.

It's well worth comparing the design goals of IPv6 as articulated in http://www.apnic.net/docs/policy/ipv6-address-policy.html

with those of LSID as articulated in Section 8 of the Draft Final Specification http://www.omg.org/cgi-bin/doc?dtc/04-05-01

>
> There could be a useful analog for this in bioinformatics
> (particularly in terms of individual institutions serving as regional
> registries for specimen UIDs, or IC_N Commissions serving as
> "regional" registries for taxon name UID assignment) -- but there
> doesn't necessarily have to be.
>

In fact, if the UUIDs are meant to be semantically opaque it matters not one whit who or how these matters are settled. Exceptions to that are social, not technical. ("If you don't let me decide X, I am not going to use your scheme". "OK, then you won't participate in its benefits. That's fine with me")

If you want to see another example of lack of semantic opacity, read the ISBN standard ISO/TC 46/SC 9 N 326. Part of certain ISBNs can help you determine an allegedly common publishing-germaine attribute of the US, Zimbabwe, Puerto Rico, Ireland, Swaziland, part of Canada, and a few other "regions".

>
>>So, there is a heirarchy of how the "unique identifiers" are managed.
>
> There is
>
>>in fact a central authority, but it delegates to decentralized
>
> authorities.

But this is mainly to distribute costs and speed issuance. It has nothing to do with the naming scheme. The number of organizations to be issued Bio GUIDs surely is several orders of magnitude less than those to be issued IBv6 addresses. So I doubt any IPv6 issuance mechanisms are instructive, at least in their purpose (and hence, if well implemented, in their implementation).

>
> To emulate this in bioinformatics, the "hierarchy" would be achieved
> simply by allowing block-assignment of UIDs to various players -- but
> the important point here is that only *one* organization ensures
> uniqueness (in the case of Internet, of ISPs). The data to which
> those UIDs apply would be, for the most part, the responsibility of
> the UID recipient, not the UID issuer (in my world view). Thus:
> centralized issuance; delegated application.
>
>
>>Is there an analogy for BioGUIDs to have a central body who divvies
>>out
>
> the
>
>>unique numbers (like IP addresses) to decentralized bodies or large
>
> organizations?

The International ISBN Organization http://www.isbn-international.org/
is roughly the IPv6 model.

>
> GBIF seems to me to be the principle contender.

I enthusiastically agree. Also the /principal/ contender. [Sorry, couldn't resist. My fingers slip on that one sometimes too.]

>
>
>>Since IP addresses are hard to memorize (and so too would be a
>>BioGUID),
>
> "domain names"
>
>>are used. Starting with a domain name, you can first find the name
>>and/or
>
> IP address
>
>>of a device, called the Domain Name Server, that can locate the IP
>>address
>
> of other
>
>>computers. This is a form of indirect addressing. ICANN also manages
>>the
>
> top-level
>
>>namespace for the Internet. They decide what the valid domain
>>"extensions"
>
> are (like
>
>>.com, .uk) so that everybody, everywhere knows where to look them up.
>
> Then, the domain
>
>>name extensions are separated among the Regional, National, and Local
>
> Interent Registries
>
>>around the world. There is a scheme for where to find the IP
>>addresses
>
> for every domain
>
>>extension (e.g. .com is on the ARIN registry, .com.uk is on the ).

Not exactly. There is one scheme in case your application can't resolve it in a more nearly "local" facility. There are /lots/ of ways to find an IP address from a domain name. All those which comply fully with the DNS protocol, however, can make available two pieces of metadata: the TTL of the record it is offering, and the IP address of a machine at which you can find an authoritative record of the assignment of the dns name to the IP address. This protocol /might/, but you hope on performance grounds usually /doesn't/, lead you up as far as the root servers, and the "one scheme to bind them all". If there is any lesson here at all, it is that name resolution protocols matter, but resolution implementations don't. Yet another attribute on which, DNS/IP and LSID are not distinguishable.

>>Then there is a layer of Domain Registrars who have been accredited by
>
> ICANN to assign
>
>>domain names for the domain extensions - e.g. tdwg.org.
>>The domain name registrars are told by the owner of the domain where
>>to
>
> find their particular
>
>>Domain Name Servers which may be many to enable redundancy - Primary,
>
> Secondary, Tertiary,
>
>>etc.

Not quite. Normally, the registrant tells the registrar who has agreed to be the servers. The other case sometimes happens with "retailers" who are selling individuals domain names and ISP services at the same time.

These redundant Domain Name Servers synchronize with each other at
>
> particular times
>
>>of day and may be located all around the world.

More often, only when the TTLs expire, there being no motivation to do otherwise.

>They are the main
>
> "switchboard" for a
>
>>particular organizations computer names and associated IP addresses.
>>Then the individual organization can create multiple computers for the
>
> domain name - e.g.
>
>>www.tdwg.org - and add them to the Domain Name Server listing. There
>>can
>
> be many computers
>
>>for a domain, for instance: info.tdwg.org, www2.tdwg.org,
>>myname.tdwg.org.
>
> Each of these
>
>>can be a different computer with a different IP address. The
>>redundant
>
> Domain Name Servers
>
>>all contain the list of all these names and what IP addresses they
>>are.

Not usually. The primary and secondary name servers would normally only cache tdwg.org permanently. They might /acquire/ a record for www.tdwg.org in response to a request, but they would not in general renew it after it expired and maybe not even keep it that long. To do so would be hideously unscalable. If I put 10,000 machines in my domain, my primary and secondary would be mighty unhappy if they had to keep them all cached.

>
>
> This is analogous in many ways to how I would envision a global
> taxonomic name service. UIDs are assigned by a centralized body
> (e.g., GBIF; or by the IC_N Commissions) to individual names.
> Analogous to multiple redundant Domain Name Servers (DNS) would be
> Taxon Name Servers (TNS). Rather than administered by one
> organization (e.g., GBIF, ITIS, Species 2000, uBio,
> etc.) these TSNs would be replicated on dozens or hundreds of servers all
> over the world, and maintained as synchronized within some reasonable time
> unit. Changes to any one replicate would be automatically propagated to all
> replicates (either chaotically, or more strictly through one or a few
> defined "hubs"). Instead of Domain names as surrogates for IP addresses,
> there would be fully qualified "Basionyms" (e.g.,
> "OriginalGenusName.OriginalSpeciesName.OriginalAuthor.DescriptionYear.Page.O
> therOriginalCitationDetailsAsNeeded") representations of the
> less-human-friendly GUIDs (analogues to IP addresses). Ideally, this system
> wouldn't be limited to just taxonomic names, but extended to all taxonomic
> concepts, so that the "Domain Name" analogue would be extended to something
> like:
>
> "OriginalGenusName.OriginalSpeciesName.OriginalAuthor.DescriptionYear.
> Page.O
> therOriginalCitationDetailsAsNeeded_AppliedGenusName.AppliedSpeciesSpelling.
> ConceptAuthor.ConceptYear.Page.OtherConceptCitationDetailsAsNeeded"
>
>
>>The players in the Internet networking fabric all now play by these
>
> layered rules.
>
>>They all know them and follow them in order to keep the Internet
>>running.
>
> This
>
>>stuff happens out of sight to everyone but the networking people and
>>we
>
> all take it
>
>>for granted and assume it is simple. But, it's invisible not because
>>it's
>
> simple,
>
>>but rather because it's disciplined.

Agreed. Yet one more attribute where IP/DNS and LSID are not distinguishable.

>
>
> Excellent synopsis, and (in my opinion), and excellent model to follow
> for at least taxonomic names/concepts data. Perhaps also for specimen
> data (but seems less intuitive for that.) This comes back to my
> earlier question about whether it is vital that all bioinformatics
> GUIDs be of the same scheme; or whether different schemes might be
> optimal for different classes of objects.
>
> Aloha,
> Rich
>
> Richard L. Pyle, PhD
> Natural Sciences Database Coordinator, Bishop Museum
> 1525 Bernice St., Honolulu, HI 96817
> Ph: (808)848-4115, Fax: (808)847-8252
> email: deepreef@bishopmuseum.org
> http://www.bishopmuseum.org/bishop/HBS/pylerichard.html