BioGUIDs and the Internet Analogy

Perhaps it would be useful to look at the issues being discussed about a bio identifier/locator/GUID in comparison to the same things that are needed for Internet communications. How do you find a web page in a directory on a server somewhere in the world? The solution used by our Internet forefathers was to create layers and have standards and standard-handling methods at each layer.

To connect to a web page across the Internet involves multiple standards, standard bodies, handlers, forwarders, duplicaters, etc. The entire chain of events works because there are systems and people all working their part of the network everyday. It's not just a single authority running everything and it's not just every institution for themselves. It's a coordinated combination approach.

This is going to be an overly simplistic description of the Internet (and probably inaccurate in some details) but I hope it conveys the analogy.

The most unique thing of all on the Ethernet is the MAC address which is assigned to the NIC at the lowest layer. Just numbers and letters. Only the computers and network hardware deal with this number.

Under the TCP/IP protocol for Ethernet communication, an IP address is assigned to the NIC/MAC address. Just numbers with no meaning. Sometimes humans actually use the IP address, but it is mostly used by the computers.

IP addresses have to be unique world-wide to make the Internet work. The Internet Corporation for Assigned Names and Numbers (ICANN- www.icann.org) provides that uniqueness by assigning all the IP numbers in unique blocks or ranges of numbers to "Internet Registries". There are Regional, National and Local Internet Registries that subdivide and "license" IP addresses to ISPs, who in turn license IP addresses to organizations. Some organizations are so big that they bypass any ISP. So, there is a heirarchy of how the "unique identifiers" are managed. There is in fact a central authority, but it delegates to decentralized authorities. Is there an analogy for BioGUIDs to have a central body who divvies out the unique numbers (like IP addresses) to decentralized bodies or large organizations?

Since IP addresses are hard to memorize (and so too would be a BioGUID), "domain names" are used. Starting with a domain name, you can first find the name and/or IP address of a device, called the Domain Name Server, that can locate the IP address of other computers. This is a form of indirect addressing. ICANN also manages the top-level namespace for the Internet. They decide what the valid domain "extensions" are (like .com, .uk) so that everybody, everywhere knows where to look them up. Then, the domain name extensions are separated among the Regional, National, and Local Interent Registries around the world. There is a scheme for where to find the IP addresses for every domain extension (e.g. .com is on the ARIN registry, .com.uk is on the ).

Then there is a layer of Domain Registrars who have been accredited by ICANN to assign domain names for the domain extensions - e.g. tdwg.org.

The domain name registrars are told by the owner of the domain where to find their particular Domain Name Servers which may be many to enable redundancy - Primary, Secondary, Tertiary, etc. These redundant Domain Name Servers synchronize with each other at particular times of day and may be located all around the world. They are the main "switchboard" for a particular organizations computer names and associated IP addresses.

Then the individual organization can create multiple computers for the domain name - e.g. www.tdwg.org - and add them to the Domain Name Server listing. There can be many computers for a domain, for instance: info.tdwg.org, www2.tdwg.org, myname.tdwg.org. Each of these can be a different computer with a different IP address. The redundant Domain Name Servers all contain the list of all these names and what IP addresses they are.

So, it all works through a series of layers, each connected to the other with indirect references. At the bottom there are the unique and cryptic IP and MAC numbers. In between them and the humans are the layering of names. And the methods for changing names. You can start with an IP address and find the domain it is in and the computer name assigned to it. Or, you can start with a name like www.tdwg.org and find its IP address.

The players in the Internet networking fabric all now play by these layered rules. They all know them and follow them in order to keep the Internet running. This stuff happens out of sight to everyone but the networking people and we all take it for granted and assume it is simple. But, it's invisible not because it's simple, but rather because it's disciplined. And a lot of hardware devices have been constructed to follow and enforce the rules.

In our discussions of how a BioGUID would be implemented - how assigned, how managed, how identified and located, how made resilient to failure - we need to be mindful that there is probably not going to be a simple, this way or that way solution. It probably needs to be organized into layers of abstraction, but it will also need to be disciplined.

I think we will need something like BioICANN with BioDomainRegistries, BioDomainExtensions, BioDomains and BioDNSs that provide the access paths to the BioGUIDs.

Chuck Miller
CIO
Missouri Botanical Garden

-----Original Message-----
From: Wouter Addink [mailto:wouter@ETI.UVA.NL]
Sent: Monday, September 27, 2004 5:42 AM
To: TDWG-SDD@LISTSERV.NHM.KU.EDU
Subject: Re: Globally Unique Identifier & Donald Hobern's PPT

Hi,
I found the PowerPoint document from Donald really helpfull on some of the discussed issues. Unfortunately referring to it is a little difficult because the pages does not contain a unique identier :)

But first some real-world experience with the 'GUID' used today in GBIF specimen network. The combination InstitutionCode, CollectionCode and CatalogNumber was chosen for that. Problems experienced last year: -the 'GUID combination' is not enforced and therefore not always used -Some collections belong to 2 or more Institutions or to none -If part of the collection moves to another institute, the guid combination is changed for that part. -The InstitutionCode should be unique, and providers where asking what to do if the code they wanted to use was already chosen, and who decides which institute may use an institutioncode if two institutes want to use it. There is no body responsible for that and there are no rules: the first Institute can claim a code, or the biggest or the most well known?? -In different science areas different InstitutionCodes within one Organisation where in use, which one to choose. -This 'GUID' can only be used for specimen, not for other life science objects.

Now let's look at LSID syntax: urn:lsid:authority:namespace:object_identifier (:revision_number) About the first part; authority: It is naturally to want this to be unique. Therefore we can expect the same problems as mentioned above, plus unclearity about the difference issuing_authority vs. current_authority for the data. The problems with authority are important for the involved authorities only, not for the rest of the life science community. So discussions about it and establishing an authority that takes decisions in political conflicts are a waste of time. We can solve it by using a unique number only and maintaining a list that gives information about each number. It should be clear that this are only the initial issuing authority/authorities.

About the second part; namespace:
Things like 'Specimen' or 'Experiment'. In contrast with the first part, problems with this part are interesting for the whole lifescience community because applications will want to use this to decide whether the data can be used for a specific application. Standardisation of namespaces is necessary. I think it should be devided in two parts (not currently present in LSID) like MIME type image/jpeg etc: 'observation/abcd' or 'specimen/darwincore' for example. If we look at the Donalds PPT we see that in model 1: LSID assigned centrally, the namespace is chosen centrally and by model2: LSID assigned by each provider, the provider is free to choose one. Even language variations of naming a namespace can already give problems, so this is why I strongly favor a central mechanism here for assigning LSIDs, unless the provider is somehow forced to use a certain namespace class. The potential bottleneck problem is not really an issue I think (see also DNS mechanisms). If we choose central mechanism the issuing authority will always be GBIF (or do we need different authorities for different parts of Life Science?) so no problems with that also in this case.

The third part; object id: no problems there.

The last part; revision id: whether you need it depends: do you give the physical objects a GUID or the data records? With the first choice you do not need a revision number because the physical object will not change (or do they with living collections?). At first I thought that a GUID should be put on physical object: if you are looking for data, you are looking for data about a certain physical object, the source of the data is not (very) important. The same data elements in different sources about the same object should be equal, else there are errors. Donald's PPT gives the example that someone wants to refer to a LSID in a publication as a source. In that case you want to refer to a data source with a certain version. Then you need to give a GUID to the data and also you need revisions. Data is not persistent, it changes all the time. Giving a persistent identifier to it is very difficult and not many data systems have full revisions support. If a GUID for a 'physical object' is chosen, a thing like a species name or author name or country should not get a GUID. These are more a kind of attributes: most data will use one or more species names as 'metadata'. There needs to be central datasource for each of these 'metadata', like a NameBank for species names (with its own ID). I am not sure whether LSID was designed for a GUID to data or to physical objects. The use of namespace and object id instead of databasename and recordid seems to indicate that it was designed for physical objects, but why then the optional revision id? Instead of a revision id you can also assign a new GUID with every change, but then how to point to a new version from an old version of data (if you have the GUID of the old version, how to get the GUID of the new one).

Requirements in Donald's PPT:
-if a GUID is on a physical object, the GUID must not refer uniquely to a single data element, it must only be unique itself. It is also not a requirement in LSID specification. There will be overlap between the objects, so an object can belong to more then one IDs. For instance a researcher can have its own ID and also belong to the ID of the Institute he is working for. The data for overlapping elements like researcher name must be equal. -I would restrict the identifiers to life science objects.

Issues to be resolved in Donald's PPT:
It would be beneficial to maintain the GUID in the datasource itself (at least for the owner of the datasource), but not absolutely necessary. I see GUID in data records as a 'tightly coupled' model (which requires some work for existing databases). I can imagine also a 'loosely coupled' model where provider software is modified to get the identifier from a central server (or mirror).

Wouter Addink