BioGUIDs and the Internet Analogy

Mon Sep 27 10:52:45 CEST 2004

Perhaps it would be useful to look at the issues being discussed about a bio
identifier/locator/GUID in comparison to the same things that are needed for
Internet communications. How do you find a web page in a directory on a
server somewhere in the world? The solution used by our Internet forefathers
was to create layers and have standards and standard-handling methods at
each layer.

To connect to a web page across the Internet involves multiple standards,
standard bodies, handlers, forwarders, duplicaters, etc.  The entire chain
of events works because there are systems and people all working their part
of the network everyday.  It's not just a single authority running
everything and it's not just every institution for themselves.  It's a
coordinated combination approach.

This is going to be an overly simplistic description of the Internet (and
probably inaccurate in some details) but I hope it conveys the analogy.

The most unique thing of all on the Ethernet is the MAC address which is
assigned to the NIC at the lowest layer. Just numbers and letters. Only the
computers and network hardware deal with this number.

Under the TCP/IP protocol for Ethernet communication, an IP address is
assigned to the NIC/MAC address. Just numbers with no meaning.  Sometimes
humans actually use the IP address, but it is mostly used by the computers.

IP addresses have to be unique world-wide to make the Internet work. The
Internet Corporation for Assigned Names and Numbers (ICANN- www.icann.org)
provides that uniqueness by assigning all the IP numbers in unique blocks or
ranges of numbers to "Internet Registries".  There are Regional, National
and Local Internet Registries that subdivide and "license" IP addresses to
ISPs, who in turn license IP addresses to organizations. Some organizations
are so big that they bypass any ISP.  So, there is a heirarchy of how the
"unique identifiers" are managed.  There is in fact a central authority, but
it delegates to decentralized authorities. Is there an analogy for BioGUIDs
to have a central body who divvies out the unique numbers (like IP
addresses) to decentralized bodies or large organizations?

Since IP addresses are hard to memorize (and so too would be a BioGUID),
"domain names" are used. Starting with a domain name, you can first find the
name and/or IP address of a device, called the Domain Name Server, that can
locate the IP address of other computers.  This is a form of indirect
addressing.  ICANN also manages the top-level namespace for the Internet.
They decide what the valid domain "extensions" are (like .com, .uk) so that
everybody, everywhere knows where to look them up.  Then, the domain name
extensions are separated among the Regional, National, and Local Interent
Registries around the world.  There is a scheme for where to find the IP
addresses for every domain extension (e.g. .com is on the ARIN registry,
.com.uk is on the ).

Then there is a layer of Domain Registrars who have been accredited by ICANN
to assign domain names for the domain extensions - e.g. tdwg.org.

The domain name registrars are told by the owner of the domain where to find
their particular Domain Name Servers which may be many to enable redundancy
- Primary, Secondary, Tertiary, etc.  These redundant Domain Name Servers
synchronize with each other at particular times of day and may be located
all around the world.  They are the main "switchboard" for a particular
organizations computer names and associated IP addresses.

Then the individual organization can create multiple computers for the
domain name - e.g. www.tdwg.org - and add them to the Domain Name Server
listing.  There can be many computers for a domain, for instance:
info.tdwg.org, www2.tdwg.org, myname.tdwg.org.  Each of these can be a
different computer with a different IP address.  The redundant Domain Name
Servers all contain the list of all these names and what IP addresses they
are.

So, it all works through a series of layers, each connected to the other
with indirect references.  At the bottom there are the unique and cryptic IP
and MAC numbers.  In between them and the humans are the layering of names.
And the methods for changing names.  You can start with an IP address and
find the domain it is in and the computer name assigned to it.  Or, you can
start with a name like www.tdwg.org and find its IP address.

The players in the Internet networking fabric all now play by these layered
rules.  They all know them and follow them in order to keep the Internet
running.  This stuff happens out of sight to everyone but the networking
people and we all take it for granted and assume it is simple.  But, it's
invisible not because it's simple, but rather because it's disciplined.  And
a lot of hardware devices have been constructed to follow and enforce the
rules.

In our discussions of how a BioGUID would be implemented - how assigned, how
managed, how identified and located, how made resilient to failure - we need
to be mindful that there is probably not going to be a simple, this way or
that way solution.  It probably needs to be organized into layers of
abstraction, but it will also need to be disciplined.

I think we will need something like BioICANN with BioDomainRegistries,
BioDomainExtensions, BioDomains and BioDNSs that provide the access paths to
the BioGUIDs.

Chuck Miller
CIO
Missouri Botanical Garden

-----Original Message-----
From: Wouter Addink [mailto:wouter at ETI.UVA.NL]
Sent: Monday, September 27, 2004 5:42 AM
To: TDWG-SDD at LISTSERV.NHM.KU.EDU
Subject: Re: Globally Unique Identifier & Donald Hobern's PPT

Hi,
I found the PowerPoint document from Donald really helpfull on some of the
discussed issues. Unfortunately referring to it is a little difficult
because the pages does not contain a unique identier :)

But first some real-world experience with the 'GUID' used today in GBIF
specimen network. The combination InstitutionCode, CollectionCode and
CatalogNumber was chosen for that. Problems experienced last year: -the
'GUID combination' is not enforced and therefore not always used -Some
collections belong to 2 or more Institutions or to none -If part of the
collection moves to another institute, the guid combination is changed for
that part. -The InstitutionCode should be unique, and providers where asking
what to do if the code they wanted to use was already chosen, and who
decides which institute may use an institutioncode if two institutes want to
use it. There is no body responsible for that and there are no rules: the
first Institute can claim a code, or the biggest or the most well known??
-In different science areas different InstitutionCodes within one
Organisation where in use, which one to choose. -This 'GUID' can only be
used for specimen, not for other life science objects.

Now let's look at LSID syntax:
urn:lsid:authority:namespace:object_identifier (:revision_number) About  the
first part; authority: It is naturally to want this to be unique. Therefore
we can expect the same problems as mentioned above, plus unclearity about
the difference issuing_authority vs. current_authority for the data. The
problems with authority are important for the involved authorities only, not
for the rest of the life science community. So discussions about it and
establishing an authority that takes decisions in political conflicts are a
waste of time. We can solve it by using a unique number only and maintaining
a list that gives information about each number. It should be clear that
this are only the initial issuing authority/authorities.

About the second part; namespace:
Things like 'Specimen' or 'Experiment'. In contrast with the first part,
problems with this part are interesting for the whole lifescience community
because applications will want to use this to decide whether the data can be
used for a specific application. Standardisation of namespaces is necessary.
I think it should be devided in two parts (not currently present in LSID)
like MIME type image/jpeg etc: 'observation/abcd' or 'specimen/darwincore'
for example. If we look at the Donalds PPT we see that in model 1: LSID
assigned centrally, the namespace is chosen centrally and by model2: LSID
assigned by each provider, the provider is free to choose one. Even language
variations of naming a namespace can already give problems, so this is why I
strongly favor a central mechanism here for assigning LSIDs, unless the
provider is somehow forced to use a certain namespace class. The potential
bottleneck problem is not really an issue I think (see also DNS mechanisms).
If we choose central mechanism the issuing authority will always be GBIF (or
do we need different authorities for different parts of Life Science?) so no
problems with that also in this case.

The third part; object id: no problems there.

The last part; revision id: whether you need it depends: do you give the
physical objects a GUID or the data records? With the first choice you do
not need a revision number because the physical object will not change (or
do they with living collections?). At first I thought that a GUID should be
put on physical object: if you are looking for data, you are looking for
data about a certain physical object, the source of the data is not (very)
important. The same data elements in different sources about the same object
should be equal, else there are errors. Donald's PPT gives the example that
someone wants to refer to a LSID in a publication as a source. In that case
you want to refer to a data source with a certain version. Then you need to
give a GUID to the data and also you need revisions. Data is not persistent,
it changes all the time. Giving a persistent identifier to it is very
difficult and not many data systems have full revisions support. If a GUID
for a 'physical object' is chosen, a thing like a species name or author
name or country should not get a GUID. These are more a kind of attributes:
most data will use one or more species names as 'metadata'. There needs to
be central datasource for each of these 'metadata', like a NameBank for
species names (with its own ID). I am not sure whether LSID was designed for
a GUID to data or to physical objects. The use of namespace and object id
instead of databasename and recordid seems to indicate that it was designed
for physical objects, but why then the optional revision id? Instead of a
revision id you can also assign a new GUID with every change, but then how
to point to a new version from an old version of data (if you have the GUID
of the old version, how to get the GUID of the new one).

Requirements in Donald's PPT:
-if a GUID is on a physical object, the GUID must not refer uniquely to a
single data element, it must only be unique itself. It is also not a
requirement in LSID specification. There will be overlap between the
objects, so an object can belong to more then one IDs. For instance a
researcher can have its own ID and also belong to the ID of the Institute he
is working for. The data for overlapping elements like researcher name must
be equal. -I would restrict the identifiers to life science objects.

Issues to be resolved in Donald's PPT:
It would be beneficial to maintain the GUID in the datasource itself (at
least for the owner of the datasource), but not absolutely necessary. I see
GUID in data records as a 'tightly coupled' model (which requires some work
for existing databases). I can imagine also a 'loosely coupled' model where
provider software is modified to get the identifier from a central server
(or mirror).

Wouter Addink

------_=_NextPart_001_01C4A4AA.07CC94A0
Content-Type: text/html;
        charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV=3D"Content-Type" CONTENT=3D"text/html; =
charset=3Diso-8859-1">
<META NAME=3D"Generator" CONTENT=3D"MS Exchange Server version =
5.5.2654.45">
<TITLE>BioGUIDs and the Internet Analogy</TITLE>
</HEAD>
<BODY>

Perhaps it would be useful to look at the issues =
being discussed about a bio identifier/locator/GUID in comparison to =
the same things that are needed for Internet communications. How do you =
find a web page in a directory on a server somewhere in the world? The =
solution used by our Internet forefathers was to create layers and have =
standards and standard-handling methods at each layer.

<P><FONT SIZE=3D2>To connect to a web page across the Internet involves =
multiple standards, standard bodies, handlers, forwarders, duplicaters, =
etc.&nbsp; The entire chain of events works because there are systems =
and people all working their part of the network everyday.&nbsp; It's =
not just a single authority running everything and it's not just every =
institution for themselves.&nbsp; It's a coordinated combination =
approach.</FONT></P>

<P><FONT SIZE=3D2>This is going to be an overly simplistic description =
of the Internet (and probably inaccurate in some details) but I hope it =
conveys the analogy.</FONT></P>

The most unique thing of all on the Ethernet is the =
MAC address which is assigned to the NIC at the lowest layer. Just =
numbers and letters. Only the computers and network hardware deal with =
this number.

Under the TCP/IP protocol for Ethernet communication, =
an IP address is assigned to the NIC/MAC address. Just numbers with no =
meaning.&nbsp; Sometimes humans actually use the IP address, but it is =
mostly used by the computers.

IP addresses have to be unique world-wide to make the =
Internet work. The Internet Corporation for Assigned Names and Numbers =
(ICANN- www.icann.org) provides that uniqueness by assigning all the IP =
numbers in unique blocks or ranges of numbers to &quot;Internet =
Registries&quot;.&nbsp; There are Regional, National and Local Internet =
Registries that subdivide and &quot;license&quot; IP addresses to ISPs, =
who in turn license IP addresses to organizations. Some organizations =
are so big that they bypass any ISP.&nbsp; So, there is a heirarchy of =
how the &quot;unique identifiers&quot; are managed.&nbsp; There is in =
fact a central authority, but it delegates to decentralized =
authorities. Is there an analogy for BioGUIDs to have a central body =
who divvies out the unique numbers (like IP addresses) to decentralized =
bodies or large organizations?

Since IP addresses are hard to memorize (and so too =
would be a BioGUID), &quot;domain names&quot; are used. Starting with a =
domain name, you can first find the name and/or IP address of a device, =
called the Domain Name Server, that can locate the IP address of other =
computers.&nbsp; This is a form of indirect addressing.&nbsp; ICANN =
also manages the top-level namespace for the Internet. They decide what =
the valid domain &quot;extensions&quot; are (like .com, .uk) so that =
everybody, everywhere knows where to look them up.&nbsp; Then, the =
domain name extensions are separated among the Regional, National, and =
Local Interent Registries around the world.&nbsp; There is a scheme for =
where to find the IP addresses for every domain extension (e.g. .com is =
on the ARIN registry, .com.uk is on the ).&nbsp;

Then there is a layer of Domain Registrars who have =
been accredited by ICANN to assign domain names for the domain =
extensions - e.g. tdwg.org.

<P><FONT SIZE=3D2>The domain name registrars are told by the owner of =
the domain where to find their particular Domain Name Servers which may =
be many to enable redundancy - Primary, Secondary, Tertiary, etc.&nbsp; =
These redundant Domain Name Servers synchronize with each other at =
particular times of day and may be located all around the world.&nbsp; =
They are the main &quot;switchboard&quot; for a particular =
organizations computer names and associated IP addresses.</FONT></P>

Then the individual organization can create multiple =
computers for the domain name - e.g. www.tdwg.org - and add them to the =
Domain Name Server listing.&nbsp; There can be many computers for a =
domain, for instance: info.tdwg.org, www2.tdwg.org, =
myname.tdwg.org.&nbsp; Each of these can be a different computer with a =
different IP address.&nbsp; The redundant Domain Name Servers all =
contain the list of all these names and what IP addresses they =
are.

<P><FONT SIZE=3D2>So, it all works through a series of layers, each =
connected to the other with indirect references.&nbsp; At the bottom =
there are the unique and cryptic IP and MAC numbers.&nbsp; In between =
them and the humans are the layering of names.&nbsp; And the methods =
for changing names.&nbsp; You can start with an IP address and find the =
domain it is in and the computer name assigned to it.&nbsp; Or, you can =
start with a name like www.tdwg.org and find its IP =
address.&nbsp;&nbsp; </FONT></P>

<P><FONT SIZE=3D2>The players in the Internet networking fabric all now =
play by these layered rules.&nbsp; They all know them and follow them =
in order to keep the Internet running.&nbsp; This stuff happens out of =
sight to everyone but the networking people and we all take it for =
granted and assume it is simple.&nbsp; But, it's invisible not because =
it's simple, but rather because it's disciplined.&nbsp; And a lot of =
hardware devices have been constructed to follow and enforce the =
rules.</FONT></P>

<P><FONT SIZE=3D2>In our discussions of how a BioGUID would be =
implemented - how assigned, how managed, how identified and located, =
how made resilient to failure - we need to be mindful that there is =
probably not going to be a simple, this way or that way solution.&nbsp; =
It probably needs to be organized into layers of abstraction, but it =
will also need to be disciplined.&nbsp; </FONT></P>

<P><FONT SIZE=3D2>I think we will need something like BioICANN with =
BioDomainRegistries, BioDomainExtensions, BioDomains and BioDNSs that =
provide the access paths to the BioGUIDs.</FONT></P>

<P><FONT SIZE=3D2>Chuck Miller</FONT>
<BR><FONT SIZE=3D2>CIO</FONT>
<BR><FONT SIZE=3D2>Missouri Botanical Garden</FONT>
</P>

<P><FONT SIZE=3D2>-----Original Message-----</FONT>
<BR><FONT SIZE=3D2>From: Wouter Addink [<A =
HREF=3D"mailto:wouter at ETI.UVA.NL">mailto:wouter at ETI.UVA.NL</A>] </FONT>
<BR><FONT SIZE=3D2>Sent: Monday, September 27, 2004 5:42 AM</FONT>
<BR><FONT SIZE=3D2>To: TDWG-SDD at LISTSERV.NHM.KU.EDU</FONT>
<BR><FONT SIZE=3D2>Subject: Re: Globally Unique Identifier &amp; Donald =
Hobern's PPT</FONT>
</P>
<BR>

Hi,
 I found the PowerPoint document from Donald really =
helpfull on some of the discussed issues. Unfortunately referring to it =
is a little difficult because the pages does not contain a unique =
identier :)

But first some real-world experience with the 'GUID' =
used today in GBIF specimen network. The combination InstitutionCode, =
CollectionCode and CatalogNumber was chosen for that. Problems =
experienced last year: -the 'GUID combination' is not enforced and =
therefore not always used -Some collections belong to 2 or more =
Institutions or to none -If part of the collection moves to another =
institute, the guid combination is changed for that part. -The =
InstitutionCode should be unique, and providers where asking what to do =
if the code they wanted to use was already chosen, and who decides =
which institute may use an institutioncode if two institutes want to =
use it. There is no body responsible for that and there are no rules: =
the first Institute can claim a code, or the biggest or the most well =
known?? -In different science areas different InstitutionCodes within =
one Organisation where in use, which one to choose. -This 'GUID' can =
only be used for specimen, not for other life science =
objects.

Now let's look at LSID syntax: =
urn:lsid:authority:namespace:object_identifier (:revision_number) =
About&nbsp; the first part; authority: It is naturally to want this to =
be unique. Therefore we can expect the same problems as mentioned =
above, plus unclearity about the difference issuing_authority vs. =
current_authority for the data. The problems with authority are =
important for the involved authorities only, not for the rest of the =
life science community. So discussions about it and establishing an =
authority that takes decisions in political conflicts are a waste of =
time. We can solve it by using a unique number only and maintaining a =
list that gives information about each number. It should be clear that =
this are only the initial issuing authority/authorities.

About the second part; namespace:
 Things like 'Specimen' or 'Experiment'. In contrast =
with the first part, problems with this part are interesting for the =
whole lifescience community because applications will want to use this =
to decide whether the data can be used for a specific application. =
Standardisation of namespaces is necessary. I think it should be =
devided in two parts (not currently present in LSID) like MIME type =
image/jpeg etc: 'observation/abcd' or 'specimen/darwincore' for =
example. If we look at the Donalds PPT we see that in model 1: LSID =
assigned centrally, the namespace is chosen centrally and by model2: =
LSID assigned by each provider, the provider is free to choose one. =
Even language variations of naming a namespace can already give =
problems, so this is why I strongly favor a central mechanism here for =
assigning LSIDs, unless the provider is somehow forced to use a certain =
namespace class. The potential bottleneck problem is not really an =
issue I think (see also DNS mechanisms). If we choose central mechanism =
the issuing authority will always be GBIF (or do we need different =
authorities for different parts of Life Science?) so no problems with =
that also in this case.

<P><FONT SIZE=3D2>The third part; object id: no problems there.</FONT>
</P>

The last part; revision id: whether you need it =
depends: do you give the physical objects a GUID or the data records? =
With the first choice you do not need a revision number because the =
physical object will not change (or do they with living collections?). =
At first I thought that a GUID should be put on physical object: if you =
are looking for data, you are looking for data about a certain physical =
object, the source of the data is not (very) important. The same data =
elements in different sources about the same object should be equal, =
else there are errors. Donald's PPT gives the example that someone =
wants to refer to a LSID in a publication as a source. In that case you =
want to refer to a data source with a certain version. Then you need to =
give a GUID to the data and also you need revisions. Data is not =
persistent, it changes all the time. Giving a persistent identifier to =
it is very difficult and not many data systems have full revisions =
support. If a GUID for a 'physical object' is chosen, a thing like a =
species name or author name or country should not get a GUID. These are =
more a kind of attributes: most data will use one or more species names =
as 'metadata'. There needs to be central datasource for each of these =
'metadata', like a NameBank for species names (with its own ID). I am =
not sure whether LSID was designed for a GUID to data or to physical =
objects. The use of namespace and object id instead of databasename and =
recordid seems to indicate that it was designed for physical objects, =
but why then the optional revision id? Instead of a revision id you can =
also assign a new GUID with every change, but then how to point to a =
new version from an old version of data (if you have the GUID of the =
old version, how to get the GUID of the new one).

Requirements in Donald's PPT:
 -if a GUID is on a physical object, the GUID must =
not refer uniquely to a single data element, it must only be unique =
itself. It is also not a requirement in LSID specification. There will =
be overlap between the objects, so an object can belong to more then =
one IDs. For instance a researcher can have its own ID and also belong =
to the ID of the Institute he is working for. The data for overlapping =
elements like researcher name must be equal. -I would restrict the =
identifiers to life science objects.

Issues to be resolved in Donald's PPT:
 It would be beneficial to maintain the GUID in the =
datasource itself (at least for the owner of the datasource), but not =
absolutely necessary. I see GUID in data records as a 'tightly coupled' =
model (which requires some work for existing databases). I can imagine =
also a 'loosely coupled' model where provider software is modified to =
get the identifier from a central server (or mirror).

<P><FONT SIZE=3D2>Wouter Addink</FONT>
</P>

</BODY>
</HTML>