Re: Globally Unique Identifier & Donald Hobern's PPT

27 Sep 2004

      Hi,
I found the PowerPoint document from Donald really helpfull on some of the
discussed issues. Unfortunately referring to it is a little difficult
because the pages does not contain a unique identier :)

But first some real-world experience with the 'GUID' used today in GBIF
specimen network.
The combination InstitutionCode, CollectionCode and CatalogNumber was chosen
for that. Problems experienced last year:
-the 'GUID combination' is not enforced and therefore not always used
-Some collections belong to 2 or more Institutions or to none
-If part of the collection moves to another institute, the guid combination
is changed for that part.
-The InstitutionCode should be unique, and providers where asking what to do
if the code they wanted to use was already chosen, and who decides which
institute may use an institutioncode if two institutes want to use it. There
is no body responsible for that and there are no rules: the first Institute
can claim a code, or the biggest or the most well known??
-In different science areas different InstitutionCodes within one
Organisation where in use, which one to choose.
-This 'GUID' can only be used for specimen, not for other life science
objects.

Now let's look at LSID syntax:
urn:lsid:authority:namespace:object_identifier (:revision_number)
About  the first part; authority:
It is naturally to want this to be unique. Therefore we can expect the same
problems as mentioned above, plus unclearity about the difference
issuing_authority vs. current_authority for the data.
The problems with authority are important for the involved authorities only,
not for the rest of the life science community. So discussions about it and
establishing an authority that takes decisions in political conflicts are a
waste of time.
We can solve it by using a unique number only and maintaining a list that
gives information about each number. It should be clear that this are only
the initial issuing authority/authorities.

About the second part; namespace:
Things like 'Specimen' or 'Experiment'. In contrast with the first part,
problems with this part are interesting for the whole lifescience community
because applications will want to use this to decide whether the data can be
used for a specific application. Standardisation of namespaces is necessary.
I think it should be devided in two parts (not currently present in LSID)
like MIME type image/jpeg etc: 'observation/abcd' or 'specimen/darwincore'
for example.
If we look at the Donalds PPT we see that in model 1: LSID assigned
centrally, the namespace is chosen centrally and by model2: LSID assigned by
each provider, the provider is free to choose one. Even language variations
of naming a namespace can already give problems, so this is why I strongly
favor a central mechanism here for assigning LSIDs, unless the provider is
somehow forced to use a certain namespace class. The potential bottleneck
problem is not really an issue I think (see also DNS mechanisms). If we
choose central mechanism the issuing authority will always be GBIF (or do we
need different authorities for different parts of Life Science?) so no
problems with that also in this case.

The third part; object id: no problems there.

The last part; revision id: whether you need it depends: do you give the
physical objects a GUID or the data records? With the first choice you do
not need a revision number because the physical object will not change (or
do they with living collections?).
At first I thought that a GUID should be put on physical object: if you are
looking for data, you are looking for data about a certain physical object,
the source of the data is not (very) important. The same data elements in
different sources about the same object should be equal, else there are
errors. Donald's PPT gives the example that someone wants to refer to a LSID
in a publication as a source. In that case you want to refer to a data
source with a certain version. Then you need to give a GUID to the data and
also you need revisions. Data is not persistent, it changes all the time.
Giving a persistent identifier to it is very difficult and not many data
systems have full revisions support. If a GUID for a 'physical object' is
chosen, a thing like a species name or author name or country should not get
a GUID. These are more a kind of attributes: most data will use one or more
species names as 'metadata'. There needs to be central datasource for each
of these 'metadata', like a NameBank for species names (with its own ID). I
am not sure whether LSID was designed for a GUID to data or to physical
objects. The use of namespace and object id instead of databasename and
recordid seems to indicate that it was designed for physical objects, but
why then the optional revision id? Instead of a revision id you can also
assign a new GUID with every change, but then how to point to a new version
from an old version of data (if you have the GUID of the old version, how to
get the GUID of the new one).

Requirements in Donald's PPT:
-if a GUID is on a physical object, the GUID must not refer uniquely to a
single data element, it must only be unique itself. It is also not a
requirement in LSID specification. There will be overlap between the
objects, so an object can belong to more then one IDs. For instance a
researcher can have its own ID and also belong to the ID of the Institute he
is working for. The data for overlapping elements like researcher name must
be equal.
-I would restrict the identifiers to life science objects.

Issues to be resolved in Donald's PPT:
It would be beneficial to maintain the GUID in the datasource itself (at
least for the owner of the datasource), but not absolutely necessary. I see
GUID in data records as a 'tightly coupled' model (which requires some work
for existing databases). I can imagine also a 'loosely coupled' model where
provider software is modified to get the identifier from a central server
(or mirror).

Wouter Addink