Re: Globally Unique Identifier

23 Sep 2004

      ...
I have to disagree - kind of.  A non-information-bearing GUID such as
one generated by a MAC, eg
{92AB5B37-70E9-4f05-9E97-CBABD08513ED}
is completely useless unless it only appears within the context of a
system that provides more information about what it actually is.
Yes, that would be an assumption.  But not an unreasonable one.  I'm trying
to imagine a scenario where I am presented with a series of MAC id's where I
don't inherently understand the context.  I suppose if I came in to work and
found such a number scribbled on a piece of paper, with no other
information, I'd be in a fix to figure out what the number refers to.  But
obviously that's not a realistic scenario.  I suspect that such IDs would be
used by computers (not humans), and would only be exchanged among computers
in some sort of semantic context; e.g., within the context of a DwC2 XML
file, nestled between appropriate tags:

<GlobalUniqueIdentifier>92AB5B37-70E9-4f05-9E97-CBABD08513ED</GlobalUniqueId
entifier>

...these themselves nestled within further context tags.
...
That's
the point of the LSID or DOI, they provide GUIDs that identify what
system can be used to resolve them.  If GUIDs for names or specimens or
whatever are to be used in other systems, then it is essential that the
GUID can be associated with a resolving system.
I tend to agree -- which is why I preferred DOIs (and increasingly, LSIDs)
to MAC ID's (which show up all over the place in all sorts of contexts).
Even still, though, I think we'll find that all electronic exchanges
involving GUIDs of which we speak, will do so within an evident context.
...
Both the DOI and LSID approaches are structured and provide context.
The DOI system uses the NISO Z39.84-2000 standard for categorization,
the LSID uses the domain name system.  Both provide a context essential
for reuse of an identifier outside it's original context.
Yes, but I initially preferred DOIs to LSIDs because there tends to be less
"context baggage" associated with them. My sense of DOIs is that each
institution would not create its own DOI category; but rather there would be
a single agreed-upon DOI category that is independent of any particular
institution (with all the potential for political baggage an
institution-specified context might afford).
...
This was one of the first recommendations to GBIF - to provide a
registry of institution codes for exactly this purpose.  Having a tool
that verified the uniqueness of records within a collection as exposed
by it's provider (either biocase or digir) would help this uniqueness
problem.  Now that the UDDI registry is available, we could in theory
use the institution identifiers in there.
More power to you (and GBIF, and the future of DiGIR)!  But in my view, it
should still be seen only as a temporary solution, until we can get our acts
together with more specific (and less information-contingent) ID systems.
...
I strongly disagree that there should be a single GUID issuer or
resolver.
I believe you are in the majority on this.  But when I think it all through,
I still feel that consolidation of GUID issuance will be more advantageous
in the long term.
...
What we really need is an organization that operates kind of
like a certificate authority- GBIF could act as the root from which
other trusted GUID issuers may be created.  In this way we can avoid the
arbitrary creation of GUIDs yet still provide considerable flexibility
and de-centralization in the community.
If I read you correctly, I gather you are saying that the issuance of
numbers would be distributed and isolated, but the issuers would fall under
a centralized authority.  I'm not sure I understand why this system is
necessarily advantageous over a centralized issuer.
...
It would be a relatively simple task to include a LSID resolver service
along with a DiGIR provider.  I have prototyped such a system a while
back, but other issues prevented deployment.  With such an
implementation, it would be trivial to assign unique identifiers to
specimens - but first the problems institutions seem to have even
providing unique identifiers within a collection must be resolved.
AGREED!
...
...
As you've outlined in subsequent slides, I see two alternative
paths:  A)
Get the biological world to rally around GBIF as the
centralized provider of
GUIDs for specimens for all collections; or B) Have each
collection/institution issue its own set of LSIDs for its own
specimens, and
have GBIF adopt those LSIDs for its own internal purposes.  I could get
behind either approach, but I see danger in the adoption of a mixture of
these two approaches. I'll defer elaboration, but a lot of it
has to do with
potential confusion about whether the GUID applies fundamentally to the
physical specimen, or the electronic conglomeration of data
associated with
the specimen. Also, I think we should avoid the risk of assigning two
separate GUIDs for the same "single data element" (sensu your Slide 5).
A mixture would still work, provided there was appropriate coordination
between the efforts.
With the level of coordination required, you might as well go for the "brass
ring" (in my opinion).  But maybe what I see as the "brass ring" is seen as
a dud to others.
...
...
Thus, when it comes to assigning GUIDs for names (not
concepts), I would propose the following:
urn:lsid:ICZN.org:TaxonName:XXXXXX (all zoological names)
urn:lsid:ICBN.org:TaxonName:XXXXXX (all botanical names)
urn:lsid:ICNB[or LBSN??].org:TaxonName:XXXXXX (all
bacteriological names)
urn:lsid:ICTV[or ICVCN??].org:TaxonName:XXXXXX (all virus names)
In an ideal world, we'd get to the point where there would be a need for
only one registrar of nomenclature, e.g.:
urn:lsid:BioCode.org:TaxonName:XXXXXXX
Or, perhaps:
urn:lsid:gbif.net:TaxonName:XXXXXXX
It is quite likely that there will be multiple LSID generators and
issuers.  There is no real reason why this should be prevented, except
to ensure that appropriate measures are taken to avoid duplication of
GUIDs for the same object (taxonomic concept in this case).
Actually, I was talking about Taxonomic Names, specifically -- but if Names
are considered as represented by a subset of Concepts (as I hope they will
be), then it's the same GUID pool.
...
So a
critical piece of infrastructure for a name service that was intending
to assign GUIDs would be a mechanism for determining if the object they
are about to assign the GUID to is not already present in the system,
held at some other location.  There needs to be something like a global
"findThisObject(taxon_object)" that absolutely guarantees that the
instance doesn't exist some other place.  And if duplicates were to
occur, then there must also be a mechanism for indicating equivalence
between GUIDs, or perhaps a way of deleting the duplicate (how to decide
which is the duplicate?).
I agree with all of this, but it seems that the infrastructure you describe
would yield a higher total cost than the single GUID provider approach
would.
...
Forcing the use of a single DN such as BioCode.org for all names would
seem to be a mistake, since that implies a single resolver service for
all names- with obvious implications in case of failure.  Perhaps there
can be multiple resolver services with a single DN?  That would probably
work fine then.
Hmmm...I'm not sure I follow.  If I interpret your word "resolver"
correctly, then I see no reason why BioCode.org LSIDs could only be resolved
by one server.  Is that what the DomainName component of a LSID is
specifically for?  That is, "go to this domain to resolve the meaning of
this LSID"?  I thought the DomainName component was simply to give
uniqueness to an LSID in the form of representing the issuer (analogous to
the function of InstitutionCode in DwC).  I see no reason why there couldn't
be dozens, or hundreds of mirrored caches of the complete dataset all over
the world, maintained automatically in synchrony with the "master" set
(which would presumably, but not necessarily, reside at BioCode.org). Any
one of the mirrors could resolve any BioCode.org LSID.  With such a system,
resolving an LSID would require that *any one* of potentially dozens of
mirrored servers to be functional.

If I understand you correctly, and an LSID is resolved only by the server at
the Domain embedded within the LSID, then a dataset containing a
heterogeneous assortment of LSIDs would need *all* of potentially dozens of
distributed servers to be functional.
...
The LSID service must be able to resolve the object.  When the object
moves some other place, then there will need to be a mechanism for the
LSID service to forward the resolution to the appropriate service.  The
really big problem is when an institution no longer exists - so the
hypothetical example of Bishop museum consuming all the Smithsonian fish
collections - the Smithsonian LSID resolver would perhaps no longer
exist, and so those LSIDs become meaningless.
In that case, I would vehemently oppose the use of LSIDs -- especially ones
issued from multiple sources, which rely on the issuer existing into
perpetuity.  It seems MUCH more feasible to me that the GUIDs only be used
within a prescribed context, than it would to require that all LSID issuers
exist into perpetuity, and be functional at all times that someone needs to
resolve the information associated with any particular ID value.

Embedding issuer context in a GUID makes sense to me.  Restricting
resolution of GUID to the embedded issuer *only*, seems like a very
dangerous system to me.
...
Perhaps there's a
delegation mechanism that can be used?  So when a DN can't be resolved,
the system backs down to a default DN, such as gbif.org that would then
indicate that smithsonian.org is now bishop.org?
But it's not that simple, is it?  If there is an LSID:

urn:lsid:bishopmuseum.org:Specimen:1234567

and another LSID, to a completely different specimen:

urn:lsid:smithsonian.gov:Specimen:1234567

...then simply re-directing all bishopmuseum.org requests to Smiithsonian
wouldn't work....would it?  Or would Smithsonian recognize the domain and
deal with it accordingly?

It seems to me that a lot of complexity would disappear if we could all get
behind a single issuer of GUIDs, and mirror the capability to resolve those
GUIDs on dozens or hundreds of servers around the world, and only use the
GUIDs in a semantic context that is self-evident.

Re-reading something I wrote:
...
...
I would go further
to suggest (as I did above) that "Name" GUIDs should also be a subtype
of
Name-Reference instances (non-exclusive of Concept subtype instances),
using
the Name-Reference instance that represents the Code-recognized original
description of the name as the "handle" to the Name.
Actually, it's probably safe to say that all "name-bearing" Name+Reference
instances (i.e., original descriptions) are also, virtually by definition,
also "concept-bearing" Name+Reference instances.  So, not only would
name-bearing and concept-bearing Name+Reference instances be non-exclusive
of each other, it would probably be safe to think of name-bearing instances
as a subset (Subtype) of Concept-bearing instances, which themselves are a
subset (Subtype) of all Name+Reference instances.
...
...
My own answers to your questions:
1) Are LSIDs the most appropriate technology?
I'm increasingly coming to that conclusion.
I agree.  The LSID system is easy to implement, stable, scalable and
does everything we need.  The DOI system is good as well, but the fee
scheme bothers me (though I understand there are ways around that).
My understanding is that it would be easy to develop a DOI-like system that
is not part of the fee-based DOI system, and I still find it appealing
because it could as simple as an integer ID and very basic context tag.

As for LSIDs -- If I understand correctly that the purpose of the
<DomainName> portion of the LSID is to point to the one (and only?) server
that can resolve the ID, then all of a sudden I don't like them at all.  If
it's true that the embedded Domain portion of an LSID *requires* that the
domain exist for as long as the GUID exists in order for the GUID to be
useful, then I definitely have reservations.  If, on the other hand, the
Domain portion can be seen as representing the issuer (somewhat analogous to
the function of "InstitutionCode" in DwC), and could be resolved by any
server set up to deal with the <namespace> part of the LSID, then I'm much
less concerned.
...
...
I think the best option would be central.  The next
option would be full
...
distributed.  Leaving it as an option would, in my opinion, be a BIG
mistake.
I disagree- the assignment of identifiers should be by the curators of
the data.  However, I do strongly consider that there should be some
sort of trust scheme in place, where identifiers are issued only by
entities trusted by the rest of the system.  A scheme similar to that
used by certificate authorities and delegates should be adequate.
Maybe I'm misunderstanding the use of the word "issuers", but in my mind,
the issuer's job is only to provide a guaranteed-unique set of ID's.  It
would not, necessarily, be the location where the ID is applied to its
associated data.

In Donald's PowerPoint file, he made reference to "mechanisms for data
providers to request and use blocks of LSIDs from central service".  Here's
how I imagine a system would work:

GBIF (or some other central entity) establishes a service that can generate
unique <objectID> numbers within its own LSID context.  The same service
also maintains a complete set of data associated with each <objectID>.
Major (and minor) institutions (essentially your set of "Trusted" entities)
would established mirrored copies of the complete set of all data (or,
perhaps, only a filtered subset of the complete data), but would not be able
to issue new GUIDs directly.  However, the mirrored sites could serve as
real-time "pass-through" to the central sight so as to be functionally able
to provide new GUIDs in real time, by retrieving them directly (in real
time) from the central server.  Also, the mirrored sites would all maintain
synchrony of their copies of the data with the central "master" copy, on a
realistic time frame (e.g., every 24 hours, or on-demand if a data provider
chose to initiate a synchronization command).

If a curator of a local institution's data needed to assign a new batch of
numbers for a new set of specimens, the curator would issue a request to the
central server (or via one of mirrored sites as a pass-through request) for
a block of N numbers.  The central server would never re-issue those same
numbers again to anyone else.  But those numbers remain "empty" until the
curator assigns them to data, and uploads that data either to the central
server or to one of the mirrors.  In other words, even though the numbers
are "issued" by a central server, they are applied to real data only by
local curators.

A big issue, of course, is control over editing of data associated with a
given GUID.  In the case of specimens, the central server and mirrored sites
could (perhaps at the discretion of the data curator who initially requested
the number) restrict subsequent editing of those data to a defined set of
password-protected user accounts.  In the case of more public data, such as
taxon names and publications, the control of data editing would be less
restrictive (e.g., either full accessible by the public, or accessible to
anyone who goes to the trouble to register themselves as a taxonomist with
the central server or with any of the mirrored sites).

Maybe this approach would not be practical for specimen data -- but I think
it would be the optimal approach for taxon data.  Perhaps those two
fundamentally different kinds of data (owned, vs. public domain) need
fundamentally different approaches to GUID issuance and assignment?
...
...
3) Which objects should receive identifiers?
Specimens, References, Name-Reference intersections
(Assertions), and
perhaps Agents.  [TaxonNames and Concepts can be subsets of
Name-Reference
intersections].
Any object. It doesn't matter what it is, just that it can be resolved,
and when you find it, you can figure out what it is.  Sensible use of
the NameSpace portion of the LSID will help a lot with this.  A trusted
organization should issue the NameSpace portion to avoid NS conflicts.
I'd have to think this through some more.  Leaving it too open might lead to
a plethora of (potentially overlapping, but not quite equivalent)
NameSpaces, which seems like it could turn into a real mess, really quickly.
Centralized ID systems such as social security numbers in the U.S.,
telephone numbers, etc. definitely have some advantage over totally open
systems.  I suppose that the pool of NS's would be self-cleaning simply by
use or non-use....but I still wonder how much better this approach would be
over the status quo.
...
...
3a) Should we develop a set of object classes for biodiversity
informatics
and assign identifiers to instances of all of these?
I think so, yes. Of course, it depends a bit on who you
mean by "we".  I'm
thinking sensu lato.
Sure, and these could be a core from which others can be built.  But we
should absolutely not restrict the capability of the "system" to accept
new classes - even classes that represent the same infomration in a
different way that may be appropriate to a group of users.
Again, I'll have to think about this some more.  I certainly don't think
that the "system" should be incapable of dealing with new classes -- sort of
like how anyone can develop their own Federation Schema and use DiGIR to
establish specific information networks.  But I'd hate to see a breakdown in
the global transmission of biodiversity information simply because different
subgroups establish their own special-needs, non-mutually-compatible classes
for dealing with essentially the same kinds of information (especially if
they do not also conform to a generalized international standard).
...
...
4) What should be done about existing records without identifiers?
As far as I know, ALL records are currently without
identifiers (unless
someone established a widely accepted GUID system and I missed the
announcement...)
All records currently have some sort of identifier, the problem is their
uniqueness is not rigorously enforced or even evaluated, so their
usefulness is probably limited.
O.K., in that case I misunderstood the meaning of "identifiers".  All
historical identifiers (e.g., catalog numbers for specimens) should be
maintained, preserved, and cross-referenced to GUIDs just like any other
metadata about the physical object.  I think of catalog numbers not so much
as unique identifiers, but as "labels" -- not altogether unlike taxonomic
names. In the databases I manage, I do not use catalog numbers as
identifiers -- the computer generates the UID, which is never seen, read,
written, or typed by a human.  That's how I'd like to see the sorts of GUIDs
we're discussing be implemented -- i.e., for the benefit of
computer-computer data exchange; not human-human data exchange or
human-computer/computer-human data exchange.
...
...
4c) Should the provider software be modified to generate "soft"
identifiers
(ones which we cannot guarantee in all cases to be unique)
based e.g. on the
combination of InstitutionCode, CollectionCode and CatalogNumber?
As an interim solution, perhaps.  See my comments under
"Slide 2" above.
Yes, but not soft.  The providers should assign their own identifiers,
but there must be a mechanism to ensure that identifiers are being
properly assigned.
Agreed -- but I still think of these as "soft" identifiers, because
CatalogNumber values can change over time, in certain circumstances.  GUIDs
should *never* need to be changed (even if the institution that issued them
vanishes without a trace from the face of the Earth).
...
Revision information is very helpful in dealing with errors such as
keystroke errors or other such details that do not change the object.
I agree the revision information *can* be helpful in dealing with errors;
but I don't see that function as being integral to the assignment of GUID
values.
...
Not many.  It seems most collections don't record any history in their
record edits, so without a major alteration in the way the data are
stored, it will be a significant undertaking to provide useful revision
information.
For what it's worth, the databases I have developed for my institution are
designed to log every change made to every field (except
performance-enhancing, purely derivative fields), including what the
previous value was, who made the change, and when the change was made.  When
records are deleted, a "snapshot" of the value of every non-null field is
logged, including the time the record was deleted, and by whom.  The reason
I say all of this is to underscore that my stance on not including
versioning IDs as part of a GUID system is NOT from lack of appreciation for
the value of preserving edit histories (something I clearly value very, very
much -- given that the total diskspace occupied by my edit logs exceeds the
total diskspace occupied by the "real" data!)

In closing, I apologize to those who find my overly-long posts on this topic
to be an annoyance.  I also am starting to wonder:  is this the appropriate
email forum to have this discussion?

Aloha,
Rich

Re: Globally Unique Identifier

Richard Pyle