Globally Unique Identifier & Donald Hobern's PPT

Mon Sep 27 10:55:51 CEST 2004

> -the 'GUID combination' is not enforced and therefore not always used
> -Some collections belong to 2 or more Institutions or to none
> -If part of the collection moves to another institute, the guid
> combination
> is changed for that part.
> -The InstitutionCode should be unique, and providers where asking
> what to do
> if the code they wanted to use was already chosen, and who decides which
> institute may use an institutioncode if two institutes want to
> use it. There
> is no body responsible for that and there are no rules: the first
> Institute
> can claim a code, or the biggest or the most well known??
> -In different science areas different InstitutionCodes within one
> Organisation where in use, which one to choose.
> -This 'GUID' can only be used for specimen, not for other life science
> objects.

Wholeheartedly agree on all counts!!!  That's why I still see it as a "soft"
ID (even with enforcement of unique registered Institution Codes, and
enforced uniqueness of CollectionCode+CatalogNumber within a single
InstitutionCode).  It's a stop-gap to solve some of the problems, until a
real GUID system is up, running, and broadly adopted.

> Now let's look at LSID syntax:
> urn:lsid:authority:namespace:object_identifier (:revision_number)
> About  the first part; authority:
> It is naturally to want this to be unique. Therefore we can
> expect the same
> problems as mentioned above, plus unclearity about the difference
> issuing_authority vs. current_authority for the data.

As to uniqueness, I think that's (part of) the point of using a URL, instead
of just an institution name or abbreviation.  URLs seem to be effectively
unique.  As to the confusion about "issuing_authority" vs.
"current_authority", count me among the befuddled.  My interpretation of
Dave Vieglais' posts were that the "Authority" URL was assumed to be the URL
where the GUID is resolved to the data it represents.  But Bob Morris' posts
suggest otherwise ("The authority name is the /issuing/ authority. It's an
authority for the LSID, not for its resolution or the underlying data.").
Perhaps I misunderstood Dave's post?  My primary concern about LSIDs is that
(I thought) the URL used for the "authority" portion of the LSID must be
live, online, active, and perpetual in order to resolve the data.  If this
is not the case, (i.e., if, as Bob says, it is only intended to indicate the
*issuer*, not the current authority), then my concerns about LSIDs are
greatly reduced.

> The problems with authority are important for the involved
> authorities only,
> not for the rest of the life science community.

Agreed!!  And further, the authority makes sense for "local" or "owned" data
(e.g., specimens, and attributes thereof), but not for "public" data (e.g.,
taxa, and attributes thereof).

> So discussions about it and
> establishing an authority that takes decisions in political
> conflicts are a waste of time.
> We can solve it by using a unique number only and maintaining a list that
> gives information about each number. It should be clear that this are only
> the initial issuing authority/authorities.

Agreed for sure on the last sentence!  And if I interpret the penultimate
sentence correctly, then full agreement there as well.  I had started
writing a response to Bob Morris' post last night, but it got too late so I
didn't finish it.  It included the following:

***************************************************
Bob Morris wrote:
> I find nothing in the LSID current(?) proposed recommendation
> http://www.omg.org/cgi-bin/doc?lifesci/2003-12-02 that specifies that
> the "object identification" part of an LSID may not be a string of hex
> digits and dashes. (Though I continue to fail to see why people are so
> in love with these).

I'm not particularly enamored by MACs or hex strings over other forms of
unique identifiers, but I WOULD like to see a protocol for bioinformatics
LSID generation that decreased the possibility of duplicate <ObjectID>
portions of the LSIDs from different issuers to effectively zero.  Maybe
there is no meaningful technical reason to do this -- but I can't help but
feel that if LSIDs are later determined not to be ideal for bioinformatics
purposes, it may prove to be DAMN useful that the <ObjectID> portion alone
serves effectively as its own gUid (emphasis on "U" intentional), even
stripped of the remaining LSID components.
***************************************************

Wouter: Is this what you meant by using a unique number?

I also agree with Wouter (in response to Kevin Richards' concern about
single-server bottleneck) that this should not be a concern.  I seriously
doubt that the sum total of all Life Sciences data calls would ever even
approach the level of calls that Google receives.  True, there will likely
not be the same sort of money backing a life sciences GUID system that is
behind Google's server farm -- but even still, given the fundamental
importance such a system would have to such a wide variety of people
receiving such a large chunk of grant money, I can certainly imagine
justifying money on the order of 6 or 7 figures ($/euro) for such a server
system, and I imagine that would be ample money by today's technological
benchmarks (or, more likely, the benchmark a few more cycles of Moore's Law
hence, which is probably when a system of this sort would start to receive
really high volume requests) to get a system capable of meeting the demand.
I think the real bottlenecks are sociological (i.e., getting these damned
fickle life science practitioners to agree on anything), not technological.

> The last part; revision id: whether you need it depends: do you give the
> physical objects a GUID or the data records?

Agreed!  Another excerpt from my attempted reply to Bob Morris:

***************************************************
Bob Morris wrote:
> This brings up  a point I may have missed in this discussion: LSIDs are
> designed to give an identifier to /data/ not to physical objects. This
> probably means that a fully compliant use of LSIDs for "specimens" will
> be assigned to specimen records, not to specimen objects. This is good,
> because if a physical object is moved it presumably gets a new specimen
> record, which record, perhaps, has some metadata that tells the curation
> history of the underlying object, including its previous LSIDs.
> Seems like the analog of taxonomic synonomy to me....

I disagree -- as I stated before, I strongly feel that the number should be
applied to the physical *object* (or virtual representation thereof -- such
as an original description), not the electronic data record.  If Bishop
Museum sends a specimen to Smithsonian, the specimen now curated at
Smithsonian is not, to my mind, a synonym of its earlier life at Bishop
Museum.  Nor is its data.  Both the specimen, and its associated data,
should be considered as fixed into perpetuity.  As more and more data are
transferred from the physical to an electronic form, those data should be
associated to the same GUID for the object -- not multiple versions of the
electronic representation of data related to that object.  If the ID *must*
be tied to the data record, instead of the object, then I take back
everything I said earlier about versioning.  In the data-centric scenario,
versioning becomes absolutely *vital*.  Personally, I think trying to manage
GUIDs as record identifiers, rather than object identifiers, would introduce
unnecessary and excessive complexity.  Biologists are interested in the
objects. The data records are just a convenient mechanism of information
conveyance -- not important entities unto themselves.  This may not apply to
all aspects of Life Sciences, but I think it should apply to objects we're
discussing here (specimens, taxa, references, etc.).
***************************************************

> With the first choice you do
> not need a revision number because the physical object will not change (or
> do they with living collections?).

Living collections do change, and so to unvochered observations, and so do
records representing populations (rather than specific physical organisms).
And even preserved specimens change over time (tracking condition,
preservation status, etc.).  But I think the GUID should be fixed to
physical object, and that whatever dynamic properties of that object are
worth recording over time should be associated back to the physical object.

Where things get more complicated is how to define the "object".  To some
collections, the unit of "Object" may be multi-taxon/multi-specimen (e.g.,
fossil), or single-taxon/multi-specimen (e.g., lot), or
single-taxon/single-specimen (individual specimen), or
single-taxon/partial-specimen (a part of a specimen, like a skeleton vs. a
skin). Single "objects" of any of these sorts may be fractioned (e.g.,
Isotypes, or simply splitting up a multi-specimen lot to send out to
different institutions).  So, one important question in such cases is
whether one of the "child" objects retains the GUID of the "parent" object,
and new GUIDs are assigned only to the remaining "child" objects (the way
Linnaean taxonomy works for taxonomic concepts, and the way most
institutions deal with catalog numbers for specimens).  Or, do *all* child
objects receive new GUIDs, each referring back to an historical "parent"
object that no longer exists?  The temptation is to support the latter, but
in this case, what of a specimen that partial deteriorates and a portion of
it is destroyed, rather than sent to a different institution?  Logically,
the remaining specimen should be treated no differently than it would have
if the deteriorated portion was instead sent to a different institution,
rather than destroyed, and hence receive a new GUID.  But that's starting to
sound an awful lot like condition monitoring.  Perhaps this distinction
should be left to the discretion of the GUID issuer/Object owner on a
case-by-case basis? (Yikes! Inconsistency!) Or, perhaps this is where
versioning comes in (where the versions are actual object versions, not
electronic data versions)?  This seems like a more complicated problem than
the ones we have been discussing so far.

> If a GUID for a 'physical object' is
> chosen, a thing like a species name or author name or country
> should not get a GUID.

I disagree.  For species names, the GUID would apply to the name's original
description/creation event. Metadata for such never change -- they can only
be corrected.  For author names, I would argue that the object to which the
GUID is applied should be thought of as the *name* of the author as a
virtual physical object; not the author as a physical object.  Multiple
AuthorName objects could be linked to each other via an Alias scheme, and/or
tied to a common "Person" (which could either be a separate GUID namespace,
or be defined as a set of linked AuthorName objects).  In the context of
biological objects, place descriptors of all sorts are really just
surrogates to defined two(three?)-dimentional physical spaces. The GUIDs for
such should primarily be established for the physical space, not the name or
other descriptors applied to that space.  The GEOnet Names Server (GNS;
http://earth-info.nima.mil/gns/html/index.html) seems to me to be a useful
model to follow.  They have two ID numbers for each record.  One is the
"Unique Feature Identifier" (UFI): "A number which uniquely identifies the
feature [=place].", and the other is the "Unique Name Identifier" (UNI): "A
number which uniquely identifies a name.".

The point is, these representations to which I think GUIDs should be applied
are effectively permanent/persistent.

Aloha,
Rich

Richard L. Pyle, PhD
Natural Sciences Database Coordinator, Bishop Museum
1525 Bernice St., Honolulu, HI 96817
Ph: (808)848-4115, Fax: (808)847-8252
email: deepreef at bishopmuseum.org
http://www.bishopmuseum.org/bishop/HBS/pylerichard.html