Globally Unique Identifier - part II

Fri Sep 24 00:13:11 CEST 2004

> > I believe you are in the majority on this.  But when I think it
> all through,
> > I still feel that consolidation of GUID issuance will be more
> advantageous
> > in the long term.
> >
>
> Nope.  You'll have to try harder to convince me :-)

Wasn't trying to convince you....just signaling that you had yet to convince
me! :-)

I can see the discussions will be lively in Christchurch... :-)

> > If I read you correctly, I gather you are saying that the issuance of
> > numbers would be distributed and isolated, but the issuers
> would fall under
> > a centralized authority.  I'm not sure I understand why this system is
> > necessarily advantageous over a centralized issuer.
> >
>
> Because there's no single point of failure, it is more scalable, and in
> the (unlikely) event the centralized authority no longer exists, it
> would be a fairly trivial matter to delegate root authority to another
> tusted party.

O.K., I can see that.  But in the world I described, the only affect of a
failure would be an inability to retrieve new IDs -- all existing ID's would
still be resolvable at any of the dozens or hundreds of mirror sites.
Moreover, I would assume that the server that issued the numbers would be
designed to be as reliable as possible (think NYSE).  In the event that the
centralized authority no longer exists, it would be, as you say, a trivial
matter to delegate number issuance to another "trusted" mirror site.

Also, what of my point that a dataset containing ID's with heterogenous
sources would allow *ALL* sources to be single points of failure?  Is it not
true that an LSID would require that the issuing domain be online in order
to retrieve the data associated with the ID?  So if my dataset included IDs
from 15 issuing domains, all 15 would need to be active at the time I run my
query in order to get a complete return of data?  This seems like an even
less robust system.

> >>It is quite likely that there will be multiple LSID generators and
> >>issuers.  There is no real reason why this should be prevented, except
> >>to ensure that appropriate measures are taken to avoid duplication of
> >>GUIDs for the same object (taxonomic concept in this case).
> >
> > Actually, I was talking about Taxonomic Names, specifically --
> but if Names
> > are considered as represented by a subset of Concepts (as I
> hope they will
> > be), then it's the same GUID pool.
> >
>
> Not sure what you mean here- If Joe enters a citation someplace and Rich
> uses it's LSID within a Taxonomic Object he entered, why does it have to
> be in the same pool?  As long as the LSID resolved to the appropriate
> object, all would be good.

No....Reference IDs (I assume by "citation", you mean what I mean when I say
"Reference"?) would be a different pool from Names.

What I meant was, in my world "Name" Id's would be drawn from the same pool
of Name+Reference IDs that Concept IDs would be.  In other words, there
would be one pool of IDs for Name+Reference instances. A subset of these
would be Concept-bearing Name+Reference instances.  And a subset of the
Concept-bearing instances (sub-subset of Name+Reference instances) would be
Name-bearing instances. Those Name-bearing instances would be the "Name"
component of other Name+Reference instances.  It's a recursive relationship,
via well-defined subtypes. But this is an entirely separate topic of
discussion, having more to do with the question of "what is the essence of a
taxonomic concept", and "what is the essence of a taxonomic name" -- that is
tangential to the more general discussion at hand.

> Yeah, but it really concerns me having a single point of failure for
> such a critical system.

As I described above, it doesn't have to be a single point of failure, and
the only "failure" would be a delay in receiving new ID's.  And there could
be a defined "chain of command" such that if the primary server (ID issuer)
goes down, the calls are automatically re-routed to the next mirror in the
chain of command, which then automatically assumes authority for issuing
numbers....and so on down the chain.

I *think* I understand the fundamental differences in our perspective.  I
see the bioinformatics world operating more smoothly within a very specific,
well-understood and universally agreed-upon context.  You seem to prefer a
completely generic, self-describing system, that is not necessarily
restricted in scope to biology (correct?)  I understand your perspective,
and I certainly see its appeal -- I just think that the case-specific
implementation is even more appealing (to me, anyway).

> How an LSID is resolved is described in detail in the document:
>
> http://www.omg.org/cgi-bin/doc?lifesci/2003-12-02
>
> Section 8.3 describes the use of DNS for resolution.
>
> Basically, the LSID client:
>
> 1. Parses the LSID urn:lsid:DN:NS:ID[:Rev]
> 2. Using DNS, locate the SVR record for DN, which points to the service
> 3. Using DNS again, resolve the location of the service
> 4. Once you have the service endpoint, basically ask it for the object
> with NS:ID:Rev
>
> That's a gross simplification, and it appears that the LSID definition
> now treats DNS resolution as one resolution mechanism, rather than the
> only one.

O.K., this clears up a lot in my mind.  But what intrigues me now is: What
are the other (potential?) resolution mechanisms?

Obviously, my grasp of LSIDs was fundamentally flawed.  As much as the
"single-point failure" problem concerns you, the "must have all DNs within a
dataset containing multiple LSIDs active & online" concerns me.  In a sense,
it means there are many single-point failures that can impede data flow.

> > Embedding issuer context in a GUID makes sense to me.  Restricting
> > resolution of GUID to the embedded issuer *only*, seems like a very
> > dangerous system to me.
> >
>
> Yeah, but once again - if the single issuer no longer exists, then
> everything is gone.  That would be a real drag.

The specimens, presumably, wouldn't be gone -- they would simply move to a
new location.  And what happens when one specific specimen is transferred
from Museum "A" to Museum "B"?  Does it keep the original LSID (meaining
that the original Musuem must continue to maintain and update data for a
specimen it no longer owns)?  Or is it issued a new LSID by the receiving
institution, in which case the specimen now has more than one ID?  What
happens to a legacy dataset that still has a pointer to the old LSID?  Yes,
I can see work-arounds to all of these (mechanisms for auto-forwarding,
maintenance of indesx of duplicate LSIDs, etc.)  But when you add up all of
that sort of baggage, the
redundant-mirrored-central-ID-issuer-with-defined-chain-of-command-cascade
system seems easier to manage, more flexible, and more reliable.

> > Well...that's partly why I emphasized that I think GUIDs should be for
> > computer-computer data exchange only.  But even if printed for a pair of
> > human eyes to read, surely there would be *some* stated context.  E.g.,
> > "ITIS TSN 1234567"; "BPBM 123456"; "GBIF Specimen ID 9876543";
> "ICZN NameID
> > 92AB5B37-70E9-4f05-9E97-CBABD08513ED"; etc....
> >
>
> So formalize that a little and you might have something more
> consistently machine parsable like: ITIS.ORG:TSN:1234567;
> BPBM.EDU:something:123456;GBIF.ORG:Specimen:9876543, ...
>
> Add in the system identifier for resolution (urn:lsid:...) and you have
> LSIDs.  The result is a far more consistent, legible and widely useful
> mechanism for referencing objects.  Allowing an author to arbitrarily
> provide the context for identifiers gets us little further along.

Yes, but the difference would be that in my world, any one of many mirrored
sites could resolve GBIF.ORG:Specimen:9876543; whereas the LSID protocol you
described above requires the issuer to resolve it.

> > How hard would it be in such cases to include within the
> Methods section of
> > the document, something to the effect of "All taxon IDs listed
> in this paper
> > refer to GBIF Specimen ID's, which can be resolved at gbif.net".  If the
> > problem is one involving a pair of human eyes reading a number, then the
> > problem can be solved in the context of a pair of human eyes reading the
> > context.
> >
>
> Sure, but do that consistently, by all authors?  And do it in a way that
> is without ambiguity?  Machine parsable (for electronic publication)?
> Easily resuable in other documents?

If it's human-human data exchange, it doesn't need to be consistent by all
authors.  ICZN requires new species-group names to be represented by a
holotype specimen. There are no rules about how an author indicates the
Holotype specimen -- only that it is done so more or less unambiguously.
Perhaps ICZN rules should be strengthened -- but the point is, human-human
communication of this sort works fairly well even without rigid rules for
consistency, and even tolerates a fair amount of ambiguity.

Machine parsable is another issue.  I see documents, as you describe them,
as a medium of human->human (or computer->human) information exchange.  If
the data exist electronically already, why pass them from machine to machine
via "dumbed-down" human-readable documents that need to be subsequently
re-interpreted by a machine?  Better to start teaching Kindergarteners to
read XML as though it were prose! :-)

It's getting late, but I'll try to send off a "Part II" before I call it a
night.

Aloha,
Rich