Globally Unique Identifier - part II

Fri Sep 24 20:51:43 CEST 2004

I sent the other response before going through the whole document. Treat
this as part II.  This is getting so huge, there might even be a part III...

Richard Pyle wrote:

...

>
>
>
>>That's
>>the point of the LSID or DOI, they provide GUIDs that identify what
>>system can be used to resolve them.  If GUIDs for names or specimens or
>>whatever are to be used in other systems, then it is essential that the
>>GUID can be associated with a resolving system.
>
>
> I tend to agree -- which is why I preferred DOIs (and increasingly, LSIDs)
> to MAC ID's (which show up all over the place in all sorts of contexts).
> Even still, though, I think we'll find that all electronic exchanges
> involving GUIDs of which we speak, will do so within an evident context.

Maybe.  Perhaps for individual records there is no need for a resolvable
identifier to a single object, and using a MAC type guid there makes
some sense.  But if we go to the trouble of making GUIDs, why not make
the useful as well?

>
>>Both the DOI and LSID approaches are structured and provide context.
>>The DOI system uses the NISO Z39.84-2000 standard for categorization,
>>the LSID uses the domain name system.  Both provide a context essential
>>for reuse of an identifier outside it's original context.
>
>
> Yes, but I initially preferred DOIs to LSIDs because there tends to be less
> "context baggage" associated with them. My sense of DOIs is that each
> institution would not create its own DOI category; but rather there would be
> a single agreed-upon DOI category that is independent of any particular
> institution (with all the potential for political baggage an
> institution-specified context might afford).
>

Yes, that would be the way they are created - the DOI category would be
assigned by the governing agency (probably DOI.org).  Then the baggage,
the unique part, would be up to the data providers or some other authority.

>
>>This was one of the first recommendations to GBIF - to provide a
>>registry of institution codes for exactly this purpose.  Having a tool
>>that verified the uniqueness of records within a collection as exposed
>>by it's provider (either biocase or digir) would help this uniqueness
>>problem.  Now that the UDDI registry is available, we could in theory
>>use the institution identifiers in there.
>
>
> More power to you (and GBIF, and the future of DiGIR)!  But in my view, it
> should still be seen only as a temporary solution, until we can get our acts
> together with more specific (and less information-contingent) ID systems.
>

Yeah, I think the UDDI registry can really be leveraged to help with this.

>
>>I strongly disagree that there should be a single GUID issuer or
>>resolver.
>
>
> I believe you are in the majority on this.  But when I think it all through,
> I still feel that consolidation of GUID issuance will be more advantageous
> in the long term.
>

Nope.  You'll have to try harder to convince me :-)

>
>>What we really need is an organization that operates kind of
>>like a certificate authority- GBIF could act as the root from which
>>other trusted GUID issuers may be created.  In this way we can avoid the
>>arbitrary creation of GUIDs yet still provide considerable flexibility
>>and de-centralization in the community.
>
>
> If I read you correctly, I gather you are saying that the issuance of
> numbers would be distributed and isolated, but the issuers would fall under
> a centralized authority.  I'm not sure I understand why this system is
> necessarily advantageous over a centralized issuer.
>

Because there's no single point of failure, it is more scalable, and in
the (unlikely) event the centralized authority no longer exists, it
would be a fairly trivial matter to delegate root authority to another
tusted party.

>
>>It would be a relatively simple task to include a LSID resolver service
>>along with a DiGIR provider.  I have prototyped such a system a while
>>back, but other issues prevented deployment.  With such an
>>implementation, it would be trivial to assign unique identifiers to
>>specimens - but first the problems institutions seem to have even
>>providing unique identifiers within a collection must be resolved.
>
>
> AGREED!
>
>
>>>As you've outlined in subsequent slides, I see two alternative
>>
>>paths:  A)
>>
>>>Get the biological world to rally around GBIF as the
>>
>>centralized provider of
>>
>>>GUIDs for specimens for all collections; or B) Have each
>>>collection/institution issue its own set of LSIDs for its own
>>
>>specimens, and
>>
>>>have GBIF adopt those LSIDs for its own internal purposes.  I could get
>>>behind either approach, but I see danger in the adoption of a mixture of
>>>these two approaches. I'll defer elaboration, but a lot of it
>>
>>has to do with
>>
>>>potential confusion about whether the GUID applies fundamentally to the
>>>physical specimen, or the electronic conglomeration of data
>>
>>associated with
>>
>>>the specimen. Also, I think we should avoid the risk of assigning two
>>>separate GUIDs for the same "single data element" (sensu your Slide 5).
>>>
>>
>>A mixture would still work, provided there was appropriate coordination
>>between the efforts.
>
>
> With the level of coordination required, you might as well go for the "brass
> ring" (in my opinion).  But maybe what I see as the "brass ring" is seen as
> a dud to others.
>
>
>>>Thus, when it comes to assigning GUIDs for names (not
>>>concepts), I would propose the following:
>>>
>>>urn:lsid:ICZN.org:TaxonName:XXXXXX (all zoological names)
>>>urn:lsid:ICBN.org:TaxonName:XXXXXX (all botanical names)
>>>urn:lsid:ICNB[or LBSN??].org:TaxonName:XXXXXX (all
>>
>>bacteriological names)
>>
>>>urn:lsid:ICTV[or ICVCN??].org:TaxonName:XXXXXX (all virus names)
>>>
>>>In an ideal world, we'd get to the point where there would be a need for
>>>only one registrar of nomenclature, e.g.:
>>>urn:lsid:BioCode.org:TaxonName:XXXXXXX
>>>
>>>Or, perhaps:
>>>urn:lsid:gbif.net:TaxonName:XXXXXXX
>>
>>It is quite likely that there will be multiple LSID generators and
>>issuers.  There is no real reason why this should be prevented, except
>>to ensure that appropriate measures are taken to avoid duplication of
>>GUIDs for the same object (taxonomic concept in this case).
>
>
> Actually, I was talking about Taxonomic Names, specifically -- but if Names
> are considered as represented by a subset of Concepts (as I hope they will
> be), then it's the same GUID pool.
>

Not sure what you mean here- If Joe enters a citation someplace and Rich
uses it's LSID within a Taxonomic Object he entered, why does it have to
be in the same pool?  As long as the LSID resolved to the appropriate
object, all would be good.

>
>>So a
>>critical piece of infrastructure for a name service that was intending
>>to assign GUIDs would be a mechanism for determining if the object they
>>are about to assign the GUID to is not already present in the system,
>>held at some other location.  There needs to be something like a global
>>"findThisObject(taxon_object)" that absolutely guarantees that the
>>instance doesn't exist some other place.  And if duplicates were to
>>occur, then there must also be a mechanism for indicating equivalence
>>between GUIDs, or perhaps a way of deleting the duplicate (how to decide
>>which is the duplicate?).
>
>
> I agree with all of this, but it seems that the infrastructure you describe
> would yield a higher total cost than the single GUID provider approach
> would.

Yeah, but it really concerns me having a single point of failure for
such a critical system.

>
>
>>Forcing the use of a single DN such as BioCode.org for all names would
>>seem to be a mistake, since that implies a single resolver service for
>>all names- with obvious implications in case of failure.  Perhaps there
>>can be multiple resolver services with a single DN?  That would probably
>>work fine then.
>
>
> Hmmm...I'm not sure I follow.  If I interpret your word "resolver"
> correctly, then I see no reason why BioCode.org LSIDs could only be resolved
> by one server.  Is that what the DomainName component of a LSID is
> specifically for?  That is, "go to this domain to resolve the meaning of
> this LSID"?  I thought the DomainName component was simply to give
> uniqueness to an LSID in the form of representing the issuer (analogous to
> the function of InstitutionCode in DwC).  I see no reason why there couldn't
> be dozens, or hundreds of mirrored caches of the complete dataset all over
> the world, maintained automatically in synchrony with the "master" set
> (which would presumably, but not necessarily, reside at BioCode.org). Any
> one of the mirrors could resolve any BioCode.org LSID.  With such a system,
> resolving an LSID would require that *any one* of potentially dozens of
> mirrored servers to be functional.
>
> If I understand you correctly, and an LSID is resolved only by the server at
> the Domain embedded within the LSID, then a dataset containing a
> heterogeneous assortment of LSIDs would need *all* of potentially dozens of
> distributed servers to be functional.
>

How an LSID is resolved is described in detail in the document:

http://www.omg.org/cgi-bin/doc?lifesci/2003-12-02

Section 8.3 describes the use of DNS for resolution.

Basically, the LSID client:

1. Parses the LSID urn:lsid:DN:NS:ID[:Rev]
2. Using DNS, locate the SVR record for DN, which points to the service
3. Using DNS again, resolve the location of the service
4. Once you have the service endpoint, basically ask it for the object
with NS:ID:Rev

That's a gross simplification, and it appears that the LSID definition
now treats DNS resolution as one resolution mechanism, rather than the
only one.

>
>>The LSID service must be able to resolve the object.  When the object
>>moves some other place, then there will need to be a mechanism for the
>>LSID service to forward the resolution to the appropriate service.  The
>>really big problem is when an institution no longer exists - so the
>>hypothetical example of Bishop museum consuming all the Smithsonian fish
>>collections - the Smithsonian LSID resolver would perhaps no longer
>>exist, and so those LSIDs become meaningless.
>
>
> In that case, I would vehemently oppose the use of LSIDs -- especially ones
> issued from multiple sources, which rely on the issuer existing into
> perpetuity.  It seems MUCH more feasible to me that the GUIDs only be used
> within a prescribed context, than it would to require that all LSID issuers
> exist into perpetuity, and be functional at all times that someone needs to
> resolve the information associated with any particular ID value.
>
> Embedding issuer context in a GUID makes sense to me.  Restricting
> resolution of GUID to the embedded issuer *only*, seems like a very
> dangerous system to me.
>

Yeah, but once again - if the single issuer no longer exists, then
everything is gone.  That would be a real drag.

Part III to follow!

Dave V.