Re: Globally Unique Identifier - part II - tdwg-content

24 Sep 2004

      I sent the other response before going through the whole document. Treat
this as part II.  This is getting so huge, there might even be a part III...

Richard Pyle wrote:

...
...
...
That's
the point of the LSID or DOI, they provide GUIDs that identify what
system can be used to resolve them.  If GUIDs for names or specimens or
whatever are to be used in other systems, then it is essential that the
GUID can be associated with a resolving system.
I tend to agree -- which is why I preferred DOIs (and increasingly, LSIDs)
to MAC ID's (which show up all over the place in all sorts of contexts).
Even still, though, I think we'll find that all electronic exchanges
involving GUIDs of which we speak, will do so within an evident context.
Maybe.  Perhaps for individual records there is no need for a resolvable
identifier to a single object, and using a MAC type guid there makes
some sense.  But if we go to the trouble of making GUIDs, why not make
the useful as well?
...
...
Both the DOI and LSID approaches are structured and provide context.
The DOI system uses the NISO Z39.84-2000 standard for categorization,
the LSID uses the domain name system.  Both provide a context essential
for reuse of an identifier outside it's original context.
Yes, but I initially preferred DOIs to LSIDs because there tends to be less
"context baggage" associated with them. My sense of DOIs is that each
institution would not create its own DOI category; but rather there would be
a single agreed-upon DOI category that is independent of any particular
institution (with all the potential for political baggage an
institution-specified context might afford).
Yes, that would be the way they are created - the DOI category would be
assigned by the governing agency (probably DOI.org).  Then the baggage,
the unique part, would be up to the data providers or some other authority.
...
...
This was one of the first recommendations to GBIF - to provide a
registry of institution codes for exactly this purpose.  Having a tool
that verified the uniqueness of records within a collection as exposed
by it's provider (either biocase or digir) would help this uniqueness
problem.  Now that the UDDI registry is available, we could in theory
use the institution identifiers in there.
More power to you (and GBIF, and the future of DiGIR)!  But in my view, it
should still be seen only as a temporary solution, until we can get our acts
together with more specific (and less information-contingent) ID systems.
Yeah, I think the UDDI registry can really be leveraged to help with this.
...
...
I strongly disagree that there should be a single GUID issuer or
resolver.
I believe you are in the majority on this.  But when I think it all through,
I still feel that consolidation of GUID issuance will be more advantageous
in the long term.
Nope.  You'll have to try harder to convince me :-)
...
...
What we really need is an organization that operates kind of
like a certificate authority- GBIF could act as the root from which
other trusted GUID issuers may be created.  In this way we can avoid the
arbitrary creation of GUIDs yet still provide considerable flexibility
and de-centralization in the community.
If I read you correctly, I gather you are saying that the issuance of
numbers would be distributed and isolated, but the issuers would fall under
a centralized authority.  I'm not sure I understand why this system is
necessarily advantageous over a centralized issuer.
Because there's no single point of failure, it is more scalable, and in
the (unlikely) event the centralized authority no longer exists, it
would be a fairly trivial matter to delegate root authority to another
tusted party.
...
...
It would be a relatively simple task to include a LSID resolver service
along with a DiGIR provider.  I have prototyped such a system a while
back, but other issues prevented deployment.  With such an
implementation, it would be trivial to assign unique identifiers to
specimens - but first the problems institutions seem to have even
providing unique identifiers within a collection must be resolved.
AGREED!
...
...
As you've outlined in subsequent slides, I see two alternative
paths:  A)
...
Get the biological world to rally around GBIF as the
centralized provider of
...
GUIDs for specimens for all collections; or B) Have each
collection/institution issue its own set of LSIDs for its own
specimens, and
...
have GBIF adopt those LSIDs for its own internal purposes.  I could get
behind either approach, but I see danger in the adoption of a mixture of
these two approaches. I'll defer elaboration, but a lot of it
has to do with
...
potential confusion about whether the GUID applies fundamentally to the
physical specimen, or the electronic conglomeration of data
associated with
...
the specimen. Also, I think we should avoid the risk of assigning two
separate GUIDs for the same "single data element" (sensu your Slide 5).
A mixture would still work, provided there was appropriate coordination
between the efforts.
With the level of coordination required, you might as well go for the "brass
ring" (in my opinion).  But maybe what I see as the "brass ring" is seen as
a dud to others.
...
...
Thus, when it comes to assigning GUIDs for names (not
concepts), I would propose the following:
urn:lsid:ICZN.org:TaxonName:XXXXXX (all zoological names)
urn:lsid:ICBN.org:TaxonName:XXXXXX (all botanical names)
urn:lsid:ICNB[or LBSN??].org:TaxonName:XXXXXX (all
bacteriological names)
...
urn:lsid:ICTV[or ICVCN??].org:TaxonName:XXXXXX (all virus names)
In an ideal world, we'd get to the point where there would be a need for
only one registrar of nomenclature, e.g.:
urn:lsid:BioCode.org:TaxonName:XXXXXXX
Or, perhaps:
urn:lsid:gbif.net:TaxonName:XXXXXXX
It is quite likely that there will be multiple LSID generators and
issuers.  There is no real reason why this should be prevented, except
to ensure that appropriate measures are taken to avoid duplication of
GUIDs for the same object (taxonomic concept in this case).
Actually, I was talking about Taxonomic Names, specifically -- but if Names
are considered as represented by a subset of Concepts (as I hope they will
be), then it's the same GUID pool.
Not sure what you mean here- If Joe enters a citation someplace and Rich
uses it's LSID within a Taxonomic Object he entered, why does it have to
be in the same pool?  As long as the LSID resolved to the appropriate
object, all would be good.
...
...
So a
critical piece of infrastructure for a name service that was intending
to assign GUIDs would be a mechanism for determining if the object they
are about to assign the GUID to is not already present in the system,
held at some other location.  There needs to be something like a global
"findThisObject(taxon_object)" that absolutely guarantees that the
instance doesn't exist some other place.  And if duplicates were to
occur, then there must also be a mechanism for indicating equivalence
between GUIDs, or perhaps a way of deleting the duplicate (how to decide
which is the duplicate?).
I agree with all of this, but it seems that the infrastructure you describe
would yield a higher total cost than the single GUID provider approach
would.
Yeah, but it really concerns me having a single point of failure for
such a critical system.
...
...
Forcing the use of a single DN such as BioCode.org for all names would
seem to be a mistake, since that implies a single resolver service for
all names- with obvious implications in case of failure.  Perhaps there
can be multiple resolver services with a single DN?  That would probably
work fine then.
Hmmm...I'm not sure I follow.  If I interpret your word "resolver"
correctly, then I see no reason why BioCode.org LSIDs could only be resolved
by one server.  Is that what the DomainName component of a LSID is
specifically for?  That is, "go to this domain to resolve the meaning of
this LSID"?  I thought the DomainName component was simply to give
uniqueness to an LSID in the form of representing the issuer (analogous to
the function of InstitutionCode in DwC).  I see no reason why there couldn't
be dozens, or hundreds of mirrored caches of the complete dataset all over
the world, maintained automatically in synchrony with the "master" set
(which would presumably, but not necessarily, reside at BioCode.org). Any
one of the mirrors could resolve any BioCode.org LSID.  With such a system,
resolving an LSID would require that *any one* of potentially dozens of
mirrored servers to be functional.
If I understand you correctly, and an LSID is resolved only by the server at
the Domain embedded within the LSID, then a dataset containing a
heterogeneous assortment of LSIDs would need *all* of potentially dozens of
distributed servers to be functional.
How an LSID is resolved is described in detail in the document:

http://www.omg.org/cgi-bin/doc?lifesci/2003-12-02

Section 8.3 describes the use of DNS for resolution.

Basically, the LSID client:

1. Parses the LSID urn:lsid:DN:NS:ID[:Rev]
2. Using DNS, locate the SVR record for DN, which points to the service
3. Using DNS again, resolve the location of the service
4. Once you have the service endpoint, basically ask it for the object
with NS:ID:Rev

That's a gross simplification, and it appears that the LSID definition
now treats DNS resolution as one resolution mechanism, rather than the
only one.
...
...
The LSID service must be able to resolve the object.  When the object
moves some other place, then there will need to be a mechanism for the
LSID service to forward the resolution to the appropriate service.  The
really big problem is when an institution no longer exists - so the
hypothetical example of Bishop museum consuming all the Smithsonian fish
collections - the Smithsonian LSID resolver would perhaps no longer
exist, and so those LSIDs become meaningless.
In that case, I would vehemently oppose the use of LSIDs -- especially ones
issued from multiple sources, which rely on the issuer existing into
perpetuity.  It seems MUCH more feasible to me that the GUIDs only be used
within a prescribed context, than it would to require that all LSID issuers
exist into perpetuity, and be functional at all times that someone needs to
resolve the information associated with any particular ID value.
Embedding issuer context in a GUID makes sense to me.  Restricting
resolution of GUID to the embedded issuer *only*, seems like a very
dangerous system to me.
Yeah, but once again - if the single issuer no longer exists, then
everything is gone.  That would be a real drag.

Part III to follow!

Dave V.

Re: Globally Unique Identifier - part II

Dave Vieglais

tags

participants (1)