Re: Globally Unique Identifier - part II
I sent the other response before going through the whole document. Treat this as part II. This is getting so huge, there might even be a part III...
Richard Pyle wrote:
...
That's the point of the LSID or DOI, they provide GUIDs that identify what system can be used to resolve them. If GUIDs for names or specimens or whatever are to be used in other systems, then it is essential that the GUID can be associated with a resolving system.
I tend to agree -- which is why I preferred DOIs (and increasingly, LSIDs) to MAC ID's (which show up all over the place in all sorts of contexts). Even still, though, I think we'll find that all electronic exchanges involving GUIDs of which we speak, will do so within an evident context.
Maybe. Perhaps for individual records there is no need for a resolvable identifier to a single object, and using a MAC type guid there makes some sense. But if we go to the trouble of making GUIDs, why not make the useful as well?
Both the DOI and LSID approaches are structured and provide context. The DOI system uses the NISO Z39.84-2000 standard for categorization, the LSID uses the domain name system. Both provide a context essential for reuse of an identifier outside it's original context.
Yes, but I initially preferred DOIs to LSIDs because there tends to be less "context baggage" associated with them. My sense of DOIs is that each institution would not create its own DOI category; but rather there would be a single agreed-upon DOI category that is independent of any particular institution (with all the potential for political baggage an institution-specified context might afford).
Yes, that would be the way they are created - the DOI category would be assigned by the governing agency (probably DOI.org). Then the baggage, the unique part, would be up to the data providers or some other authority.
This was one of the first recommendations to GBIF - to provide a registry of institution codes for exactly this purpose. Having a tool that verified the uniqueness of records within a collection as exposed by it's provider (either biocase or digir) would help this uniqueness problem. Now that the UDDI registry is available, we could in theory use the institution identifiers in there.
More power to you (and GBIF, and the future of DiGIR)! But in my view, it should still be seen only as a temporary solution, until we can get our acts together with more specific (and less information-contingent) ID systems.
Yeah, I think the UDDI registry can really be leveraged to help with this.
I strongly disagree that there should be a single GUID issuer or resolver.
I believe you are in the majority on this. But when I think it all through, I still feel that consolidation of GUID issuance will be more advantageous in the long term.
Nope. You'll have to try harder to convince me :-)
What we really need is an organization that operates kind of like a certificate authority- GBIF could act as the root from which other trusted GUID issuers may be created. In this way we can avoid the arbitrary creation of GUIDs yet still provide considerable flexibility and de-centralization in the community.
If I read you correctly, I gather you are saying that the issuance of numbers would be distributed and isolated, but the issuers would fall under a centralized authority. I'm not sure I understand why this system is necessarily advantageous over a centralized issuer.
Because there's no single point of failure, it is more scalable, and in the (unlikely) event the centralized authority no longer exists, it would be a fairly trivial matter to delegate root authority to another tusted party.
It would be a relatively simple task to include a LSID resolver service along with a DiGIR provider. I have prototyped such a system a while back, but other issues prevented deployment. With such an implementation, it would be trivial to assign unique identifiers to specimens - but first the problems institutions seem to have even providing unique identifiers within a collection must be resolved.
AGREED!
As you've outlined in subsequent slides, I see two alternative
paths: A)
Get the biological world to rally around GBIF as the
centralized provider of
GUIDs for specimens for all collections; or B) Have each collection/institution issue its own set of LSIDs for its own
specimens, and
have GBIF adopt those LSIDs for its own internal purposes. I could get behind either approach, but I see danger in the adoption of a mixture of these two approaches. I'll defer elaboration, but a lot of it
has to do with
potential confusion about whether the GUID applies fundamentally to the physical specimen, or the electronic conglomeration of data
associated with
the specimen. Also, I think we should avoid the risk of assigning two separate GUIDs for the same "single data element" (sensu your Slide 5).
A mixture would still work, provided there was appropriate coordination between the efforts.
With the level of coordination required, you might as well go for the "brass ring" (in my opinion). But maybe what I see as the "brass ring" is seen as a dud to others.
Thus, when it comes to assigning GUIDs for names (not concepts), I would propose the following:
urn:lsid:ICZN.org:TaxonName:XXXXXX (all zoological names) urn:lsid:ICBN.org:TaxonName:XXXXXX (all botanical names) urn:lsid:ICNB[or LBSN??].org:TaxonName:XXXXXX (all
bacteriological names)
urn:lsid:ICTV[or ICVCN??].org:TaxonName:XXXXXX (all virus names)
In an ideal world, we'd get to the point where there would be a need for only one registrar of nomenclature, e.g.: urn:lsid:BioCode.org:TaxonName:XXXXXXX
Or, perhaps: urn:lsid:gbif.net:TaxonName:XXXXXXX
It is quite likely that there will be multiple LSID generators and issuers. There is no real reason why this should be prevented, except to ensure that appropriate measures are taken to avoid duplication of GUIDs for the same object (taxonomic concept in this case).
Actually, I was talking about Taxonomic Names, specifically -- but if Names are considered as represented by a subset of Concepts (as I hope they will be), then it's the same GUID pool.
Not sure what you mean here- If Joe enters a citation someplace and Rich uses it's LSID within a Taxonomic Object he entered, why does it have to be in the same pool? As long as the LSID resolved to the appropriate object, all would be good.
So a critical piece of infrastructure for a name service that was intending to assign GUIDs would be a mechanism for determining if the object they are about to assign the GUID to is not already present in the system, held at some other location. There needs to be something like a global "findThisObject(taxon_object)" that absolutely guarantees that the instance doesn't exist some other place. And if duplicates were to occur, then there must also be a mechanism for indicating equivalence between GUIDs, or perhaps a way of deleting the duplicate (how to decide which is the duplicate?).
I agree with all of this, but it seems that the infrastructure you describe would yield a higher total cost than the single GUID provider approach would.
Yeah, but it really concerns me having a single point of failure for such a critical system.
Forcing the use of a single DN such as BioCode.org for all names would seem to be a mistake, since that implies a single resolver service for all names- with obvious implications in case of failure. Perhaps there can be multiple resolver services with a single DN? That would probably work fine then.
Hmmm...I'm not sure I follow. If I interpret your word "resolver" correctly, then I see no reason why BioCode.org LSIDs could only be resolved by one server. Is that what the DomainName component of a LSID is specifically for? That is, "go to this domain to resolve the meaning of this LSID"? I thought the DomainName component was simply to give uniqueness to an LSID in the form of representing the issuer (analogous to the function of InstitutionCode in DwC). I see no reason why there couldn't be dozens, or hundreds of mirrored caches of the complete dataset all over the world, maintained automatically in synchrony with the "master" set (which would presumably, but not necessarily, reside at BioCode.org). Any one of the mirrors could resolve any BioCode.org LSID. With such a system, resolving an LSID would require that *any one* of potentially dozens of mirrored servers to be functional.
If I understand you correctly, and an LSID is resolved only by the server at the Domain embedded within the LSID, then a dataset containing a heterogeneous assortment of LSIDs would need *all* of potentially dozens of distributed servers to be functional.
How an LSID is resolved is described in detail in the document:
http://www.omg.org/cgi-bin/doc?lifesci/2003-12-02
Section 8.3 describes the use of DNS for resolution.
Basically, the LSID client:
1. Parses the LSID urn:lsid:DN:NS:ID[:Rev] 2. Using DNS, locate the SVR record for DN, which points to the service 3. Using DNS again, resolve the location of the service 4. Once you have the service endpoint, basically ask it for the object with NS:ID:Rev
That's a gross simplification, and it appears that the LSID definition now treats DNS resolution as one resolution mechanism, rather than the only one.
The LSID service must be able to resolve the object. When the object moves some other place, then there will need to be a mechanism for the LSID service to forward the resolution to the appropriate service. The really big problem is when an institution no longer exists - so the hypothetical example of Bishop museum consuming all the Smithsonian fish collections - the Smithsonian LSID resolver would perhaps no longer exist, and so those LSIDs become meaningless.
In that case, I would vehemently oppose the use of LSIDs -- especially ones issued from multiple sources, which rely on the issuer existing into perpetuity. It seems MUCH more feasible to me that the GUIDs only be used within a prescribed context, than it would to require that all LSID issuers exist into perpetuity, and be functional at all times that someone needs to resolve the information associated with any particular ID value.
Embedding issuer context in a GUID makes sense to me. Restricting resolution of GUID to the embedded issuer *only*, seems like a very dangerous system to me.
Yeah, but once again - if the single issuer no longer exists, then everything is gone. That would be a real drag.
Part III to follow!
Dave V.
participants (1)
-
Dave Vieglais