Re: Globally Unique Identifier
I have to disagree - kind of. A non-information-bearing GUID such as one generated by a MAC, eg
{92AB5B37-70E9-4f05-9E97-CBABD08513ED}
is completely useless unless it only appears within the context of a system that provides more information about what it actually is.
Yes, that would be an assumption. But not an unreasonable one. I'm trying to imagine a scenario where I am presented with a series of MAC id's where I don't inherently understand the context. I suppose if I came in to work and found such a number scribbled on a piece of paper, with no other information, I'd be in a fix to figure out what the number refers to. But obviously that's not a realistic scenario. I suspect that such IDs would be used by computers (not humans), and would only be exchanged among computers in some sort of semantic context; e.g., within the context of a DwC2 XML file, nestled between appropriate tags:
<GlobalUniqueIdentifier>92AB5B37-70E9-4f05-9E97-CBABD08513ED</GlobalUniqueId entifier>
...these themselves nestled within further context tags.
That's the point of the LSID or DOI, they provide GUIDs that identify what system can be used to resolve them. If GUIDs for names or specimens or whatever are to be used in other systems, then it is essential that the GUID can be associated with a resolving system.
I tend to agree -- which is why I preferred DOIs (and increasingly, LSIDs) to MAC ID's (which show up all over the place in all sorts of contexts). Even still, though, I think we'll find that all electronic exchanges involving GUIDs of which we speak, will do so within an evident context.
Both the DOI and LSID approaches are structured and provide context. The DOI system uses the NISO Z39.84-2000 standard for categorization, the LSID uses the domain name system. Both provide a context essential for reuse of an identifier outside it's original context.
Yes, but I initially preferred DOIs to LSIDs because there tends to be less "context baggage" associated with them. My sense of DOIs is that each institution would not create its own DOI category; but rather there would be a single agreed-upon DOI category that is independent of any particular institution (with all the potential for political baggage an institution-specified context might afford).
This was one of the first recommendations to GBIF - to provide a registry of institution codes for exactly this purpose. Having a tool that verified the uniqueness of records within a collection as exposed by it's provider (either biocase or digir) would help this uniqueness problem. Now that the UDDI registry is available, we could in theory use the institution identifiers in there.
More power to you (and GBIF, and the future of DiGIR)! But in my view, it should still be seen only as a temporary solution, until we can get our acts together with more specific (and less information-contingent) ID systems.
I strongly disagree that there should be a single GUID issuer or resolver.
I believe you are in the majority on this. But when I think it all through, I still feel that consolidation of GUID issuance will be more advantageous in the long term.
What we really need is an organization that operates kind of like a certificate authority- GBIF could act as the root from which other trusted GUID issuers may be created. In this way we can avoid the arbitrary creation of GUIDs yet still provide considerable flexibility and de-centralization in the community.
If I read you correctly, I gather you are saying that the issuance of numbers would be distributed and isolated, but the issuers would fall under a centralized authority. I'm not sure I understand why this system is necessarily advantageous over a centralized issuer.
It would be a relatively simple task to include a LSID resolver service along with a DiGIR provider. I have prototyped such a system a while back, but other issues prevented deployment. With such an implementation, it would be trivial to assign unique identifiers to specimens - but first the problems institutions seem to have even providing unique identifiers within a collection must be resolved.
AGREED!
As you've outlined in subsequent slides, I see two alternative
paths: A)
Get the biological world to rally around GBIF as the
centralized provider of
GUIDs for specimens for all collections; or B) Have each collection/institution issue its own set of LSIDs for its own
specimens, and
have GBIF adopt those LSIDs for its own internal purposes. I could get behind either approach, but I see danger in the adoption of a mixture of these two approaches. I'll defer elaboration, but a lot of it
has to do with
potential confusion about whether the GUID applies fundamentally to the physical specimen, or the electronic conglomeration of data
associated with
the specimen. Also, I think we should avoid the risk of assigning two separate GUIDs for the same "single data element" (sensu your Slide 5).
A mixture would still work, provided there was appropriate coordination between the efforts.
With the level of coordination required, you might as well go for the "brass ring" (in my opinion). But maybe what I see as the "brass ring" is seen as a dud to others.
Thus, when it comes to assigning GUIDs for names (not concepts), I would propose the following:
urn:lsid:ICZN.org:TaxonName:XXXXXX (all zoological names) urn:lsid:ICBN.org:TaxonName:XXXXXX (all botanical names) urn:lsid:ICNB[or LBSN??].org:TaxonName:XXXXXX (all
bacteriological names)
urn:lsid:ICTV[or ICVCN??].org:TaxonName:XXXXXX (all virus names)
In an ideal world, we'd get to the point where there would be a need for only one registrar of nomenclature, e.g.: urn:lsid:BioCode.org:TaxonName:XXXXXXX
Or, perhaps: urn:lsid:gbif.net:TaxonName:XXXXXXX
It is quite likely that there will be multiple LSID generators and issuers. There is no real reason why this should be prevented, except to ensure that appropriate measures are taken to avoid duplication of GUIDs for the same object (taxonomic concept in this case).
Actually, I was talking about Taxonomic Names, specifically -- but if Names are considered as represented by a subset of Concepts (as I hope they will be), then it's the same GUID pool.
So a critical piece of infrastructure for a name service that was intending to assign GUIDs would be a mechanism for determining if the object they are about to assign the GUID to is not already present in the system, held at some other location. There needs to be something like a global "findThisObject(taxon_object)" that absolutely guarantees that the instance doesn't exist some other place. And if duplicates were to occur, then there must also be a mechanism for indicating equivalence between GUIDs, or perhaps a way of deleting the duplicate (how to decide which is the duplicate?).
I agree with all of this, but it seems that the infrastructure you describe would yield a higher total cost than the single GUID provider approach would.
Forcing the use of a single DN such as BioCode.org for all names would seem to be a mistake, since that implies a single resolver service for all names- with obvious implications in case of failure. Perhaps there can be multiple resolver services with a single DN? That would probably work fine then.
Hmmm...I'm not sure I follow. If I interpret your word "resolver" correctly, then I see no reason why BioCode.org LSIDs could only be resolved by one server. Is that what the DomainName component of a LSID is specifically for? That is, "go to this domain to resolve the meaning of this LSID"? I thought the DomainName component was simply to give uniqueness to an LSID in the form of representing the issuer (analogous to the function of InstitutionCode in DwC). I see no reason why there couldn't be dozens, or hundreds of mirrored caches of the complete dataset all over the world, maintained automatically in synchrony with the "master" set (which would presumably, but not necessarily, reside at BioCode.org). Any one of the mirrors could resolve any BioCode.org LSID. With such a system, resolving an LSID would require that *any one* of potentially dozens of mirrored servers to be functional.
If I understand you correctly, and an LSID is resolved only by the server at the Domain embedded within the LSID, then a dataset containing a heterogeneous assortment of LSIDs would need *all* of potentially dozens of distributed servers to be functional.
The LSID service must be able to resolve the object. When the object moves some other place, then there will need to be a mechanism for the LSID service to forward the resolution to the appropriate service. The really big problem is when an institution no longer exists - so the hypothetical example of Bishop museum consuming all the Smithsonian fish collections - the Smithsonian LSID resolver would perhaps no longer exist, and so those LSIDs become meaningless.
In that case, I would vehemently oppose the use of LSIDs -- especially ones issued from multiple sources, which rely on the issuer existing into perpetuity. It seems MUCH more feasible to me that the GUIDs only be used within a prescribed context, than it would to require that all LSID issuers exist into perpetuity, and be functional at all times that someone needs to resolve the information associated with any particular ID value.
Embedding issuer context in a GUID makes sense to me. Restricting resolution of GUID to the embedded issuer *only*, seems like a very dangerous system to me.
Perhaps there's a delegation mechanism that can be used? So when a DN can't be resolved, the system backs down to a default DN, such as gbif.org that would then indicate that smithsonian.org is now bishop.org?
But it's not that simple, is it? If there is an LSID:
urn:lsid:bishopmuseum.org:Specimen:1234567
and another LSID, to a completely different specimen:
urn:lsid:smithsonian.gov:Specimen:1234567
...then simply re-directing all bishopmuseum.org requests to Smiithsonian wouldn't work....would it? Or would Smithsonian recognize the domain and deal with it accordingly?
It seems to me that a lot of complexity would disappear if we could all get behind a single issuer of GUIDs, and mirror the capability to resolve those GUIDs on dozens or hundreds of servers around the world, and only use the GUIDs in a semantic context that is self-evident.
Re-reading something I wrote:
I would go further to suggest (as I did above) that "Name" GUIDs should also be a subtype
of
Name-Reference instances (non-exclusive of Concept subtype instances),
using
the Name-Reference instance that represents the Code-recognized original description of the name as the "handle" to the Name.
Actually, it's probably safe to say that all "name-bearing" Name+Reference instances (i.e., original descriptions) are also, virtually by definition, also "concept-bearing" Name+Reference instances. So, not only would name-bearing and concept-bearing Name+Reference instances be non-exclusive of each other, it would probably be safe to think of name-bearing instances as a subset (Subtype) of Concept-bearing instances, which themselves are a subset (Subtype) of all Name+Reference instances.
My own answers to your questions:
Are LSIDs the most appropriate technology?
I'm increasingly coming to that conclusion.
I agree. The LSID system is easy to implement, stable, scalable and does everything we need. The DOI system is good as well, but the fee scheme bothers me (though I understand there are ways around that).
My understanding is that it would be easy to develop a DOI-like system that is not part of the fee-based DOI system, and I still find it appealing because it could as simple as an integer ID and very basic context tag.
As for LSIDs -- If I understand correctly that the purpose of the <DomainName> portion of the LSID is to point to the one (and only?) server that can resolve the ID, then all of a sudden I don't like them at all. If it's true that the embedded Domain portion of an LSID *requires* that the domain exist for as long as the GUID exists in order for the GUID to be useful, then I definitely have reservations. If, on the other hand, the Domain portion can be seen as representing the issuer (somewhat analogous to the function of "InstitutionCode" in DwC), and could be resolved by any server set up to deal with the <namespace> part of the LSID, then I'm much less concerned.
I think the best option would be central. The next
option would be full
distributed. Leaving it as an option would, in my opinion, be a BIG mistake.
I disagree- the assignment of identifiers should be by the curators of the data. However, I do strongly consider that there should be some sort of trust scheme in place, where identifiers are issued only by entities trusted by the rest of the system. A scheme similar to that used by certificate authorities and delegates should be adequate.
Maybe I'm misunderstanding the use of the word "issuers", but in my mind, the issuer's job is only to provide a guaranteed-unique set of ID's. It would not, necessarily, be the location where the ID is applied to its associated data.
In Donald's PowerPoint file, he made reference to "mechanisms for data providers to request and use blocks of LSIDs from central service". Here's how I imagine a system would work:
GBIF (or some other central entity) establishes a service that can generate unique <objectID> numbers within its own LSID context. The same service also maintains a complete set of data associated with each <objectID>. Major (and minor) institutions (essentially your set of "Trusted" entities) would established mirrored copies of the complete set of all data (or, perhaps, only a filtered subset of the complete data), but would not be able to issue new GUIDs directly. However, the mirrored sites could serve as real-time "pass-through" to the central sight so as to be functionally able to provide new GUIDs in real time, by retrieving them directly (in real time) from the central server. Also, the mirrored sites would all maintain synchrony of their copies of the data with the central "master" copy, on a realistic time frame (e.g., every 24 hours, or on-demand if a data provider chose to initiate a synchronization command).
If a curator of a local institution's data needed to assign a new batch of numbers for a new set of specimens, the curator would issue a request to the central server (or via one of mirrored sites as a pass-through request) for a block of N numbers. The central server would never re-issue those same numbers again to anyone else. But those numbers remain "empty" until the curator assigns them to data, and uploads that data either to the central server or to one of the mirrors. In other words, even though the numbers are "issued" by a central server, they are applied to real data only by local curators.
A big issue, of course, is control over editing of data associated with a given GUID. In the case of specimens, the central server and mirrored sites could (perhaps at the discretion of the data curator who initially requested the number) restrict subsequent editing of those data to a defined set of password-protected user accounts. In the case of more public data, such as taxon names and publications, the control of data editing would be less restrictive (e.g., either full accessible by the public, or accessible to anyone who goes to the trouble to register themselves as a taxonomist with the central server or with any of the mirrored sites).
Maybe this approach would not be practical for specimen data -- but I think it would be the optimal approach for taxon data. Perhaps those two fundamentally different kinds of data (owned, vs. public domain) need fundamentally different approaches to GUID issuance and assignment?
Which objects should receive identifiers?
Specimens, References, Name-Reference intersections
(Assertions), and
perhaps Agents. [TaxonNames and Concepts can be subsets of
Name-Reference
intersections].
Any object. It doesn't matter what it is, just that it can be resolved, and when you find it, you can figure out what it is. Sensible use of the NameSpace portion of the LSID will help a lot with this. A trusted organization should issue the NameSpace portion to avoid NS conflicts.
I'd have to think this through some more. Leaving it too open might lead to a plethora of (potentially overlapping, but not quite equivalent) NameSpaces, which seems like it could turn into a real mess, really quickly. Centralized ID systems such as social security numbers in the U.S., telephone numbers, etc. definitely have some advantage over totally open systems. I suppose that the pool of NS's would be self-cleaning simply by use or non-use....but I still wonder how much better this approach would be over the status quo.
3a) Should we develop a set of object classes for biodiversity
informatics
and assign identifiers to instances of all of these?
I think so, yes. Of course, it depends a bit on who you
mean by "we". I'm
thinking sensu lato.
Sure, and these could be a core from which others can be built. But we should absolutely not restrict the capability of the "system" to accept new classes - even classes that represent the same infomration in a different way that may be appropriate to a group of users.
Again, I'll have to think about this some more. I certainly don't think that the "system" should be incapable of dealing with new classes -- sort of like how anyone can develop their own Federation Schema and use DiGIR to establish specific information networks. But I'd hate to see a breakdown in the global transmission of biodiversity information simply because different subgroups establish their own special-needs, non-mutually-compatible classes for dealing with essentially the same kinds of information (especially if they do not also conform to a generalized international standard).
What should be done about existing records without identifiers?
As far as I know, ALL records are currently without
identifiers (unless
someone established a widely accepted GUID system and I missed the announcement...)
All records currently have some sort of identifier, the problem is their uniqueness is not rigorously enforced or even evaluated, so their usefulness is probably limited.
O.K., in that case I misunderstood the meaning of "identifiers". All historical identifiers (e.g., catalog numbers for specimens) should be maintained, preserved, and cross-referenced to GUIDs just like any other metadata about the physical object. I think of catalog numbers not so much as unique identifiers, but as "labels" -- not altogether unlike taxonomic names. In the databases I manage, I do not use catalog numbers as identifiers -- the computer generates the UID, which is never seen, read, written, or typed by a human. That's how I'd like to see the sorts of GUIDs we're discussing be implemented -- i.e., for the benefit of computer-computer data exchange; not human-human data exchange or human-computer/computer-human data exchange.
4c) Should the provider software be modified to generate "soft"
identifiers
(ones which we cannot guarantee in all cases to be unique)
based e.g. on the
combination of InstitutionCode, CollectionCode and CatalogNumber?
As an interim solution, perhaps. See my comments under
"Slide 2" above.
Yes, but not soft. The providers should assign their own identifiers, but there must be a mechanism to ensure that identifiers are being properly assigned.
Agreed -- but I still think of these as "soft" identifiers, because CatalogNumber values can change over time, in certain circumstances. GUIDs should *never* need to be changed (even if the institution that issued them vanishes without a trace from the face of the Earth).
Revision information is very helpful in dealing with errors such as keystroke errors or other such details that do not change the object.
I agree the revision information *can* be helpful in dealing with errors; but I don't see that function as being integral to the assignment of GUID values.
Not many. It seems most collections don't record any history in their record edits, so without a major alteration in the way the data are stored, it will be a significant undertaking to provide useful revision information.
For what it's worth, the databases I have developed for my institution are designed to log every change made to every field (except performance-enhancing, purely derivative fields), including what the previous value was, who made the change, and when the change was made. When records are deleted, a "snapshot" of the value of every non-null field is logged, including the time the record was deleted, and by whom. The reason I say all of this is to underscore that my stance on not including versioning IDs as part of a GUID system is NOT from lack of appreciation for the value of preserving edit histories (something I clearly value very, very much -- given that the total diskspace occupied by my edit logs exceeds the total diskspace occupied by the "real" data!)
In closing, I apologize to those who find my overly-long posts on this topic to be an annoyance. I also am starting to wonder: is this the appropriate email forum to have this discussion?
Aloha, Rich
participants (1)
-
Richard Pyle