I believe you are in the majority on this. But when I think it
all through,
I still feel that consolidation of GUID issuance will be more
advantageous
in the long term.
Nope. You'll have to try harder to convince me :-)
Wasn't trying to convince you....just signaling that you had yet to convince me! :-)
I can see the discussions will be lively in Christchurch... :-)
If I read you correctly, I gather you are saying that the issuance of numbers would be distributed and isolated, but the issuers
would fall under
a centralized authority. I'm not sure I understand why this system is necessarily advantageous over a centralized issuer.
Because there's no single point of failure, it is more scalable, and in the (unlikely) event the centralized authority no longer exists, it would be a fairly trivial matter to delegate root authority to another tusted party.
O.K., I can see that. But in the world I described, the only affect of a failure would be an inability to retrieve new IDs -- all existing ID's would still be resolvable at any of the dozens or hundreds of mirror sites. Moreover, I would assume that the server that issued the numbers would be designed to be as reliable as possible (think NYSE). In the event that the centralized authority no longer exists, it would be, as you say, a trivial matter to delegate number issuance to another "trusted" mirror site.
Also, what of my point that a dataset containing ID's with heterogenous sources would allow *ALL* sources to be single points of failure? Is it not true that an LSID would require that the issuing domain be online in order to retrieve the data associated with the ID? So if my dataset included IDs from 15 issuing domains, all 15 would need to be active at the time I run my query in order to get a complete return of data? This seems like an even less robust system.
It is quite likely that there will be multiple LSID generators and issuers. There is no real reason why this should be prevented, except to ensure that appropriate measures are taken to avoid duplication of GUIDs for the same object (taxonomic concept in this case).
Actually, I was talking about Taxonomic Names, specifically --
but if Names
are considered as represented by a subset of Concepts (as I
hope they will
be), then it's the same GUID pool.
Not sure what you mean here- If Joe enters a citation someplace and Rich uses it's LSID within a Taxonomic Object he entered, why does it have to be in the same pool? As long as the LSID resolved to the appropriate object, all would be good.
No....Reference IDs (I assume by "citation", you mean what I mean when I say "Reference"?) would be a different pool from Names.
What I meant was, in my world "Name" Id's would be drawn from the same pool of Name+Reference IDs that Concept IDs would be. In other words, there would be one pool of IDs for Name+Reference instances. A subset of these would be Concept-bearing Name+Reference instances. And a subset of the Concept-bearing instances (sub-subset of Name+Reference instances) would be Name-bearing instances. Those Name-bearing instances would be the "Name" component of other Name+Reference instances. It's a recursive relationship, via well-defined subtypes. But this is an entirely separate topic of discussion, having more to do with the question of "what is the essence of a taxonomic concept", and "what is the essence of a taxonomic name" -- that is tangential to the more general discussion at hand.
Yeah, but it really concerns me having a single point of failure for such a critical system.
As I described above, it doesn't have to be a single point of failure, and the only "failure" would be a delay in receiving new ID's. And there could be a defined "chain of command" such that if the primary server (ID issuer) goes down, the calls are automatically re-routed to the next mirror in the chain of command, which then automatically assumes authority for issuing numbers....and so on down the chain.
I *think* I understand the fundamental differences in our perspective. I see the bioinformatics world operating more smoothly within a very specific, well-understood and universally agreed-upon context. You seem to prefer a completely generic, self-describing system, that is not necessarily restricted in scope to biology (correct?) I understand your perspective, and I certainly see its appeal -- I just think that the case-specific implementation is even more appealing (to me, anyway).
How an LSID is resolved is described in detail in the document:
http://www.omg.org/cgi-bin/doc?lifesci/2003-12-02
Section 8.3 describes the use of DNS for resolution.
Basically, the LSID client:
- Parses the LSID urn:lsid:DN:NS:ID[:Rev]
- Using DNS, locate the SVR record for DN, which points to the service
- Using DNS again, resolve the location of the service
- Once you have the service endpoint, basically ask it for the object
with NS:ID:Rev
That's a gross simplification, and it appears that the LSID definition now treats DNS resolution as one resolution mechanism, rather than the only one.
O.K., this clears up a lot in my mind. But what intrigues me now is: What are the other (potential?) resolution mechanisms?
Obviously, my grasp of LSIDs was fundamentally flawed. As much as the "single-point failure" problem concerns you, the "must have all DNs within a dataset containing multiple LSIDs active & online" concerns me. In a sense, it means there are many single-point failures that can impede data flow.
Embedding issuer context in a GUID makes sense to me. Restricting resolution of GUID to the embedded issuer *only*, seems like a very dangerous system to me.
Yeah, but once again - if the single issuer no longer exists, then everything is gone. That would be a real drag.
The specimens, presumably, wouldn't be gone -- they would simply move to a new location. And what happens when one specific specimen is transferred from Museum "A" to Museum "B"? Does it keep the original LSID (meaining that the original Musuem must continue to maintain and update data for a specimen it no longer owns)? Or is it issued a new LSID by the receiving institution, in which case the specimen now has more than one ID? What happens to a legacy dataset that still has a pointer to the old LSID? Yes, I can see work-arounds to all of these (mechanisms for auto-forwarding, maintenance of indesx of duplicate LSIDs, etc.) But when you add up all of that sort of baggage, the redundant-mirrored-central-ID-issuer-with-defined-chain-of-command-cascade system seems easier to manage, more flexible, and more reliable.
Well...that's partly why I emphasized that I think GUIDs should be for computer-computer data exchange only. But even if printed for a pair of human eyes to read, surely there would be *some* stated context. E.g., "ITIS TSN 1234567"; "BPBM 123456"; "GBIF Specimen ID 9876543";
"ICZN NameID
92AB5B37-70E9-4f05-9E97-CBABD08513ED"; etc....
So formalize that a little and you might have something more consistently machine parsable like: ITIS.ORG:TSN:1234567; BPBM.EDU:something:123456;GBIF.ORG:Specimen:9876543, ...
Add in the system identifier for resolution (urn:lsid:...) and you have LSIDs. The result is a far more consistent, legible and widely useful mechanism for referencing objects. Allowing an author to arbitrarily provide the context for identifiers gets us little further along.
Yes, but the difference would be that in my world, any one of many mirrored sites could resolve GBIF.ORG:Specimen:9876543; whereas the LSID protocol you described above requires the issuer to resolve it.
How hard would it be in such cases to include within the
Methods section of
the document, something to the effect of "All taxon IDs listed
in this paper
refer to GBIF Specimen ID's, which can be resolved at gbif.net". If the problem is one involving a pair of human eyes reading a number, then the problem can be solved in the context of a pair of human eyes reading the context.
Sure, but do that consistently, by all authors? And do it in a way that is without ambiguity? Machine parsable (for electronic publication)? Easily resuable in other documents?
If it's human-human data exchange, it doesn't need to be consistent by all authors. ICZN requires new species-group names to be represented by a holotype specimen. There are no rules about how an author indicates the Holotype specimen -- only that it is done so more or less unambiguously. Perhaps ICZN rules should be strengthened -- but the point is, human-human communication of this sort works fairly well even without rigid rules for consistency, and even tolerates a fair amount of ambiguity.
Machine parsable is another issue. I see documents, as you describe them, as a medium of human->human (or computer->human) information exchange. If the data exist electronically already, why pass them from machine to machine via "dumbed-down" human-readable documents that need to be subsequently re-interpreted by a machine? Better to start teaching Kindergarteners to read XML as though it were prose! :-)
It's getting late, but I'll try to send off a "Part II" before I call it a night.
Aloha, Rich