Re: Globally Unique Identifier - Part III
It seems to me that a lot of complexity would disappear if we
could all get
behind a single issuer of GUIDs, and mirror the capability to
resolve those
GUIDs on dozens or hundreds of servers around the world, and
only use the
GUIDs in a semantic context that is self-evident.
Perhaps. But what about if an insitution wants to provide IDs for more than just specimen or name objects? Should we always rely on a single authority to provide a mechanism for doing that? I don't think that would go very far.
No -- we come together as a community to agree upon mechanisms for GUID assignment and exchange among well-defined classes that we routinely wish to share and aggregate information about (Specimen/Observations, Names/Concepts/Assertions, References, Agents, and maybe a couple of others). If Bishop Museum wanted to provide IDs for, say, Baseball cards, or Star Wars memorabilia, or some other object type, then it would coordinate with other holders of data relating to baseball cards [or perhaps trading cars in general] or Star Wars memorabilia [or perhaps movie memorabilia in general], and they would decide amongst themselves whether DOIs or LSIDs or UUIDs, or MACs, or whatever make the most sense for their particular needs.
If Bishop Museum wanted to provide IDs for something like condition reports of specimens over time, or changing dynamics of wild populations, or DNA sequences, or Images, or any number of other objects that are relevant to bioinformatics, then they would propose a new class of objects to TDWG, along with an appropriate schema and recommend standards for implementation, and other institutions with like-data would discuss the optimal approach to dealing with those sorst of data. The "community" would debate the various options in the context of integration with existing data exchange protocols, etc., and a standard would emerge.
Don't get me wrong....the letter "G" is my favorite letter in DiGIR. And the more I think about it, the more I understand why you want the IDs to have embedded context that can be resolved automatically, without any custom tweaking to accomodate new classes of IDs. But the costs (or at least my perception of the costs) do tend to frighten me a bit. It makes a lot more sense for specimen data. If Bishop Museum's server is down -- tough luck, you can't have our data. But I would *HATE* to rely on any one particular server to get public-domain information like taxon names whenever I needed it. Yes, the single-point failure problem is real. But there are technological solutions to minimize those sorts of concerns (on an intermittent basis) down to something indistinguishable from zero. I don't think I have ever gone to Google and found it down. And there can be redundancy built into the system.
So now let me ask you this: Are the advantages of serving the GUID needs of both specimens and taxon names via an identical, generalized GUID scheme greater than the advantages of custom-tailoring the GUID scheme for each different sort of data (owned, vs. public-domain)? I think I already know the short answer is "Yes", but I'm interested in the reasons (e.g., common set of software tools to deal with both, etc.)
The DN portion is meant to be resolvable by the DNS system. So yes, there is a dependency on the continued existence of the DN, but is can be set up to be resolved by any LSID service endpoint.
I think I need to learn more about LSIDs before forming too strong of an opinion either for, or against them.
If we use the MAC approach + a context such as an LSID or DOI form, then there is absolutely no need for a central issuing agency.
Agreed! And that might be the better approach (for a lot of reasons). My only concern would be to what extent desktop database applications can uses MAC values as primary keys (compared to using something like long integers) efficiently and effectively, when manipulating large datasets in real time. I guess this may be a trivial point in the grand scheme of things -- but ultimately taxonomists will want to be able to work with large datasets on their personal computers.
Again, just use a MAC based GUID inside an LSID context. If you have any MS dev tools on your machine type "guidgen" and the command prompt. Voila! Globally unique identifiers. No matter how many times you push the "New GUID" button.
Yes, and this may be the best approach. MACs are scary for people to read, but as I said before, people really shouldn't be reading them.
But in order to avoid having a bottleneck for data resolution at the point of MAC issuance, you would need to get the data mirrored efficiently across many servers. This could be done like DNS (as I undertand it), where the changes propagate through a haphazard web of servers. But I think I'd be more comfortable having some sort of centralized hub or coordinator (like GBIF) to ensure data mirroring is done efficiently and completely. On this point, I could very-well be persuaded to change my view.
Again, I'll have to think about this some more. I certainly don't think that the "system" should be incapable of dealing with new
classes -- sort of
like how anyone can develop their own Federation Schema and use DiGIR to establish specific information networks. But I'd hate to see a
breakdown in
the global transmission of biodiversity information simply
because different
subgroups establish their own special-needs,
non-mutually-compatible classes
for dealing with essentially the same kinds of information
(especially if
they do not also conform to a generalized international standard).
Bah. That's the whole point of this - to facilitate data exchange. If a small subgroup wants to start exchanging data in an abbreviated format, so what?
No problem if they also continue to provide the relevant bits via the "conventional" means. But BIG problem if everyone in Europe gravitates towards one conventional standard, and everyone in the U.S. gravitates towards a different flavor, and everyone in Asia gravitates towards yet a different flavor. I'm thinking Cell Phone standards, NTSC vs. PAL, competing HDTV standards, DVD+R vs. DVD-R, etc., etc.
Already we have some problems in dealing with the fact that ICZN and ICBN do not have perfectly compatible Codes of nomenclature. Imagine if different geographic regions adopted their own versions of nomenclatural Codes; or if the Fish people got together and decided they wanted slight different rules to apply to their names. Such freedom would not likely contribute favorably to scientific progress. Taxonomists conform to respective Codes of nomenclature not because they are perfect in how they establish names, but because the community has converged on a single standard.
As long as the identifiers being used are able to resolve the type of object being passed around,
...which they presumably wouldn't be if the providing server were offline....no?
and the objects conform to their definitions, it shouldn't be a problem. By initially establishing a robust framework for Scientific Names and perhaps specimen data / collections, then there will be little need for others to recreate new ways to represent that data.
...so why comprimise the optimality of the system in order to accomodate those who might prefer to define objects in a slight different way from everyone else?
The benefits of a robust reliable representation and provision of cheap, effective software tools will hopefully overcome the steep learning curve needed to even understand what's in some of these schemas.
Yes, but this is something we both agree on!
But if I want to say to you, hey look at this specimen xxx while we're chatting from around the world using an instant messenger while collaborating on some project, would't it be nice to just be able to type in lsid:mymuseum.org:specimen:1234 and have your client retrieve that exact data and associated metadata directly?
But such a chat would certainly provide an opportunity to suggest context. Wouldn't it be easier if I said "Hey, go to GBIF and look up SpecimenID 1234"? This scenario seems to apply equally to both of our world perspectives. The only advantage I would see for LSIDs in this case is if I forgot to mention to my colleague that I wanted her to look up a specimen, rather than a taxon name, and just told her to "look up 1234". Not likely in a human-human conversation.
A trivial example but one that can form the foundation of some cool stuff for data exchange and interaction. I thought that was the whole point of these GUID things. But maybe I'm mistaken?
Yes, I certainly agree that this is the whole point of GUIDs! What I don't understand is why LSIDs (with domain-active requirements) fulfill this more effectively (all costs and benefits considered) than other GUID systems.
Except in the somewhat bizarre case when you need the old version of the object.
Yeah, but as you say this is a bizarre case. In those bizarre cases, you can send an email to the data manager and ask for a report on the edit history of the record. If they got it, they got it. But I don't see this as being such a routine need that it needs to be accomdated-for by the GUID system (unless, as I suggested earlier, that it was otherwise completely transparent and could be ignored without consequence).
Yeah, good question. Maybe this should be on the GBIF DADI list or TDWG general? Or even the LSID list?
If it moves, let me know where it goes. Right now, it's time for bed....
Cheers, Rich
participants (1)
-
Richard Pyle