Lots of good discuccion points on GUIDs -- thanks, Rod. I need to get two little people to two different soccer (football) games soon, so I have no time for an elaborate response. But I do want to comment on one point, which I have been thinking a great deal about lately:
- I think the first priority for assigning GUIDs is museum specimens.
For taxon names (if not concepts) this is trivial, given that most name databases have their own, internally unique ids (but not all -- those databases that use names as primary keys, or which don't expose integer identifiers will need to rethink their design).
I think it's critical that, whatever GUID system we establish for taxon names (and concepts), we do it in the context of the next several decades of informatic landscape; not just in the context of immediate needs or current political climate.
As you said at the start of your message, GUIDs by themselves are trivial. So the only real difference between establishing a system that is intuitive for the current needs and a system that will serve longer-term future needs, is a little bit of careful forethought.
Official taxon name registration already exists for one of the major Codes of Nomenclature (Bacterial), and within the next fortnight we will see a public announcement of a plan for registration in another of the major Codes. I predict that all Codes of nomenclature will implement mandatory registration for all new names by about 2010, and for all "available" names (i.e., since Linnaeus) within five to ten years thereafter. So the medium-term future landscape in this case will be one in which all names are issued a GUID through their respective Commission of Nomenclature.
Further, it's not unreasonable to predict that sometime within the next few decades we will converge on a unified "BioCode" for all organism names, meaning that the longer-term landscape has a single set of taxon names. Wouldn't it be nice, after that time, if we didn't have to forever maintain legacy GUIDs? In other words, wouldn't it be nice if the established GUID system for all taxon names were the same *now*, at the outset, so it's a non-issue to combine them all as one set of GUIDs later on?
I'm not entirely sold on LSIDs, but it does seem that a lot of smart and knowledgable people are leaning that way. My hesitation is mainly that one of the main reasons for leaning that way is that all sorts of software already exists for resolving them, so there is less overhead in initial implementation. As long as LSID meet long-term needs, that shouldn't be a problem. But 50 years from now, I'm not sure how wise it will seem that the universal GUID system adopted for biological data was influenced strongly by the available software of the time. Imagine being locked in now to a universal system that was designed based on software that was available in 1955!
But, not being able to predict which GUID system will be the best in the context of 2055, we really have no choice but to go with something that makes a lot of sense now (which is justififable, in that it's also very important that the delicate transition from no universal GUIDs to widespread universal GUIDs will be best supported by keeping it as painless as possible in the context of that transition time).
But I still suggest we do things in a way that maximally keeps our options open. For example, in the context of LSIDs, consider different paradigms for registring the fish name, Mygenus myspecies Hyam (Hi, Roger! :-) )
One paradigm might have each major database create its own LSID:
urn:lsid:catalogoffishes.org:SPNO:123456 urn:lsid:gbif.org:ECAT:876543 urn:lsid:itis.gov:TSN:567890
But then we're burdoned with the task of cross-mapping each of these, and also preserving the legacy IDs into perpetuity after we've eventually converged on a single taxon name GUID system.
I was going to illustrate several other paradigms, but soccer departure time approaches, so I'll cut to the chase. In the LSID paradigm, I would propose the following system:
urn:lsid:bioregistry.org:[Data Domain]:[randomly generated 64-bit integer]
The "bioregistry.org" part represents the decoupling of the GUID from the institution that initially created the GUID. It encompases all domains of biological data (taxon names, concepts, specimens, etc.). It could be "tdwg.org" or "gbif.org", but we're not sure those organizations will be around 50 or 100 years from now. I imagine that GBIF would create and manage the bioregistry.org domain for the near-term.
The "Data Domain" represents a tag for the main domain of data (e.g. "Specimens", or "TaxonNames", or whatever the major information domains end up being).
The randomly generated 64-bit integer would be unique across all data domains, so that it, by itself, is unique within bioregistry.org (no time now to explain the rationale for this...)
Gotta run....more later.
Aloha, Rich
participants (1)
-
Richard Pyle