Re: GUIDs, LSIDs, and metadata
Kevin Richards wrote:
somewhere. The assumption that an LSID will refer to, for eaxample, a global 'taxon concept' that all other taxon records should point to, is not correct. This relies on a system to be in place that provides the functionality for this global repository.
Rod Page replied:
I hope people don't make this assumption, because it's obviously erroneous. LSIDs just provide a mechanism to assign GUIDs to metadata and data. Whether there will be a global taxon concept is a completely different question. I also doubt that deferring to some central authority is the best way forward. I suspect the very nature and scale of the task makes a distributed approach inevitable.
I'm not sure I understand, in the context of this dicussion, what the phrase "global 'taxon concept' that all other taxon records should point to" is intended to mean.
If it means a single global taxonomy with a single "officially sanctioned" taxon concept circumscription for each taxon name that everyone must conform to, then certainly I agree that this should not (and cannot) be implemented.
However, I DO see value in designating a 'global' GUID to each defined Taxonomic Concept (e.g., "Mygenus myspecies Hyam 2004 SEC Pyle 2005"), and having all other taxon records in databases around the world point to this same GUID whenever they specifically want to reference that particular defined concept ("Mygenus myspecies Hyam 2004 SEC Pyle 2005"). Whether that is a dedicated GUID, or a union of a Taxon Name GUID ("Mygenus myspecies Hyam 2004") and a Literature GUID ("Pyle 2005"), is an issue that needs to be discussed.
The problem of relying on a (central) system to be in place can be dramatically mitigated if that system is mirrored with robust synchronization protocols around the world. The effort and maintainence should definitely be distributed among the taxonomic experts of the world.
Lastly, while GBIF and/or the commissions for the various codes of nomenclature may feel they are the obvious authorities for serving information on taxonomic names, it's not obvious to me that they will, in fact, be so. Are we really to expect that the commissions will be issuing GUIDs for all names within 10-15 years? Are we expected to wait for them, when technically there's no reason why they couldn't start doing this tomorrow?
The reason (I think) we don't want to start issuing them tomorrow is that if there is no effort in place to make sure the same object instance isn't receiving multiple GUIDs from multiple issuers, then there's no real point in assigning the GUIDs in the first place. Each major database has its own internal LUIDs already. The work is in cross-mapping all of these LUIDs to each other, so we can more esily exchange information. If each database holder were to assign LSIDs to all of their records, what problem have we solved? OK, so the IDs attached to each record are guaranteed to be globally unique, and in some way embed resolving metadata (in the cae of LSIDs), and all of the database holders at least conform to a common system of IDs. But these gains are trivial compared to the monumental task of cross-linking all of the object instances that exist redudantly in dozens of databases around the world.
My feeling is that the "brass ring" of one GUID per object instance (i.e, a single common "flag pole" around which all data holders can rally, and cross-link their own LUIDs to) is very much within reach, and is what will be needed in the long run anyway.
No single organization can be relied upon for safekeeping of "the" master database into perpetuity. That's why "the" database needs to be mirrored all over the world. All that's needed is a robust synchronization protocol. The role of the Code Commissions, and GBIF, and TDWG would be to define the protocols and standards, and establish the initial implementations. They should not, in my opinion, be put into a position where thy must be relied upon over the next decades/centuries in order to facilitate the perpetual exchange of data.
The notion that we should wait for these bodies to get their act together, and that we should defer to them strikes me as a recipe for disaster (or at least inertia). There are various efforts already underway out there, and perhaps we need a little healthy competition and exploration of alternatives. I suspect this area will be driven by users and data providers addressing their actual needs, rather than from "on high". I take Richard's point that it would be nice to get this right, but not at the cost of not actually doing something. And regarding legacy GUIDs, in the case of LSIDs this can be handled fairly easily via the DNS. It's rather like the case when company a.com buys company b.com, the DNS record for b.com is changed to map to a.com
I also see your point about the inertia problem. But I've always thought that the paradigm of independent solutions to this issue by competitive and disconnected efforts, as has been ongoing for the past few decades, is exactly the sort of chaos that has lead to our current data exchange problems that we are now trying to solve. My feeling is that we are now ready move past that phase and into the next phase. GBIF now has nearly $1.5 million specificlly to solve this problem, and at least one of the historically "inertia-ladden" Commissions is about to take a dramatic step forward. It feels to me like we're rapidly approaching critical mass, and personally I'd like to see how far we can push it forward, and capitalize on the new paradigm by simultaneously solving as many problems as we can all at once.
I think we also need to be careful about the idea of a central registry of GUIDs if this means that a single body will be responsible for issuing them. There are a range of alternatives, such as the DOI model. DOIs have two parts, one generated centrally, the other by the data provider. There is a central repository of metadata associated with DOIs (http://www.crossref.org), rather like GBIF has a local copy of data provided by DiGIR server. However, local providers are responsible for providing the content that corresponds to a DOI, and for constructing the second part of the DOI. In a sense this is pretty much what my Taxonomic Search Engine does -- it generates LSIDs for the databases that it queries, but retrieves the metadata on the fly from the data providers.
My reasons for looking at 64-bit integers is that there are ~10^19 of them to go around. I can see them being issued to any institution who wants them at, say, a billion numbers at a time (that allows for ~10 billion such blocks of 1 billion numbers/block). Each insititution/individual would then assign them to data objects however they want, whenever they want. If they've assigned the numbers to objects in conformance with TDWG standards (yet to be developed), then the associated TDWG-compliant data/metadata for each number can be uploaded to any one of the mirror servers, at which time the link between the number and the data/metadata gets automatically propagated to all of the mirror servers. The point is, the only time when a single entity needs to be relied upon is the initial issuance of blocks of numbers. And even this could be distributed (e.g., by pre-distributing blocks of ~10^17 numbers to each of ~100 different issuers). My reason for thinking in terms of simple integers is that they allow flexibility for embedding within different GUID schemes, if the TDWG standard for the GUID "package" needed to change in the future (i.e., the numbers could remain the same, and the resolving metadata packaging can change).
Maybe I'm off my rocker here (very possible). But it seems so simple and straightforward, and seems (to me, anyway) to leave options open for future GUID packaging schemes.
This note is starting to lack whatever coherence it might have had at the start. Perhaps it's time to have some real examples to play with...
Ditto!! My apoligies to all for the rambling (it's a slow Sunday afternoon here). I'll shutup now.
Aloha, Rich
participants (1)
-
Richard Pyle