Re: GUIDs, LSIDs, and metadata

11 Sep 2005

      Kevin Richards wrote:
...
...
somewhere.  The assumption that an LSID will refer to, for eaxample, a
global 'taxon concept' that all other taxon records should point to,
is not correct.  This relies on a system to be in place that provides
the functionality for this global repository.
Rod Page replied:
...
I hope people don't make this assumption, because it's obviously
erroneous. LSIDs just provide a mechanism to assign GUIDs to metadata
and data. Whether there will be a global taxon concept is a completely
different question. I also doubt that deferring to some central
authority is the best way forward. I suspect the very nature and scale
of the task makes a distributed approach inevitable.
I'm not sure I understand, in the context of this dicussion, what the phrase
"global 'taxon concept' that all other taxon records should point to" is
intended to mean.

If it means a single global taxonomy with a single "officially sanctioned"
taxon concept circumscription for each taxon name that everyone must conform
to, then certainly I agree that this should not (and cannot) be implemented.

However, I DO see value in designating a 'global' GUID to each defined
Taxonomic Concept (e.g., "Mygenus myspecies Hyam 2004 SEC Pyle 2005"), and
having all other taxon records in databases around the world point to this
same GUID whenever they specifically want to reference that particular
defined concept ("Mygenus myspecies Hyam 2004 SEC Pyle 2005").  Whether that
is a dedicated GUID, or a union of a Taxon Name GUID ("Mygenus myspecies
Hyam 2004") and a Literature GUID ("Pyle 2005"), is an issue that needs to
be discussed.

The problem of relying on a (central) system to be in place can be
dramatically mitigated if that system is mirrored with robust
synchronization protocols around the world. The effort and maintainence
should definitely be distributed among the taxonomic experts of the world.
...
Lastly, while GBIF and/or the commissions for the various codes of
nomenclature may feel they are the obvious authorities for serving
information on taxonomic names, it's not obvious to me that they will,
in fact, be so. Are we really to expect that the commissions will be
issuing GUIDs for all names within 10-15 years? Are we expected to wait
for them, when technically there's no reason why they couldn't start
doing this tomorrow?
The reason (I think) we don't want to start issuing them tomorrow is that if
there is no effort in place to make sure the same object instance isn't
receiving multiple GUIDs from multiple issuers, then there's no real point
in assigning the GUIDs in the first place.  Each major database has its own
internal LUIDs already.  The work is in cross-mapping all of these LUIDs to
each other, so we can more esily exchange information.  If each database
holder were to assign LSIDs to all of their records, what problem have we
solved?  OK, so the IDs attached to each record are guaranteed to be
globally unique, and in some way embed resolving metadata (in the cae of
LSIDs), and all of the database holders at least conform to a common system
of IDs.  But these gains are trivial compared to the monumental task of
cross-linking all of the object instances that exist redudantly in dozens of
databases around the world.

My feeling is that the "brass ring" of one GUID per object instance (i.e, a
single common "flag pole" around which all data holders can rally, and
cross-link their own LUIDs to) is very much within reach, and is what will
be needed in the long run anyway.

No single organization can be relied upon for safekeeping of "the" master
database into perpetuity.  That's why "the" database needs to be mirrored
all over the world.  All that's needed is a robust synchronization protocol.
The role of the Code Commissions, and GBIF, and TDWG would be to define the
protocols and standards, and establish the initial implementations.  They
should not, in my opinion, be put into a position where thy must be relied
upon over the next decades/centuries in order to facilitate the perpetual
exchange of data.
...
The notion that we should wait for these bodies to get their act
together, and that we should defer to them strikes me as a recipe for
disaster (or at least inertia). There are various efforts already
underway out there, and perhaps we need a little healthy competition
and exploration of alternatives.  I suspect this area will be driven by
users and data providers addressing their actual needs, rather than
from "on high".  I take Richard's point that it would be nice to get
this right, but not at the cost of not actually doing something. And
regarding legacy GUIDs, in the case of LSIDs this can be handled fairly
easily via the DNS. It's rather like the case when company a.com buys
company b.com, the DNS record for b.com is changed to map to a.com
I also see your point about the inertia problem.  But I've always thought
that the paradigm of independent solutions to this issue by competitive and
disconnected efforts, as has been ongoing for the past few decades, is
exactly the sort of chaos that has lead to our current data exchange
problems that we are now trying to solve. My feeling is that we are now
ready move past that phase and into the next phase.  GBIF now has nearly
$1.5 million specificlly to solve this problem, and at least one of the
historically "inertia-ladden" Commissions is about to take a dramatic step
forward.  It feels to me like we're rapidly approaching critical mass, and
personally I'd like to see how far we can push it forward, and capitalize on
the new paradigm by simultaneously solving as many problems as we can all at
once.
...
I think we also need to be careful about the idea of a central registry
of GUIDs if this means that a single body will be responsible for
issuing them. There are a range of alternatives, such as the DOI model.
DOIs have two parts, one generated centrally, the other by the data
provider. There is a central repository of metadata associated with
DOIs (http://www.crossref.org), rather like GBIF has a local copy of
data provided by DiGIR server. However, local providers are responsible
for providing the content that corresponds to a DOI, and for
constructing the second part of the DOI. In a sense this is pretty much
what my Taxonomic Search Engine does -- it generates LSIDs for the
databases that it queries, but retrieves the metadata on the fly from
the data providers.
My reasons for looking at 64-bit integers is that there are ~10^19 of them
to go around. I can see them being issued to any institution who wants them
at, say, a billion numbers at a time (that allows for ~10 billion such
blocks of 1 billion numbers/block).  Each insititution/individual would then
assign them to data objects however they want, whenever they want. If
they've assigned the numbers to objects in conformance with TDWG standards
(yet to be developed), then the associated TDWG-compliant data/metadata for
each number can be uploaded to any one of the mirror servers, at which time
the link between the number and the data/metadata gets automatically
propagated to all of the mirror servers.  The point is, the only time when a
single entity needs to be relied upon is the initial issuance of blocks of
numbers.  And even this could be distributed (e.g., by pre-distributing
blocks of ~10^17 numbers to each of ~100 different issuers). My reason for
thinking in terms of simple integers is that they allow flexibility for
embedding within different GUID schemes, if the TDWG standard for the GUID
"package" needed to change in the future (i.e., the numbers could remain the
same, and the resolving metadata packaging can change).

Maybe I'm off my rocker here (very possible).  But it seems so simple and
straightforward, and seems (to me, anyway) to leave options open for future
GUID packaging schemes.
...
This note is starting to lack whatever coherence it might have had at
the start. Perhaps it's time to have some real examples to play with...
Ditto!!  My apoligies to all for the rambling (it's a slow Sunday afternoon
here).  I'll shutup now.

Aloha,
Rich

Richard Pyle

tags

participants (1)