GUIDs, LSIDs, and metadata

Richard Pyle deepreef at BISHOPMUSEUM.ORG
Sun Sep 11 13:59:36 CEST 2005


Kevin Richards wrote:

> > somewhere.  The assumption that an LSID will refer to, for eaxample, a
> > global 'taxon concept' that all other taxon records should point to,
> > is not correct.  This relies on a system to be in place that provides
> > the functionality for this global repository.

Rod Page replied:

> I hope people don't make this assumption, because it's obviously
> erroneous. LSIDs just provide a mechanism to assign GUIDs to metadata
> and data. Whether there will be a global taxon concept is a completely
> different question. I also doubt that deferring to some central
> authority is the best way forward. I suspect the very nature and scale
> of the task makes a distributed approach inevitable.

I'm not sure I understand, in the context of this dicussion, what the phrase
"global 'taxon concept' that all other taxon records should point to" is
intended to mean.

If it means a single global taxonomy with a single "officially sanctioned"
taxon concept circumscription for each taxon name that everyone must conform
to, then certainly I agree that this should not (and cannot) be implemented.

However, I DO see value in designating a 'global' GUID to each defined
Taxonomic Concept (e.g., "Mygenus myspecies Hyam 2004 SEC Pyle 2005"), and
having all other taxon records in databases around the world point to this
same GUID whenever they specifically want to reference that particular
defined concept ("Mygenus myspecies Hyam 2004 SEC Pyle 2005").  Whether that
is a dedicated GUID, or a union of a Taxon Name GUID ("Mygenus myspecies
Hyam 2004") and a Literature GUID ("Pyle 2005"), is an issue that needs to
be discussed.

The problem of relying on a (central) system to be in place can be
dramatically mitigated if that system is mirrored with robust
synchronization protocols around the world. The effort and maintainence
should definitely be distributed among the taxonomic experts of the world.

> Lastly, while GBIF and/or the commissions for the various codes of
> nomenclature may feel they are the obvious authorities for serving
> information on taxonomic names, it's not obvious to me that they will,
> in fact, be so. Are we really to expect that the commissions will be
> issuing GUIDs for all names within 10-15 years? Are we expected to wait
> for them, when technically there's no reason why they couldn't start
> doing this tomorrow?

The reason (I think) we don't want to start issuing them tomorrow is that if
there is no effort in place to make sure the same object instance isn't
receiving multiple GUIDs from multiple issuers, then there's no real point
in assigning the GUIDs in the first place.  Each major database has its own
internal LUIDs already.  The work is in cross-mapping all of these LUIDs to
each other, so we can more esily exchange information.  If each database
holder were to assign LSIDs to all of their records, what problem have we
solved?  OK, so the IDs attached to each record are guaranteed to be
globally unique, and in some way embed resolving metadata (in the cae of
LSIDs), and all of the database holders at least conform to a common system
of IDs.  But these gains are trivial compared to the monumental task of
cross-linking all of the object instances that exist redudantly in dozens of
databases around the world.

My feeling is that the "brass ring" of one GUID per object instance (i.e, a
single common "flag pole" around which all data holders can rally, and
cross-link their own LUIDs to) is very much within reach, and is what will
be needed in the long run anyway.

No single organization can be relied upon for safekeeping of "the" master
database into perpetuity.  That's why "the" database needs to be mirrored
all over the world.  All that's needed is a robust synchronization protocol.
The role of the Code Commissions, and GBIF, and TDWG would be to define the
protocols and standards, and establish the initial implementations.  They
should not, in my opinion, be put into a position where thy must be relied
upon over the next decades/centuries in order to facilitate the perpetual
exchange of data.

> The notion that we should wait for these bodies to get their act
> together, and that we should defer to them strikes me as a recipe for
> disaster (or at least inertia). There are various efforts already
> underway out there, and perhaps we need a little healthy competition
> and exploration of alternatives.  I suspect this area will be driven by
> users and data providers addressing their actual needs, rather than
> from "on high".  I take Richard's point that it would be nice to get
> this right, but not at the cost of not actually doing something. And
> regarding legacy GUIDs, in the case of LSIDs this can be handled fairly
> easily via the DNS. It's rather like the case when company a.com buys
> company b.com, the DNS record for b.com is changed to map to a.com

I also see your point about the inertia problem.  But I've always thought
that the paradigm of independent solutions to this issue by competitive and
disconnected efforts, as has been ongoing for the past few decades, is
exactly the sort of chaos that has lead to our current data exchange
problems that we are now trying to solve. My feeling is that we are now
ready move past that phase and into the next phase.  GBIF now has nearly
$1.5 million specificlly to solve this problem, and at least one of the
historically "inertia-ladden" Commissions is about to take a dramatic step
forward.  It feels to me like we're rapidly approaching critical mass, and
personally I'd like to see how far we can push it forward, and capitalize on
the new paradigm by simultaneously solving as many problems as we can all at
once.

> I think we also need to be careful about the idea of a central registry
> of GUIDs if this means that a single body will be responsible for
> issuing them. There are a range of alternatives, such as the DOI model.
> DOIs have two parts, one generated centrally, the other by the data
> provider. There is a central repository of metadata associated with
> DOIs (http://www.crossref.org), rather like GBIF has a local copy of
> data provided by DiGIR server. However, local providers are responsible
> for providing the content that corresponds to a DOI, and for
> constructing the second part of the DOI. In a sense this is pretty much
> what my Taxonomic Search Engine does -- it generates LSIDs for the
> databases that it queries, but retrieves the metadata on the fly from
> the data providers.

My reasons for looking at 64-bit integers is that there are ~10^19 of them
to go around. I can see them being issued to any institution who wants them
at, say, a billion numbers at a time (that allows for ~10 billion such
blocks of 1 billion numbers/block).  Each insititution/individual would then
assign them to data objects however they want, whenever they want. If
they've assigned the numbers to objects in conformance with TDWG standards
(yet to be developed), then the associated TDWG-compliant data/metadata for
each number can be uploaded to any one of the mirror servers, at which time
the link between the number and the data/metadata gets automatically
propagated to all of the mirror servers.  The point is, the only time when a
single entity needs to be relied upon is the initial issuance of blocks of
numbers.  And even this could be distributed (e.g., by pre-distributing
blocks of ~10^17 numbers to each of ~100 different issuers). My reason for
thinking in terms of simple integers is that they allow flexibility for
embedding within different GUID schemes, if the TDWG standard for the GUID
"package" needed to change in the future (i.e., the numbers could remain the
same, and the resolving metadata packaging can change).

Maybe I'm off my rocker here (very possible).  But it seems so simple and
straightforward, and seems (to me, anyway) to leave options open for future
GUID packaging schemes.

> This note is starting to lack whatever coherence it might have had at
> the start. Perhaps it's time to have some real examples to play with...

Ditto!!  My apoligies to all for the rambling (it's a slow Sunday afternoon
here).  I'll shutup now.

Aloha,
Rich




More information about the tdwg-tag mailing list