Re: GUIDs, LSIDs, and metadata

11 Sep 2005

      Thanks, Kevin.

I didn't realize that the LSID infrastructure was comparatively large
compared to other GUID systems that have been suggested. Whenever I've been
involved with discussions about GUIDs with people who understand the
implications much better than I do, it always seems like the availability of
open-source software tools is one of the reason people tend to favor LSIDs.

My vision of the GUID itself would be the 64-bit integer, which could be
wrapped into an LSID package, our used as a DOI number, or in some other
GUID system. I also believe the resolution service should be mirrored (via
robust and fast synchronization mechanisms) on hundreds or even thousands of
servers around the world -- at least for the "data commons" (e.g., names,
concepts, literature).

I FULLY agree that it is very important to clearly define what objects
should be assigned TDWG-standard GUIDs. In my view, the two object-domains
in most need of GUIDs for the biological informatics community are taxonomic
names, and "documentation" instanaces (~= authored/dated references,
publications, etc.), with taxon concepts represented by the intersection of
these two domains. Unfortunately, *neither* of these objects has been
clearly defined within our community.  It would be nice if we could simply
adopt an exisitng literature-based GUID system developed by some other
community, but from what I have learned, none quite meets the particular
needs of the taxonomic informatics community (hence the emerging TDWG
Literature Subgroup). The reason I single these two out from other data
domains are: 1) they are (or should be) central to virtually all taxonomic
data domains; and 2) they are particularly "thorny" in terms of unambiguous
natural keys and cross-dataset resolution.

Aloha,
Rich

Richard Pyle
...
-----Original Message-----
From: Taxonomic Databases Working Group GUID Project
[mailto:TDWG-GUID@LISTSERV.NHM.KU.EDU]On Behalf Of Kevin Richards
Sent: Sunday, September 11, 2005 9:55 AM
To: TDWG-GUID@LISTSERV.NHM.KU.EDU
Subject: Re: GUIDs, LSIDs, and metadata
Good points.
A few comments I have:
I think LSIDs are assumed to solve all conflicts in the various
datasets of taxonomic data.  However they are JUST resolvable
IDs, anything else is infrastructure surrounding the LSID
mechanisms.  An LSID refers to a specific set of bytes that
resides on some computer somewhere.  The assumption that an LSID
will refer to, for eaxample, a global 'taxon concept' that all
other taxon records should point to, is not correct.  This relies
on a system to be in place that provides the functionality for
this global repository.
Also I feel one argument AGAINST LSIDs is that the initial
investment in infrastructure is large, ie the development and
setting up of authorities, etc.  So I think this would lean
people away from LSIDs, bot towards them.  The advantage with the
LSID mechanism, I think, is that it is flexible enough to not
rely on existing software and internet configuration.
A GUID really needs to refer to a reasonably basic record, eg a
name object rather than the entire taxon concept (although you
could have a GUID for either).  This allows these individual
components to be referenced from other systems/datasets without
having to refer to and accept the enitre concept.  It is probably
a good idea to map out which sort of taxonomic objects should get
GUIDs and how they relate to other objects.
Kevin Richards
...
...
...
deepreef@BISHOPMUSEUM.ORG 09/11/05 6:50 AM >>>
Lots of good discuccion points on GUIDs -- thanks, Rod.  I need to get two
little people to two different soccer (football) games soon, so I have no
time for an elaborate response.  But I do want to comment on one point,
which I have been thinking a great deal about lately:
...
7. I think the first priority for assigning GUIDs is museum specimens.
For taxon names (if not concepts) this is trivial, given that most name
databases have their own, internally unique ids (but not all -- those
databases that use names as primary keys, or which don't expose integer
identifiers will need to rethink their design).
I think it's critical that, whatever GUID system we establish for taxon
names (and concepts), we do it in the context of the next several
decades of
informatic landscape; not just in the context of immediate needs
or current
political climate.
As you said at the start of your message, GUIDs by themselves are trivial.
So the only real difference between establishing a system that is
intuitive
for the current needs and a system that will serve longer-term
future needs,
is a little bit of careful forethought.
Official taxon name registration already exists for one of the major Codes
of Nomenclature (Bacterial), and within the next fortnight we will see a
public announcement of a plan for registration in another of the major
Codes.  I predict that all Codes of nomenclature will implement mandatory
registration for all new names by about 2010, and for all
"available" names
(i.e., since Linnaeus) within five to ten years thereafter.  So the
medium-term future landscape in this case will be one in which
all names are
issued a GUID through their respective Commission of Nomenclature.
Further, it's not unreasonable to predict that sometime within
the next few
decades we will converge on a unified "BioCode" for all organism names,
meaning that the longer-term landscape has a single set of taxon names.
Wouldn't it be nice, after that time, if we didn't have to
forever maintain
legacy GUIDs? In other words, wouldn't it be nice if the established GUID
system for all taxon names were the same *now*, at the outset, so it's a
non-issue to combine them all as one set of GUIDs later on?
I'm not entirely sold on LSIDs, but it does seem that a lot of smart and
knowledgable people are leaning that way.  My hesitation is
mainly that one
of the main reasons for leaning that way is that all sorts of software
already exists for resolving them, so there is less overhead in initial
implementation.  As long as LSID meet long-term needs, that shouldn't be a
problem.  But 50 years from now, I'm not sure how wise it will
seem that the
universal GUID system adopted for biological data was influenced
strongly by
the available software of the time.  Imagine being locked in now to a
universal system that was designed based on software that was available in
1955!
But, not being able to predict which GUID system will be the best in the
context of 2055, we really have no choice but to go with something that
makes a lot of sense now (which is justififable, in that it's also very
important that the delicate transition from no universal GUIDs to
widespread
universal GUIDs will be best supported by keeping it as painless
as possible
in the context of that transition time).
But I still suggest we do things in a way that maximally keeps our options
open.  For example, in the context of LSIDs, consider different paradigms
for registring the fish name, Mygenus myspecies Hyam (Hi, Roger! :-) )
One paradigm might have each major database create its own LSID:
urn:lsid:catalogoffishes.org:SPNO:123456
urn:lsid:gbif.org:ECAT:876543
urn:lsid:itis.gov:TSN:567890
But then we're burdoned with the task of cross-mapping each of these, and
also preserving the legacy IDs into perpetuity after we've eventually
converged on a single taxon name GUID system.
I was going to illustrate several other paradigms, but soccer
departure time
approaches, so I'll cut to the chase.  In the LSID paradigm, I
would propose
the following system:
urn:lsid:bioregistry.org:[Data Domain]:[randomly generated 64-bit integer]
The "bioregistry.org" part represents the decoupling of the GUID from the
institution that initially created the GUID.  It encompases all domains of
biological data (taxon names, concepts, specimens, etc.).  It could be
"tdwg.org" or "gbif.org", but we're not sure those organizations will be
around 50 or 100 years from now.  I imagine that GBIF would create and
manage the bioregistry.org domain for the near-term.
The "Data Domain" represents a tag for the main domain of data (e.g.
"Specimens", or "TaxonNames", or whatever the major information
domains end
up being).
The randomly generated 64-bit integer would be unique across all data
domains, so that it, by itself, is unique within bioregistry.org (no time
now to explain the rationale for this...)
Gotta run....more later.
Aloha,
Rich
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++
WARNING: This email and any attachments may be confidential and/or
privileged. They are intended for the addressee only and are not
to be read,
used, copied or disseminated by anyone receiving them in error.
If you are
not the intended recipient, please notify the sender by return email and
delete this message and any attachments.
The views expressed in this email are those of the sender and do not
necessarily reflect the official views of Landcare Research.
Landcare Research
http://www.landcareresearch.co.nz
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++