GUIDs, LSIDs, and metadata

Sun Sep 11 23:36:53 CEST 2005

On 11 Sep 2005, at 20:55, Kevin Richards wrote:

> I think LSIDs are assumed to solve all conflicts in the various
> datasets of taxonomic data.  However they are JUST resolvable IDs,
> anything else is infrastructure surrounding the LSID mechanisms.  An
> LSID refers to a specific set of bytes that resides on some computer
> somewhere.  The assumption that an LSID will refer to, for eaxample, a
> global 'taxon concept' that all other taxon records should point to,
> is not correct.  This relies on a system to be in place that provides
> the functionality for this global repository.
>

I hope people don't make this assumption, because it's obviously
erroneous. LSIDs just provide a mechanism to assign GUIDs to metadata
and data. Whether there will be a global taxon concept is a completely
different question. I also doubt that deferring to some central
authority is the best way forward. I suspect the very nature and scale
of the task makes a distributed approach inevitable.

> Also I feel one argument AGAINST LSIDs is that the initial investment
> in infrastructure is large, ie the development and setting up of
> authorities, etc.  So I think this would lean people away from LSIDs,
> bot towards them.  The advantage with the LSID mechanism, I think, is
> that it is flexible enough to not rely on existing software and
> internet configuration.
>

Actually it's pretty easy to do this. Basically it can be down with a
bunch of Perl scripts, and some fiddling with the DNS. If you can
program CGI scripts, you can implement LSIDs (like I said, if an
amateur like me can do it, it can't be that hard). Any GUID is going to
need some mechanism for associating the GUID with data, and a mechanism
to ensure uniqueness and persistence of the GUID. However, they do
represent more work than, say, simply using URLs. However, I suggest
that any solution is basically going to require (a) some way of
resolving a GUID, and (b) some way to return metadata, data, or both,
about a GUID. LSIDs can do this for us now.

> A GUID really needs to refer to a reasonably basic record, eg a name
> object rather than the entire taxon concept (although you could have a
> GUID for either).  This allows these individual components to be
> referenced from other systems/datasets without having to refer to and
> accept the enitre concept.  It is probably a good idea to map out
> which sort of taxonomic objects should get GUIDs and how they relate
> to other objects.
>

One could have GUIDs for names and concepts. I think ultimately
anything that is worth making available to other people will/should get
GUIDs. This is, after all, why the web is so powerful -- bits of data
have GUIDs (albeit often rather fragile) in the form of URLs, and
people make use of them by linking (just look at blogs and RSS feeds as
the latest illustration of this power).

Maybe there's a confusion here between globally unique identifiers (a
way of uniquely identifying a bit of data), and a global authority
(specifying a particular view)?

Lastly, while GBIF and/or the commissions for the various codes of
nomenclature may feel they are the obvious authorities for serving
information on taxonomic names, it's not obvious to me that they will,
in fact, be so. Are we really to expect that the commissions will be
issuing GUIDs for all names within 10-15 years? Are we expected to wait
for them, when technically there's no reason why they couldn't start
doing this tomorrow?

The notion that we should wait for these bodies to get their act
together, and that we should defer to them strikes me as a recipe for
disaster (or at least inertia). There are various efforts already
underway out there, and perhaps we need a little healthy competition
and exploration of alternatives.  I suspect this area will be driven by
users and data providers addressing their actual needs, rather than
from "on high".  I take Richard's point that it would be nice to get
this right, but not at the cost of not actually doing something. And
regarding legacy GUIDs, in the case of LSIDs this can be handled fairly
easily via the DNS. It's rather like the case when company a.com buys
company b.com, the DNS record for b.com is changed to map to a.com

I think we also need to be careful about the idea of a central registry
of GUIDs if this means that a single body will be responsible for
issuing them. There are a range of alternatives, such as the DOI model.
DOIs have two parts, one generated centrally, the other by the data
provider. There is a central repository of metadata associated with
DOIs (http://www.crossref.org), rather like GBIF has a local copy of
data provided by DiGIR server. However, local providers are responsible
for providing the content that corresponds to a DOI, and for
constructing the second part of the DOI. In a sense this is pretty much
what my Taxonomic Search Engine does -- it generates LSIDs for the
databases that it queries, but retrieves the metadata on the fly from
the data providers.

This note is starting to lack whatever coherence it might have had at
the start. Perhaps it's time to have some real examples to play with...

Regards

Rod

> Kevin Richards
>
>>>> deepreef at BISHOPMUSEUM.ORG 09/11/05 6:50 AM >>>
> Lots of good discuccion points on GUIDs -- thanks, Rod.  I need to get
> two
> little people to two different soccer (football) games soon, so I have
> no
> time for an elaborate response.  But I do want to comment on one point,
> which I have been thinking a great deal about lately:
>
>> 7. I think the first priority for assigning GUIDs is museum specimens.
>> For taxon names (if not concepts) this is trivial, given that most
>> name
>> databases have their own, internally unique ids (but not all -- those
>> databases that use names as primary keys, or which don't expose
>> integer
>> identifiers will need to rethink their design).
>
> I think it's critical that, whatever GUID system we establish for taxon
> names (and concepts), we do it in the context of the next several
> decades of
> informatic landscape; not just in the context of immediate needs or
> current
> political climate.
>
> As you said at the start of your message, GUIDs by themselves are
> trivial.
> So the only real difference between establishing a system that is
> intuitive
> for the current needs and a system that will serve longer-term future
> needs,
> is a little bit of careful forethought.
>
> Official taxon name registration already exists for one of the major
> Codes
> of Nomenclature (Bacterial), and within the next fortnight we will see
> a
> public announcement of a plan for registration in another of the major
> Codes.  I predict that all Codes of nomenclature will implement
> mandatory
> registration for all new names by about 2010, and for all "available"
> names
> (i.e., since Linnaeus) within five to ten years thereafter.  So the
> medium-term future landscape in this case will be one in which all
> names are
> issued a GUID through their respective Commission of Nomenclature.
>
> Further, it's not unreasonable to predict that sometime within the
> next few
> decades we will converge on a unified "BioCode" for all organism names,
> meaning that the longer-term landscape has a single set of taxon names.
> Wouldn't it be nice, after that time, if we didn't have to forever
> maintain
> legacy GUIDs? In other words, wouldn't it be nice if the established
> GUID
> system for all taxon names were the same *now*, at the outset, so it's
> a
> non-issue to combine them all as one set of GUIDs later on?
>
> I'm not entirely sold on LSIDs, but it does seem that a lot of smart
> and
> knowledgable people are leaning that way.  My hesitation is mainly
> that one
> of the main reasons for leaning that way is that all sorts of software
> already exists for resolving them, so there is less overhead in initial
> implementation.  As long as LSID meet long-term needs, that shouldn't
> be a
> problem.  But 50 years from now, I'm not sure how wise it will seem
> that the
> universal GUID system adopted for biological data was influenced
> strongly by
> the available software of the time.  Imagine being locked in now to a
> universal system that was designed based on software that was
> available in
> 1955!
>
> But, not being able to predict which GUID system will be the best in
> the
> context of 2055, we really have no choice but to go with something that
> makes a lot of sense now (which is justififable, in that it's also very
> important that the delicate transition from no universal GUIDs to
> widespread
> universal GUIDs will be best supported by keeping it as painless as
> possible
> in the context of that transition time).
>
> But I still suggest we do things in a way that maximally keeps our
> options
> open.  For example, in the context of LSIDs, consider different
> paradigms
> for registring the fish name, Mygenus myspecies Hyam (Hi, Roger! :-) )
>
> One paradigm might have each major database create its own LSID:
>
> urn:lsid:catalogoffishes.org:SPNO:123456
> urn:lsid:gbif.org:ECAT:876543
> urn:lsid:itis.gov:TSN:567890
>
> But then we're burdoned with the task of cross-mapping each of these,
> and
> also preserving the legacy IDs into perpetuity after we've eventually
> converged on a single taxon name GUID system.
>
> I was going to illustrate several other paradigms, but soccer
> departure time
> approaches, so I'll cut to the chase.  In the LSID paradigm, I would
> propose
> the following system:
>
> urn:lsid:bioregistry.org:[Data Domain]:[randomly generated 64-bit
> integer]
>
> The "bioregistry.org" part represents the decoupling of the GUID from
> the
> institution that initially created the GUID.  It encompases all
> domains of
> biological data (taxon names, concepts, specimens, etc.).  It could be
> "tdwg.org" or "gbif.org", but we're not sure those organizations will
> be
> around 50 or 100 years from now.  I imagine that GBIF would create and
> manage the bioregistry.org domain for the near-term.
>
> The "Data Domain" represents a tag for the main domain of data (e.g.
> "Specimens", or "TaxonNames", or whatever the major information
> domains end
> up being).
>
> The randomly generated 64-bit integer would be unique across all data
> domains, so that it, by itself, is unique within bioregistry.org (no
> time
> now to explain the rationale for this...)
>
> Gotta run....more later.
>
> Aloha,
> Rich
>
> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> +++++
> WARNING: This email and any attachments may be confidential and/or
> privileged. They are intended for the addressee only and are not to be
> read,
> used, copied or disseminated by anyone receiving them in error.  If
> you are
> not the intended recipient, please notify the sender by return email
> and
> delete this message and any attachments.
>
> The views expressed in this email are those of the sender and do not
> necessarily reflect the official views of Landcare Research.
>
> Landcare Research
> http://www.landcareresearch.co.nz
> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> +++++
>
>
Professor Roderic D. M. Page
Editor, Systematic Biology
DEEB, IBLS
Graham Kerr Building
University of Glasgow
Glasgow G12 8QP
United Kingdom

Phone:    +44 141 330 4778
Fax:      +44 141 330 2792
email:    r.page at bio.gla.ac.uk
web:      http://taxonomy.zoology.gla.ac.uk/rod/rod.html
reprints: http://taxonomy.zoology.gla.ac.uk/rod/pubs.html

Subscribe to Systematic Biology through the Society of Systematic
Biologists Website:  http://systematicbiology.org
Search for taxon names at http://darwin.zoology.gla.ac.uk/~rpage/portal/