GUIDs, LSIDs, and metadata

Mon Sep 12 09:40:17 CEST 2005

Although having a 1:1 mapping between identifiers and identifyable objects might
seem the best option from a theoretical point of view, it often turn out to be
utopia when it comes to practical situations ("I theory, there is no difference
between theory and practice. In practice, there is." Chuck Reid). Therefore,
avoiding the assignment of multiple identifiers to the same entity is often
unavoidable, and in some cases even favourable for tracking and tracing
purposes. It is up to the informatics community to build intelligent information
systems that can cope with this sort of problems. Here's some story on how we
are solving this kind of issues in the field of microbiology.

As microbiologists work with living organisms that are transfered around the
globe among research institutions and culture collections, different tags
(called strain numbers) are assigned to a single isolate. From an information
technological point of view it is thus favourable to link downstream information
(literature references, experimental information (eg. sequences), administrative
information) onto the specimen level, and not to taxonomic level, as the letter
is vulnerable for change over time (due to different opinions, changing taxonomy
and new identification technologies. As such, taxonomic status becomes decoupled
from the downstream information with the specimen standing at the intermediate
level.

Coping with different strain numbers that are used to tag the same specimen, can
be easily resolved by maintaining equivalence relation of the strain numbers in
a central repository (just as you could do for keeping track of synonym
taxonomic names). This is what is done in the Integrated Strain Database, where
the equivalence relation is automatically managed by the application of
accumulative learning principles (using calculation of the transitive closure
for incremental placement of new strain number into equivalence classes).
Currently, information is gathered from 42 microbial culture collections that
cover all earth’s continents and range from small niche specific research
collections to large general-purpose service collections. In addition, the
information extracted from two lists of bacterial type strains is equally
incorporated. This integration process has currently lumped over 600.000 strain
numbers into some 250.000 equivalence classes that represent different strains
of bacteria, archaea, filamentous fungi and yeasts.

As we live in an imperfect world, special attention has been paid to error
detection and correction within the equivalence classes due to irregularities in
the data provided by the underlying information sources, through the design of
novel intelligent tools that enable the automatic discovery of intrusions in the
consistency of the integrated information. Just to give you an impression on the
necessity of checking the information coming from different information sources:
without profound quality control of the integrated information, at least 719
(11.89%) of the bacterial type strains would have been affected by illegitimate
merges into single equivalence classes.

While incrementally calculating the strain equivalence classes, new unique
identifiers are assigned to strain numbers that were not previously encountered
during the integration procedure. This helps to resolve some of the ambiguities
that are a logical consequence of the local nature of the strain number
assignment process and enables to set down context-dependant resolution of
ambiguous strain numbers that often require some form of human-intervention. The
latter is important to secure the tedious disambiguation procedure of existing
cross-references for correct machine interpretation in the future. Moreover, it
turns out that the information content of the Integrated Strain Database offers
the perfect semantic context to guide the disambiguation process in a number of
ways.

To demonstrate the potential of this approach to fill the gap where there is no
universally adopted system for assigning and recognizing persistent and unique
identifiers for biological resources, we have set up a portal system called
StrainInfo.net (www.straininfo.net), where we have consolidated the strain
information captured within the Integrated Strain Database with relevant
sequences and literature references assembled within public repositories. Not
only does this offer a de-duplicated view on the downstream information that is
available on the micro-organisms worldwide, but also allows for the execution of
all sorts of dynamic queries that can automatically bridge over multiple web
services that were physically separated before the integration process. The
presented cross-reference model will however only show its full dynamic strength
when the reverse references to the Integrated Strain Database are included in
third party databases, thus establishing a true divide and conquer strategy for
tracking related information within autonomously operating biological
information sources.

It seems that the solutions worked into the StrainInfo.net portal have many
common grounds with the problems encoutered with the integration of taxonomic
names into a single coherent system. In this context I also recommend the
Taxonomic Databases Working Group to take a look at the experimental work done
by George Garrity of Bergey's Manual Trust to work bacterial taxonomic names
into the DOI framework. After all, it seems to me that the DOI framework
currently offers a far more extended framework of software solutions and
organisational issues that outreach those of the LSIDs at present. An essential
thing that is missing in the latter framework seems to be a well-thought about
business plan to guarantee the long-term survival of the GUID system. Also it
seems a bit like reinventing the wheel to me to overlook as system that has
already gone through the 'proof-of-principle' stage. We already have a morbid
growth of identifiers that are piling up our information systems, so it would be
unwise to put our effort into the proliferation of network standards that cover
the same domain.