Although having a 1:1 mapping between identifiers and identifyable objects might seem the best option from a theoretical point of view, it often turn out to be utopia when it comes to practical situations ("I theory, there is no difference between theory and practice. In practice, there is." Chuck Reid). Therefore, avoiding the assignment of multiple identifiers to the same entity is often unavoidable, and in some cases even favourable for tracking and tracing purposes. It is up to the informatics community to build intelligent information systems that can cope with this sort of problems. Here's some story on how we are solving this kind of issues in the field of microbiology.
As microbiologists work with living organisms that are transfered around the globe among research institutions and culture collections, different tags (called strain numbers) are assigned to a single isolate. From an information technological point of view it is thus favourable to link downstream information (literature references, experimental information (eg. sequences), administrative information) onto the specimen level, and not to taxonomic level, as the letter is vulnerable for change over time (due to different opinions, changing taxonomy and new identification technologies. As such, taxonomic status becomes decoupled from the downstream information with the specimen standing at the intermediate level.
Coping with different strain numbers that are used to tag the same specimen, can be easily resolved by maintaining equivalence relation of the strain numbers in a central repository (just as you could do for keeping track of synonym taxonomic names). This is what is done in the Integrated Strain Database, where the equivalence relation is automatically managed by the application of accumulative learning principles (using calculation of the transitive closure for incremental placement of new strain number into equivalence classes). Currently, information is gathered from 42 microbial culture collections that cover all earthÂ’s continents and range from small niche specific research collections to large general-purpose service collections. In addition, the information extracted from two lists of bacterial type strains is equally incorporated. This integration process has currently lumped over 600.000 strain numbers into some 250.000 equivalence classes that represent different strains of bacteria, archaea, filamentous fungi and yeasts.
As we live in an imperfect world, special attention has been paid to error detection and correction within the equivalence classes due to irregularities in the data provided by the underlying information sources, through the design of novel intelligent tools that enable the automatic discovery of intrusions in the consistency of the integrated information. Just to give you an impression on the necessity of checking the information coming from different information sources: without profound quality control of the integrated information, at least 719 (11.89%) of the bacterial type strains would have been affected by illegitimate merges into single equivalence classes.
While incrementally calculating the strain equivalence classes, new unique identifiers are assigned to strain numbers that were not previously encountered during the integration procedure. This helps to resolve some of the ambiguities that are a logical consequence of the local nature of the strain number assignment process and enables to set down context-dependant resolution of ambiguous strain numbers that often require some form of human-intervention. The latter is important to secure the tedious disambiguation procedure of existing cross-references for correct machine interpretation in the future. Moreover, it turns out that the information content of the Integrated Strain Database offers the perfect semantic context to guide the disambiguation process in a number of ways.
To demonstrate the potential of this approach to fill the gap where there is no universally adopted system for assigning and recognizing persistent and unique identifiers for biological resources, we have set up a portal system called StrainInfo.net (www.straininfo.net), where we have consolidated the strain information captured within the Integrated Strain Database with relevant sequences and literature references assembled within public repositories. Not only does this offer a de-duplicated view on the downstream information that is available on the micro-organisms worldwide, but also allows for the execution of all sorts of dynamic queries that can automatically bridge over multiple web services that were physically separated before the integration process. The presented cross-reference model will however only show its full dynamic strength when the reverse references to the Integrated Strain Database are included in third party databases, thus establishing a true divide and conquer strategy for tracking related information within autonomously operating biological information sources.
It seems that the solutions worked into the StrainInfo.net portal have many common grounds with the problems encoutered with the integration of taxonomic names into a single coherent system. In this context I also recommend the Taxonomic Databases Working Group to take a look at the experimental work done by George Garrity of Bergey's Manual Trust to work bacterial taxonomic names into the DOI framework. After all, it seems to me that the DOI framework currently offers a far more extended framework of software solutions and organisational issues that outreach those of the LSIDs at present. An essential thing that is missing in the latter framework seems to be a well-thought about business plan to guarantee the long-term survival of the GUID system. Also it seems a bit like reinventing the wheel to me to overlook as system that has already gone through the 'proof-of-principle' stage. We already have a morbid growth of identifiers that are piling up our information systems, so it would be unwise to put our effort into the proliferation of network standards that cover the same domain.
Further reading:
[1] P. Dawyndt, M. Vancanneyt, H. De Meyer & J. Swings (2005) Knowledge Accumulation and Resolution of Data Inconsistencies during the Integration of Microbial Information Sources 17(8), 1111-1126. http://doi.ieeecomputersociety.org/10.1109/TKDE.2005.131
[2] P. Dawyndt, M. Vancanneyt & J. Swings (2004). On the integration of microbial information. WFCC Newsletter 38, 19-34. http://wdcm.nig.ac.jp/wfcc/NEWSLETTER/newsletter38/a3.pdf
[3] P. Dawyndt, B. De Baets, X. Zhou, J. Ma & J. Swings. StrainInfo.net: Holding a wealth of downstream information on microbial resources right in our hands. http://www.cpdr.ucl.ac.be/bioinf/papers/bioinf/bioinf_dawyndt.pdf
Also check out the background document and discussion papers that came out of the specialist workshop on "Exploring and exploiting microbiological commons: contributions of bioinformatics and intellectual property rights in sharing biological information" at http://lmg.ugent.be/bioinf-ipr/
Cheers,
Peter Dawyndt
------------------------------------------------------------------------------- Peter Dawyndt
email: P e t e r . D a w y n d t @ U G e n t . b e phone: +32 (0)9 264 5132 fax: +32 (0)9 264 5092
contact addresses: Laboratory of Microbiology, Ghent University, K. L. Ledeganckstraat 35, B-9000 Ghent, Belgium.
Department of Applied Mathematics, Biometrics and Process Control, Ghent University, Coupure links 653, B-9000 Gent, Belgium. -------------------------------------------------------------------------------
participants (1)
-
Peter Dawyndt