Although having a 1:1 mapping between identifiers and identifyable objects might
seem the best option from a theoretical point of view, it often turn out to be
utopia when it comes to practical situations ("I theory, there is no difference
between theory and practice. In practice, there is." Chuck Reid). Therefore,
avoiding the assignment of multiple identifiers to the same entity is often
unavoidable, and in some cases even favourable for tracking and tracing
purposes. It is up to the informatics community to build intelligent information
systems that can cope with this sort of problems. Here's some story on how we
are solving this kind of issues in the field of microbiology.
As microbiologists work with living organisms that are transfered around the
globe among research institutions and culture collections, different tags
(called strain numbers) are assigned to a single isolate. From an information
technological point of view it is thus favourable to link downstream information
(literature references, experimental information (eg. sequences), administrative
information) onto the specimen level, and not to taxonomic level, as the letter
is vulnerable for change over time (due to different opinions, changing taxonomy
and new identification technologies. As such, taxonomic status becomes decoupled
from the downstream information with the specimen standing at the intermediate
level.
Coping with different strain numbers that are used to tag the same specimen, can
be easily resolved by maintaining equivalence relation of the strain numbers in
a central repository (just as you could do for keeping track of synonym
taxonomic names). This is what is done in the Integrated Strain Database, where
the equivalence relation is automatically managed by the application of
accumulative learning principles (using calculation of the transitive closure
for incremental placement of new strain number into equivalence classes).
Currently, information is gathered from 42 microbial culture collections that
cover all earthÂ’s continents and range from small niche specific research
collections to large general-purpose service collections. In addition, the
information extracted from two lists of bacterial type strains is equally
incorporated. This integration process has currently lumped over 600.000 strain
numbers into some 250.000 equivalence classes that represent different strains
of bacteria, archaea, filamentous fungi and yeasts.
As we live in an imperfect world, special attention has been paid to error
detection and correction within the equivalence classes due to irregularities in
the data provided by the underlying information sources, through the design of
novel intelligent tools that enable the automatic discovery of intrusions in the
consistency of the integrated information. Just to give you an impression on the
necessity of checking the information coming from different information sources:
without profound quality control of the integrated information, at least 719
(11.89%) of the bacterial type strains would have been affected by illegitimate
merges into single equivalence classes.
While incrementally calculating the strain equivalence classes, new unique
identifiers are assigned to strain numbers that were not previously encountered
during the integration procedure. This helps to resolve some of the ambiguities
that are a logical consequence of the local nature of the strain number
assignment process and enables to set down context-dependant resolution of
ambiguous strain numbers that often require some form of human-intervention. The
latter is important to secure the tedious disambiguation procedure of existing
cross-references for correct machine interpretation in the future. Moreover, it
turns out that the information content of the Integrated Strain Database offers
the perfect semantic context to guide the disambiguation process in a number of
ways.
To demonstrate the potential of this approach to fill the gap where there is no
universally adopted system for assigning and recognizing persistent and unique
identifiers for biological resources, we have set up a portal system called
StrainInfo.net (www.straininfo.net), where we have consolidated the strain
information captured within the Integrated Strain Database with relevant
sequences and literature references assembled within public repositories. Not
only does this offer a de-duplicated view on the downstream information that is
available on the micro-organisms worldwide, but also allows for the execution of
all sorts of dynamic queries that can automatically bridge over multiple web
services that were physically separated before the integration process. The
presented cross-reference model will however only show its full dynamic strength
when the reverse references to the Integrated Strain Database are included in
third party databases, thus establishing a true divide and conquer strategy for
tracking related information within autonomously operating biological
information sources.
It seems that the solutions worked into the StrainInfo.net portal have many
common grounds with the problems encoutered with the integration of taxonomic
names into a single coherent system. In this context I also recommend the
Taxonomic Databases Working Group to take a look at the experimental work done
by George Garrity of Bergey's Manual Trust to work bacterial taxonomic names
into the DOI framework. After all, it seems to me that the DOI framework
currently offers a far more extended framework of software solutions and
organisational issues that outreach those of the LSIDs at present. An essential
thing that is missing in the latter framework seems to be a well-thought about
business plan to guarantee the long-term survival of the GUID system. Also it
seems a bit like reinventing the wheel to me to overlook as system that has
already gone through the 'proof-of-principle' stage. We already have a morbid
growth of identifiers that are piling up our information systems, so it would be
unwise to put our effort into the proliferation of network standards that cover
the same domain.
Further reading:
[1] P. Dawyndt, M. Vancanneyt, H. De Meyer & J. Swings (2005) Knowledge
Accumulation and Resolution of Data Inconsistencies during the Integration of
Microbial Information Sources 17(8), 1111-1126.
http://doi.ieeecomputersociety.org/10.1109/TKDE.2005.131
[2] P. Dawyndt, M. Vancanneyt & J. Swings (2004). On the integration of
microbial information. WFCC Newsletter 38, 19-34.
http://wdcm.nig.ac.jp/wfcc/NEWSLETTER/newsletter38/a3.pdf
[3] P. Dawyndt, B. De Baets, X. Zhou, J. Ma & J. Swings. StrainInfo.net: Holding
a wealth of downstream information on microbial resources right in our hands.
http://www.cpdr.ucl.ac.be/bioinf/papers/bioinf/bioinf_dawyndt.pdf
Also check out the background document and discussion papers that came out of
the specialist workshop on "Exploring and exploiting microbiological commons:
contributions of bioinformatics and intellectual property rights in sharing
biological information" at http://lmg.ugent.be/bioinf-ipr/
Cheers,
Peter Dawyndt
-------------------------------------------------------------------------------
Peter Dawyndt
email: P e t e r . D a w y n d t @ U G e n t . b e
phone: +32 (0)9 264 5132
fax: +32 (0)9 264 5092
contact addresses:
Laboratory of Microbiology,
Ghent University,
K. L. Ledeganckstraat 35,
B-9000 Ghent,
Belgium.
Department of Applied Mathematics,
Biometrics and Process Control,
Ghent University,
Coupure links 653,
B-9000 Gent,
Belgium.
-------------------------------------------------------------------------------
Another GUID system that might be of interest is ARK
(http://www.cdlib.org/inside/diglib/ark/). The site describing this
approach has a nice discussion of the issues surrounding persistent
identifiers, and various solutions such as DOIs and URNs (of which
LSIDs are an example). I think this system should be looked at closely,
before we limit discussion to DOIs or LSIDs.
Regards
Rod
Professor Roderic D. M. Page
Editor, Systematic Biology
DEEB, IBLS
Graham Kerr Building
University of Glasgow
Glasgow G12 8QP
United Kingdom
Phone: +44 141 330 4778
Fax: +44 141 330 2792
email: r.page(a)bio.gla.ac.uk
web: http://taxonomy.zoology.gla.ac.uk/rod/rod.html
reprints: http://taxonomy.zoology.gla.ac.uk/rod/pubs.html
Subscribe to Systematic Biology through the Society of Systematic
Biologists Website: http://systematicbiology.org
Search for taxon names at http://darwin.zoology.gla.ac.uk/~rpage/portal/
Good points.
A few comments I have:
I think LSIDs are assumed to solve all conflicts in the various datasets of taxonomic data. However they are JUST resolvable IDs, anything else is infrastructure surrounding the LSID mechanisms. An LSID refers to a specific set of bytes that resides on some computer somewhere. The assumption that an LSID will refer to, for eaxample, a global 'taxon concept' that all other taxon records should point to, is not correct. This relies on a system to be in place that provides the functionality for this global repository.
Also I feel one argument AGAINST LSIDs is that the initial investment in infrastructure is large, ie the development and setting up of authorities, etc. So I think this would lean people away from LSIDs, bot towards them. The advantage with the LSID mechanism, I think, is that it is flexible enough to not rely on existing software and internet configuration.
A GUID really needs to refer to a reasonably basic record, eg a name object rather than the entire taxon concept (although you could have a GUID for either). This allows these individual components to be referenced from other systems/datasets without having to refer to and accept the enitre concept. It is probably a good idea to map out which sort of taxonomic objects should get GUIDs and how they relate to other objects.
Kevin Richards
>>> deepreef(a)BISHOPMUSEUM.ORG 09/11/05 6:50 AM >>>
Lots of good discuccion points on GUIDs -- thanks, Rod. I need to get two
little people to two different soccer (football) games soon, so I have no
time for an elaborate response. But I do want to comment on one point,
which I have been thinking a great deal about lately:
> 7. I think the first priority for assigning GUIDs is museum specimens.
> For taxon names (if not concepts) this is trivial, given that most name
> databases have their own, internally unique ids (but not all -- those
> databases that use names as primary keys, or which don't expose integer
> identifiers will need to rethink their design).
I think it's critical that, whatever GUID system we establish for taxon
names (and concepts), we do it in the context of the next several decades of
informatic landscape; not just in the context of immediate needs or current
political climate.
As you said at the start of your message, GUIDs by themselves are trivial.
So the only real difference between establishing a system that is intuitive
for the current needs and a system that will serve longer-term future needs,
is a little bit of careful forethought.
Official taxon name registration already exists for one of the major Codes
of Nomenclature (Bacterial), and within the next fortnight we will see a
public announcement of a plan for registration in another of the major
Codes. I predict that all Codes of nomenclature will implement mandatory
registration for all new names by about 2010, and for all "available" names
(i.e., since Linnaeus) within five to ten years thereafter. So the
medium-term future landscape in this case will be one in which all names are
issued a GUID through their respective Commission of Nomenclature.
Further, it's not unreasonable to predict that sometime within the next few
decades we will converge on a unified "BioCode" for all organism names,
meaning that the longer-term landscape has a single set of taxon names.
Wouldn't it be nice, after that time, if we didn't have to forever maintain
legacy GUIDs? In other words, wouldn't it be nice if the established GUID
system for all taxon names were the same *now*, at the outset, so it's a
non-issue to combine them all as one set of GUIDs later on?
I'm not entirely sold on LSIDs, but it does seem that a lot of smart and
knowledgable people are leaning that way. My hesitation is mainly that one
of the main reasons for leaning that way is that all sorts of software
already exists for resolving them, so there is less overhead in initial
implementation. As long as LSID meet long-term needs, that shouldn't be a
problem. But 50 years from now, I'm not sure how wise it will seem that the
universal GUID system adopted for biological data was influenced strongly by
the available software of the time. Imagine being locked in now to a
universal system that was designed based on software that was available in
1955!
But, not being able to predict which GUID system will be the best in the
context of 2055, we really have no choice but to go with something that
makes a lot of sense now (which is justififable, in that it's also very
important that the delicate transition from no universal GUIDs to widespread
universal GUIDs will be best supported by keeping it as painless as possible
in the context of that transition time).
But I still suggest we do things in a way that maximally keeps our options
open. For example, in the context of LSIDs, consider different paradigms
for registring the fish name, Mygenus myspecies Hyam (Hi, Roger! :-) )
One paradigm might have each major database create its own LSID:
urn:lsid:catalogoffishes.org:SPNO:123456
urn:lsid:gbif.org:ECAT:876543
urn:lsid:itis.gov:TSN:567890
But then we're burdoned with the task of cross-mapping each of these, and
also preserving the legacy IDs into perpetuity after we've eventually
converged on a single taxon name GUID system.
I was going to illustrate several other paradigms, but soccer departure time
approaches, so I'll cut to the chase. In the LSID paradigm, I would propose
the following system:
urn:lsid:bioregistry.org:[Data Domain]:[randomly generated 64-bit integer]
The "bioregistry.org" part represents the decoupling of the GUID from the
institution that initially created the GUID. It encompases all domains of
biological data (taxon names, concepts, specimens, etc.). It could be
"tdwg.org" or "gbif.org", but we're not sure those organizations will be
around 50 or 100 years from now. I imagine that GBIF would create and
manage the bioregistry.org domain for the near-term.
The "Data Domain" represents a tag for the main domain of data (e.g.
"Specimens", or "TaxonNames", or whatever the major information domains end
up being).
The randomly generated 64-bit integer would be unique across all data
domains, so that it, by itself, is unique within bioregistry.org (no time
now to explain the rationale for this...)
Gotta run....more later.
Aloha,
Rich
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
WARNING: This email and any attachments may be confidential and/or
privileged. They are intended for the addressee only and are not to be read,
used, copied or disseminated by anyone receiving them in error. If you are
not the intended recipient, please notify the sender by return email and
delete this message and any attachments.
The views expressed in this email are those of the sender and do not
necessarily reflect the official views of Landcare Research.
Landcare Research
http://www.landcareresearch.co.nz
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> This thread has also raised the issue of mapping between multiple
> GUIDs. I think it is inevitable that we will have to deal with this,
I absolutely agree. But at the same time, I think we should try (at least)
to minimize duplicate GUID issuance to the same object (and we should
certainly not encourage it!)
> especially as there already exist major databases containing taxonomic
> information. For example, consider the task of mapping between
> mammalian names in DiGIR providers, and those used in GenBank (a
> relatively straightforward problem).
I think it would be a mistake to try to map DiGIR specimen/observation
instances directly to GenBank sequences via taxon names (although I think
GenBank sequences should be mapped directly to the specimen record from
which it was drawn, but that's independent of the taxonomy). Putting aside
the question of whether gene sequence blocks should fall within the same
data domain as specimen/observation objects, in both cases the taxon name is
a secondary attribute.
Instead, owners of each record should map their objects (specimen or
sequence) to the same universal GUID for the taxon name (or, preferably, to
the same taxon concept GUID) to which the specimen/observation/sequence
instance has been assigned. That way, when someone queries on the name (or
concept), the relevant DiGIR and GenBank objects show up in the results
because they are mapped via a common taxon GUID.
Consider the alternative where the DiGIR provider created its own taxon
GUID, separate from the taxon GUID assigned for the GenBank sequence. We'd
still be left with the task of mapping those two separate GUIDs as
representing the same taxon object (be it a name or a concept).
> In some lucky cases where we have
> specimen information in GenBank we can tie the two together that way,
Agreed! But that's a completely separate issue from how either instance is
mapped to a taxon GUID.
Maybe not completely separate. I don't think gene sequences should be
considered in the same data domain as speciemns. They fit better with
morphological characters. In the ideal world (and admittedly, this may be
out of reach in the immediate future). Neither sequences nor morphological
characters should link directely to taxon objects (names or concepts), but
rather inherit taxonomic attributes from a specimen object to which they are
attached (regardless of whether the specimen was or was not vouchered in a
museum). Like I said, I'm probably reaching too far on this one.
> but for other names/sequences we aren't this lucky. If our databases
> are distributed, and run by organisations with different goals and
> agendas (I doubt biodiversity rates highly in NCBIs list of things to
> do), we will have to deal with this.
Agreed. And I think the cleanest way to deal with it is for the "public"
data domains (literature citations and taxon names & concepts) to be
established via a single mechanism for assigning GUIDs (shooting as best we
can for 1:1 GUID:object instance), and then all of the "private" data
domains (specimens, sequences, characters, microbial cultures, etc.) be
managed by their respective data owners, and the onus would be on them to
map their taxonomic & literature links to the common universal GUID system.
Aloha,
Rich
P.S. I promised I would shutup, and I apologize for breaking that promise.
Richard L. Pyle, PhD
Database Coordinator for Natural Sciences
Department of Natural Sciences, Bishop Museum
1525 Bernice St., Honolulu, HI 96817
Ph: (808)848-4115, Fax: (808)847-8252
email: deepreef(a)bishopmuseum.org
http://hbs.bishopmuseum.org/staff/pylerichard.html
On 11 Sep 2005, at 20:55, Kevin Richards wrote:
> I think LSIDs are assumed to solve all conflicts in the various
> datasets of taxonomic data. However they are JUST resolvable IDs,
> anything else is infrastructure surrounding the LSID mechanisms. An
> LSID refers to a specific set of bytes that resides on some computer
> somewhere. The assumption that an LSID will refer to, for eaxample, a
> global 'taxon concept' that all other taxon records should point to,
> is not correct. This relies on a system to be in place that provides
> the functionality for this global repository.
>
I hope people don't make this assumption, because it's obviously
erroneous. LSIDs just provide a mechanism to assign GUIDs to metadata
and data. Whether there will be a global taxon concept is a completely
different question. I also doubt that deferring to some central
authority is the best way forward. I suspect the very nature and scale
of the task makes a distributed approach inevitable.
> Also I feel one argument AGAINST LSIDs is that the initial investment
> in infrastructure is large, ie the development and setting up of
> authorities, etc. So I think this would lean people away from LSIDs,
> bot towards them. The advantage with the LSID mechanism, I think, is
> that it is flexible enough to not rely on existing software and
> internet configuration.
>
Actually it's pretty easy to do this. Basically it can be down with a
bunch of Perl scripts, and some fiddling with the DNS. If you can
program CGI scripts, you can implement LSIDs (like I said, if an
amateur like me can do it, it can't be that hard). Any GUID is going to
need some mechanism for associating the GUID with data, and a mechanism
to ensure uniqueness and persistence of the GUID. However, they do
represent more work than, say, simply using URLs. However, I suggest
that any solution is basically going to require (a) some way of
resolving a GUID, and (b) some way to return metadata, data, or both,
about a GUID. LSIDs can do this for us now.
> A GUID really needs to refer to a reasonably basic record, eg a name
> object rather than the entire taxon concept (although you could have a
> GUID for either). This allows these individual components to be
> referenced from other systems/datasets without having to refer to and
> accept the enitre concept. It is probably a good idea to map out
> which sort of taxonomic objects should get GUIDs and how they relate
> to other objects.
>
One could have GUIDs for names and concepts. I think ultimately
anything that is worth making available to other people will/should get
GUIDs. This is, after all, why the web is so powerful -- bits of data
have GUIDs (albeit often rather fragile) in the form of URLs, and
people make use of them by linking (just look at blogs and RSS feeds as
the latest illustration of this power).
Maybe there's a confusion here between globally unique identifiers (a
way of uniquely identifying a bit of data), and a global authority
(specifying a particular view)?
Lastly, while GBIF and/or the commissions for the various codes of
nomenclature may feel they are the obvious authorities for serving
information on taxonomic names, it's not obvious to me that they will,
in fact, be so. Are we really to expect that the commissions will be
issuing GUIDs for all names within 10-15 years? Are we expected to wait
for them, when technically there's no reason why they couldn't start
doing this tomorrow?
The notion that we should wait for these bodies to get their act
together, and that we should defer to them strikes me as a recipe for
disaster (or at least inertia). There are various efforts already
underway out there, and perhaps we need a little healthy competition
and exploration of alternatives. I suspect this area will be driven by
users and data providers addressing their actual needs, rather than
from "on high". I take Richard's point that it would be nice to get
this right, but not at the cost of not actually doing something. And
regarding legacy GUIDs, in the case of LSIDs this can be handled fairly
easily via the DNS. It's rather like the case when company a.com buys
company b.com, the DNS record for b.com is changed to map to a.com
I think we also need to be careful about the idea of a central registry
of GUIDs if this means that a single body will be responsible for
issuing them. There are a range of alternatives, such as the DOI model.
DOIs have two parts, one generated centrally, the other by the data
provider. There is a central repository of metadata associated with
DOIs (http://www.crossref.org), rather like GBIF has a local copy of
data provided by DiGIR server. However, local providers are responsible
for providing the content that corresponds to a DOI, and for
constructing the second part of the DOI. In a sense this is pretty much
what my Taxonomic Search Engine does -- it generates LSIDs for the
databases that it queries, but retrieves the metadata on the fly from
the data providers.
This note is starting to lack whatever coherence it might have had at
the start. Perhaps it's time to have some real examples to play with...
Regards
Rod
> Kevin Richards
>
>>>> deepreef(a)BISHOPMUSEUM.ORG 09/11/05 6:50 AM >>>
> Lots of good discuccion points on GUIDs -- thanks, Rod. I need to get
> two
> little people to two different soccer (football) games soon, so I have
> no
> time for an elaborate response. But I do want to comment on one point,
> which I have been thinking a great deal about lately:
>
>> 7. I think the first priority for assigning GUIDs is museum specimens.
>> For taxon names (if not concepts) this is trivial, given that most
>> name
>> databases have their own, internally unique ids (but not all -- those
>> databases that use names as primary keys, or which don't expose
>> integer
>> identifiers will need to rethink their design).
>
> I think it's critical that, whatever GUID system we establish for taxon
> names (and concepts), we do it in the context of the next several
> decades of
> informatic landscape; not just in the context of immediate needs or
> current
> political climate.
>
> As you said at the start of your message, GUIDs by themselves are
> trivial.
> So the only real difference between establishing a system that is
> intuitive
> for the current needs and a system that will serve longer-term future
> needs,
> is a little bit of careful forethought.
>
> Official taxon name registration already exists for one of the major
> Codes
> of Nomenclature (Bacterial), and within the next fortnight we will see
> a
> public announcement of a plan for registration in another of the major
> Codes. I predict that all Codes of nomenclature will implement
> mandatory
> registration for all new names by about 2010, and for all "available"
> names
> (i.e., since Linnaeus) within five to ten years thereafter. So the
> medium-term future landscape in this case will be one in which all
> names are
> issued a GUID through their respective Commission of Nomenclature.
>
> Further, it's not unreasonable to predict that sometime within the
> next few
> decades we will converge on a unified "BioCode" for all organism names,
> meaning that the longer-term landscape has a single set of taxon names.
> Wouldn't it be nice, after that time, if we didn't have to forever
> maintain
> legacy GUIDs? In other words, wouldn't it be nice if the established
> GUID
> system for all taxon names were the same *now*, at the outset, so it's
> a
> non-issue to combine them all as one set of GUIDs later on?
>
> I'm not entirely sold on LSIDs, but it does seem that a lot of smart
> and
> knowledgable people are leaning that way. My hesitation is mainly
> that one
> of the main reasons for leaning that way is that all sorts of software
> already exists for resolving them, so there is less overhead in initial
> implementation. As long as LSID meet long-term needs, that shouldn't
> be a
> problem. But 50 years from now, I'm not sure how wise it will seem
> that the
> universal GUID system adopted for biological data was influenced
> strongly by
> the available software of the time. Imagine being locked in now to a
> universal system that was designed based on software that was
> available in
> 1955!
>
> But, not being able to predict which GUID system will be the best in
> the
> context of 2055, we really have no choice but to go with something that
> makes a lot of sense now (which is justififable, in that it's also very
> important that the delicate transition from no universal GUIDs to
> widespread
> universal GUIDs will be best supported by keeping it as painless as
> possible
> in the context of that transition time).
>
> But I still suggest we do things in a way that maximally keeps our
> options
> open. For example, in the context of LSIDs, consider different
> paradigms
> for registring the fish name, Mygenus myspecies Hyam (Hi, Roger! :-) )
>
> One paradigm might have each major database create its own LSID:
>
> urn:lsid:catalogoffishes.org:SPNO:123456
> urn:lsid:gbif.org:ECAT:876543
> urn:lsid:itis.gov:TSN:567890
>
> But then we're burdoned with the task of cross-mapping each of these,
> and
> also preserving the legacy IDs into perpetuity after we've eventually
> converged on a single taxon name GUID system.
>
> I was going to illustrate several other paradigms, but soccer
> departure time
> approaches, so I'll cut to the chase. In the LSID paradigm, I would
> propose
> the following system:
>
> urn:lsid:bioregistry.org:[Data Domain]:[randomly generated 64-bit
> integer]
>
> The "bioregistry.org" part represents the decoupling of the GUID from
> the
> institution that initially created the GUID. It encompases all
> domains of
> biological data (taxon names, concepts, specimens, etc.). It could be
> "tdwg.org" or "gbif.org", but we're not sure those organizations will
> be
> around 50 or 100 years from now. I imagine that GBIF would create and
> manage the bioregistry.org domain for the near-term.
>
> The "Data Domain" represents a tag for the main domain of data (e.g.
> "Specimens", or "TaxonNames", or whatever the major information
> domains end
> up being).
>
> The randomly generated 64-bit integer would be unique across all data
> domains, so that it, by itself, is unique within bioregistry.org (no
> time
> now to explain the rationale for this...)
>
> Gotta run....more later.
>
> Aloha,
> Rich
>
> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> +++++
> WARNING: This email and any attachments may be confidential and/or
> privileged. They are intended for the addressee only and are not to be
> read,
> used, copied or disseminated by anyone receiving them in error. If
> you are
> not the intended recipient, please notify the sender by return email
> and
> delete this message and any attachments.
>
> The views expressed in this email are those of the sender and do not
> necessarily reflect the official views of Landcare Research.
>
> Landcare Research
> http://www.landcareresearch.co.nz
> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> +++++
>
>
Professor Roderic D. M. Page
Editor, Systematic Biology
DEEB, IBLS
Graham Kerr Building
University of Glasgow
Glasgow G12 8QP
United Kingdom
Phone: +44 141 330 4778
Fax: +44 141 330 2792
email: r.page(a)bio.gla.ac.uk
web: http://taxonomy.zoology.gla.ac.uk/rod/rod.html
reprints: http://taxonomy.zoology.gla.ac.uk/rod/pubs.html
Subscribe to Systematic Biology through the Society of Systematic
Biologists Website: http://systematicbiology.org
Search for taxon names at http://darwin.zoology.gla.ac.uk/~rpage/portal/
Kevin Richards wrote:
> > somewhere. The assumption that an LSID will refer to, for eaxample, a
> > global 'taxon concept' that all other taxon records should point to,
> > is not correct. This relies on a system to be in place that provides
> > the functionality for this global repository.
Rod Page replied:
> I hope people don't make this assumption, because it's obviously
> erroneous. LSIDs just provide a mechanism to assign GUIDs to metadata
> and data. Whether there will be a global taxon concept is a completely
> different question. I also doubt that deferring to some central
> authority is the best way forward. I suspect the very nature and scale
> of the task makes a distributed approach inevitable.
I'm not sure I understand, in the context of this dicussion, what the phrase
"global 'taxon concept' that all other taxon records should point to" is
intended to mean.
If it means a single global taxonomy with a single "officially sanctioned"
taxon concept circumscription for each taxon name that everyone must conform
to, then certainly I agree that this should not (and cannot) be implemented.
However, I DO see value in designating a 'global' GUID to each defined
Taxonomic Concept (e.g., "Mygenus myspecies Hyam 2004 SEC Pyle 2005"), and
having all other taxon records in databases around the world point to this
same GUID whenever they specifically want to reference that particular
defined concept ("Mygenus myspecies Hyam 2004 SEC Pyle 2005"). Whether that
is a dedicated GUID, or a union of a Taxon Name GUID ("Mygenus myspecies
Hyam 2004") and a Literature GUID ("Pyle 2005"), is an issue that needs to
be discussed.
The problem of relying on a (central) system to be in place can be
dramatically mitigated if that system is mirrored with robust
synchronization protocols around the world. The effort and maintainence
should definitely be distributed among the taxonomic experts of the world.
> Lastly, while GBIF and/or the commissions for the various codes of
> nomenclature may feel they are the obvious authorities for serving
> information on taxonomic names, it's not obvious to me that they will,
> in fact, be so. Are we really to expect that the commissions will be
> issuing GUIDs for all names within 10-15 years? Are we expected to wait
> for them, when technically there's no reason why they couldn't start
> doing this tomorrow?
The reason (I think) we don't want to start issuing them tomorrow is that if
there is no effort in place to make sure the same object instance isn't
receiving multiple GUIDs from multiple issuers, then there's no real point
in assigning the GUIDs in the first place. Each major database has its own
internal LUIDs already. The work is in cross-mapping all of these LUIDs to
each other, so we can more esily exchange information. If each database
holder were to assign LSIDs to all of their records, what problem have we
solved? OK, so the IDs attached to each record are guaranteed to be
globally unique, and in some way embed resolving metadata (in the cae of
LSIDs), and all of the database holders at least conform to a common system
of IDs. But these gains are trivial compared to the monumental task of
cross-linking all of the object instances that exist redudantly in dozens of
databases around the world.
My feeling is that the "brass ring" of one GUID per object instance (i.e, a
single common "flag pole" around which all data holders can rally, and
cross-link their own LUIDs to) is very much within reach, and is what will
be needed in the long run anyway.
No single organization can be relied upon for safekeeping of "the" master
database into perpetuity. That's why "the" database needs to be mirrored
all over the world. All that's needed is a robust synchronization protocol.
The role of the Code Commissions, and GBIF, and TDWG would be to define the
protocols and standards, and establish the initial implementations. They
should not, in my opinion, be put into a position where thy must be relied
upon over the next decades/centuries in order to facilitate the perpetual
exchange of data.
> The notion that we should wait for these bodies to get their act
> together, and that we should defer to them strikes me as a recipe for
> disaster (or at least inertia). There are various efforts already
> underway out there, and perhaps we need a little healthy competition
> and exploration of alternatives. I suspect this area will be driven by
> users and data providers addressing their actual needs, rather than
> from "on high". I take Richard's point that it would be nice to get
> this right, but not at the cost of not actually doing something. And
> regarding legacy GUIDs, in the case of LSIDs this can be handled fairly
> easily via the DNS. It's rather like the case when company a.com buys
> company b.com, the DNS record for b.com is changed to map to a.com
I also see your point about the inertia problem. But I've always thought
that the paradigm of independent solutions to this issue by competitive and
disconnected efforts, as has been ongoing for the past few decades, is
exactly the sort of chaos that has lead to our current data exchange
problems that we are now trying to solve. My feeling is that we are now
ready move past that phase and into the next phase. GBIF now has nearly
$1.5 million specificlly to solve this problem, and at least one of the
historically "inertia-ladden" Commissions is about to take a dramatic step
forward. It feels to me like we're rapidly approaching critical mass, and
personally I'd like to see how far we can push it forward, and capitalize on
the new paradigm by simultaneously solving as many problems as we can all at
once.
> I think we also need to be careful about the idea of a central registry
> of GUIDs if this means that a single body will be responsible for
> issuing them. There are a range of alternatives, such as the DOI model.
> DOIs have two parts, one generated centrally, the other by the data
> provider. There is a central repository of metadata associated with
> DOIs (http://www.crossref.org), rather like GBIF has a local copy of
> data provided by DiGIR server. However, local providers are responsible
> for providing the content that corresponds to a DOI, and for
> constructing the second part of the DOI. In a sense this is pretty much
> what my Taxonomic Search Engine does -- it generates LSIDs for the
> databases that it queries, but retrieves the metadata on the fly from
> the data providers.
My reasons for looking at 64-bit integers is that there are ~10^19 of them
to go around. I can see them being issued to any institution who wants them
at, say, a billion numbers at a time (that allows for ~10 billion such
blocks of 1 billion numbers/block). Each insititution/individual would then
assign them to data objects however they want, whenever they want. If
they've assigned the numbers to objects in conformance with TDWG standards
(yet to be developed), then the associated TDWG-compliant data/metadata for
each number can be uploaded to any one of the mirror servers, at which time
the link between the number and the data/metadata gets automatically
propagated to all of the mirror servers. The point is, the only time when a
single entity needs to be relied upon is the initial issuance of blocks of
numbers. And even this could be distributed (e.g., by pre-distributing
blocks of ~10^17 numbers to each of ~100 different issuers). My reason for
thinking in terms of simple integers is that they allow flexibility for
embedding within different GUID schemes, if the TDWG standard for the GUID
"package" needed to change in the future (i.e., the numbers could remain the
same, and the resolving metadata packaging can change).
Maybe I'm off my rocker here (very possible). But it seems so simple and
straightforward, and seems (to me, anyway) to leave options open for future
GUID packaging schemes.
> This note is starting to lack whatever coherence it might have had at
> the start. Perhaps it's time to have some real examples to play with...
Ditto!! My apoligies to all for the rambling (it's a slow Sunday afternoon
here). I'll shutup now.
Aloha,
Rich
Thanks, Kevin.
I didn't realize that the LSID infrastructure was comparatively large
compared to other GUID systems that have been suggested. Whenever I've been
involved with discussions about GUIDs with people who understand the
implications much better than I do, it always seems like the availability of
open-source software tools is one of the reason people tend to favor LSIDs.
My vision of the GUID itself would be the 64-bit integer, which could be
wrapped into an LSID package, our used as a DOI number, or in some other
GUID system. I also believe the resolution service should be mirrored (via
robust and fast synchronization mechanisms) on hundreds or even thousands of
servers around the world -- at least for the "data commons" (e.g., names,
concepts, literature).
I FULLY agree that it is very important to clearly define what objects
should be assigned TDWG-standard GUIDs. In my view, the two object-domains
in most need of GUIDs for the biological informatics community are taxonomic
names, and "documentation" instanaces (~= authored/dated references,
publications, etc.), with taxon concepts represented by the intersection of
these two domains. Unfortunately, *neither* of these objects has been
clearly defined within our community. It would be nice if we could simply
adopt an exisitng literature-based GUID system developed by some other
community, but from what I have learned, none quite meets the particular
needs of the taxonomic informatics community (hence the emerging TDWG
Literature Subgroup). The reason I single these two out from other data
domains are: 1) they are (or should be) central to virtually all taxonomic
data domains; and 2) they are particularly "thorny" in terms of unambiguous
natural keys and cross-dataset resolution.
Aloha,
Rich
Richard Pyle
> -----Original Message-----
> From: Taxonomic Databases Working Group GUID Project
> [mailto:TDWG-GUID@LISTSERV.NHM.KU.EDU]On Behalf Of Kevin Richards
> Sent: Sunday, September 11, 2005 9:55 AM
> To: TDWG-GUID(a)LISTSERV.NHM.KU.EDU
> Subject: Re: GUIDs, LSIDs, and metadata
>
>
> Good points.
> A few comments I have:
>
> I think LSIDs are assumed to solve all conflicts in the various
> datasets of taxonomic data. However they are JUST resolvable
> IDs, anything else is infrastructure surrounding the LSID
> mechanisms. An LSID refers to a specific set of bytes that
> resides on some computer somewhere. The assumption that an LSID
> will refer to, for eaxample, a global 'taxon concept' that all
> other taxon records should point to, is not correct. This relies
> on a system to be in place that provides the functionality for
> this global repository.
>
> Also I feel one argument AGAINST LSIDs is that the initial
> investment in infrastructure is large, ie the development and
> setting up of authorities, etc. So I think this would lean
> people away from LSIDs, bot towards them. The advantage with the
> LSID mechanism, I think, is that it is flexible enough to not
> rely on existing software and internet configuration.
>
> A GUID really needs to refer to a reasonably basic record, eg a
> name object rather than the entire taxon concept (although you
> could have a GUID for either). This allows these individual
> components to be referenced from other systems/datasets without
> having to refer to and accept the enitre concept. It is probably
> a good idea to map out which sort of taxonomic objects should get
> GUIDs and how they relate to other objects.
>
> Kevin Richards
>
> >>> deepreef(a)BISHOPMUSEUM.ORG 09/11/05 6:50 AM >>>
> Lots of good discuccion points on GUIDs -- thanks, Rod. I need to get two
> little people to two different soccer (football) games soon, so I have no
> time for an elaborate response. But I do want to comment on one point,
> which I have been thinking a great deal about lately:
>
> > 7. I think the first priority for assigning GUIDs is museum specimens.
> > For taxon names (if not concepts) this is trivial, given that most name
> > databases have their own, internally unique ids (but not all -- those
> > databases that use names as primary keys, or which don't expose integer
> > identifiers will need to rethink their design).
>
> I think it's critical that, whatever GUID system we establish for taxon
> names (and concepts), we do it in the context of the next several
> decades of
> informatic landscape; not just in the context of immediate needs
> or current
> political climate.
>
> As you said at the start of your message, GUIDs by themselves are trivial.
> So the only real difference between establishing a system that is
> intuitive
> for the current needs and a system that will serve longer-term
> future needs,
> is a little bit of careful forethought.
>
> Official taxon name registration already exists for one of the major Codes
> of Nomenclature (Bacterial), and within the next fortnight we will see a
> public announcement of a plan for registration in another of the major
> Codes. I predict that all Codes of nomenclature will implement mandatory
> registration for all new names by about 2010, and for all
> "available" names
> (i.e., since Linnaeus) within five to ten years thereafter. So the
> medium-term future landscape in this case will be one in which
> all names are
> issued a GUID through their respective Commission of Nomenclature.
>
> Further, it's not unreasonable to predict that sometime within
> the next few
> decades we will converge on a unified "BioCode" for all organism names,
> meaning that the longer-term landscape has a single set of taxon names.
> Wouldn't it be nice, after that time, if we didn't have to
> forever maintain
> legacy GUIDs? In other words, wouldn't it be nice if the established GUID
> system for all taxon names were the same *now*, at the outset, so it's a
> non-issue to combine them all as one set of GUIDs later on?
>
> I'm not entirely sold on LSIDs, but it does seem that a lot of smart and
> knowledgable people are leaning that way. My hesitation is
> mainly that one
> of the main reasons for leaning that way is that all sorts of software
> already exists for resolving them, so there is less overhead in initial
> implementation. As long as LSID meet long-term needs, that shouldn't be a
> problem. But 50 years from now, I'm not sure how wise it will
> seem that the
> universal GUID system adopted for biological data was influenced
> strongly by
> the available software of the time. Imagine being locked in now to a
> universal system that was designed based on software that was available in
> 1955!
>
> But, not being able to predict which GUID system will be the best in the
> context of 2055, we really have no choice but to go with something that
> makes a lot of sense now (which is justififable, in that it's also very
> important that the delicate transition from no universal GUIDs to
> widespread
> universal GUIDs will be best supported by keeping it as painless
> as possible
> in the context of that transition time).
>
> But I still suggest we do things in a way that maximally keeps our options
> open. For example, in the context of LSIDs, consider different paradigms
> for registring the fish name, Mygenus myspecies Hyam (Hi, Roger! :-) )
>
> One paradigm might have each major database create its own LSID:
>
> urn:lsid:catalogoffishes.org:SPNO:123456
> urn:lsid:gbif.org:ECAT:876543
> urn:lsid:itis.gov:TSN:567890
>
> But then we're burdoned with the task of cross-mapping each of these, and
> also preserving the legacy IDs into perpetuity after we've eventually
> converged on a single taxon name GUID system.
>
> I was going to illustrate several other paradigms, but soccer
> departure time
> approaches, so I'll cut to the chase. In the LSID paradigm, I
> would propose
> the following system:
>
> urn:lsid:bioregistry.org:[Data Domain]:[randomly generated 64-bit integer]
>
> The "bioregistry.org" part represents the decoupling of the GUID from the
> institution that initially created the GUID. It encompases all domains of
> biological data (taxon names, concepts, specimens, etc.). It could be
> "tdwg.org" or "gbif.org", but we're not sure those organizations will be
> around 50 or 100 years from now. I imagine that GBIF would create and
> manage the bioregistry.org domain for the near-term.
>
> The "Data Domain" represents a tag for the main domain of data (e.g.
> "Specimens", or "TaxonNames", or whatever the major information
> domains end
> up being).
>
> The randomly generated 64-bit integer would be unique across all data
> domains, so that it, by itself, is unique within bioregistry.org (no time
> now to explain the rationale for this...)
>
> Gotta run....more later.
>
> Aloha,
> Rich
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> ++++++++++
> WARNING: This email and any attachments may be confidential and/or
> privileged. They are intended for the addressee only and are not
> to be read,
> used, copied or disseminated by anyone receiving them in error.
> If you are
> not the intended recipient, please notify the sender by return email and
> delete this message and any attachments.
>
> The views expressed in this email are those of the sender and do not
> necessarily reflect the official views of Landcare Research.
>
> Landcare Research
> http://www.landcareresearch.co.nz
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> ++++++++++
Not much is happening on the list side of things, so in the interest of
sparking discussion here are a few thoughts.
1. GUIDs by themselves are trivial. We are awash in them (book ISBNs,
GenBank accession numbers, etc.). Software developers generate them all
the time for things Windows components, Firefox extensions, web
objects, etc. There are tools for making these, e.g. here's one:
AAF813DE-21E0-11DA-A940-000D93425524.
2. The key is to link GUIDs to information, and for that information to
be in a predictable form. For example, DOIs are widely used GUIDs, but
when you resolve a DOI you have no idea what to expect. You might get a
PDF or HTML view of a manuscript, or just an abstract, or a page asking
for money to view a manuscript. The format of the response varies
widely.
3. Of course, GUIDs ARE vital. The DiGIR protocol's biggest weakness,
in my opinion, is that it fails to provide GUIDs. Whereas it does
provide information in a standard form (Darwin Core), the user has no
way of getting a GUID. I'd briefly toyed with an interim solution for a
project I'm working on. A DiGIR GUID would be
digir.fieldmuseum.org:80/digir/DiGIR.php:MammalsDwC2:158106
which is the address of the DiGIR provider, the Resource name, and the
specimen number (in this case, the specimen is FMNH 158106). This plan
was scuppered by the fact that more than one specimen can have the same
specimen code.For example the Museum of Vertebrate Zoology has three
speciemns with the code MVZ 148946, corresponding to the taxa
Chaetodipus baileyi baileyi, Calidris mauri, and Rana cascadae. A DiGIR
request for specimen MVZ 148946 returns three totally different
specimens!
4. I like LSIDs (despite the overhead of setting them up), but for me
the main attraction is their use of metadata in RDF. This opens up a
world of tools from the Semantci Web community, such as triple stores
(databases for RDF). One can harvest metadata and store this is a
"knowledge base." As this knowledge base grows we can uncover new
facts. For example, NCBI doesn't know that Gliricidia ehrenbergii and
Hybosema ehrenbergii are synonyms, whereas IPNI does. If these database
soutput RDF we can extract this information. If you have IBM's
LaunchPad and Internet Explorer 6, or Firefox with my LSID extension,
then this link
(lsidres:urn:lsid:ipni.org.lsid.zoology.gla.ac.uk:Id:1108320-2)
displays RDF for one of IPNI's records for Gliricidia ehrenbergii
(readers without any of these tools can view the raw RDF at
http://ipni.org.lsid.zoology.gla.ac.uk/authority/metadata?lsid=urn:
lsid:ipni.org.lsid.zoology.gla.ac.uk:Id:1108320-2 ). This RDF has links
to LSIDs for nomenclatural synonyms for this name, and if you follow
those you encounter Hybosema ehrenbergii. Hence, armed with consistent
metadata one can make inferences about names.
5. Another attraction of RDF is it side steps the need for the huge,
bloated XML schema which seem to bedevil the field at the moment. RDF
tends to be simple, flat, and there are a number of existing
vocabularies we can draw on (e.g., http://www.w3.org/2003/01/geo/)
6. I must confess I regard taxonomic concepts as a potential black
hole. I understand the arguments in favour, I just don't buy that this
is a tractable problem. I also think it is largely going to be of
historical interest as more and more data become linked to specimens
and to things like DNA barcodes. The fact that reconciling even two
taxonomic classifications can be a major undertaking does not bode well
for this project. For some more general thoughts on this issue, see
http://shirky.com/writings/ontology_overrated.html (a taxonomic
classification is an ontology).
7. I think the first priority for assigning GUIDs is museum specimens.
For taxon names (if not concepts) this is trivial, given that most name
databases have their own, internally unique ids (but not all -- those
databases that use names as primary keys, or which don't expose integer
identifiers will need to rethink their design).
Regards
Rod
Professor Roderic D. M. Page
Editor, Systematic Biology
DEEB, IBLS
Graham Kerr Building
University of Glasgow
Glasgow G12 8QP
United Kingdom
Phone: +44 141 330 4778
Fax: +44 141 330 2792
email: r.page(a)bio.gla.ac.uk
web: http://taxonomy.zoology.gla.ac.uk/rod/rod.html
reprints: http://taxonomy.zoology.gla.ac.uk/rod/pubs.html
Subscribe to Systematic Biology through the Society of Systematic
Biologists Website: http://systematicbiology.org
Search for taxon names at http://darwin.zoology.gla.ac.uk/~rpage/portal/
Lots of good discuccion points on GUIDs -- thanks, Rod. I need to get two
little people to two different soccer (football) games soon, so I have no
time for an elaborate response. But I do want to comment on one point,
which I have been thinking a great deal about lately:
> 7. I think the first priority for assigning GUIDs is museum specimens.
> For taxon names (if not concepts) this is trivial, given that most name
> databases have their own, internally unique ids (but not all -- those
> databases that use names as primary keys, or which don't expose integer
> identifiers will need to rethink their design).
I think it's critical that, whatever GUID system we establish for taxon
names (and concepts), we do it in the context of the next several decades of
informatic landscape; not just in the context of immediate needs or current
political climate.
As you said at the start of your message, GUIDs by themselves are trivial.
So the only real difference between establishing a system that is intuitive
for the current needs and a system that will serve longer-term future needs,
is a little bit of careful forethought.
Official taxon name registration already exists for one of the major Codes
of Nomenclature (Bacterial), and within the next fortnight we will see a
public announcement of a plan for registration in another of the major
Codes. I predict that all Codes of nomenclature will implement mandatory
registration for all new names by about 2010, and for all "available" names
(i.e., since Linnaeus) within five to ten years thereafter. So the
medium-term future landscape in this case will be one in which all names are
issued a GUID through their respective Commission of Nomenclature.
Further, it's not unreasonable to predict that sometime within the next few
decades we will converge on a unified "BioCode" for all organism names,
meaning that the longer-term landscape has a single set of taxon names.
Wouldn't it be nice, after that time, if we didn't have to forever maintain
legacy GUIDs? In other words, wouldn't it be nice if the established GUID
system for all taxon names were the same *now*, at the outset, so it's a
non-issue to combine them all as one set of GUIDs later on?
I'm not entirely sold on LSIDs, but it does seem that a lot of smart and
knowledgable people are leaning that way. My hesitation is mainly that one
of the main reasons for leaning that way is that all sorts of software
already exists for resolving them, so there is less overhead in initial
implementation. As long as LSID meet long-term needs, that shouldn't be a
problem. But 50 years from now, I'm not sure how wise it will seem that the
universal GUID system adopted for biological data was influenced strongly by
the available software of the time. Imagine being locked in now to a
universal system that was designed based on software that was available in
1955!
But, not being able to predict which GUID system will be the best in the
context of 2055, we really have no choice but to go with something that
makes a lot of sense now (which is justififable, in that it's also very
important that the delicate transition from no universal GUIDs to widespread
universal GUIDs will be best supported by keeping it as painless as possible
in the context of that transition time).
But I still suggest we do things in a way that maximally keeps our options
open. For example, in the context of LSIDs, consider different paradigms
for registring the fish name, Mygenus myspecies Hyam (Hi, Roger! :-) )
One paradigm might have each major database create its own LSID:
urn:lsid:catalogoffishes.org:SPNO:123456
urn:lsid:gbif.org:ECAT:876543
urn:lsid:itis.gov:TSN:567890
But then we're burdoned with the task of cross-mapping each of these, and
also preserving the legacy IDs into perpetuity after we've eventually
converged on a single taxon name GUID system.
I was going to illustrate several other paradigms, but soccer departure time
approaches, so I'll cut to the chase. In the LSID paradigm, I would propose
the following system:
urn:lsid:bioregistry.org:[Data Domain]:[randomly generated 64-bit integer]
The "bioregistry.org" part represents the decoupling of the GUID from the
institution that initially created the GUID. It encompases all domains of
biological data (taxon names, concepts, specimens, etc.). It could be
"tdwg.org" or "gbif.org", but we're not sure those organizations will be
around 50 or 100 years from now. I imagine that GBIF would create and
manage the bioregistry.org domain for the near-term.
The "Data Domain" represents a tag for the main domain of data (e.g.
"Specimens", or "TaxonNames", or whatever the major information domains end
up being).
The randomly generated 64-bit integer would be unique across all data
domains, so that it, by itself, is unique within bioregistry.org (no time
now to explain the rationale for this...)
Gotta run....more later.
Aloha,
Rich