Different reasons for different GUIDs (Was: GUIDs, LSIDs, and metadata)

Wed Sep 14 04:28:27 CEST 2005

Donald Hobern wrote:

> 3. Users can assume that data records relate to the same real-world object
> if and only if they share the same GUID.

Here's some technology that has worked for us to assign unique identifiers to
objects for which there is no prior universally adopted system for assigning and
recognizing persistent and unique identifiers (it uses the terminology of
object, entities and instances orginating from the object-oriented community and
is commonly adopted by people working in the text mining field). In this
context, the task is to map each instance of a given object to it's
corresponding entity (by using a unique identifier). During this process,
instances of entities that have no unique identifier assigned to them, are
assigned a new unique identifier (accumulative learning). How this mapping from
the instance of an entity (usually a textual description such as the reference
to a publication, the label of a specimen or the name of a taxon) to the unique
identifier of the entity is object-dependent and may require the input of expert
knowledge.

Once a unique identifier is assigned to an entity, the entity can be made more
"intelligent" by incorporating it in solid cross-reference schemes with other
entities (using their unique identifiers). However, there's two problems that
occur during the mapping of instances (textual descriptions) and the unique
identifiers:

1. the same unique identifier is assigned to instances of different real-world
entities (eg. because they use the same textual description). These are called
false-positive assignments. Some ambiguities during the assignment process can
be resolved by making the mapping algorithm context-dependent, taking more
information into account than only the textual description of the instance
itself, and in some cases it the mapping algorithm could even decide that is
impossible to automatically assign unique identifiers and ask for some form of
user intervention. Apart from that, it is also possible after the assignment of
unique identifiers to do intrusion detection by finding inconsistencies in the
assignment of identifiers that come out of the cross-referencing of the entity
with other entities (eg. some taxon name is assigned to both specimen of mammals
and fungi, which could be the case if the assignment of names to taxa has not
been unique over all kingdoms; or two sequences of the same gene from the same
specimen are too dissimilar). If such cases are detected, it would mean that the
instances need to be split and at least one new identifier would need to be
created to discriminate the two entities (Even better would be in such cases to
create two new identifiers, and linking them to the original identifier which
would become obsolete for future reference, but could serve to resolve existing
references that make use of the identifier).

2. different unique identifiers are assigned to instances of the same entity.
These are called false-negative assignments. Although this particular case is
generally less harmful than the previous case, it could mean that queries based
on cross-referencing schemes that make use of the unique identifiers could come
up with less-than-complete information (while in the previous case they would
come up with incorrect information). When false-negative assignments are
detected (and again many domain-dependent solutions have been suggested, in
which an investigation of the cross-referencing schemes usually play an
important role) they can simply be resolved by merging the identifiers so that
they represent the same entity.

Splitting and merging identifiers that were assigned and represent different or
identical entities respectively can be easily done by representing them in a
tree-like structure (tree are either splitted completely, or the splitted
subtrees are still connected by the same root which carries and obsoleted
identifier) and cross-referencing schemes can take into account the
relationships represented by these tree-like structures. Diversity of unique
identifiers for the same entity can be avoided to recommend the root of the tree
(or subtrees in case of splitted trees) as the identifier to use for future
cross-references.

This technology has a number of practical advantages, most and forall that the
assignment of unique identifiers is completely decoupled from the process of
error detection and correction. Fact is that unique identifiers need to be
assigned to implement solid cross-referencing schemes and that solid
cross-referencing schemes are valuable tools for the detection of erroneouos
assignments of unique identifiers (which came first, the chicken or the egg?).
In this phylosophy, the assignment of unique identifiers is thus no longer seen
as a one-step process, but as a perpetuous process of identifiers assignments
and quality control. And of course it should be a major concern to make the
mapping between unique identifiers assigned to entities a 1:1 relationship as
much as possible, by making the mapping algorithms as intelligent as possible.
But again maintaining a 1:1 relationship only exists in the perfect world (even
if there is only one authority that can assign new identifiers) and we should
thus come up with schemes that work in practice, as the one descibed above.

> 2. It may simplify a data provider's own task in providing cross-references
> between data elements which are all under their own control.

Most of the cross-references I see will between entities that are not under the
control of a single data provider. At least most data providers will want to
hook up their own data within a broader network of information source of the
life sciences. That is where actionable, persistent and globally unique
identifiers really come into play I think, just to keep autonomous and
heterogeneous data sources as independent as possible while enabling to make an
extensible network of cross-references feasible.

A lot of the cross-references already are out there somewhere. Having them
nicely stored in public databases would be ideal, and therefore a lot of work is
done to extract cross-reference relationships from public literature databases
such as PubMed. It is worthwhile to take a look at what is moving in the
community of people that try to extract functional pathways from the free text
in the scientific literature. All of them are making use of one or another
system to assign unique identifiers (not with the purpose of making them
globally unique, but merely only unique in the context of their application).
The intelligence they can currently reach with natural language processing to
avoid false-positive and false-negative cross-references (good old Chomsky's
theory is finally finding it's way into practice) is quite amazing.

Cheers,

Peter Dawyndt