Donald Hobern wrote:
- Users can assume that data records relate to the same real-world object
if and only if they share the same GUID.
Here's some technology that has worked for us to assign unique identifiers to objects for which there is no prior universally adopted system for assigning and recognizing persistent and unique identifiers (it uses the terminology of object, entities and instances orginating from the object-oriented community and is commonly adopted by people working in the text mining field). In this context, the task is to map each instance of a given object to it's corresponding entity (by using a unique identifier). During this process, instances of entities that have no unique identifier assigned to them, are assigned a new unique identifier (accumulative learning). How this mapping from the instance of an entity (usually a textual description such as the reference to a publication, the label of a specimen or the name of a taxon) to the unique identifier of the entity is object-dependent and may require the input of expert knowledge.
Once a unique identifier is assigned to an entity, the entity can be made more "intelligent" by incorporating it in solid cross-reference schemes with other entities (using their unique identifiers). However, there's two problems that occur during the mapping of instances (textual descriptions) and the unique identifiers:
1. the same unique identifier is assigned to instances of different real-world entities (eg. because they use the same textual description). These are called false-positive assignments. Some ambiguities during the assignment process can be resolved by making the mapping algorithm context-dependent, taking more information into account than only the textual description of the instance itself, and in some cases it the mapping algorithm could even decide that is impossible to automatically assign unique identifiers and ask for some form of user intervention. Apart from that, it is also possible after the assignment of unique identifiers to do intrusion detection by finding inconsistencies in the assignment of identifiers that come out of the cross-referencing of the entity with other entities (eg. some taxon name is assigned to both specimen of mammals and fungi, which could be the case if the assignment of names to taxa has not been unique over all kingdoms; or two sequences of the same gene from the same specimen are too dissimilar). If such cases are detected, it would mean that the instances need to be split and at least one new identifier would need to be created to discriminate the two entities (Even better would be in such cases to create two new identifiers, and linking them to the original identifier which would become obsolete for future reference, but could serve to resolve existing references that make use of the identifier).
2. different unique identifiers are assigned to instances of the same entity. These are called false-negative assignments. Although this particular case is generally less harmful than the previous case, it could mean that queries based on cross-referencing schemes that make use of the unique identifiers could come up with less-than-complete information (while in the previous case they would come up with incorrect information). When false-negative assignments are detected (and again many domain-dependent solutions have been suggested, in which an investigation of the cross-referencing schemes usually play an important role) they can simply be resolved by merging the identifiers so that they represent the same entity.
Splitting and merging identifiers that were assigned and represent different or identical entities respectively can be easily done by representing them in a tree-like structure (tree are either splitted completely, or the splitted subtrees are still connected by the same root which carries and obsoleted identifier) and cross-referencing schemes can take into account the relationships represented by these tree-like structures. Diversity of unique identifiers for the same entity can be avoided to recommend the root of the tree (or subtrees in case of splitted trees) as the identifier to use for future cross-references.
This technology has a number of practical advantages, most and forall that the assignment of unique identifiers is completely decoupled from the process of error detection and correction. Fact is that unique identifiers need to be assigned to implement solid cross-referencing schemes and that solid cross-referencing schemes are valuable tools for the detection of erroneouos assignments of unique identifiers (which came first, the chicken or the egg?). In this phylosophy, the assignment of unique identifiers is thus no longer seen as a one-step process, but as a perpetuous process of identifiers assignments and quality control. And of course it should be a major concern to make the mapping between unique identifiers assigned to entities a 1:1 relationship as much as possible, by making the mapping algorithms as intelligent as possible. But again maintaining a 1:1 relationship only exists in the perfect world (even if there is only one authority that can assign new identifiers) and we should thus come up with schemes that work in practice, as the one descibed above.
- It may simplify a data provider's own task in providing cross-references
between data elements which are all under their own control.
Most of the cross-references I see will between entities that are not under the control of a single data provider. At least most data providers will want to hook up their own data within a broader network of information source of the life sciences. That is where actionable, persistent and globally unique identifiers really come into play I think, just to keep autonomous and heterogeneous data sources as independent as possible while enabling to make an extensible network of cross-references feasible.
A lot of the cross-references already are out there somewhere. Having them nicely stored in public databases would be ideal, and therefore a lot of work is done to extract cross-reference relationships from public literature databases such as PubMed. It is worthwhile to take a look at what is moving in the community of people that try to extract functional pathways from the free text in the scientific literature. All of them are making use of one or another system to assign unique identifiers (not with the purpose of making them globally unique, but merely only unique in the context of their application). The intelligence they can currently reach with natural language processing to avoid false-positive and false-negative cross-references (good old Chomsky's theory is finally finding it's way into practice) is quite amazing.
Cheers,
Peter Dawyndt