How GUIDs will be used

Wed Feb 1 01:13:29 CET 2006

>> We need to decide under which circumstances we want this to be a
>> bidirectional implication, i.e. that:
>>
>> B)   O1 == O2 --> I1 == I2
>>
>
>My 30 second reaction to this is that it assumes that the metadata
>attached to I1 and I2 are the same, which may often not be the case.

Rod is of course right.  The example I gave was a major oversimplification.
I would say that we need to distinguish several more levels in order to
understand what we mean by "the same data object".

Let me expand a little with some extra symbols.

O1 and O2 are the (real-world) objects or events to which we wish to refer
in our computer systems.  Examples would be a specimen in a collection, or a
nomenclatural act.

D1 and D2 are digital representations of O1 and O2.  This immediately raises
a large number of questions which are probably external to the development
of a basic GUID infrastructure, but which need to be addressed in
applicability statements for different subdomains.  We need to be sure that
we agree on what is a data representation for the object/event and the
extent to which it must have a standard form (including whether or not
identity of byte streams matters to us).  In some cases it may be hard for
us to identify anything that we regard as essential to a canonical digital
representation of an object/event.  (This has already been identified as a
problem that we would need to address with LSIDs.)  It is also worth noting
that there are cases in which distinguishing between objects and events and
their digital representations is difficult or perhaps meaningless.  Some
objects of interest to us may not exist except in a digital form.

M1 and M2 are representations of metadata that describe D1 and D2.  Again
the distinction between digital representations and associated metadata is
rarely completely clear (applicability statements again needed).

We then have the identifiers I1 and I2.  In different situations these may
serve as identifiers for O1 and O2 or for D1 and D2 or for D1+M1 and D2+M2.
None of these is necessarily right or wrong.  Appropriate practices must be
defined in each case.

The meaning of the identity I1==I2 will vary according to these defined
practices.  If I1==I2-->O1==O2 (and nothing more), we may retrieve different
data records and metadata as alternative resolutions of the same identifier.

Returning full circle to my original point, the inequality I1!=I2 may tell
us that M1 and M2 differ, or that M1 and M2 and D1 and D2 differ, or that M1
and M2 and D1 and D2 and O1 and O2 differ, or in other cases may imply none
of the above.  We need to be sure which of these inferences we need to
support for each subdomain.

On top of this we also need to consider whether the same identifier should
be used for D1 and for D1', where D1' is a later version of D1 (with some
corrections or modifications of the data elements).

I hope this makes some things clearer for someone.  It helps me with
thinking about the problems.

Thanks,

Donald

---------------------------------------------------------------
Donald Hobern (dhobern at gbif.org)
Programme Officer for Data Access and Database Interoperability
Global Biodiversity Information Facility Secretariat
Universitetsparken 15, DK-2100 Copenhagen, Denmark
Tel: +45-35321483   Mobile: +45-28751483   Fax: +45-35321480
---------------------------------------------------------------