[ I will be trying to provide some structure to discussions in this mailing list by raising specific topics and looking for comments. Please keep the Topic number in responses ]

Topic 1: What do we mean by GUID?

The most fundamental thing that we need to establish as we consider a GUID implementation is a definition for “GUID” in this context. We have been using a number of terms to describe the identifiers we need (unique, resolvable, persistent, etc.).

I’ve been spending some time following up on Rod Page’s recommendation that we consider the use of Archival Resource Keys (ARK) from the California Digital Library (see http://wiki.gbif.org/guidwiki/wikka.php?wakka=ARK). The CDL web site includes an excellent overview of this GUID model, which also serves as an excellent introduction to the issues involved. I would urge you all to read this document – it’s only nine pages long!):

http://www.cdlib.org/inside/diglib/ark/arkcdl.pdf

This document arrives at the following problem definition for persistent, actionable identifiers:

The goal: long-term actionable identifiers.

Requirement: that identifiers deliver you to objects (where feasible).
Requirement: that identifiers deliver you to object metadata.
Desirable: each object should wear its own identifier.
Requirement: that identifiers deliver you to statements of commitment.

The problem: URLs break for some objects (that is, associations between URLs and objects are not maintained), and we have no way to tell which ones will or won’t break.
Why URLs break: because objects are moved, removed, and replaced – completely normal activities – and the provider in each case demonstrates insufficient commitment to update indirection tables, or to plan identifier assignment carefully. Persistence is in the mission of few organizations.
Conventional hypothesis: use indirect names (PURLs, URNs, Handles) instead of URLs; what worked for DNS should work for digital object references. Wrong. Indirection is spectacularly successful and elegant in DNS, but it’s a side issue in the provision of digital object persistence.

This document clearly identifies issues around provider service commitments as the key problem that needs solving. The construction of ARKs seeks to address this in a couple of ways. It separates the role of Name Assigning Authority (i.e. who initially assigns the identifier) from that of the Name Mapping Authority (i.e. who is able to map the identifier to the data object at any particular time). It also defines a simple standard relationship between three things: the data object, the metadata for the object, and a commitment statement from the provider as to what aspects of persistence are guaranteed.

ARK is a technology that we have not really considered up to this point. My question for discussion is what, if anything, is missing or wrong about the problem definition provided in this document? If we agree that it provides a crisp definition of what we need, that in itself will be a major step forward.

Please provide your thoughts.

Donald

---------------------------------------------------------------
Donald Hobern (dhobern@gbif.org)
Programme Officer for Data Access and Database Interoperability
Global Biodiversity Information Facility Secretariat
Universitetsparken 15, DK-2100 Copenhagen, Denmark
Tel: +45-35321483 Mobile: +45-28751483 Fax: +45-35321480
---------------------------------------------------------------