Topic 1: What do we mean by "GUID"?

Wed Oct 12 12:14:39 CEST 2005

I think this is a very nice statement of the issues.

My own view is that ARK is interesting, but I'm not sure ARK is the 
best way forward. Persistence is a (perhaps the) key issue, and it is a 
social one not a technological one, as the DOI people make very clear. 
DOIs only work because the publishing industry has invested in the 
infrastructure to support them.

In some ways, DOIs and ARK are very similar. If I use the DOI resolver 
to resolve a DOI

http://dx.doi.org/10.1086/303303
        \--------/ \-----/ \----/
             |        |      |
             |        Name   Name
     Name mapping     Assigning
     Authority        Authority Number (NAAN)
     Hostport (NMAH)

then I have a URL very like an ARK, where the authority assigning the 
name (such as a publisher, in this case the University of Chicago) is 
different from the authority makes the identifier actionable (doi.org). 
One could imagine that if DOI.org were to fall over, one could 
substitute another authority, such as doi.reborn.org. Indeed some 
publishers almost do essentially this, for example 
http://www.journals.uchicago.edu/cgi-bin/resolve?id=doi:10.1086/303303 
(although this will only resolve local DOIs). ARK simply makes this 
possibility explicit. LSIDs are more strongly tied to the DNS (the 
uniqueness of an LSID is partly guaranteed by using Internet domain 
names), although they do have limited support for foreign authorities 
(other providers that can serve metadata for objects that those 
providers don't actually own).

ARK also adds the ability to retrieve a statement of commitment. I'm 
less impressed by this, as a statement is all very well, but will 
service providers actually honour it? I guess this is an issue of 
trust. I suspect that user's rating of service providers will be much 
more accurate than a rating provided by a service provider.

One issue not on this list is who generates GUIDs? ARKs and DOIs 
require some degree of centralisation because both require unique 
identifiers for organisations providing data (e.g., 10.10086 identifies 
the University of Chicago Press). This in itself requires some degree 
of service commitment. LSIDs are decentralised, in that the unique 
identifier for an organisation is provided by the DNS. If, for example, 
GBIF took on the role of providing unique identifiers for 
organisations, but then closed due to funding issues (heaven forbid), 
then we have a problem. If the DNS goes belly up, then we will have 
much more pressing issues to worry about...

Regards

Rod

On 11 Oct 2005, at 15:37, Donald Hobern wrote:

>
> [ I will be trying to provide some structure to discussions in this 
> mailing list by raising specific topics and looking for comments. 
>  Please keep the Topic number in responses ]
>  
> Topic 1: What do we mean by GUID?
>  
> The most fundamental thing that we need to establish as we consider a 
> GUID implementation is a definition for “GUID” in this context.  We 
> have been using a number of terms to describe the identifiers we need 
> (unique, resolvable, persistent, etc.).  
>  
> I’ve been spending some time following up on Rod Page’s recommendation 
> that we consider the use of Archival Resource Keys (ARK) from the 
> California Digital Library (see 
> http://wiki.gbif.org/guidwiki/wikka.php?wakka=ARK).  The CDL web site 
> includes an excellent overview of this GUID model, which also serves 
> as an excellent introduction to the issues involved.  I would urge you 
> all to read this document – it’s only nine pages long!):
>  
> http://www.cdlib.org/inside/diglib/ark/arkcdl.pdf
>  
> This document arrives at the following problem definition for 
> persistent, actionable identifiers:
>  
>       1       The goal: long-term actionable identifiers.
>       a       Requirement: that identifiers deliver you to objects (where 
> feasible).
>       b       Requirement: that identifiers deliver you to object metadata.
>       c       Desirable: each object should wear its own identifier.
>       d       Requirement: that identifiers deliver you to statements of 
> commitment.
>       2       The problem: URLs break for some objects (that is, associations 
> between URLs and objects are not maintained), and we have no way to 
> tell which ones will or won’t break.
>       3       Why URLs break: because objects are moved, removed, and replaced – 
> completely normal activities – and the provider in each case 
> demonstrates insufficient commitment to update indirection tables, or 
> to plan identifier assignment carefully. Persistence is in the mission 
> of few organizations.
>       4       Conventional hypothesis: use indirect names (PURLs, URNs, Handles) 
> instead of URLs; what worked for DNS should work for digital object 
> references.  Wrong. Indirection is spectacularly successful and 
> elegant in DNS, but it’s a side issue in the provision of digital 
> object persistence.
>  
> This document clearly identifies issues around provider service 
> commitments as the key problem that needs solving.  The construction 
> of ARKs seeks to address this in a couple of ways.  It separates the 
> role of Name Assigning Authority (i.e. who initially assigns the 
> identifier) from that of the Name Mapping Authority (i.e. who is able 
> to map the identifier to the data object at any particular time).  It 
> also defines a simple standard relationship between three things: the 
> data object, the metadata for the object, and a commitment statement 
> from the provider as to what aspects of persistence are guaranteed.
>  
> ARK is a technology that we have not really considered up to this 
> point.  My question for discussion is what, if anything, is missing or 
> wrong about the problem definition provided in this document?  If we 
> agree that it provides a crisp definition of what we need, that in 
> itself will be a major step forward.
>  
> Please provide your thoughts.
>  
> Donald
>   
>  ---------------------------------------------------------------
>  Donald Hobern (dhobern at gbif.org)
>  Programme Officer for Data Access and Database Interoperability
>  Global Biodiversity Information Facility Secretariat
>  Universitetsparken 15, DK-2100 Copenhagen,  Denmark
>  Tel: +45-35321483   Mobile: +45-28751483   Fax: +45-35321480
>  ---------------------------------------------------------------
>  
>
Professor Roderic D. M. Page
Editor, Systematic Biology
DEEB, IBLS
Graham Kerr Building
University of Glasgow
Glasgow G12 8QP
United Kingdom

Phone:    +44 141 330 4778
Fax:      +44 141 330 2792
email:    r.page at bio.gla.ac.uk
web:      http://taxonomy.zoology.gla.ac.uk/rod/rod.html
reprints: http://taxonomy.zoology.gla.ac.uk/rod/pubs.html

Subscribe to Systematic Biology through the Society of Systematic
Biologists Website:  http://systematicbiology.org
Search for taxon names at http://darwin.zoology.gla.ac.uk/~rpage/portal/

--Apple-Mail-51-484624406
Content-Transfer-Encoding: quoted-printable
Content-Type: text/enriched;
        charset=WINDOWS-1252

I think this is a very nice statement of the issues.

My own view is that ARK is interesting, but I'm not sure ARK is the
best way forward. Persistence is a (perhaps the) key issue, and it is
a social one not a technological one, as the DOI people make very
clear. DOIs only work because the publishing industry has invested in
the infrastructure to support them.

In some ways, DOIs and ARK are very similar. If I use the DOI resolver
to resolve a DOI

<fontfamily><param>Courier</param>http://dx.doi.org/10.1086/303303

       \--------/ \-----/ \----/

            |        |      |

            |        Name   Name

    Name mapping     Assigning

    Authority        Authority Number (NAAN)

    Hostport (NMAH)

</fontfamily>

then I have a URL very like an ARK, where the authority assigning the
name (such as a publisher, in this case the University of Chicago) is
different from the authority makes the identifier actionable
(doi.org). One could imagine that if DOI.org were to fall over, one
could substitute another authority, such as doi.reborn.org. Indeed
some publishers almost do essentially this, for example
http://www.journals.uchicago.edu/cgi-bin/resolve?id=doi:10.1086/303303
(although this will only resolve local DOIs). ARK simply makes this
possibility explicit. LSIDs are more strongly tied to the DNS (the
uniqueness of an LSID is partly guaranteed by using Internet domain
names), although they do have limited support for foreign authorities
(other providers that can serve metadata for objects that those
providers don't actually own).

ARK also adds the ability to retrieve a statement of commitment. I'm
less impressed by this, as a statement is all very well, but will
service providers actually honour it? I guess this is an issue of
trust. I suspect that user's rating of service providers will be much
more accurate than a rating provided by a service provider.

One issue not on this list is who generates GUIDs? ARKs and DOIs
require some degree of centralisation because both require unique
identifiers for organisations providing data (e.g., 10.10086
identifies the University of Chicago Press). This in itself requires
some degree of service commitment. LSIDs are decentralised, in that
the unique identifier for an organisation is provided by the DNS. If,
for example, GBIF took on the role of providing unique identifiers for
organisations, but then closed due to funding issues (heaven forbid),
then we have a problem. If the DNS goes belly up, then we will have
much more pressing issues to worry about...

Regards

Rod

On 11 Oct 2005, at 15:37, Donald Hobern wrote:

<excerpt>   

[ I will be trying to provide some structure to discussions in this
mailing list by raising specific topics and looking for comments.
 Please keep the Topic number in responses ]

Topic 1: What do we mean by GUID?

The most fundamental thing that we need to establish as we consider a
GUID implementation is a definition for “GUID” in this context.  We
have been using a number of terms to describe the identifiers we need
(unique, resolvable, persistent, etc.).  

I’ve been spending some time following up on Rod Page’s recommendation
that we consider the use of Archival Resource Keys (ARK) from the
California Digital Library (see
http://wiki.gbif.org/guidwiki/wikka.php?wakka=ARK).  The CDL web site
includes an excellent overview of this GUID model, which also serves
as an excellent introduction to the issues involved.  I would urge you
all to read this document – it’s only nine pages long!):

http://www.cdlib.org/inside/diglib/ark/arkcdl.pdf

This document arrives at the following problem definition for
persistent, actionable identifiers:

        1       The goal: long-term actionable identifiers.

        a       Requirement: that identifiers deliver you to objects (where
feasible).

        b       Requirement: that identifiers deliver you to object metadata.

        c       Desirable: each object should wear its own identifier.

        d       Requirement: that identifiers deliver you to statements of
commitment.

        2       The problem: URLs break for some objects (that is, associations
between URLs and objects are not maintained), and we have no way to
tell which ones will or won’t break.

        3       Why URLs break: because objects are moved, removed, and replaced –
completely normal activities – and the provider in each case
demonstrates insufficient commitment to update indirection tables, or
to plan identifier assignment carefully. Persistence is in the mission
of few organizations.

        4       Conventional hypothesis: use indirect names (PURLs, URNs, Handles)
instead of URLs; what worked for DNS should work for digital object
references.  Wrong. Indirection is spectacularly successful and
elegant in DNS, but it’s a side issue in the provision of digital
object persistence.

This document clearly identifies issues around provider service
commitments as the key problem that needs solving.  The construction
of ARKs seeks to address this in a couple of ways.  It separates the
role of Name Assigning Authority (i.e. who initially assigns the
identifier) from that of the Name Mapping Authority (i.e. who is able
to map the identifier to the data object at any particular time).  It
also defines a simple standard relationship between three things: the
data object, the metadata for the object, and a commitment statement
from the provider as to what aspects of persistence are guaranteed.

ARK is a technology that we have not really considered up to this
point.  My question for discussion is what, if anything, is missing or
wrong about the problem definition provided in this document?  If we
agree that it provides a crisp definition of what we need, that in
itself will be a major step forward.

Please provide your thoughts.

Donald

 ---------------------------------------------------------------

 Donald Hobern (dhobern at gbif.org)

 Programme Officer for Data Access and Database Interoperability

 Global Biodiversity Information Facility Secretariat 

 Universitetsparken 15, DK-2100 Copenhagen,  Denmark

 Tel: +45-35321483   Mobile: +45-28751483   Fax: +45-35321480

 ---------------------------------------------------------------

</excerpt>Professor Roderic D. M. Page

Editor, Systematic Biology

DEEB, IBLS

Graham Kerr Building

University of Glasgow

Glasgow G12 8QP

United Kingdom

Phone:    +44 141 330 4778

Fax:      +44 141 330 2792

email:    r.page at bio.gla.ac.uk

web:      http://taxonomy.zoology.gla.ac.uk/rod/rod.html

reprints: http://taxonomy.zoology.gla.ac.uk/rod/pubs.html

Subscribe to Systematic Biology through the Society of Systematic

Biologists Website:  http://systematicbiology.org

Search for taxon names at
http://darwin.zoology.gla.ac.uk/~rpage/portal/