Re: Globally Unique Identifier - Part III

24 Sep 2004

      Richard Pyle wrote:

...
...
...
Perhaps there's a
delegation mechanism that can be used?  So when a DN can't be resolved,
the system backs down to a default DN, such as gbif.org that would then
indicate that smithsonian.org is now bishop.org?
But it's not that simple, is it?  If there is an LSID:
urn:lsid:bishopmuseum.org:Specimen:1234567
and another LSID, to a completely different specimen:
urn:lsid:smithsonian.gov:Specimen:1234567
...then simply re-directing all bishopmuseum.org requests to Smiithsonian
wouldn't work....would it?  Or would Smithsonian recognize the domain and
deal with it accordingly?
It seems to me that a lot of complexity would disappear if we could all get
behind a single issuer of GUIDs, and mirror the capability to resolve those
GUIDs on dozens or hundreds of servers around the world, and only use the
GUIDs in a semantic context that is self-evident.
Perhaps.  But what about if an insitution wants to provide IDs for more
than just specimen or name objects?  Should we always rely on a single
authority to provide a mechanism for doing that? I don't think that
would go very far.
...
Re-reading something I wrote:
...
...
I would go further
to suggest (as I did above) that "Name" GUIDs should also be a subtype
of
...
...
Name-Reference instances (non-exclusive of Concept subtype instances),
using
...
...
the Name-Reference instance that represents the Code-recognized original
description of the name as the "handle" to the Name.
Actually, it's probably safe to say that all "name-bearing" Name+Reference
instances (i.e., original descriptions) are also, virtually by definition,
also "concept-bearing" Name+Reference instances.  So, not only would
name-bearing and concept-bearing Name+Reference instances be non-exclusive
of each other, it would probably be safe to think of name-bearing instances
as a subset (Subtype) of Concept-bearing instances, which themselves are a
subset (Subtype) of all Name+Reference instances.
ok.
...
...
...
My own answers to your questions:
1) Are LSIDs the most appropriate technology?
I'm increasingly coming to that conclusion.
I agree.  The LSID system is easy to implement, stable, scalable and
does everything we need.  The DOI system is good as well, but the fee
scheme bothers me (though I understand there are ways around that).
My understanding is that it would be easy to develop a DOI-like system that
is not part of the fee-based DOI system, and I still find it appealing
because it could as simple as an integer ID and very basic context tag.
As for LSIDs -- If I understand correctly that the purpose of the
<DomainName> portion of the LSID is to point to the one (and only?) server
that can resolve the ID, then all of a sudden I don't like them at all.  If
it's true that the embedded Domain portion of an LSID *requires* that the
domain exist for as long as the GUID exists in order for the GUID to be
useful, then I definitely have reservations.  If, on the other hand, the
Domain portion can be seen as representing the issuer (somewhat analogous to
the function of "InstitutionCode" in DwC), and could be resolved by any
server set up to deal with the <namespace> part of the LSID, then I'm much
less concerned.
The DN portion is meant to be resolvable by the DNS system.  So yes,
there is a dependency on the continued existence of the DN, but is can
be set up to be resolved by any LSID service endpoint.
...
...
...
I think the best option would be central.  The next
option would be full
...
distributed.  Leaving it as an option would, in my opinion, be a BIG
mistake.
I disagree- the assignment of identifiers should be by the curators of
the data.  However, I do strongly consider that there should be some
sort of trust scheme in place, where identifiers are issued only by
entities trusted by the rest of the system.  A scheme similar to that
used by certificate authorities and delegates should be adequate.
Maybe I'm misunderstanding the use of the word "issuers", but in my mind,
the issuer's job is only to provide a guaranteed-unique set of ID's.  It
would not, necessarily, be the location where the ID is applied to its
associated data.
If we use the MAC approach + a context such as an LSID or DOI form, then
  there is absolutely no need for a central issuing agency.  If data
providers are more careful about assuring they meet requirements about
their identifiers, then again there is no need for a central issuer.
...
In Donald's PowerPoint file, he made reference to "mechanisms for data
providers to request and use blocks of LSIDs from central service".  Here's
how I imagine a system would work:
GBIF (or some other central entity) establishes a service that can generate
unique <objectID> numbers within its own LSID context.  The same service
also maintains a complete set of data associated with each <objectID>.
Major (and minor) institutions (essentially your set of "Trusted" entities)
would established mirrored copies of the complete set of all data (or,
perhaps, only a filtered subset of the complete data), but would not be able
to issue new GUIDs directly.  However, the mirrored sites could serve as
real-time "pass-through" to the central sight so as to be functionally able
to provide new GUIDs in real time, by retrieving them directly (in real
time) from the central server.  Also, the mirrored sites would all maintain
synchrony of their copies of the data with the central "master" copy, on a
realistic time frame (e.g., every 24 hours, or on-demand if a data provider
chose to initiate a synchronization command).
Again, just use a MAC based GUID inside an LSID context.  If you have
any MS dev tools on your machine type "guidgen" and the command prompt.
  Voila!  Globally unique identifiers.  No matter how many times you
push the "New GUID" button.
...
If a curator of a local institution's data needed to assign a new batch of
numbers for a new set of specimens, the curator would issue a request to the
central server (or via one of mirrored sites as a pass-through request) for
a block of N numbers.  The central server would never re-issue those same
numbers again to anyone else.  But those numbers remain "empty" until the
curator assigns them to data, and uploads that data either to the central
server or to one of the mirrors.  In other words, even though the numbers
are "issued" by a central server, they are applied to real data only by
local curators.
A big issue, of course, is control over editing of data associated with a
given GUID.  In the case of specimens, the central server and mirrored sites
could (perhaps at the discretion of the data curator who initially requested
the number) restrict subsequent editing of those data to a defined set of
password-protected user accounts.  In the case of more public data, such as
taxon names and publications, the control of data editing would be less
restrictive (e.g., either full accessible by the public, or accessible to
anyone who goes to the trouble to register themselves as a taxonomist with
the central server or with any of the mirrored sites).
Maybe this approach would not be practical for specimen data -- but I think
it would be the optimal approach for taxon data.  Perhaps those two
fundamentally different kinds of data (owned, vs. public domain) need
fundamentally different approaches to GUID issuance and assignment?
...
...
3) Which objects should receive identifiers?
Specimens, References, Name-Reference intersections
(Assertions), and
...
perhaps Agents.  [TaxonNames and Concepts can be subsets of
Name-Reference
...
intersections].
Any object. It doesn't matter what it is, just that it can be resolved,
and when you find it, you can figure out what it is.  Sensible use of
the NameSpace portion of the LSID will help a lot with this.  A trusted
organization should issue the NameSpace portion to avoid NS conflicts.
I'd have to think this through some more.  Leaving it too open might lead to
a plethora of (potentially overlapping, but not quite equivalent)
NameSpaces, which seems like it could turn into a real mess, really quickly.
Centralized ID systems such as social security numbers in the U.S.,
telephone numbers, etc. definitely have some advantage over totally open
systems.  I suppose that the pool of NS's would be self-cleaning simply by
use or non-use....but I still wonder how much better this approach would be
over the status quo.
That's why there needs to be some agreement over the issuance of namespaces.
...
...
...
3a) Should we develop a set of object classes for biodiversity
informatics
...
and assign identifiers to instances of all of these?
I think so, yes. Of course, it depends a bit on who you
mean by "we".  I'm
...
thinking sensu lato.
Sure, and these could be a core from which others can be built.  But we
should absolutely not restrict the capability of the "system" to accept
new classes - even classes that represent the same infomration in a
different way that may be appropriate to a group of users.
Again, I'll have to think about this some more.  I certainly don't think
that the "system" should be incapable of dealing with new classes -- sort of
like how anyone can develop their own Federation Schema and use DiGIR to
establish specific information networks.  But I'd hate to see a breakdown in
the global transmission of biodiversity information simply because different
subgroups establish their own special-needs, non-mutually-compatible classes
for dealing with essentially the same kinds of information (especially if
they do not also conform to a generalized international standard).
Bah.  That's the whole point of this - to facilitate data exchange.  If
a small subgroup wants to start exchanging data in an abbreviated
format, so what?  As long as the identifiers being used are able to
resolve the type of object being passed around, and the objects conform
to their definitions, it shouldn't be a problem.  By initially
establishing a robust framework for Scientific Names and perhaps
specimen data / collections, then there will be little need for others
to recreate new ways to represent that data.  The benefits of a robust
reliable representation and provision of cheap, effective software tools
will hopefully overcome the steep learning curve needed to even
understand what's in some of these schemas.
...
...
...
4) What should be done about existing records without identifiers?
As far as I know, ALL records are currently without
identifiers (unless
...
someone established a widely accepted GUID system and I missed the
announcement...)
All records currently have some sort of identifier, the problem is their
uniqueness is not rigorously enforced or even evaluated, so their
usefulness is probably limited.
O.K., in that case I misunderstood the meaning of "identifiers".  All
historical identifiers (e.g., catalog numbers for specimens) should be
maintained, preserved, and cross-referenced to GUIDs just like any other
metadata about the physical object.  I think of catalog numbers not so much
as unique identifiers, but as "labels" -- not altogether unlike taxonomic
names. In the databases I manage, I do not use catalog numbers as
identifiers -- the computer generates the UID, which is never seen, read,
written, or typed by a human.  That's how I'd like to see the sorts of GUIDs
we're discussing be implemented -- i.e., for the benefit of
computer-computer data exchange; not human-human data exchange or
human-computer/computer-human data exchange.
But if I want to say to you, hey look at this specimen xxx while we're
chatting from around the world using an instant messenger while
collaborating on some project, would't it be nice to just be able to
type in lsid:mymuseum.org:specimen:1234 and have your client retrieve
that exact data and associated metadata directly?  A trivial example but
one that can form the foundation of some cool stuff for data exchange
and interaction.  I thought that was the whole point of these GUID
things.  But maybe I'm mistaken?
...
...
...
4c) Should the provider software be modified to generate "soft"
identifiers
...
(ones which we cannot guarantee in all cases to be unique)
based e.g. on the
...
combination of InstitutionCode, CollectionCode and CatalogNumber?
As an interim solution, perhaps.  See my comments under
"Slide 2" above.
Yes, but not soft.  The providers should assign their own identifiers,
but there must be a mechanism to ensure that identifiers are being
properly assigned.
Agreed -- but I still think of these as "soft" identifiers, because
CatalogNumber values can change over time, in certain circumstances.  GUIDs
should *never* need to be changed (even if the institution that issued them
vanishes without a trace from the face of the Earth).
Yep.  That would be the ideal.
...
...
Revision information is very helpful in dealing with errors such as
keystroke errors or other such details that do not change the object.
I agree the revision information *can* be helpful in dealing with errors;
but I don't see that function as being integral to the assignment of GUID
values.
Except in the somewhat bizarre case when you need the old version of the
object.
...
...
Not many.  It seems most collections don't record any history in their
record edits, so without a major alteration in the way the data are
stored, it will be a significant undertaking to provide useful revision
information.
For what it's worth, the databases I have developed for my institution are
designed to log every change made to every field (except
performance-enhancing, purely derivative fields), including what the
previous value was, who made the change, and when the change was made.  When
records are deleted, a "snapshot" of the value of every non-null field is
logged, including the time the record was deleted, and by whom.  The reason
I say all of this is to underscore that my stance on not including
versioning IDs as part of a GUID system is NOT from lack of appreciation for
the value of preserving edit histories (something I clearly value very, very
much -- given that the total diskspace occupied by my edit logs exceeds the
total diskspace occupied by the "real" data!)
Awesome.  That's a nice way to do it.
...
In closing, I apologize to those who find my overly-long posts on this topic
to be an annoyance.  I also am starting to wonder:  is this the appropriate
email forum to have this discussion?
Yeah, good question.  Maybe this should be on the GBIF DADI list or TDWG
general?  Or even the LSID list?
...
Aloha,
Rich
Kia ora,
   Dave V.

Re: Globally Unique Identifier - Part III

Dave Vieglais