Globally Unique Identifier - Part III

Fri Sep 24 21:35:37 CEST 2004

Richard Pyle wrote:

...

>
>>Perhaps there's a
>>delegation mechanism that can be used?  So when a DN can't be resolved,
>>the system backs down to a default DN, such as gbif.org that would then
>>indicate that smithsonian.org is now bishop.org?
>
>
> But it's not that simple, is it?  If there is an LSID:
>
> urn:lsid:bishopmuseum.org:Specimen:1234567
>
> and another LSID, to a completely different specimen:
>
> urn:lsid:smithsonian.gov:Specimen:1234567
>
> ...then simply re-directing all bishopmuseum.org requests to Smiithsonian
> wouldn't work....would it?  Or would Smithsonian recognize the domain and
> deal with it accordingly?
>
> It seems to me that a lot of complexity would disappear if we could all get
> behind a single issuer of GUIDs, and mirror the capability to resolve those
> GUIDs on dozens or hundreds of servers around the world, and only use the
> GUIDs in a semantic context that is self-evident.

Perhaps.  But what about if an insitution wants to provide IDs for more
than just specimen or name objects?  Should we always rely on a single
authority to provide a mechanism for doing that? I don't think that
would go very far.

>
> Re-reading something I wrote:
>
>
>>>I would go further
>>>to suggest (as I did above) that "Name" GUIDs should also be a subtype
>
> of
>
>>>Name-Reference instances (non-exclusive of Concept subtype instances),
>
> using
>
>>>the Name-Reference instance that represents the Code-recognized original
>>>description of the name as the "handle" to the Name.
>
>
> Actually, it's probably safe to say that all "name-bearing" Name+Reference
> instances (i.e., original descriptions) are also, virtually by definition,
> also "concept-bearing" Name+Reference instances.  So, not only would
> name-bearing and concept-bearing Name+Reference instances be non-exclusive
> of each other, it would probably be safe to think of name-bearing instances
> as a subset (Subtype) of Concept-bearing instances, which themselves are a
> subset (Subtype) of all Name+Reference instances.
>

ok.

>
>>>My own answers to your questions:
>>>
>>>1) Are LSIDs the most appropriate technology?
>>>
>>>        I'm increasingly coming to that conclusion.
>>
>>I agree.  The LSID system is easy to implement, stable, scalable and
>>does everything we need.  The DOI system is good as well, but the fee
>>scheme bothers me (though I understand there are ways around that).
>
>
> My understanding is that it would be easy to develop a DOI-like system that
> is not part of the fee-based DOI system, and I still find it appealing
> because it could as simple as an integer ID and very basic context tag.
>
> As for LSIDs -- If I understand correctly that the purpose of the
> <DomainName> portion of the LSID is to point to the one (and only?) server
> that can resolve the ID, then all of a sudden I don't like them at all.  If
> it's true that the embedded Domain portion of an LSID *requires* that the
> domain exist for as long as the GUID exists in order for the GUID to be
> useful, then I definitely have reservations.  If, on the other hand, the
> Domain portion can be seen as representing the issuer (somewhat analogous to
> the function of "InstitutionCode" in DwC), and could be resolved by any
> server set up to deal with the <namespace> part of the LSID, then I'm much
> less concerned.
>

The DN portion is meant to be resolvable by the DNS system.  So yes,
there is a dependency on the continued existence of the DN, but is can
be set up to be resolved by any LSID service endpoint.

>
>>>        I think the best option would be central.  The next
>>
>>option would be full
>>
>>>distributed.  Leaving it as an option would, in my opinion, be a BIG
>>>mistake.
>>
>>I disagree- the assignment of identifiers should be by the curators of
>>the data.  However, I do strongly consider that there should be some
>>sort of trust scheme in place, where identifiers are issued only by
>>entities trusted by the rest of the system.  A scheme similar to that
>>used by certificate authorities and delegates should be adequate.
>
>
> Maybe I'm misunderstanding the use of the word "issuers", but in my mind,
> the issuer's job is only to provide a guaranteed-unique set of ID's.  It
> would not, necessarily, be the location where the ID is applied to its
> associated data.
>

If we use the MAC approach + a context such as an LSID or DOI form, then
  there is absolutely no need for a central issuing agency.  If data
providers are more careful about assuring they meet requirements about
their identifiers, then again there is no need for a central issuer.

> In Donald's PowerPoint file, he made reference to "mechanisms for data
> providers to request and use blocks of LSIDs from central service".  Here's
> how I imagine a system would work:
>
> GBIF (or some other central entity) establishes a service that can generate
> unique <objectID> numbers within its own LSID context.  The same service
> also maintains a complete set of data associated with each <objectID>.
> Major (and minor) institutions (essentially your set of "Trusted" entities)
> would established mirrored copies of the complete set of all data (or,
> perhaps, only a filtered subset of the complete data), but would not be able
> to issue new GUIDs directly.  However, the mirrored sites could serve as
> real-time "pass-through" to the central sight so as to be functionally able
> to provide new GUIDs in real time, by retrieving them directly (in real
> time) from the central server.  Also, the mirrored sites would all maintain
> synchrony of their copies of the data with the central "master" copy, on a
> realistic time frame (e.g., every 24 hours, or on-demand if a data provider
> chose to initiate a synchronization command).

Again, just use a MAC based GUID inside an LSID context.  If you have
any MS dev tools on your machine type "guidgen" and the command prompt.
  Voila!  Globally unique identifiers.  No matter how many times you
push the "New GUID" button.

>
> If a curator of a local institution's data needed to assign a new batch of
> numbers for a new set of specimens, the curator would issue a request to the
> central server (or via one of mirrored sites as a pass-through request) for
> a block of N numbers.  The central server would never re-issue those same
> numbers again to anyone else.  But those numbers remain "empty" until the
> curator assigns them to data, and uploads that data either to the central
> server or to one of the mirrors.  In other words, even though the numbers
> are "issued" by a central server, they are applied to real data only by
> local curators.
>
> A big issue, of course, is control over editing of data associated with a
> given GUID.  In the case of specimens, the central server and mirrored sites
> could (perhaps at the discretion of the data curator who initially requested
> the number) restrict subsequent editing of those data to a defined set of
> password-protected user accounts.  In the case of more public data, such as
> taxon names and publications, the control of data editing would be less
> restrictive (e.g., either full accessible by the public, or accessible to
> anyone who goes to the trouble to register themselves as a taxonomist with
> the central server or with any of the mirrored sites).
>
> Maybe this approach would not be practical for specimen data -- but I think
> it would be the optimal approach for taxon data.  Perhaps those two
> fundamentally different kinds of data (owned, vs. public domain) need
> fundamentally different approaches to GUID issuance and assignment?
>
>
>>>3) Which objects should receive identifiers?
>>>
>>>        Specimens, References, Name-Reference intersections
>>
>>(Assertions), and
>>
>>>perhaps Agents.  [TaxonNames and Concepts can be subsets of
>>
>>Name-Reference
>>
>>>intersections].
>>
>>Any object. It doesn't matter what it is, just that it can be resolved,
>>and when you find it, you can figure out what it is.  Sensible use of
>>the NameSpace portion of the LSID will help a lot with this.  A trusted
>>organization should issue the NameSpace portion to avoid NS conflicts.
>
>
> I'd have to think this through some more.  Leaving it too open might lead to
> a plethora of (potentially overlapping, but not quite equivalent)
> NameSpaces, which seems like it could turn into a real mess, really quickly.
> Centralized ID systems such as social security numbers in the U.S.,
> telephone numbers, etc. definitely have some advantage over totally open
> systems.  I suppose that the pool of NS's would be self-cleaning simply by
> use or non-use....but I still wonder how much better this approach would be
> over the status quo.

That's why there needs to be some agreement over the issuance of namespaces.

>
>
>>>3a) Should we develop a set of object classes for biodiversity
>>
>>informatics
>>
>>>and assign identifiers to instances of all of these?
>>>
>>>        I think so, yes. Of course, it depends a bit on who you
>>
>>mean by "we".  I'm
>>
>>>thinking sensu lato.
>>
>>Sure, and these could be a core from which others can be built.  But we
>>should absolutely not restrict the capability of the "system" to accept
>>new classes - even classes that represent the same infomration in a
>>different way that may be appropriate to a group of users.
>
>
> Again, I'll have to think about this some more.  I certainly don't think
> that the "system" should be incapable of dealing with new classes -- sort of
> like how anyone can develop their own Federation Schema and use DiGIR to
> establish specific information networks.  But I'd hate to see a breakdown in
> the global transmission of biodiversity information simply because different
> subgroups establish their own special-needs, non-mutually-compatible classes
> for dealing with essentially the same kinds of information (especially if
> they do not also conform to a generalized international standard).

Bah.  That's the whole point of this - to facilitate data exchange.  If
a small subgroup wants to start exchanging data in an abbreviated
format, so what?  As long as the identifiers being used are able to
resolve the type of object being passed around, and the objects conform
to their definitions, it shouldn't be a problem.  By initially
establishing a robust framework for Scientific Names and perhaps
specimen data / collections, then there will be little need for others
to recreate new ways to represent that data.  The benefits of a robust
reliable representation and provision of cheap, effective software tools
will hopefully overcome the steep learning curve needed to even
understand what's in some of these schemas.

>
>
>>>4) What should be done about existing records without identifiers?
>>>
>>>        As far as I know, ALL records are currently without
>>
>>identifiers (unless
>>
>>>someone established a widely accepted GUID system and I missed the
>>>announcement...)
>>
>>All records currently have some sort of identifier, the problem is their
>>uniqueness is not rigorously enforced or even evaluated, so their
>>usefulness is probably limited.
>
>
> O.K., in that case I misunderstood the meaning of "identifiers".  All
> historical identifiers (e.g., catalog numbers for specimens) should be
> maintained, preserved, and cross-referenced to GUIDs just like any other
> metadata about the physical object.  I think of catalog numbers not so much
> as unique identifiers, but as "labels" -- not altogether unlike taxonomic
> names. In the databases I manage, I do not use catalog numbers as
> identifiers -- the computer generates the UID, which is never seen, read,
> written, or typed by a human.  That's how I'd like to see the sorts of GUIDs
> we're discussing be implemented -- i.e., for the benefit of
> computer-computer data exchange; not human-human data exchange or
> human-computer/computer-human data exchange.
>

But if I want to say to you, hey look at this specimen xxx while we're
chatting from around the world using an instant messenger while
collaborating on some project, would't it be nice to just be able to
type in lsid:mymuseum.org:specimen:1234 and have your client retrieve
that exact data and associated metadata directly?  A trivial example but
one that can form the foundation of some cool stuff for data exchange
and interaction.  I thought that was the whole point of these GUID
things.  But maybe I'm mistaken?

>
>>>4c) Should the provider software be modified to generate "soft"
>>
>>identifiers
>>
>>>(ones which we cannot guarantee in all cases to be unique)
>>
>>based e.g. on the
>>
>>>combination of InstitutionCode, CollectionCode and CatalogNumber?
>>>
>>>        As an interim solution, perhaps.  See my comments under
>>
>>"Slide 2" above.
>>
>>Yes, but not soft.  The providers should assign their own identifiers,
>>but there must be a mechanism to ensure that identifiers are being
>>properly assigned.
>
>
> Agreed -- but I still think of these as "soft" identifiers, because
> CatalogNumber values can change over time, in certain circumstances.  GUIDs
> should *never* need to be changed (even if the institution that issued them
> vanishes without a trace from the face of the Earth).
>

Yep.  That would be the ideal.

>
>>Revision information is very helpful in dealing with errors such as
>>keystroke errors or other such details that do not change the object.
>
>
> I agree the revision information *can* be helpful in dealing with errors;
> but I don't see that function as being integral to the assignment of GUID
> values.

Except in the somewhat bizarre case when you need the old version of the
object.

>
>
>>Not many.  It seems most collections don't record any history in their
>>record edits, so without a major alteration in the way the data are
>>stored, it will be a significant undertaking to provide useful revision
>>information.
>
>
> For what it's worth, the databases I have developed for my institution are
> designed to log every change made to every field (except
> performance-enhancing, purely derivative fields), including what the
> previous value was, who made the change, and when the change was made.  When
> records are deleted, a "snapshot" of the value of every non-null field is
> logged, including the time the record was deleted, and by whom.  The reason
> I say all of this is to underscore that my stance on not including
> versioning IDs as part of a GUID system is NOT from lack of appreciation for
> the value of preserving edit histories (something I clearly value very, very
> much -- given that the total diskspace occupied by my edit logs exceeds the
> total diskspace occupied by the "real" data!)

Awesome.  That's a nice way to do it.

>
> In closing, I apologize to those who find my overly-long posts on this topic
> to be an annoyance.  I also am starting to wonder:  is this the appropriate
> email forum to have this discussion?
>

Yeah, good question.  Maybe this should be on the GBIF DADI list or TDWG
general?  Or even the LSID list?

> Aloha,
> Rich
>

Kia ora,
   Dave V.