Globally Unique Identifier - Part III

Fri Sep 24 00:54:54 CEST 2004

> > It seems to me that a lot of complexity would disappear if we
> could all get
> > behind a single issuer of GUIDs, and mirror the capability to
> resolve those
> > GUIDs on dozens or hundreds of servers around the world, and
> only use the
> > GUIDs in a semantic context that is self-evident.
>
> Perhaps.  But what about if an insitution wants to provide IDs for more
> than just specimen or name objects?  Should we always rely on a single
> authority to provide a mechanism for doing that? I don't think that
> would go very far.

No -- we come together as a community to agree upon mechanisms for GUID
assignment and exchange among well-defined classes that we routinely wish to
share and aggregate information about (Specimen/Observations,
Names/Concepts/Assertions, References, Agents, and maybe a couple of
others). If Bishop Museum wanted to provide IDs for, say, Baseball cards, or
Star Wars memorabilia, or some other object type, then it would coordinate
with other holders of data relating to baseball cards [or perhaps trading
cars in general] or Star Wars memorabilia [or perhaps movie memorabilia in
general], and they would decide amongst themselves whether DOIs or LSIDs or
UUIDs, or MACs, or whatever make the most sense for their particular needs.

If Bishop Museum wanted to provide IDs for something like condition reports
of specimens over time, or changing dynamics of wild populations, or DNA
sequences, or Images, or any number of other objects that are relevant to
bioinformatics, then they would propose a new class of objects to TDWG,
along with an appropriate schema and recommend standards for implementation,
and other institutions with like-data would discuss the optimal approach to
dealing with those sorst of data. The "community" would debate the various
options in the context of integration with existing data exchange protocols,
etc., and a standard would emerge.

Don't get me wrong....the letter "G" is my favorite letter in DiGIR.  And
the more I think about it, the more I understand why you want the IDs to
have embedded context that can be resolved automatically, without any custom
tweaking to accomodate new classes of IDs.  But the costs (or at least my
perception of the costs) do tend to frighten me a bit.  It makes a lot more
sense for specimen data.  If Bishop Museum's server is down -- tough luck,
you can't have our data.  But I would *HATE* to rely on any one particular
server to get public-domain information like taxon names whenever I needed
it.  Yes, the single-point failure problem is real.  But there are
technological solutions to minimize those sorts of concerns (on an
intermittent basis) down to something indistinguishable from zero.  I don't
think I have ever gone to Google and found it down.  And there can be
redundancy built into the system.

So now let me ask you this:  Are the advantages of serving the GUID needs of
both specimens and taxon names via an identical, generalized GUID scheme
greater than the advantages of custom-tailoring the GUID scheme for each
different sort of data (owned, vs. public-domain)?  I think I already know
the short answer is "Yes", but I'm interested in the reasons (e.g., common
set of software tools to deal with both, etc.)

> The DN portion is meant to be resolvable by the DNS system.  So yes,
> there is a dependency on the continued existence of the DN, but is can
> be set up to be resolved by any LSID service endpoint.

I think I need to learn more about LSIDs before forming too strong of an
opinion either for, or against them.

> If we use the MAC approach + a context such as an LSID or DOI form, then
>   there is absolutely no need for a central issuing agency.

Agreed!  And that might be the better approach (for a lot of reasons).  My
only concern would be to what extent desktop database applications can uses
MAC values as primary keys (compared to using something like long integers)
efficiently and effectively, when manipulating large datasets in real time.
I guess this may be a trivial point in the grand scheme of things -- but
ultimately taxonomists will want to be able to work with large datasets on
their personal computers.

> Again, just use a MAC based GUID inside an LSID context.  If you have
> any MS dev tools on your machine type "guidgen" and the command prompt.
>   Voila!  Globally unique identifiers.  No matter how many times you
> push the "New GUID" button.

Yes, and this may be the best approach. MACs are scary for people to read,
but as I said before, people really shouldn't be reading them.

But in order to avoid having a bottleneck for data resolution at the point
of MAC issuance, you would need to get the data mirrored efficiently across
many servers. This could be done like DNS (as I undertand it), where the
changes propagate through a haphazard web of servers.  But I think I'd be
more comfortable having some sort of centralized hub or coordinator (like
GBIF) to ensure data mirroring is done efficiently and completely.  On this
point, I could very-well be persuaded to change my view.

> > Again, I'll have to think about this some more.  I certainly don't think
> > that the "system" should be incapable of dealing with new
> classes -- sort of
> > like how anyone can develop their own Federation Schema and use DiGIR to
> > establish specific information networks.  But I'd hate to see a
> breakdown in
> > the global transmission of biodiversity information simply
> because different
> > subgroups establish their own special-needs,
> non-mutually-compatible classes
> > for dealing with essentially the same kinds of information
> (especially if
> > they do not also conform to a generalized international standard).
>
> Bah.  That's the whole point of this - to facilitate data exchange.  If
> a small subgroup wants to start exchanging data in an abbreviated
> format, so what?

No problem if they also continue to provide the relevant bits via the
"conventional" means.  But BIG problem if everyone in Europe gravitates
towards one conventional standard, and everyone in the U.S. gravitates
towards a different flavor, and everyone in Asia gravitates towards yet a
different flavor.  I'm thinking Cell Phone standards, NTSC vs. PAL,
competing HDTV standards, DVD+R vs. DVD-R, etc., etc.

Already we have some problems in dealing with the fact that ICZN and ICBN do
not have perfectly compatible Codes of nomenclature.  Imagine if different
geographic regions adopted their own versions of nomenclatural Codes; or if
the Fish people got together and decided they wanted slight different rules
to apply to their names.  Such freedom would not likely contribute favorably
to scientific progress.  Taxonomists conform to respective Codes of
nomenclature not because they are perfect in how they establish names, but
because the community has converged on a single standard.

> As long as the identifiers being used are able to
> resolve the type of object being passed around,

...which they presumably wouldn't be if the providing server were
offline....no?

> and the objects conform
> to their definitions, it shouldn't be a problem.  By initially
> establishing a robust framework for Scientific Names and perhaps
> specimen data / collections, then there will be little need for others
> to recreate new ways to represent that data.

...so why comprimise the optimality of the system in order to accomodate
those who might prefer to define objects in a slight different way from
everyone else?

> The benefits of a robust
> reliable representation and provision of cheap, effective software tools
> will hopefully overcome the steep learning curve needed to even
> understand what's in some of these schemas.

Yes, but this is something we both agree on!

> But if I want to say to you, hey look at this specimen xxx while we're
> chatting from around the world using an instant messenger while
> collaborating on some project, would't it be nice to just be able to
> type in lsid:mymuseum.org:specimen:1234 and have your client retrieve
> that exact data and associated metadata directly?

But such a chat would certainly provide an opportunity to suggest context.
Wouldn't it be easier if I said "Hey, go to GBIF and look up SpecimenID
1234"? This scenario seems to apply equally to both of our world
perspectives. The only advantage I would see for LSIDs in this case is if I
forgot to mention to my colleague that I wanted her to look up a specimen,
rather than a taxon name, and just told her to "look up 1234".  Not likely
in a human-human conversation.

> A trivial example but
> one that can form the foundation of some cool stuff for data exchange
> and interaction.  I thought that was the whole point of these GUID
> things.  But maybe I'm mistaken?

Yes, I certainly agree that this is the whole point of GUIDs!  What I don't
understand is why LSIDs (with domain-active requirements) fulfill this more
effectively (all costs and benefits considered) than other GUID systems.

> Except in the somewhat bizarre case when you need the old version of the
> object.

Yeah, but as you say this is a bizarre case.  In those bizarre cases, you
can send an email to the data manager and ask for a report on the edit
history of the record.  If they got it, they got it. But I don't see this as
being such a routine need that it needs to be accomdated-for by the GUID
system (unless, as I suggested earlier, that it was otherwise completely
transparent and could be ignored without consequence).

> Yeah, good question.  Maybe this should be on the GBIF DADI list or TDWG
> general?  Or even the LSID list?

If it moves, let me know where it goes.  Right now, it's time for bed....

Cheers,
Rich