How GUIDs will be used

Mon Jan 30 09:56:05 CET 2006

> I'm not worried about centralised taxonomy, I'm simply wondering who is
> going to do all this work of deciding what GUID gets allocated for,
> say, a name

The same people who would do the work in your scenario.  It's just that they
would check to see if the name(s) had already been assigned GUIDs before
assigning new ones.

> Yes, in some cases things are simple. For example, we could simply ask
> uBio to store every name string (which is pretty much what they are
> doing already), and use their ids as the basis of name GUIDs. But
> mapping between some of the "higher-level" name databases is not
> trivial.

I agree it's not trivial -- which is why I think it would best be done as a
coordinated effort, rather than multiple independant efforts -- so that
we're not stuck with the non-trivial task of managing all those duplicate
GUIDs.  And in my ideal world, the full dataset would not live on only one
server.  It would live in replicated form across hundreds of servers, with
robust synchronization protocols.

> Are IPNI and MOBOT going to sit down and go through their databases and
> match things up, are we then going to do the same thing with IPNI,
> MOBOT, NCBI  and TreeBASE?

We would have to do this iteratively like that if they all created they're
own public GUIDs for the same set of names.  If they were coordinated about
it, each dataset would have to be mapped to a single common GUID-set only
once.

> Will we wait until this is done before
> assigning GUIDs? And given that mapping between databases can be
> contentious (is this name really the same as that name, how do we know,
> etc.) -- and I should point out that current attempts to do this, such
> as NCBI's LinkOut which uses names are riddled with errors -- it seems
> this is knowledge that will evolve over time.

I guess my point is that the Name+Author strings alone (sans GUIDs) serve
almost the same function as individual-provider-assigined GUIDs would, in
the case of taxon names.  It seems to me that the GUIDs only become really
useful when they are used to unambiguously cross-link multiple datasets
together.  So I don't see how publicly exposing provider-assigned GUIDs
(with multiple redundancies) does anything useful for us beyond what we
already can do when the providers expose Name+Author strings -- that is, the
GUIDs really only serve as an extra layer.

> In the same vain,  I suggest that we are likely to make more progress
> if we have resolvable GUIDs now so that major data sources open their
> data up, then we use data mining tools to go in an finding mappings,
> inconsistencies, etc. Many of these things can be computed, i.e. can be
> automated. Being open could encourage anybody to have a go at examining
> mappings.
>
> I'm probably being wildly naive, but I think concern for getting it
> "right" might get in the way of getting it "done".

Fair enough -- I don't disagree with that concern at all (in fact I share it
quite strongly -- I'm a huge fan of getting it done). Any passion I may have
expressed in this or recent posts is a passion not to be misunderstood --
not a passion for centralized/coordinated GUIDs for shared data objects.
The only thing I'm adamant about is keeping an open mind about this stuff,
and keeping my personal biases in check.

> Ducks incoming flames/brickbats/etc.

Definitely not from me! As I have said before, I think the amount we agree
on vastly exceeds what we disagree on -- and even where we both disagree, we
really only lie on different positions of the same continuum.

Aloha,
Rich