I'm not worried about centralised taxonomy, I'm simply wondering who is going to do all this work of deciding what GUID gets allocated for, say, a name
The same people who would do the work in your scenario. It's just that they would check to see if the name(s) had already been assigned GUIDs before assigning new ones.
Yes, in some cases things are simple. For example, we could simply ask uBio to store every name string (which is pretty much what they are doing already), and use their ids as the basis of name GUIDs. But mapping between some of the "higher-level" name databases is not trivial.
I agree it's not trivial -- which is why I think it would best be done as a coordinated effort, rather than multiple independant efforts -- so that we're not stuck with the non-trivial task of managing all those duplicate GUIDs. And in my ideal world, the full dataset would not live on only one server. It would live in replicated form across hundreds of servers, with robust synchronization protocols.
Are IPNI and MOBOT going to sit down and go through their databases and match things up, are we then going to do the same thing with IPNI, MOBOT, NCBI and TreeBASE?
We would have to do this iteratively like that if they all created they're own public GUIDs for the same set of names. If they were coordinated about it, each dataset would have to be mapped to a single common GUID-set only once.
Will we wait until this is done before assigning GUIDs? And given that mapping between databases can be contentious (is this name really the same as that name, how do we know, etc.) -- and I should point out that current attempts to do this, such as NCBI's LinkOut which uses names are riddled with errors -- it seems this is knowledge that will evolve over time.
I guess my point is that the Name+Author strings alone (sans GUIDs) serve almost the same function as individual-provider-assigined GUIDs would, in the case of taxon names. It seems to me that the GUIDs only become really useful when they are used to unambiguously cross-link multiple datasets together. So I don't see how publicly exposing provider-assigned GUIDs (with multiple redundancies) does anything useful for us beyond what we already can do when the providers expose Name+Author strings -- that is, the GUIDs really only serve as an extra layer.
In the same vain, I suggest that we are likely to make more progress if we have resolvable GUIDs now so that major data sources open their data up, then we use data mining tools to go in an finding mappings, inconsistencies, etc. Many of these things can be computed, i.e. can be automated. Being open could encourage anybody to have a go at examining mappings.
I'm probably being wildly naive, but I think concern for getting it "right" might get in the way of getting it "done".
Fair enough -- I don't disagree with that concern at all (in fact I share it quite strongly -- I'm a huge fan of getting it done). Any passion I may have expressed in this or recent posts is a passion not to be misunderstood -- not a passion for centralized/coordinated GUIDs for shared data objects. The only thing I'm adamant about is keeping an open mind about this stuff, and keeping my personal biases in check.
Ducks incoming flames/brickbats/etc.
Definitely not from me! As I have said before, I think the amount we agree on vastly exceeds what we disagree on -- and even where we both disagree, we really only lie on different positions of the same continuum.
Aloha, Rich
participants (1)
-
Richard Pyle