I certainly agree that collaborative assignment of identifier is one of the most critical issues we need to address in our discussions.
Here are some of my thoughts attempting to approach the problem.
If identifier I1 refers to data object O1 and identifier I2 refers to data object O2, our minimal requirement for GUIDs would be that:
A) I1 == I2 --> O1 == O2 (for all relevant purposes)
We need to decide under which circumstances we want this to be a bidirectional implication, i.e. that:
B) O1 == O2 --> I1 == I2
I believe that we will need to address this question separately for different classes of data. We need to relate the objects we identify to clearly defined data classes. If a nomenclatural GUID is actually the identifier for a record in a nomenclatural database rather than for the associated nomenclatural event, we may not have to worry about getting IPNI and MOBOT to use the same identifier for their records. The possible downside is that such an approach is an unambitious one that only solves a subset of our problems.
We should consider what it would take for us to devise a system that could support (enforce?) bidirectional inference in those cases in which it matters to us. It seems pretty clear to me that such a system would operate by layering further standards and best practices on top of the actual identifier system.
In general we need to think hard about management of GUIDs for each data class. I suggest that the goal of these workshops is to allow us to select a framework of identifiers that will meet our needs, but that TDWG and others should then develop applicability statements which define exactly how we expect to use them in different contexts. These statements would in each case also make explicit the semantic and operational characteristics that a provider is expected to support.
Donald
--------------------------------------------------------------- Donald Hobern (dhobern@gbif.org) Programme Officer for Data Access and Database Interoperability Global Biodiversity Information Facility Secretariat Universitetsparken 15, DK-2100 Copenhagen, Denmark Tel: +45-35321483 Mobile: +45-28751483 Fax: +45-35321480 ---------------------------------------------------------------
-----Original Message----- From: Taxonomic Databases Working Group GUID Project [mailto:TDWG-GUID@LISTSERV.NHM.KU.EDU] On Behalf Of Roderic Page Sent: 30 January 2006 13:37 To: TDWG-GUID@LISTSERV.NHM.KU.EDU Subject: Re: How GUIDs will be used
I'm not worried about centralised taxonomy, I'm simply wondering who is going to do all this work of deciding what GUID gets allocated for, say, a name (and yes, we DO need GUIDs for names).
Yes, in some cases things are simple. For example, we could simply ask uBio to store every name string (which is pretty much what they are doing already), and use their ids as the basis of name GUIDs. But mapping between some of the "higher-level" name databases is not trivial.
Are IPNI and MOBOT going to sit down and go through their databases and match things up, are we then going to do the same thing with IPNI, MOBOT, NCBI and TreeBASE? Will we wait until this is done before assigning GUIDs? And given that mapping between databases can be contentious (is this name really the same as that name, how do we know, etc.) -- and I should point out that current attempts to do this, such as NCBI's LinkOut which uses names are riddled with errors -- it seems this is knowledge that will evolve over time.
In the same vain, I suggest that we are likely to make more progress if we have resolvable GUIDs now so that major data sources open their data up, then we use data mining tools to go in an finding mappings, inconsistencies, etc. Many of these things can be computed, i.e. can be automated. Being open could encourage anybody to have a go at examining mappings.
I'm probably being wildly naive, but I think concern for getting it "right" might get in the way of getting it "done".
Ducks incoming flames/brickbats/etc.
Regards
Rod
On 30 Jan 2006, at 11:59, Richard Pyle wrote:
Hi Rod,
To me centralisation is red rag to a bull, especially as the objects of interest (names and concepts) are things we might reasonably disagree over.
Please don't misunderstand what I'm talking about here. Of course we might reasonably disagree over which names to regard as valid and which to regard as synonyms. We will also disagree about the scope of organisms to include within the circumscription of a taxon concept. However, in most cases, we will not disagree that Smith (1955) described the species "bus", and placed it in the genus "Aus" (i.e., the taxon name object "Aus bus Smith 1955"); or that Jones (1975) regarded Smith's "Aus bus" as a junior synonym of Brown's (1935) "Aus xus" (i.e., the taxon concept object "Aus xus Brown 1935 SEC Jones 1975"; the circumscription of which includes the taxon concept object "Aus bus Smith 1955 SEC Smith 1955").
Centralizing the issuance of GUIDs for things like taxon name objects and concepts/usage instances does NOT, in any way, centralize "taxonomy". It simply serves to avoid issuing 150 different GUIDs for the taxon name object "Aus bus Smith 1955" -- one GUID from each of 150 different data providers that happen to list that name in their taxonomic authority table.
Why not let users decide this, by which I mean, if a provider comes up with a comprehensive list of names with good supporting metadata, users will gravitate towards using them. There will also be a "market" for people building services that map between GUIDs (I'm thinking of making one for TreeBASE, for example). Why centralise this activity?
So that we don't need a "market" for services to cross-map duplicate GUIDs that never needed to be created in the first place. Instead, we should "market" services that utilize a common/shared set of GUIDs for objective name objects (and concept/usage objects) to assist *taxonomy*. (And, in the shorter term, market tools that allow data providers to cross-map their internal taxonomic authorities to shared GUIDs.)
We certainly can't eliminate duplicates, but at least we can try to minimize the unnecessary duplicates. I spend an inordinate chunk of my time doing two things that I should not have to do: 1) cross-mapping large datasets to a common shared authority (like taxon names); and 2) cleaning up the database messes created by earlier workers who were pressed for time, and opted for the quick & dirty solution.
Frankly, I'm not sure why we even need GUIDs for things like Taxon Names, other than to mitigate these two kinds of problems. I thought the point was to facilitate electronic information flow. How have we facilitated electronic information flow if you assign one GUID for "Aus bus Smith 1955", and I assign another GUID to the same taxon name, and a pair of human eyes is required to ascertain that they are, indeed, two pointers to the same abstract data object?
I see the point that multiple GUIDs for the same thing can be a pain (for papers we have DOIs, PubMed ids, Google Scholar ids, DSpace handles, etc.), but in the end centralised GUID assignment reeks of committees, etc., in other words, impediments to actually getting things done.
Again, please do not confuse the idea of centralized (or at least coordinated) issuance of GUIDs for unambiguously shared data objects (like taxon name objects), with some sort of ill-advised centralized effort for a "shared taxonomy". I have not seen anybody in recent years even suggest the possibility of the latter.
I agree that software tools to "cross-walk multiple independent datasets with broadly overlapping data objects" would be very nice, but let's separate this from centralising GUID assignment. One of the lessons of the web, IMHO, is that centralisation doesn't scale.
You can't scale much bigger than the global pool of IP addresses which are, ultimately, issued in blocks in a coordinated, semi-centralized way (not althogether unlike a model of GUID issuance that I have previously suggested).
Aloha, Rich
Richard L. Pyle, PhD Database Coordinator for Natural Sciences Department of Natural Sciences, Bishop Museum 1525 Bernice St., Honolulu, HI 96817 Ph: (808)848-4115, Fax: (808)847-8252 email: deepreef@bishopmuseum.org http://hbs.bishopmuseum.org/staff/pylerichard.html
------------------------------------------------------------------------ ---------------------------------------- Professor Roderic D. M. Page Editor, Systematic Biology DEEB, IBLS Graham Kerr Building University of Glasgow Glasgow G12 8QP United Kingdom
Phone: +44 141 330 4778 Fax: +44 141 330 2792 email: r.page@bio.gla.ac.uk web: http://taxonomy.zoology.gla.ac.uk/rod/rod.html reprints: http://taxonomy.zoology.gla.ac.uk/rod/pubs.html
Subscribe to Systematic Biology through the Society of Systematic Biologists Website: http://systematicbiology.org Search for taxon names at http://darwin.zoology.gla.ac.uk/~rpage/portal/ Find out what we know about a species at http://ispecies.org
participants (1)
-
Donald Hobern