Re: Globally Unique Identifier & Donald Hobern's PPT
-the 'GUID combination' is not enforced and therefore not always used -Some collections belong to 2 or more Institutions or to none -If part of the collection moves to another institute, the guid combination is changed for that part. -The InstitutionCode should be unique, and providers where asking what to do if the code they wanted to use was already chosen, and who decides which institute may use an institutioncode if two institutes want to use it. There is no body responsible for that and there are no rules: the first Institute can claim a code, or the biggest or the most well known?? -In different science areas different InstitutionCodes within one Organisation where in use, which one to choose. -This 'GUID' can only be used for specimen, not for other life science objects.
Wholeheartedly agree on all counts!!! That's why I still see it as a "soft" ID (even with enforcement of unique registered Institution Codes, and enforced uniqueness of CollectionCode+CatalogNumber within a single InstitutionCode). It's a stop-gap to solve some of the problems, until a real GUID system is up, running, and broadly adopted.
Now let's look at LSID syntax: urn:lsid:authority:namespace:object_identifier (:revision_number) About the first part; authority: It is naturally to want this to be unique. Therefore we can expect the same problems as mentioned above, plus unclearity about the difference issuing_authority vs. current_authority for the data.
As to uniqueness, I think that's (part of) the point of using a URL, instead of just an institution name or abbreviation. URLs seem to be effectively unique. As to the confusion about "issuing_authority" vs. "current_authority", count me among the befuddled. My interpretation of Dave Vieglais' posts were that the "Authority" URL was assumed to be the URL where the GUID is resolved to the data it represents. But Bob Morris' posts suggest otherwise ("The authority name is the /issuing/ authority. It's an authority for the LSID, not for its resolution or the underlying data."). Perhaps I misunderstood Dave's post? My primary concern about LSIDs is that (I thought) the URL used for the "authority" portion of the LSID must be live, online, active, and perpetual in order to resolve the data. If this is not the case, (i.e., if, as Bob says, it is only intended to indicate the *issuer*, not the current authority), then my concerns about LSIDs are greatly reduced.
The problems with authority are important for the involved authorities only, not for the rest of the life science community.
Agreed!! And further, the authority makes sense for "local" or "owned" data (e.g., specimens, and attributes thereof), but not for "public" data (e.g., taxa, and attributes thereof).
So discussions about it and establishing an authority that takes decisions in political conflicts are a waste of time. We can solve it by using a unique number only and maintaining a list that gives information about each number. It should be clear that this are only the initial issuing authority/authorities.
Agreed for sure on the last sentence! And if I interpret the penultimate sentence correctly, then full agreement there as well. I had started writing a response to Bob Morris' post last night, but it got too late so I didn't finish it. It included the following:
*************************************************** Bob Morris wrote:
I find nothing in the LSID current(?) proposed recommendation http://www.omg.org/cgi-bin/doc?lifesci/2003-12-02 that specifies that the "object identification" part of an LSID may not be a string of hex digits and dashes. (Though I continue to fail to see why people are so in love with these).
I'm not particularly enamored by MACs or hex strings over other forms of unique identifiers, but I WOULD like to see a protocol for bioinformatics LSID generation that decreased the possibility of duplicate <ObjectID> portions of the LSIDs from different issuers to effectively zero. Maybe there is no meaningful technical reason to do this -- but I can't help but feel that if LSIDs are later determined not to be ideal for bioinformatics purposes, it may prove to be DAMN useful that the <ObjectID> portion alone serves effectively as its own gUid (emphasis on "U" intentional), even stripped of the remaining LSID components. ***************************************************
Wouter: Is this what you meant by using a unique number?
I also agree with Wouter (in response to Kevin Richards' concern about single-server bottleneck) that this should not be a concern. I seriously doubt that the sum total of all Life Sciences data calls would ever even approach the level of calls that Google receives. True, there will likely not be the same sort of money backing a life sciences GUID system that is behind Google's server farm -- but even still, given the fundamental importance such a system would have to such a wide variety of people receiving such a large chunk of grant money, I can certainly imagine justifying money on the order of 6 or 7 figures ($/euro) for such a server system, and I imagine that would be ample money by today's technological benchmarks (or, more likely, the benchmark a few more cycles of Moore's Law hence, which is probably when a system of this sort would start to receive really high volume requests) to get a system capable of meeting the demand. I think the real bottlenecks are sociological (i.e., getting these damned fickle life science practitioners to agree on anything), not technological.
The last part; revision id: whether you need it depends: do you give the physical objects a GUID or the data records?
Agreed! Another excerpt from my attempted reply to Bob Morris:
*************************************************** Bob Morris wrote:
This brings up a point I may have missed in this discussion: LSIDs are designed to give an identifier to /data/ not to physical objects. This probably means that a fully compliant use of LSIDs for "specimens" will be assigned to specimen records, not to specimen objects. This is good, because if a physical object is moved it presumably gets a new specimen record, which record, perhaps, has some metadata that tells the curation history of the underlying object, including its previous LSIDs. Seems like the analog of taxonomic synonomy to me....
I disagree -- as I stated before, I strongly feel that the number should be applied to the physical *object* (or virtual representation thereof -- such as an original description), not the electronic data record. If Bishop Museum sends a specimen to Smithsonian, the specimen now curated at Smithsonian is not, to my mind, a synonym of its earlier life at Bishop Museum. Nor is its data. Both the specimen, and its associated data, should be considered as fixed into perpetuity. As more and more data are transferred from the physical to an electronic form, those data should be associated to the same GUID for the object -- not multiple versions of the electronic representation of data related to that object. If the ID *must* be tied to the data record, instead of the object, then I take back everything I said earlier about versioning. In the data-centric scenario, versioning becomes absolutely *vital*. Personally, I think trying to manage GUIDs as record identifiers, rather than object identifiers, would introduce unnecessary and excessive complexity. Biologists are interested in the objects. The data records are just a convenient mechanism of information conveyance -- not important entities unto themselves. This may not apply to all aspects of Life Sciences, but I think it should apply to objects we're discussing here (specimens, taxa, references, etc.). ***************************************************
With the first choice you do not need a revision number because the physical object will not change (or do they with living collections?).
Living collections do change, and so to unvochered observations, and so do records representing populations (rather than specific physical organisms). And even preserved specimens change over time (tracking condition, preservation status, etc.). But I think the GUID should be fixed to physical object, and that whatever dynamic properties of that object are worth recording over time should be associated back to the physical object.
Where things get more complicated is how to define the "object". To some collections, the unit of "Object" may be multi-taxon/multi-specimen (e.g., fossil), or single-taxon/multi-specimen (e.g., lot), or single-taxon/single-specimen (individual specimen), or single-taxon/partial-specimen (a part of a specimen, like a skeleton vs. a skin). Single "objects" of any of these sorts may be fractioned (e.g., Isotypes, or simply splitting up a multi-specimen lot to send out to different institutions). So, one important question in such cases is whether one of the "child" objects retains the GUID of the "parent" object, and new GUIDs are assigned only to the remaining "child" objects (the way Linnaean taxonomy works for taxonomic concepts, and the way most institutions deal with catalog numbers for specimens). Or, do *all* child objects receive new GUIDs, each referring back to an historical "parent" object that no longer exists? The temptation is to support the latter, but in this case, what of a specimen that partial deteriorates and a portion of it is destroyed, rather than sent to a different institution? Logically, the remaining specimen should be treated no differently than it would have if the deteriorated portion was instead sent to a different institution, rather than destroyed, and hence receive a new GUID. But that's starting to sound an awful lot like condition monitoring. Perhaps this distinction should be left to the discretion of the GUID issuer/Object owner on a case-by-case basis? (Yikes! Inconsistency!) Or, perhaps this is where versioning comes in (where the versions are actual object versions, not electronic data versions)? This seems like a more complicated problem than the ones we have been discussing so far.
If a GUID for a 'physical object' is chosen, a thing like a species name or author name or country should not get a GUID.
I disagree. For species names, the GUID would apply to the name's original description/creation event. Metadata for such never change -- they can only be corrected. For author names, I would argue that the object to which the GUID is applied should be thought of as the *name* of the author as a virtual physical object; not the author as a physical object. Multiple AuthorName objects could be linked to each other via an Alias scheme, and/or tied to a common "Person" (which could either be a separate GUID namespace, or be defined as a set of linked AuthorName objects). In the context of biological objects, place descriptors of all sorts are really just surrogates to defined two(three?)-dimentional physical spaces. The GUIDs for such should primarily be established for the physical space, not the name or other descriptors applied to that space. The GEOnet Names Server (GNS; http://earth-info.nima.mil/gns/html/index.html) seems to me to be a useful model to follow. They have two ID numbers for each record. One is the "Unique Feature Identifier" (UFI): "A number which uniquely identifies the feature [=place].", and the other is the "Unique Name Identifier" (UNI): "A number which uniquely identifies a name.".
The point is, these representations to which I think GUIDs should be applied are effectively permanent/persistent.
Aloha, Rich
Richard L. Pyle, PhD Natural Sciences Database Coordinator, Bishop Museum 1525 Bernice St., Honolulu, HI 96817 Ph: (808)848-4115, Fax: (808)847-8252 email: deepreef@bishopmuseum.org http://www.bishopmuseum.org/bishop/HBS/pylerichard.html
participants (1)
-
Richard Pyle