Some comments in text below.
Dave V.
Richard Pyle wrote:
I want to start by wholeheartedly endorsing Wouter's plea for non-information-bearing (meaningless) GUIDs. This feature is CRITICAL to the long-term success of any GUID system. It is absolutely imperative that there NEVER be any motivation to change the content of a GUID (i.e., it should be permanent). If the GUID itself contains any information whatsoever, there may be motivation to change that information at a later time.
I have to disagree - kind of. A non-information-bearing GUID such as one generated by a MAC, eg
{92AB5B37-70E9-4f05-9E97-CBABD08513ED}
is completely useless unless it only appears within the context of a system that provides more information about what it actually is. That's the point of the LSID or DOI, they provide GUIDs that identify what system can be used to resolve them. If GUIDs for names or specimens or whatever are to be used in other systems, then it is essential that the GUID can be associated with a resolving system.
For this reason, I had initially preferred the DOI approach, but over time, I am gradually warming up to the LSID approach. While components of an LSID do, indeed, represent information, they represent the one piece of information that I think may legitimately belong embedded within a GUID: context. That is, the context, or domain, of the GUID itself. The context in this case would be the "issuer" of the GUID -- not necessarily the current "owner" of the GUID (see more discussion on this below). Though the organization that issued a GUID may eventually disappear, the fact that the organization was the one to issue the GUID in the first place will never change, and thus represents a permanent and unchanging component of the GUID. Without the context portion, the GUID itself is really nothing more than a random string of characters. In summary, I'm warming up to the LSID approach because it represents embedded context, without the risk of temptation to change the content of a GUID after it has been issued.
Both the DOI and LSID approaches are structured and provide context. The DOI system uses the NISO Z39.84-2000 standard for categorization, the LSID uses the domain name system. Both provide a context essential for reuse of an identifier outside it's original context.
Regarding Donald's PPT file, I have a couple of comments and questions: (Assumes Title slide is "Slide 1")
Slide 2: You note there is "No reliable mechanism" to relate the same record from different providers to each other. But in the context of DarwinCore, the combination of [InstitutionCode]+[CollectionCode]+[CatalogNumber] should represent a virtual GUID (provided that the Global Provider Registry ensures no duplication of [InstitutionCode]). I do realize that words like "should" and "reliable" are critical here. Perhaps the DarwinCore implementation should enforce the requirement of uniqueness of [CollectionCode]+[CatalogNumber] within a single [InstitutionCode], and further ensure globally unique [InstitutionCode] values via the Global Provider Registry.
This was one of the first recommendations to GBIF - to provide a registry of institution codes for exactly this purpose. Having a tool that verified the uniqueness of records within a collection as exposed by it's provider (either biocase or digir) would help this uniqueness problem. Now that the UDDI registry is available, we could in theory use the institution identifiers in there.
Slide 3: Wouldn't most of the problems indicated in the first four bulleted points be largely solved by the Global Provider Registry? Using the [InstitutionCode] would allow lookup in the registry for a (current/active) metadata URL, and the metadata URL would provide information on where to access a particular [CollectionCode]+[CatalogNumber] piece of data.
The issue of specimens changing numbers and/or collections is problematic, of course.
The issue of versioning is a bit dicey, in my mind (e.g., at what resolution of information change)? Some things, like changing taxonomic determinations (i.e., "real" changes) need to be handled in a robust way. Other things, like the correction of typos and different styles of representing the exact same information (e.g., R.L. Pile==>R.L. Pyle; or R.L. Pyle==>Pyle, R.L.) probably don't need to be versioned. Other sorts of changes (e.g., the elaboration of previously existing information, such as the addition of retroactively-generated georeference coordinates) fall somewhere in-between.
Slide 4: We should all get behind SEEK in addressing these issues (Taxon concept mapping). Ultimately, we minimally need a GUID pool for References (inclusive of unpublished works), and a GUID pool for what I call "Protonyms" (original creations of IC_N Code-compliant names). The union of these two GUIDs (what I would call "Assertions") would itself represent a GUID to a "potential concept" (Berendsohn). (Note: my preference would be to define Protonyms as a subtype of Assertions, and therefore Protonym GUIDs would be a subset drawn from the same pool as Assertion GUIDs -- but this is a technical discussion for another time).
Good progress is being made on this, prototypes should be ready for evaluation of the LSID approach soon.
Slide 5: Nice summary!!
Slide 6: Good stuff here, but I'll respond with some of my personal opinions:
RevisionID: see points of concern already expressed above
Specimen Record LSIDs: I gather from subsequent slides that you recognize
two alternative approaches: having the "owner" of a specimen assign the LSID within the context of their own <domainName>, or adopting GBIF as the international standard issuer for ALL specimen GUID. In other words, GBIF would represent the centralized issuer of GUIDs for all biological specimens, and the biological specimen community would/should rally around GBIF for thus purpose, and adopt GBIF specimen GUIDs as their own. I personally have no problem with this (I do not live in fear of "Big Brother" centralization when it serves the benefit of all, as I believe it would in this case) -- but I know there are many who might have a problem with it, and therefore it might not garner widespread adoption without large volumes of "fuss".
I strongly disagree that there should be a single GUID issuer or resolver. What we really need is an organization that operates kind of like a certificate authority- GBIF could act as the root from which other trusted GUID issuers may be created. In this way we can avoid the arbitrary creation of GUIDs yet still provide considerable flexibility and de-centralization in the community.
If, on the other hand, each organization issues its own GUIDs for its own set of specimens, then the question is when, if ever, GBIF would assign a specimen GUID? Perhaps as a surrogate for institutions that lack the technological ability to assign their own LSIDs? But I wonder, how many institutions that could server electronic data of their holdings to the internet would lack the ability to assign their own LSIDs?
It would be a relatively simple task to include a LSID resolver service along with a DiGIR provider. I have prototyped such a system a while back, but other issues prevented deployment. With such an implementation, it would be trivial to assign unique identifiers to specimens - but first the problems institutions seem to have even providing unique identifiers within a collection must be resolved.
As you've outlined in subsequent slides, I see two alternative paths: A) Get the biological world to rally around GBIF as the centralized provider of GUIDs for specimens for all collections; or B) Have each collection/institution issue its own set of LSIDs for its own specimens, and have GBIF adopt those LSIDs for its own internal purposes. I could get behind either approach, but I see danger in the adoption of a mixture of these two approaches. I'll defer elaboration, but a lot of it has to do with potential confusion about whether the GUID applies fundamentally to the physical specimen, or the electronic conglomeration of data associated with the specimen. Also, I think we should avoid the risk of assigning two separate GUIDs for the same "single data element" (sensu your Slide 5).
A mixture would still work, provided there was appropriate coordination between the efforts.
- Name record LSIDs: I understand the example of an IPNI LSID for a plant
name, and presumably there would be analogous "Catalog of Fishes" LSIDs for each fish name, etc. But I don't think that would be a wise approach. Unlike specimen records, where there are fairly unambiguous "owner" institutions (or at least "original owner" institutions that issued a GUID), taxonomic aggregators (IPNI, ITIS, Species2000, GBIF, uBio, etc.) are most certainly not owners of the taxonomic names that they include in their databases. We would want to avoid the risk of duplicate GUIDs for the same name, and thus the need for mapping, e.g., an IPNI GUID for a name to its ITIS equivalent. Again, I can't help but think that the world will be a better place if we can avoid assigning multiple GUIDs to the same "single data element".
One approach would be to rally around GBIF, and rely on them to issue GUIDs for all taxon names. However, I also recognize that we do not exist in a political/personality vacuum with regards to "ownership" of taxonomic names, or the electronic representations thereof. Therefore, the closest thing that exists to an "owner" of a taxonomic name is the Commission of Nomenclature (and it's respective Code of Nomenclature) under which the name was established. Thus, when it comes to assigning GUIDs for names (not concepts), I would propose the following:
urn:lsid:ICZN.org:TaxonName:XXXXXX (all zoological names) urn:lsid:ICBN.org:TaxonName:XXXXXX (all botanical names) urn:lsid:ICNB[or LBSN??].org:TaxonName:XXXXXX (all bacteriological names) urn:lsid:ICTV[or ICVCN??].org:TaxonName:XXXXXX (all virus names)
In an ideal world, we'd get to the point where there would be a need for only one registrar of nomenclature, e.g.: urn:lsid:BioCode.org:TaxonName:XXXXXXX
Or, perhaps: urn:lsid:gbif.net:TaxonName:XXXXXXX
It is quite likely that there will be multiple LSID generators and issuers. There is no real reason why this should be prevented, except to ensure that appropriate measures are taken to avoid duplication of GUIDs for the same object (taxonomic concept in this case). So a critical piece of infrastructure for a name service that was intending to assign GUIDs would be a mechanism for determining if the object they are about to assign the GUID to is not already present in the system, held at some other location. There needs to be something like a global "findThisObject(taxon_object)" that absolutely guarantees that the instance doesn't exist some other place. And if duplicates were to occur, then there must also be a mechanism for indicating equivalence between GUIDs, or perhaps a way of deleting the duplicate (how to decide which is the duplicate?).
Forcing the use of a single DN such as BioCode.org for all names would seem to be a mistake, since that implies a single resolver service for all names- with obvious implications in case of failure. Perhaps there can be multiple resolver services with a single DN? That would probably work fine then.
But I don't think we're quite there yet.
In any case, the idea would be for the taxon name aggregators to adopt the unambiguously unique GUID for each taxon name.
Taxonomic concepts are a whole 'nother ball of wax....
Slide 8: I actually prefer this approach (GBIF as the central issuer of specimen GUIDs), for a variety of reasons. One of the main reasons is that it would assure uniqueness of an integer within a given <namespace> (e.g., Specimens), which would make things a bit easier for those of us who like to use integers as primary keys in databases. In other words, it avoids the possibility of urn:lsid:bishopmuseum.org:Specimen:1234567 colliding with urn:lsid:usnm.gov:Specimen:1234567, when reducing the GUID to just its integer component for local application purposes (where context can be enforced by other means). However, I should point something out regarding the "Advantage" part of this slide, which is that the "problem" of transferring record locations doesn't exist, provided that the <domainName> component of the LSID is taken as the issuer of the GUID, not as the current owner of the specimen. In other words, if Bishop Museum assigned GUID urn:lsid:bishopmuseum.org:Specimen:1234567 to a specimen, and then gave that specimen to Smithsonian, then Smithsonian would retain the complete GUID intact as: urn:lsid:bishopmuseum.org:Specimen:1234567.
The danger comes when you try to use the <domainName> component as metadata to represent the current location of the specimen and/or its electronically represented data. This is where Wouter's original point about 'meaningless' GUIDs comes into play. If the whole point of using LSIDs is to embed the "current location" information within the ID itself so that applications can retrieve additional data associated with the GUID directly, then I have some concerns (mostly address already).
The LSID service must be able to resolve the object. When the object moves some other place, then there will need to be a mechanism for the LSID service to forward the resolution to the appropriate service. The really big problem is when an institution no longer exists - so the hypothetical example of Bishop museum consuming all the Smithsonian fish collections - the Smithsonian LSID resolver would perhaps no longer exist, and so those LSIDs become meaningless. Perhaps there's a delegation mechanism that can be used? So when a DN can't be resolved, the system backs down to a default DN, such as gbif.org that would then indicate that smithsonian.org is now bishop.org?
Why there is a reference to urn:lsid:gbif.net:TaxonConcept:106734 at the top of this slide???
Slide 9: Again, I'm not sure I understand on this slide why there is a reference to urn:lsid:ipni.org:TaxonName:82090-3:1.1 Also, in this model, what function does the LSID serve that is not met by the concatenated [InstitutionCode]+[CollectionCode]+[CatalogNumber] (in the context of Global Provider Registry).
Slide 10 (taxon concepts and literature): This message is already getting too long... :-) I already touched on this above under "Slide 4". I definitely agree that we need a GUID system for References. This should include more than just published references. It doesn't quite exist yet among the existing Reference registrars (as far as I can tell) to accommodate the specific needs of taxonomists (e.g. referring to a subsection of a reference as representing an original taxonomic description), so I do see a need to create a Reference GUID system specific to biology. I could rant for pages on this, but I'll summarize simply with a plea to *DEFINE* a Concept GUID as an intersection between an Name GUID and a Reference GUID (i.e., what I would call an "Assertion"). Not all Name-Reference combinations will be worthy of recognition as a distinct "Concept", but all are *potentially* representative of a concept (Berendsohn), and thus all should be drawn from the same pool of GUIDs as Concept GUIDs. In other words, "Concepts" should be thought of as a subtype of Name-Reference instances. I would go further to suggest (as I did above) that "Name" GUIDs should also be a subtype of Name-Reference instances (non-exclusive of Concept subtype instances), using the Name-Reference instance that represents the Code-recognized original description of the name as the "handle" to the Name.
By this approach, you need only two GUID object classes <objectClass>: one for References, and one for Name-Reference intersections (Assertions). The latter of these could serve as the source for both Concept GUIDs and Name GUIDs.
Last Slide:
My own answers to your questions:
Are LSIDs the most appropriate technology?
I'm increasingly coming to that conclusion.
I agree. The LSID system is easy to implement, stable, scalable and does everything we need. The DOI system is good as well, but the fee scheme bothers me (though I understand there are ways around that).
- Should identifiers be assigned and resolved centrally or via a fully
distributed model (or should providers have the option of using either model)?
I think the best option would be central. The next option would be full
distributed. Leaving it as an option would, in my opinion, be a BIG mistake.
I disagree- the assignment of identifiers should be by the curators of the data. However, I do strongly consider that there should be some sort of trust scheme in place, where identifiers are issued only by entities trusted by the rest of the system. A scheme similar to that used by certificate authorities and delegates should be adequate.
Which objects should receive identifiers?
Specimens, References, Name-Reference intersections (Assertions), and
perhaps Agents. [TaxonNames and Concepts can be subsets of Name-Reference intersections].
Any object. It doesn't matter what it is, just that it can be resolved, and when you find it, you can figure out what it is. Sensible use of the NameSpace portion of the LSID will help a lot with this. A trusted organization should issue the NameSpace portion to avoid NS conflicts.
3a) Should we develop a set of object classes for biodiversity informatics and assign identifiers to instances of all of these?
I think so, yes. Of course, it depends a bit on who you mean by "we". I'm
thinking sensu lato.
Sure, and these could be a core from which others can be built. But we should asolutely not restrict the capability of the "system" to accept new classes - even classes that represent the same infomration in a different way that may be appropriate to a group of users.
3b) Should identifiers be associated with real world objects (e.g. specimens), or with digitised records representing them (e.g. perhaps multiple records representing different digitisation attempts by different researchers for the same specimen), or both?
I would say definitely real-world objects (treating things like
Code-recognized original descriptions of taxon names, and citable references as "real-world objects"). I do NOT think we should have separate GUIDs for digital representations thereof. Alternative digital representations are simply clutter that will eventually be weeded out of the system, once we all get organized on this stuff, and harness the power of the internet to implement a global editing/QA system.
Yeah, we need to be very clear about what these identifiers are assigned to. There should be very clear documentation about this that is accepted by the relevant community. Where possible, it makes a lot of sense to use the same identifier in the electronic record as that associated withthe physical object- afterall, the electronic data is really just metadata about the physical object.
What should be done about existing records without identifiers?
As far as I know, ALL records are currently without identifiers (unless
someone established a widely accepted GUID system and I missed the announcement...)
All records currently have some sort of identifier, the problem is their uniqueness is not rigorously enforced or even evaluated, so their usefulness is probably limited.
4a) Should they be left alone?
Ultimately, no.
4b) Should they all be updated with identifiers?
Ultimately, yes.
All records that will be referenced by another entity need to have unique identifiers in order for a robust system that allows reuse of data to be properly implemented.
4c) Should the provider software be modified to generate "soft" identifiers (ones which we cannot guarantee in all cases to be unique) based e.g. on the combination of InstitutionCode, CollectionCode and CatalogNumber?
As an interim solution, perhaps. See my comments under "Slide 2" above.
Yes, but not soft. The providers should assign their own identifiers, but there must be a mechanism to ensure that identifiers are being properly assigned.
Are revision identifiers a useful feature?
I would like to think not. If the information is truly dynamic over time
(e.g., re-determinations of taxonomic identity of specimens), then individual instances should probably receive their own set of GUIDs (as opposed to versions of the "parent" GUID). If the information is static over time, and changes represent objective corrections, then I don't see a real need to track that within the context of a GUID (record edit history may or may not need to be tracked, but this seems to me to be a separate issue from GUIDs).
Revision information is very helpful in dealing with errors such as keystroke errors or other such details that do not change the object.
5b) How many providers will be able to provide and handle them?
If versioning is incorporated, then it should be designed such that a
"default" version is provided automatically when versioning is not handled.
Not many. It seems most collections don't record any history in their record edits, so without a major alteration in the way the data are stored, it will be a significant undertaking to provide useful revision information.
Sorry for the long post, but I feel that this issue is extremely important at this point in bioinformatics history.
Aloha, Rich
Richard L. Pyle, PhD Natural Sciences Database Coordinator, Bishop Museum 1525 Bernice St., Honolulu, HI 96817 Ph: (808)848-4115, Fax: (808)847-8252 email: deepreef@bishopmuseum.org http://www.bishopmuseum.org/bishop/HBS/pylerichard.html
-----Original Message----- From: TDWG - Structure of Descriptive Data [mailto:TDWG-SDD@LISTSERV.NHM.KU.EDU]On Behalf Of Donald Hobern Sent: Thursday, September 23, 2004 6:22 AM To: TDWG-SDD@LISTSERV.NHM.KU.EDU Subject: Re: Globally Unique Identifier
This is precisely one of the key questions we need to address with any identifier framework we adopt. I think we could easily use LSIDs in a way that should overcome your concerns, and I think that the built-in mechanisms for discovery and metadata access within the LSID model are really exciting.
I have just put together a PowerPoint presentation to explain some of what I think we could achieve with globally unique identifiers and particularly with LSIDS. It can be downloaded from:
http://circa.gbif.net/Public/irc/gbif/dadi/library?l=/architecture/globa llyuniqueidentifier/
It may be clearest if you go through it as a slide show rather than in edit mode.
Thanks,
Donald
Donald Hobern (dhobern@gbif.org) Programme Officer for Data Access and Database Interoperability Global Biodiversity Information Facility Secretariat Universitetsparken 15, DK-2100 Copenhagen, Denmark Tel: +45-35321483 Mobile: +45-28751483 Fax: +45-35321480
-----Original Message----- From: TDWG - Structure of Descriptive Data [mailto:TDWG-SDD@LISTSERV.NHM.KU.EDU] On Behalf Of Wouter Addink Sent: 23. september 2004 17:38 To: TDWG-SDD@LISTSERV.NHM.KU.EDU Subject: Re: Globally Unique Identifier
It seems that DOI allows for any existing IDs to be used as part of the unique identifier. That seems to me as a fast to adopt short term solution but not a good idea for the long term. At first sight I very much liked the LSID specification, but the longer I think about it, the less I like some parts. What I think is missing in the LSID specification is that the unique identifier should be 'meaningless' apart from being an identifier to become time independent (and to avoid possible political problems). Any solution with a URN I can think of has some meaning, which makes solutions like a MAC-address generated GUID favorable in my opinion. And any meaning you need (like an authority of an object) can be specified in metadata instead of using it in the identifier. What is not very clear to me in the LSID specification is where the LSID generated by a LSIDAssigningService is actually stored.
Wouter Addink
----- Original Message ----- From: "Gregor Hagedorn" G.Hagedorn@BBA.DE To: TDWG-SDD@LISTSERV.NHM.KU.EDU Sent: Wednesday, September 08, 2004 6:20 PM Subject: Re: Globally Unique Identifier
I am not quite sure, but to me it seems with "GUID" you refer to the numeric, MAC-address generated GUID type. I have nothing against these. However, any URN in my view is a GUID that has most of the properties you mention:
- it is guaranteed to be unique globally, and can be created
anywhere,
anytime by any server or client machine - it has no meaning as to where the data is physically located and will there not confuse any user about this
- most id
mechanisms, especially URI/URN ids require a 'governing body' to handle namespaces/urls to ensure every URN is unique, whereas a GUID is always unique
The governing body is restricted to the primary web address, and in most cases such an address is already available. Being a member of a governmental institution that explicitly forbids the use without prior consent, and forbids the use of its domain name once you are no longer working for them, I realize some potential for problem.
I do think a URL of some kind would be useful for things such as global searches of multiple databases, as this will allow the search to go directly to the data source where the name, referene, etc comes from. But this should not be part of its ID. Maybe a name/id should have several foms, a GUID for an ID and a URL + a GUID for a fully specified name.
What are the current thoughts on these ideas?
A GUID is only part of the problem. The other half of the problem is actually getting at the resource. URN schemes like DOI or LSID (I prefer the latter) intend to define resolution mechanisms. That make the URN not yet a URL - in my view the good comes with the good, location and reorganization independence.
I believe GBIF should install such an LSID resolver, which is why in the UBIF proxy model, under Links, I propose to support a general URL (including potentially URNS), a typed LSID and a typed DOI. This could be simplified to have just a URN (LSID and DOI are URNs), but that would then require string parsing to determine and recognize the preferred resolvable GUID types. Comments on splitting/not splitting this are welcome!
There may be some need to define a non-resolvable URN/numeric GUID as well. However, that would not be under the linking question. Is it correct that linking requires resolvability, or am I thinking into a wrong direction?
Gregor
Gregor Hagedorn (G.Hagedorn@bba.de) Institute for Plant Virology, Microbiology, and Biosafety Federal Research Center for Agriculture and Forestry (BBA) Koenigin-Luise-Str. 19 Tel: +49-30-8304-2220 14195 Berlin, Germany Fax: +49-30-8304-2203
Often wrong but never in doubt!