Hi, I found the PowerPoint document from Donald really helpfull on some of the discussed issues. Unfortunately referring to it is a little difficult because the pages does not contain a unique identier :)
But first some real-world experience with the 'GUID' used today in GBIF specimen network. The combination InstitutionCode, CollectionCode and CatalogNumber was chosen for that. Problems experienced last year: -the 'GUID combination' is not enforced and therefore not always used -Some collections belong to 2 or more Institutions or to none -If part of the collection moves to another institute, the guid combination is changed for that part. -The InstitutionCode should be unique, and providers where asking what to do if the code they wanted to use was already chosen, and who decides which institute may use an institutioncode if two institutes want to use it. There is no body responsible for that and there are no rules: the first Institute can claim a code, or the biggest or the most well known?? -In different science areas different InstitutionCodes within one Organisation where in use, which one to choose. -This 'GUID' can only be used for specimen, not for other life science objects.
Now let's look at LSID syntax: urn:lsid:authority:namespace:object_identifier (:revision_number) About the first part; authority: It is naturally to want this to be unique. Therefore we can expect the same problems as mentioned above, plus unclearity about the difference issuing_authority vs. current_authority for the data. The problems with authority are important for the involved authorities only, not for the rest of the life science community. So discussions about it and establishing an authority that takes decisions in political conflicts are a waste of time. We can solve it by using a unique number only and maintaining a list that gives information about each number. It should be clear that this are only the initial issuing authority/authorities.
About the second part; namespace: Things like 'Specimen' or 'Experiment'. In contrast with the first part, problems with this part are interesting for the whole lifescience community because applications will want to use this to decide whether the data can be used for a specific application. Standardisation of namespaces is necessary. I think it should be devided in two parts (not currently present in LSID) like MIME type image/jpeg etc: 'observation/abcd' or 'specimen/darwincore' for example. If we look at the Donalds PPT we see that in model 1: LSID assigned centrally, the namespace is chosen centrally and by model2: LSID assigned by each provider, the provider is free to choose one. Even language variations of naming a namespace can already give problems, so this is why I strongly favor a central mechanism here for assigning LSIDs, unless the provider is somehow forced to use a certain namespace class. The potential bottleneck problem is not really an issue I think (see also DNS mechanisms). If we choose central mechanism the issuing authority will always be GBIF (or do we need different authorities for different parts of Life Science?) so no problems with that also in this case.
The third part; object id: no problems there.
The last part; revision id: whether you need it depends: do you give the physical objects a GUID or the data records? With the first choice you do not need a revision number because the physical object will not change (or do they with living collections?). At first I thought that a GUID should be put on physical object: if you are looking for data, you are looking for data about a certain physical object, the source of the data is not (very) important. The same data elements in different sources about the same object should be equal, else there are errors. Donald's PPT gives the example that someone wants to refer to a LSID in a publication as a source. In that case you want to refer to a data source with a certain version. Then you need to give a GUID to the data and also you need revisions. Data is not persistent, it changes all the time. Giving a persistent identifier to it is very difficult and not many data systems have full revisions support. If a GUID for a 'physical object' is chosen, a thing like a species name or author name or country should not get a GUID. These are more a kind of attributes: most data will use one or more species names as 'metadata'. There needs to be central datasource for each of these 'metadata', like a NameBank for species names (with its own ID). I am not sure whether LSID was designed for a GUID to data or to physical objects. The use of namespace and object id instead of databasename and recordid seems to indicate that it was designed for physical objects, but why then the optional revision id? Instead of a revision id you can also assign a new GUID with every change, but then how to point to a new version from an old version of data (if you have the GUID of the old version, how to get the GUID of the new one).
Requirements in Donald's PPT: -if a GUID is on a physical object, the GUID must not refer uniquely to a single data element, it must only be unique itself. It is also not a requirement in LSID specification. There will be overlap between the objects, so an object can belong to more then one IDs. For instance a researcher can have its own ID and also belong to the ID of the Institute he is working for. The data for overlapping elements like researcher name must be equal. -I would restrict the identifiers to life science objects.
Issues to be resolved in Donald's PPT: It would be beneficial to maintain the GUID in the datasource itself (at least for the owner of the datasource), but not absolutely necessary. I see GUID in data records as a 'tightly coupled' model (which requires some work for existing databases). I can imagine also a 'loosely coupled' model where provider software is modified to get the identifier from a central server (or mirror).
Wouter Addink