Richard Pyle wrote:
...
Perhaps there's a delegation mechanism that can be used? So when a DN can't be resolved, the system backs down to a default DN, such as gbif.org that would then indicate that smithsonian.org is now bishop.org?
But it's not that simple, is it? If there is an LSID:
urn:lsid:bishopmuseum.org:Specimen:1234567
and another LSID, to a completely different specimen:
urn:lsid:smithsonian.gov:Specimen:1234567
...then simply re-directing all bishopmuseum.org requests to Smiithsonian wouldn't work....would it? Or would Smithsonian recognize the domain and deal with it accordingly?
It seems to me that a lot of complexity would disappear if we could all get behind a single issuer of GUIDs, and mirror the capability to resolve those GUIDs on dozens or hundreds of servers around the world, and only use the GUIDs in a semantic context that is self-evident.
Perhaps. But what about if an insitution wants to provide IDs for more than just specimen or name objects? Should we always rely on a single authority to provide a mechanism for doing that? I don't think that would go very far.
Re-reading something I wrote:
I would go further to suggest (as I did above) that "Name" GUIDs should also be a subtype
of
Name-Reference instances (non-exclusive of Concept subtype instances),
using
the Name-Reference instance that represents the Code-recognized original description of the name as the "handle" to the Name.
Actually, it's probably safe to say that all "name-bearing" Name+Reference instances (i.e., original descriptions) are also, virtually by definition, also "concept-bearing" Name+Reference instances. So, not only would name-bearing and concept-bearing Name+Reference instances be non-exclusive of each other, it would probably be safe to think of name-bearing instances as a subset (Subtype) of Concept-bearing instances, which themselves are a subset (Subtype) of all Name+Reference instances.
ok.
My own answers to your questions:
Are LSIDs the most appropriate technology?
I'm increasingly coming to that conclusion.
I agree. The LSID system is easy to implement, stable, scalable and does everything we need. The DOI system is good as well, but the fee scheme bothers me (though I understand there are ways around that).
My understanding is that it would be easy to develop a DOI-like system that is not part of the fee-based DOI system, and I still find it appealing because it could as simple as an integer ID and very basic context tag.
As for LSIDs -- If I understand correctly that the purpose of the <DomainName> portion of the LSID is to point to the one (and only?) server that can resolve the ID, then all of a sudden I don't like them at all. If it's true that the embedded Domain portion of an LSID *requires* that the domain exist for as long as the GUID exists in order for the GUID to be useful, then I definitely have reservations. If, on the other hand, the Domain portion can be seen as representing the issuer (somewhat analogous to the function of "InstitutionCode" in DwC), and could be resolved by any server set up to deal with the <namespace> part of the LSID, then I'm much less concerned.
The DN portion is meant to be resolvable by the DNS system. So yes, there is a dependency on the continued existence of the DN, but is can be set up to be resolved by any LSID service endpoint.
I think the best option would be central. The next
option would be full
distributed. Leaving it as an option would, in my opinion, be a BIG mistake.
I disagree- the assignment of identifiers should be by the curators of the data. However, I do strongly consider that there should be some sort of trust scheme in place, where identifiers are issued only by entities trusted by the rest of the system. A scheme similar to that used by certificate authorities and delegates should be adequate.
Maybe I'm misunderstanding the use of the word "issuers", but in my mind, the issuer's job is only to provide a guaranteed-unique set of ID's. It would not, necessarily, be the location where the ID is applied to its associated data.
If we use the MAC approach + a context such as an LSID or DOI form, then there is absolutely no need for a central issuing agency. If data providers are more careful about assuring they meet requirements about their identifiers, then again there is no need for a central issuer.
In Donald's PowerPoint file, he made reference to "mechanisms for data providers to request and use blocks of LSIDs from central service". Here's how I imagine a system would work:
GBIF (or some other central entity) establishes a service that can generate unique <objectID> numbers within its own LSID context. The same service also maintains a complete set of data associated with each <objectID>. Major (and minor) institutions (essentially your set of "Trusted" entities) would established mirrored copies of the complete set of all data (or, perhaps, only a filtered subset of the complete data), but would not be able to issue new GUIDs directly. However, the mirrored sites could serve as real-time "pass-through" to the central sight so as to be functionally able to provide new GUIDs in real time, by retrieving them directly (in real time) from the central server. Also, the mirrored sites would all maintain synchrony of their copies of the data with the central "master" copy, on a realistic time frame (e.g., every 24 hours, or on-demand if a data provider chose to initiate a synchronization command).
Again, just use a MAC based GUID inside an LSID context. If you have any MS dev tools on your machine type "guidgen" and the command prompt. Voila! Globally unique identifiers. No matter how many times you push the "New GUID" button.
If a curator of a local institution's data needed to assign a new batch of numbers for a new set of specimens, the curator would issue a request to the central server (or via one of mirrored sites as a pass-through request) for a block of N numbers. The central server would never re-issue those same numbers again to anyone else. But those numbers remain "empty" until the curator assigns them to data, and uploads that data either to the central server or to one of the mirrors. In other words, even though the numbers are "issued" by a central server, they are applied to real data only by local curators.
A big issue, of course, is control over editing of data associated with a given GUID. In the case of specimens, the central server and mirrored sites could (perhaps at the discretion of the data curator who initially requested the number) restrict subsequent editing of those data to a defined set of password-protected user accounts. In the case of more public data, such as taxon names and publications, the control of data editing would be less restrictive (e.g., either full accessible by the public, or accessible to anyone who goes to the trouble to register themselves as a taxonomist with the central server or with any of the mirrored sites).
Maybe this approach would not be practical for specimen data -- but I think it would be the optimal approach for taxon data. Perhaps those two fundamentally different kinds of data (owned, vs. public domain) need fundamentally different approaches to GUID issuance and assignment?
Which objects should receive identifiers?
Specimens, References, Name-Reference intersections
(Assertions), and
perhaps Agents. [TaxonNames and Concepts can be subsets of
Name-Reference
intersections].
Any object. It doesn't matter what it is, just that it can be resolved, and when you find it, you can figure out what it is. Sensible use of the NameSpace portion of the LSID will help a lot with this. A trusted organization should issue the NameSpace portion to avoid NS conflicts.
I'd have to think this through some more. Leaving it too open might lead to a plethora of (potentially overlapping, but not quite equivalent) NameSpaces, which seems like it could turn into a real mess, really quickly. Centralized ID systems such as social security numbers in the U.S., telephone numbers, etc. definitely have some advantage over totally open systems. I suppose that the pool of NS's would be self-cleaning simply by use or non-use....but I still wonder how much better this approach would be over the status quo.
That's why there needs to be some agreement over the issuance of namespaces.
3a) Should we develop a set of object classes for biodiversity
informatics
and assign identifiers to instances of all of these?
I think so, yes. Of course, it depends a bit on who you
mean by "we". I'm
thinking sensu lato.
Sure, and these could be a core from which others can be built. But we should absolutely not restrict the capability of the "system" to accept new classes - even classes that represent the same infomration in a different way that may be appropriate to a group of users.
Again, I'll have to think about this some more. I certainly don't think that the "system" should be incapable of dealing with new classes -- sort of like how anyone can develop their own Federation Schema and use DiGIR to establish specific information networks. But I'd hate to see a breakdown in the global transmission of biodiversity information simply because different subgroups establish their own special-needs, non-mutually-compatible classes for dealing with essentially the same kinds of information (especially if they do not also conform to a generalized international standard).
Bah. That's the whole point of this - to facilitate data exchange. If a small subgroup wants to start exchanging data in an abbreviated format, so what? As long as the identifiers being used are able to resolve the type of object being passed around, and the objects conform to their definitions, it shouldn't be a problem. By initially establishing a robust framework for Scientific Names and perhaps specimen data / collections, then there will be little need for others to recreate new ways to represent that data. The benefits of a robust reliable representation and provision of cheap, effective software tools will hopefully overcome the steep learning curve needed to even understand what's in some of these schemas.
What should be done about existing records without identifiers?
As far as I know, ALL records are currently without
identifiers (unless
someone established a widely accepted GUID system and I missed the announcement...)
All records currently have some sort of identifier, the problem is their uniqueness is not rigorously enforced or even evaluated, so their usefulness is probably limited.
O.K., in that case I misunderstood the meaning of "identifiers". All historical identifiers (e.g., catalog numbers for specimens) should be maintained, preserved, and cross-referenced to GUIDs just like any other metadata about the physical object. I think of catalog numbers not so much as unique identifiers, but as "labels" -- not altogether unlike taxonomic names. In the databases I manage, I do not use catalog numbers as identifiers -- the computer generates the UID, which is never seen, read, written, or typed by a human. That's how I'd like to see the sorts of GUIDs we're discussing be implemented -- i.e., for the benefit of computer-computer data exchange; not human-human data exchange or human-computer/computer-human data exchange.
But if I want to say to you, hey look at this specimen xxx while we're chatting from around the world using an instant messenger while collaborating on some project, would't it be nice to just be able to type in lsid:mymuseum.org:specimen:1234 and have your client retrieve that exact data and associated metadata directly? A trivial example but one that can form the foundation of some cool stuff for data exchange and interaction. I thought that was the whole point of these GUID things. But maybe I'm mistaken?
4c) Should the provider software be modified to generate "soft"
identifiers
(ones which we cannot guarantee in all cases to be unique)
based e.g. on the
combination of InstitutionCode, CollectionCode and CatalogNumber?
As an interim solution, perhaps. See my comments under
"Slide 2" above.
Yes, but not soft. The providers should assign their own identifiers, but there must be a mechanism to ensure that identifiers are being properly assigned.
Agreed -- but I still think of these as "soft" identifiers, because CatalogNumber values can change over time, in certain circumstances. GUIDs should *never* need to be changed (even if the institution that issued them vanishes without a trace from the face of the Earth).
Yep. That would be the ideal.
Revision information is very helpful in dealing with errors such as keystroke errors or other such details that do not change the object.
I agree the revision information *can* be helpful in dealing with errors; but I don't see that function as being integral to the assignment of GUID values.
Except in the somewhat bizarre case when you need the old version of the object.
Not many. It seems most collections don't record any history in their record edits, so without a major alteration in the way the data are stored, it will be a significant undertaking to provide useful revision information.
For what it's worth, the databases I have developed for my institution are designed to log every change made to every field (except performance-enhancing, purely derivative fields), including what the previous value was, who made the change, and when the change was made. When records are deleted, a "snapshot" of the value of every non-null field is logged, including the time the record was deleted, and by whom. The reason I say all of this is to underscore that my stance on not including versioning IDs as part of a GUID system is NOT from lack of appreciation for the value of preserving edit histories (something I clearly value very, very much -- given that the total diskspace occupied by my edit logs exceeds the total diskspace occupied by the "real" data!)
Awesome. That's a nice way to do it.
In closing, I apologize to those who find my overly-long posts on this topic to be an annoyance. I also am starting to wonder: is this the appropriate email forum to have this discussion?
Yeah, good question. Maybe this should be on the GBIF DADI list or TDWG general? Or even the LSID list?
Aloha, Rich
Kia ora, Dave V.