Hi Jason,
Thank you for posting these questions, as I have many of the same questions for an LSID system that we plan to implement over the course of the next few months.
So....a quick review of the basics to make sure that my understanding is consistent with the understanding of others on this list:
LSIDs refer to (resolve to) "data" and "metadata". The "data" must never change for a given LSID, and is usually used for a binary digital object (like an image file, or a PDF file, for example). The "metadata" for a given LSID may change, without changing the underlying LSID. The "data" part is optional -- LSIDs containing only metadata are thought of as "conceptual" or "abstract" LSIDs -- referring to a "concept" that has no inherent digital manifestation. The "notion" of a particular image (e.g., corresponding to the shutter release of a camera) might be represented by a "conceptual/abstract" LSID, while each digital manifestation of the captured image (RAW, JPEG, TIFF, different crops, color-corrections, etc.) could/would each have their own data-bearing LSID, where the "data" would be the binary/bit stream digital image file. The "conceptual" (data-less) LSID for the "notion" of the image could serve as a "hub", such that the metadata of each of the data-bearing LSID image files might include the conceptual LSID amongst their metadata, such that all of the digital renderings could be referred back to the same image "concept" (i.e., the same shutter-release event).
If I have all of the above more or less right, then I have a few questions of my own in the same context.
I gather from your email that you're mostly asking about LSIDs that do not (necessarily) have any digital/binary data associated with them, and you're wondering about how to manage versioning, etc. Before I address your specific questions, I want to point out that I used the word "necessarily" above -- because the first issue that I would like to understand in terms of how others approach LSIDs is the following question:
Does the LSID represent the *specimen*, or does it represent the *database record* for the specimen? It seems like a subtle distinction -- and it is -- but I think it's an important one. My understanding of data-less LSIDs is that they are specifically intended to represent things that have no inherent digital manifestation. In the case of a biological specimen, there is no inherent digital manifestation, so just like the notion of the "concept" of an image, the "concept" of a specimen seems an appropriate abstract object to which a data-less LSID could be assigned. In that view of the world, the database record only exists as a tool to associate the LSID with the metadata in a form that's easy to distribute electronically (i.e., as opposed to writing the LSID down with ink in a paper ledger book, and writing down associated metadata next to it).
I believe this is how most biologists/collection managers think about the problem, and think about how they would use LSIDs -- sort of like globally unique catalog numbers. We all have databases of specimens, and our specimens have catalog numbers, but those electronic databases and catalog numbers are not the "real" units of concern to us -- rather the physical specimens on shelves are what we're worried about. The databases are just tools to help us organize and track information about the specimens (tools that happen to have more practical value than hand-written labels physically tied to specimens -- that otherwise serve the same purpose).
The other perspective, which was foreign to me at first, but I'm now beginning to appreciate more, is that the LSID is *NOT* assigned to the physical specimen, but rather the electronic *database record* representing the specimen. In this case, the LSID *is* assigned to something with a digital manifestation, and therefore *can* (and *should*) be a data-bearing LSID. The data, in this case, would be the binary blob representing a concatenation of the complete database record, in some specified format. The metadata, in this case, would probably not be the data fields we think of for a specimen, but rather information about how to interpret and parse the binary data represented by the LSID, which itself would resolve to information associated with the specimen.
Personally, I'm still very firmly in the first camp -- that is, the assignment of "conceptual" (data-less) LSIDs to physical specimen objects, the metadata for which would be our standard specimen data fields. The LSID effectively serves as the globally unique catalog number, with the added bonus of self-resolution -- which is what, I think, the biodiversity community needs most right now. However, I'm keeping an open mind on this, so I would very much like to hear from others on this list who feel that the object represented by a specimen LSID should be the digital database record, rather than the physical specimen.
The reason I wrote all of the above is that I think it has direct bearing on the answers to your questions.
At this stage, we are only concerned with assigning LSIDs to collections/collection events, specimens and versions of each. Since these aren't represented by bytecode we don't have to be concerned about issuing a new LSID each time the metadata changes (through improvements/changes in determination, geolocation etc), but we also don't want to throw away the previous revisions so the concept of a "hub" would serve well.
O.K., so let's assume we will create data-less (conceptual/abstract) LSIDs for our specimens, and that the metadata are the standard specimen data fields.
This hub would allow us to have a single unchanging LSID that points to (or returns) the current metadata but also points to each LSID for the previous collection revisions. A change in the collection metadata would not change the LSID of the collection hub, it would just create a new collection version record which is issued a new LSID and promoted to "current". This collection hub would also point to a "hub" for each of the specimens that are represented by the collection and these specimen hubs would each point to the current metadata for the specimen as well as the previous versions. We would not be using the revision method of LSIDs, rather we would issue a totally new LSID for each version as recommended by TDWG.
I'm not sure I follow -- by "collection" do you mean collecting event? Or do you mean "collection" like "Bishop Museum Fish Collection"? I read the above a couple of times, and I *think* I understand what you are saying, but I'm not sure. Let me describe my approach to the same basic problem, and see if it makes sense in the context of the above.
Lets suppose we generate data-less LSIDs to represent our collecting events, and data-less LSIDs to represent our specimens. In both cases, the LSIDs represent the abstract notion of the collecting event or specimen; not the electronic database record per se. Our metadata for each LSID would correspond to our usual data fields for each kind of object (i.e., date, collector, etc. for collecting events; and preservation method, determinations, etc. for specimens).
The question, it seems (at least the question I have, and I think the question you have) is how do we manage edit histories of metadata elements. I can think of three scenarios to deal with this:
1) Assign the LSID to the database record object, not the conceptual collecting event/specimen. In this scenario, the LSID would represent the database record itself, and would have data. The data would be a binary concatenation of all the elements of a typical specimen data record, with some sort of delimiter between elements (fields). This binary digital object would be fixed and permanent, and would never change. Metadata associated with each of these LSIDs would include information on how to parse the binary data blob into its component fields/elements, so they could be rendered, searched, etc. If some data element needed to change (e.g., the collector's name was originally misspelled), then a replacement LSID would be generated for the new binary concatenated data blob, and this new LSID would use the versioning feature of LSIDs (i.e., it would differ from the original LSID only in the revision id part of the LSID. Thus, every data edit would be automatically issued a new LSID, because the data component itself has changed. As I stated above, I'm not too keen on this approach, based on my current understanding.
2) Utilize the Revision ID part of an LSID to track the history of metadata changes In this scenario, the LSIDs would themselves be data-less, and the metadata would be our typical data fields. If any of our data fields changed, we would issue a LSID differing from the original only by the Revision ID component. This way, each version of the data gets its own LSID, and resolving any one of the versions automatically redirects to the latest/most recent version, using the LSID versioning features. This way, if you strip the revision ID part of the LSID, you're essentially left with an LSID that applies to the "concept" of the specimen (i.e., the "hub" LSID). This seems almost the same method you described above (if I understood you correctly), except that the new LSIDs are generated by altering only the Revision ID part of the LSID, rather than creating a new LSID with a different Object ID. I'm not sure why you would want to issue new LSIDs with new Object ID components for what effectively represent different versions of metadata for the same object. The main problem I have with this approach is that I don't think this is what was intended by the LSID revision ID component. I believe the revision ID was intended as a mechanism to allow altering the *data*, not to track changes to the metadata. In other words, I think it goes against the spirit of the intent of LSIDs to use the revision ID (or issue new LSIDs with different Object Ids) to track changing metadata. But I may well be wrong about this.
3) Issue data-less LSIDs without using the revision ID feature, and track data change history separately from the LSIDs In this scenario, the LSIDs are *not* used as a tool to track versioning of metadata. Rather, they are issued to the "concept" of an object (e.g., collecting event or specimen), with no inherent binary "data", and the metadata resolved for the LSID would be a function of whatever resolve service is used. Tracking historical changes to the metadata would be the responsibility of the data issuer, but would not involve the generation of new LSIDs. Indeed, there's nothing stopping the resolver service from maintaining a complete log of metadata changes as part of the metadata associated with the LSID.
I personally favor the third approach, because only a small fraction of people are concerned with metadata edit history. I say this in the context that multiple historical determinations are *not*, in my mind, examples of metadata edit history. To me, a determination is an object in its own right, perhaps worthy of its own LSID. Part of the metadata of a specimen could be selecting from among multiple determinations which is deemed to be correct/current from the perspective of the specimen owner (=museum collection). But when I think of metadata edits and versioning, I think of correcting typos and otherwise fixing mistakes -- not the act of linking new information to an existing LSID (as a determination would be).
I'm not sure if any of this addresses your questions, but I think these issues are all inter-related. I would very-much like to hear from others on this stuff.
Aloha, Rich
Richard L. Pyle, PhD Database Coordinator for Natural Sciences and Associate Zoologist in Ichthyology Department of Natural Sciences, Bishop Museum 1525 Bernice St., Honolulu, HI 96817 Ph: (808)848-4115, Fax: (808)847-8252 email: deepreef@bishopmuseum.org http://hbs.bishopmuseum.org/staff/pylerichard.html