[tdwg-guid] First step in implementing LSIDs?

Sat Jun 2 00:49:19 CEST 2007

Hi Jason,

Thank you for posting these questions, as I have many of the same questions
for an LSID system that we plan to implement over the course of the next few
months.

So....a quick review of the basics to make sure that my understanding is
consistent with the understanding of others on this list:

LSIDs refer to (resolve to) "data" and "metadata".  The "data" must never
change for a given LSID, and is usually used for a binary digital object
(like an image file, or a PDF file, for example).  The "metadata" for a
given LSID may change, without changing the underlying LSID.  The "data"
part is optional -- LSIDs containing only metadata are thought of as
"conceptual" or "abstract" LSIDs -- referring to a "concept" that has no
inherent digital manifestation. The "notion" of a particular image (e.g.,
corresponding to the shutter release of a camera) might be represented by a
"conceptual/abstract" LSID, while each digital manifestation of the captured
image (RAW, JPEG, TIFF, different crops, color-corrections, etc.)
could/would each have their own data-bearing LSID, where the "data" would be
the binary/bit stream digital image file.  The "conceptual" (data-less) LSID
for the "notion" of the image could serve as a "hub", such that the metadata
of each of the data-bearing LSID image files might include the conceptual
LSID amongst their metadata, such that all of the digital renderings could
be referred back to the same image "concept" (i.e., the same shutter-release
event).

If I have all of the above more or less right, then I have a few questions
of my own in the same context.

I gather from your email that you're mostly asking about LSIDs that do not
(necessarily) have any digital/binary data associated with them, and you're
wondering about how to manage versioning, etc.  Before I address your
specific questions, I want to point out that I used the word "necessarily"
above -- because the first issue that I would like to understand in terms of
how others approach LSIDs is the following question:

Does the LSID represent the *specimen*, or does it represent the *database
record* for the specimen?  It seems like a subtle distinction -- and it is
-- but I think it's an important one.  My understanding of data-less LSIDs
is that they are specifically intended to represent things that have no
inherent digital manifestation.  In the case of a biological specimen, there
is no inherent digital manifestation, so just like the notion of the
"concept" of an image, the "concept" of a specimen seems an appropriate
abstract object to which a data-less LSID could be assigned.  In that view
of the world, the database record only exists as a tool to associate the
LSID with the metadata in a form that's easy to distribute electronically
(i.e., as opposed to writing the LSID down with ink in a paper ledger book,
and writing down associated metadata next to it).

I believe this is how most biologists/collection managers think about the
problem, and think about how they would use LSIDs -- sort of like globally
unique catalog numbers.  We all have databases of specimens, and our
specimens have catalog numbers, but those electronic databases and catalog
numbers are not the "real" units of concern to us -- rather the physical
specimens on shelves are what we're worried about.  The databases are just
tools to help us organize and track information about the specimens (tools
that happen to have more practical value than hand-written labels physically
tied to specimens -- that otherwise serve the same purpose).

The other perspective, which was foreign to me at first, but I'm now
beginning to appreciate more, is that the LSID is *NOT* assigned to the
physical specimen, but rather the electronic *database record* representing
the specimen.  In this case, the LSID *is* assigned to something with a
digital manifestation, and therefore *can* (and *should*) be a data-bearing
LSID. The data, in this case, would be the binary blob representing a
concatenation of the complete database record, in some specified format.
The metadata, in this case, would probably not be the data fields we think
of for a specimen, but rather information about how to interpret and parse
the binary data represented by the LSID, which itself would resolve to
information associated with the specimen.

Personally, I'm still very firmly in the first camp -- that is, the
assignment of "conceptual" (data-less) LSIDs to physical specimen objects,
the metadata for which would be our standard specimen data fields. The LSID
effectively serves as the globally unique catalog number, with the added
bonus of self-resolution -- which is what, I think, the biodiversity
community needs most right now. However, I'm keeping an open mind on this,
so I would very much like to hear from others on this list who feel that the
object represented by a specimen LSID should be the digital database record,
rather than the physical specimen.

The reason I wrote all of the above is that I think it has direct bearing on
the answers to your questions.

> At this stage, we are only concerned with assigning LSIDs to 
> collections/collection events, specimens and versions of 
> each. Since these aren't represented by bytecode we don't 
> have to be concerned about issuing a new LSID each time  the 
> metadata changes (through improvements/changes in 
> determination, geolocation etc),  but we also don't want to 
> throw away the previous revisions so the concept of a "hub" 
> would serve well. 

O.K., so let's assume we will create data-less (conceptual/abstract) LSIDs
for our specimens, and that the metadata are the standard specimen data
fields.

> This hub would allow us to have a single 
> unchanging LSID that points to (or returns) the current 
> metadata but also points to each LSID for the previous 
> collection revisions. A change in the collection metadata 
> would not change the LSID of the collection hub, it would 
> just create a new collection version record which is issued a 
> new LSID and promoted to "current". This collection hub would 
> also point to a "hub" for each of the specimens that are 
> represented by the collection and these specimen hubs would 
> each point to the current metadata for the specimen as well 
> as the previous versions. We would not be using the revision 
> method of LSIDs, rather we would issue a totally new LSID for 
> each version as recommended by TDWG.

I'm not sure I follow -- by "collection" do you mean collecting event?  Or
do you mean "collection" like "Bishop Museum Fish Collection"?  I read the
above a couple of times, and I *think* I understand what you are saying, but
I'm not sure. Let me describe my approach to the same basic problem, and see
if it makes sense in the context of the above.

Lets suppose we generate data-less LSIDs to represent our collecting events,
and data-less LSIDs to represent our specimens.  In both cases, the LSIDs
represent the abstract notion of the collecting event or specimen; not the
electronic database record per se.  Our metadata for each LSID would
correspond to our usual data fields for each kind of object (i.e., date,
collector, etc. for collecting events; and preservation method,
determinations, etc. for specimens).

The question, it seems (at least the question I have, and I think the
question you have) is how do we manage edit histories of metadata elements.
I can think of three scenarios to deal with this:

1) Assign the LSID to the database record object, not the conceptual
collecting event/specimen.
In this scenario, the LSID would represent the database record itself, and
would have data.  The data would be a binary concatenation of all the
elements of a typical specimen data record, with some sort of delimiter
between elements (fields).  This binary digital object would be fixed and
permanent, and would never change.  Metadata associated with each of these
LSIDs would include information on how to parse the binary data blob into
its component fields/elements, so they could be rendered, searched, etc.  If
some data element needed to change (e.g., the collector's name was
originally misspelled), then a replacement LSID would be generated for the
new binary concatenated data blob, and this new LSID would use the
versioning feature of LSIDs (i.e., it would differ from the original LSID
only in the revision id part of the LSID.  Thus, every data edit would be
automatically issued a new LSID, because the data component itself has
changed. As I stated above, I'm not too keen on this approach, based on my
current understanding.

2) Utilize the Revision ID part of an LSID to track the history of metadata
changes
In this scenario, the LSIDs would themselves be data-less, and the metadata
would be our typical data fields.  If any of our data fields changed, we
would issue a LSID differing from the original only by the Revision ID
component.  This way, each version of the data gets its own LSID, and
resolving any one of the versions automatically redirects to the latest/most
recent version, using the LSID versioning features.  This way, if you strip
the revision ID part of the LSID, you're essentially left with an LSID that
applies to the "concept" of the specimen (i.e., the "hub" LSID).  This seems
almost the same method you described above (if I understood you correctly),
except that the new LSIDs are generated by altering only the Revision ID
part of the LSID, rather than creating a new LSID with a different Object
ID.  I'm not sure why you would want to issue new LSIDs with new Object ID
components for what effectively represent different versions of metadata for
the same object.  The main problem I have with this approach is that I don't
think this is what was intended by the LSID revision ID component.  I
believe the revision ID was intended as a mechanism to allow altering the
*data*, not to track changes to the metadata.  In other words, I think it
goes against the spirit of the intent of LSIDs to use the revision ID (or
issue new LSIDs with different Object Ids) to track changing metadata.  But
I may well be wrong about this.

3) Issue data-less LSIDs without using the revision ID feature, and track
data change history separately from the LSIDs
In this scenario, the LSIDs are *not* used as a tool to track versioning of
metadata.  Rather, they are issued to the "concept" of an object (e.g.,
collecting event or specimen), with no inherent binary "data", and the
metadata resolved for the LSID would be a function of whatever resolve
service is used.  Tracking historical changes to the metadata would be the
responsibility of the data issuer, but would not involve the generation of
new LSIDs.  Indeed, there's nothing stopping the resolver service from
maintaining a complete log of metadata changes as part of the metadata
associated with the LSID.

I personally favor the third approach, because only a small fraction of
people are concerned with metadata edit history.  I say this in the context
that multiple historical determinations are *not*, in my mind, examples of
metadata edit history.  To me, a determination is an object in its own
right, perhaps worthy of its own LSID.  Part of the metadata of a specimen
could be selecting from among multiple determinations which is deemed to be
correct/current from the perspective of the specimen owner (=museum
collection).  But when I think of metadata edits and versioning, I think of
correcting typos and otherwise fixing mistakes -- not the act of linking new
information to an existing LSID (as a determination would be). 

I'm not sure if any of this addresses your questions, but I think these
issues are all inter-related.  I would very-much like to hear from others on
this stuff.

Aloha,
Rich

Richard L. Pyle, PhD
Database Coordinator for Natural Sciences
  and Associate Zoologist in Ichthyology
Department of Natural Sciences, Bishop Museum
1525 Bernice St., Honolulu, HI 96817
Ph: (808)848-4115, Fax: (808)847-8252
email: deepreef at bishopmuseum.org
http://hbs.bishopmuseum.org/staff/pylerichard.html