[tdwg-guid] First step in implementing LSIDs?
Greetings GUIDers,
We are in the process of adding new functionality to our biodiversity database application that would include adding LSID "compliance" to our existing database of botanical collections. I have some questions about how the rest of the community is addressing some particular issues.
I've read through the TDWG GUID Wiki and there seems to be some debate about how LSIDs should be applied to various entities in a system. I like the ideas I've read about "conceptual entities" and "hubs" because this lines up well with many of our ideas of how LSIDs could work for us. We do have a lot of images but right now our primary concern is with the collection metadata, so for the sake of this email, I'm going to exclude any questions about image or other bytecode. We will not be implementing an LSID resolver immediately, but we do want to make sure our schema and implementation of LSIDs will allow us a easy and seamless transition to having a proper resolver.
At this stage, we are only concerned with assigning LSIDs to collections/collection events, specimens and versions of each. Since these aren't represented by bytecode we don't have to be concerned about issuing a new LSID each time the metadata changes (through improvements/changes in determination, geolocation etc), but we also don't want to throw away the previous revisions so the concept of a "hub" would serve well. This hub would allow us to have a single unchanging LSID that points to (or returns) the current metadata but also points to each LSID for the previous collection revisions. A change in the collection metadata would not change the LSID of the collection hub, it would just create a new collection version record which is issued a new LSID and promoted to "current". This collection hub would also point to a "hub" for each of the specimens that are represented by the collection and these specimen hubs would each point to the current metadata for the specimen as well as the previous versions. We would not be using the revision method of LSIDs, rather we would issue a totally new LSID for each version as recommended by TDWG.
So the big questions are, 1) does this sound in line with the spec and 2) does this method of issuing LSIDs for "hubs" and revisions fit in with the approach others are taking? Any suggestions or caveats are appreciated.
Thanks, Jason
Jason Best IT Manager - Andes to Amazon Biodiversity Program Botanical Research Institute of Texas atrium.andesamazon.org
Hi Jason,
Thank you for posting these questions, as I have many of the same questions for an LSID system that we plan to implement over the course of the next few months.
So....a quick review of the basics to make sure that my understanding is consistent with the understanding of others on this list:
LSIDs refer to (resolve to) "data" and "metadata". The "data" must never change for a given LSID, and is usually used for a binary digital object (like an image file, or a PDF file, for example). The "metadata" for a given LSID may change, without changing the underlying LSID. The "data" part is optional -- LSIDs containing only metadata are thought of as "conceptual" or "abstract" LSIDs -- referring to a "concept" that has no inherent digital manifestation. The "notion" of a particular image (e.g., corresponding to the shutter release of a camera) might be represented by a "conceptual/abstract" LSID, while each digital manifestation of the captured image (RAW, JPEG, TIFF, different crops, color-corrections, etc.) could/would each have their own data-bearing LSID, where the "data" would be the binary/bit stream digital image file. The "conceptual" (data-less) LSID for the "notion" of the image could serve as a "hub", such that the metadata of each of the data-bearing LSID image files might include the conceptual LSID amongst their metadata, such that all of the digital renderings could be referred back to the same image "concept" (i.e., the same shutter-release event).
If I have all of the above more or less right, then I have a few questions of my own in the same context.
I gather from your email that you're mostly asking about LSIDs that do not (necessarily) have any digital/binary data associated with them, and you're wondering about how to manage versioning, etc. Before I address your specific questions, I want to point out that I used the word "necessarily" above -- because the first issue that I would like to understand in terms of how others approach LSIDs is the following question:
Does the LSID represent the *specimen*, or does it represent the *database record* for the specimen? It seems like a subtle distinction -- and it is -- but I think it's an important one. My understanding of data-less LSIDs is that they are specifically intended to represent things that have no inherent digital manifestation. In the case of a biological specimen, there is no inherent digital manifestation, so just like the notion of the "concept" of an image, the "concept" of a specimen seems an appropriate abstract object to which a data-less LSID could be assigned. In that view of the world, the database record only exists as a tool to associate the LSID with the metadata in a form that's easy to distribute electronically (i.e., as opposed to writing the LSID down with ink in a paper ledger book, and writing down associated metadata next to it).
I believe this is how most biologists/collection managers think about the problem, and think about how they would use LSIDs -- sort of like globally unique catalog numbers. We all have databases of specimens, and our specimens have catalog numbers, but those electronic databases and catalog numbers are not the "real" units of concern to us -- rather the physical specimens on shelves are what we're worried about. The databases are just tools to help us organize and track information about the specimens (tools that happen to have more practical value than hand-written labels physically tied to specimens -- that otherwise serve the same purpose).
The other perspective, which was foreign to me at first, but I'm now beginning to appreciate more, is that the LSID is *NOT* assigned to the physical specimen, but rather the electronic *database record* representing the specimen. In this case, the LSID *is* assigned to something with a digital manifestation, and therefore *can* (and *should*) be a data-bearing LSID. The data, in this case, would be the binary blob representing a concatenation of the complete database record, in some specified format. The metadata, in this case, would probably not be the data fields we think of for a specimen, but rather information about how to interpret and parse the binary data represented by the LSID, which itself would resolve to information associated with the specimen.
Personally, I'm still very firmly in the first camp -- that is, the assignment of "conceptual" (data-less) LSIDs to physical specimen objects, the metadata for which would be our standard specimen data fields. The LSID effectively serves as the globally unique catalog number, with the added bonus of self-resolution -- which is what, I think, the biodiversity community needs most right now. However, I'm keeping an open mind on this, so I would very much like to hear from others on this list who feel that the object represented by a specimen LSID should be the digital database record, rather than the physical specimen.
The reason I wrote all of the above is that I think it has direct bearing on the answers to your questions.
At this stage, we are only concerned with assigning LSIDs to collections/collection events, specimens and versions of each. Since these aren't represented by bytecode we don't have to be concerned about issuing a new LSID each time the metadata changes (through improvements/changes in determination, geolocation etc), but we also don't want to throw away the previous revisions so the concept of a "hub" would serve well.
O.K., so let's assume we will create data-less (conceptual/abstract) LSIDs for our specimens, and that the metadata are the standard specimen data fields.
This hub would allow us to have a single unchanging LSID that points to (or returns) the current metadata but also points to each LSID for the previous collection revisions. A change in the collection metadata would not change the LSID of the collection hub, it would just create a new collection version record which is issued a new LSID and promoted to "current". This collection hub would also point to a "hub" for each of the specimens that are represented by the collection and these specimen hubs would each point to the current metadata for the specimen as well as the previous versions. We would not be using the revision method of LSIDs, rather we would issue a totally new LSID for each version as recommended by TDWG.
I'm not sure I follow -- by "collection" do you mean collecting event? Or do you mean "collection" like "Bishop Museum Fish Collection"? I read the above a couple of times, and I *think* I understand what you are saying, but I'm not sure. Let me describe my approach to the same basic problem, and see if it makes sense in the context of the above.
Lets suppose we generate data-less LSIDs to represent our collecting events, and data-less LSIDs to represent our specimens. In both cases, the LSIDs represent the abstract notion of the collecting event or specimen; not the electronic database record per se. Our metadata for each LSID would correspond to our usual data fields for each kind of object (i.e., date, collector, etc. for collecting events; and preservation method, determinations, etc. for specimens).
The question, it seems (at least the question I have, and I think the question you have) is how do we manage edit histories of metadata elements. I can think of three scenarios to deal with this:
1) Assign the LSID to the database record object, not the conceptual collecting event/specimen. In this scenario, the LSID would represent the database record itself, and would have data. The data would be a binary concatenation of all the elements of a typical specimen data record, with some sort of delimiter between elements (fields). This binary digital object would be fixed and permanent, and would never change. Metadata associated with each of these LSIDs would include information on how to parse the binary data blob into its component fields/elements, so they could be rendered, searched, etc. If some data element needed to change (e.g., the collector's name was originally misspelled), then a replacement LSID would be generated for the new binary concatenated data blob, and this new LSID would use the versioning feature of LSIDs (i.e., it would differ from the original LSID only in the revision id part of the LSID. Thus, every data edit would be automatically issued a new LSID, because the data component itself has changed. As I stated above, I'm not too keen on this approach, based on my current understanding.
2) Utilize the Revision ID part of an LSID to track the history of metadata changes In this scenario, the LSIDs would themselves be data-less, and the metadata would be our typical data fields. If any of our data fields changed, we would issue a LSID differing from the original only by the Revision ID component. This way, each version of the data gets its own LSID, and resolving any one of the versions automatically redirects to the latest/most recent version, using the LSID versioning features. This way, if you strip the revision ID part of the LSID, you're essentially left with an LSID that applies to the "concept" of the specimen (i.e., the "hub" LSID). This seems almost the same method you described above (if I understood you correctly), except that the new LSIDs are generated by altering only the Revision ID part of the LSID, rather than creating a new LSID with a different Object ID. I'm not sure why you would want to issue new LSIDs with new Object ID components for what effectively represent different versions of metadata for the same object. The main problem I have with this approach is that I don't think this is what was intended by the LSID revision ID component. I believe the revision ID was intended as a mechanism to allow altering the *data*, not to track changes to the metadata. In other words, I think it goes against the spirit of the intent of LSIDs to use the revision ID (or issue new LSIDs with different Object Ids) to track changing metadata. But I may well be wrong about this.
3) Issue data-less LSIDs without using the revision ID feature, and track data change history separately from the LSIDs In this scenario, the LSIDs are *not* used as a tool to track versioning of metadata. Rather, they are issued to the "concept" of an object (e.g., collecting event or specimen), with no inherent binary "data", and the metadata resolved for the LSID would be a function of whatever resolve service is used. Tracking historical changes to the metadata would be the responsibility of the data issuer, but would not involve the generation of new LSIDs. Indeed, there's nothing stopping the resolver service from maintaining a complete log of metadata changes as part of the metadata associated with the LSID.
I personally favor the third approach, because only a small fraction of people are concerned with metadata edit history. I say this in the context that multiple historical determinations are *not*, in my mind, examples of metadata edit history. To me, a determination is an object in its own right, perhaps worthy of its own LSID. Part of the metadata of a specimen could be selecting from among multiple determinations which is deemed to be correct/current from the perspective of the specimen owner (=museum collection). But when I think of metadata edits and versioning, I think of correcting typos and otherwise fixing mistakes -- not the act of linking new information to an existing LSID (as a determination would be).
I'm not sure if any of this addresses your questions, but I think these issues are all inter-related. I would very-much like to hear from others on this stuff.
Aloha, Rich
Richard L. Pyle, PhD Database Coordinator for Natural Sciences and Associate Zoologist in Ichthyology Department of Natural Sciences, Bishop Museum 1525 Bernice St., Honolulu, HI 96817 Ph: (808)848-4115, Fax: (808)847-8252 email: deepreef@bishopmuseum.org http://hbs.bishopmuseum.org/staff/pylerichard.html
Jason, Rich,
Read these timely posts with interest as we (CABI : 'herb.IMI' and LCR NZ - Kevin Richards) are soon to implement a demonstration project for TDWG on using LSIDs in the context of biological specimens (fungi). The reason this collection was attractive (IMHO) for this demonstrator is that most of the near 400000 speciments have up to two LSIDs as part of their metadata - that for the current determination (an IndexFungorum LSID) and for the associated organism (either an IndexFungorum LSID or an IPNI LSID ... unfortunately no LSIDs yet for animals so all the fungi we have on or from animals are lacking in LSIDs in this part of the metadata).
So, without making things too complicated as we 'start to walk' in this domain of biodiversity informatics my vote is for a variation of scenario 3) from Rich. The reason I vote for this is that in the fullness of time, and the 'herb.IMI' database has already started this, much of the metadata with be LSIDs and it's correctness (i.e. sorting out typos etc) will be delegated to the entities who issue those LSIDs. As IPNI improves the quality of the metadata associated with the LSIDs they issue (and if I understand correctly they do use the scenario 3) from Rich) so the quality of the metadata associated with a 'herb.IMI' LSID improves. The reason I prefer the data + metadate 'model' is that in this instance the data is fixed ... who changes collection/accession numbers? ... so perfect for this role. Even if a collection moves to a new owner the original data need not 'disappear' in the same way that DOI's move with the objects as book and journal titles change from one publisher to another.
Final point, the 'data' is the 'herb.IMI' accession number; in context this is a GUI because of the existence of Index Herbariorum. So, our data will be 123456 not IMI123456 because ... in the fullness of time we will include an Index Herbariorum LSID to 'identify' the 'institutional acronym' element of the metadata.
Any comments on this before Kevin starts coding will be much appreciated.
Cheers,
Paul
ps I use 'herb.IMI' in quotes because fungi are not plants, although they are traditionally considered as, and currently for their nomenclature treated as, plants.
________________________________
From: tdwg-guid-bounces@lists.tdwg.org on behalf of Richard Pyle Sent: Fri 01/06/2007 23:49 To: 'Jason Best'; tdwg-guid@lists.tdwg.org Subject: RE: [tdwg-guid] First step in implementing LSIDs?[Scanned]
Hi Jason,
Thank you for posting these questions, as I have many of the same questions for an LSID system that we plan to implement over the course of the next few months.
So....a quick review of the basics to make sure that my understanding is consistent with the understanding of others on this list:
LSIDs refer to (resolve to) "data" and "metadata". The "data" must never change for a given LSID, and is usually used for a binary digital object (like an image file, or a PDF file, for example). The "metadata" for a given LSID may change, without changing the underlying LSID. The "data" part is optional -- LSIDs containing only metadata are thought of as "conceptual" or "abstract" LSIDs -- referring to a "concept" that has no inherent digital manifestation. The "notion" of a particular image (e.g., corresponding to the shutter release of a camera) might be represented by a "conceptual/abstract" LSID, while each digital manifestation of the captured image (RAW, JPEG, TIFF, different crops, color-corrections, etc.) could/would each have their own data-bearing LSID, where the "data" would be the binary/bit stream digital image file. The "conceptual" (data-less) LSID for the "notion" of the image could serve as a "hub", such that the metadata of each of the data-bearing LSID image files might include the conceptual LSID amongst their metadata, such that all of the digital renderings could be referred back to the same image "concept" (i.e., the same shutter-release event).
If I have all of the above more or less right, then I have a few questions of my own in the same context.
I gather from your email that you're mostly asking about LSIDs that do not (necessarily) have any digital/binary data associated with them, and you're wondering about how to manage versioning, etc. Before I address your specific questions, I want to point out that I used the word "necessarily" above -- because the first issue that I would like to understand in terms of how others approach LSIDs is the following question:
Does the LSID represent the *specimen*, or does it represent the *database record* for the specimen? It seems like a subtle distinction -- and it is -- but I think it's an important one. My understanding of data-less LSIDs is that they are specifically intended to represent things that have no inherent digital manifestation. In the case of a biological specimen, there is no inherent digital manifestation, so just like the notion of the "concept" of an image, the "concept" of a specimen seems an appropriate abstract object to which a data-less LSID could be assigned. In that view of the world, the database record only exists as a tool to associate the LSID with the metadata in a form that's easy to distribute electronically (i.e., as opposed to writing the LSID down with ink in a paper ledger book, and writing down associated metadata next to it).
I believe this is how most biologists/collection managers think about the problem, and think about how they would use LSIDs -- sort of like globally unique catalog numbers. We all have databases of specimens, and our specimens have catalog numbers, but those electronic databases and catalog numbers are not the "real" units of concern to us -- rather the physical specimens on shelves are what we're worried about. The databases are just tools to help us organize and track information about the specimens (tools that happen to have more practical value than hand-written labels physically tied to specimens -- that otherwise serve the same purpose).
The other perspective, which was foreign to me at first, but I'm now beginning to appreciate more, is that the LSID is *NOT* assigned to the physical specimen, but rather the electronic *database record* representing the specimen. In this case, the LSID *is* assigned to something with a digital manifestation, and therefore *can* (and *should*) be a data-bearing LSID. The data, in this case, would be the binary blob representing a concatenation of the complete database record, in some specified format. The metadata, in this case, would probably not be the data fields we think of for a specimen, but rather information about how to interpret and parse the binary data represented by the LSID, which itself would resolve to information associated with the specimen.
Personally, I'm still very firmly in the first camp -- that is, the assignment of "conceptual" (data-less) LSIDs to physical specimen objects, the metadata for which would be our standard specimen data fields. The LSID effectively serves as the globally unique catalog number, with the added bonus of self-resolution -- which is what, I think, the biodiversity community needs most right now. However, I'm keeping an open mind on this, so I would very much like to hear from others on this list who feel that the object represented by a specimen LSID should be the digital database record, rather than the physical specimen.
The reason I wrote all of the above is that I think it has direct bearing on the answers to your questions.
At this stage, we are only concerned with assigning LSIDs to collections/collection events, specimens and versions of each. Since these aren't represented by bytecode we don't have to be concerned about issuing a new LSID each time the metadata changes (through improvements/changes in determination, geolocation etc), but we also don't want to throw away the previous revisions so the concept of a "hub" would serve well.
O.K., so let's assume we will create data-less (conceptual/abstract) LSIDs for our specimens, and that the metadata are the standard specimen data fields.
This hub would allow us to have a single unchanging LSID that points to (or returns) the current metadata but also points to each LSID for the previous collection revisions. A change in the collection metadata would not change the LSID of the collection hub, it would just create a new collection version record which is issued a new LSID and promoted to "current". This collection hub would also point to a "hub" for each of the specimens that are represented by the collection and these specimen hubs would each point to the current metadata for the specimen as well as the previous versions. We would not be using the revision method of LSIDs, rather we would issue a totally new LSID for each version as recommended by TDWG.
I'm not sure I follow -- by "collection" do you mean collecting event? Or do you mean "collection" like "Bishop Museum Fish Collection"? I read the above a couple of times, and I *think* I understand what you are saying, but I'm not sure. Let me describe my approach to the same basic problem, and see if it makes sense in the context of the above.
Lets suppose we generate data-less LSIDs to represent our collecting events, and data-less LSIDs to represent our specimens. In both cases, the LSIDs represent the abstract notion of the collecting event or specimen; not the electronic database record per se. Our metadata for each LSID would correspond to our usual data fields for each kind of object (i.e., date, collector, etc. for collecting events; and preservation method, determinations, etc. for specimens).
The question, it seems (at least the question I have, and I think the question you have) is how do we manage edit histories of metadata elements. I can think of three scenarios to deal with this:
1) Assign the LSID to the database record object, not the conceptual collecting event/specimen. In this scenario, the LSID would represent the database record itself, and would have data. The data would be a binary concatenation of all the elements of a typical specimen data record, with some sort of delimiter between elements (fields). This binary digital object would be fixed and permanent, and would never change. Metadata associated with each of these LSIDs would include information on how to parse the binary data blob into its component fields/elements, so they could be rendered, searched, etc. If some data element needed to change (e.g., the collector's name was originally misspelled), then a replacement LSID would be generated for the new binary concatenated data blob, and this new LSID would use the versioning feature of LSIDs (i.e., it would differ from the original LSID only in the revision id part of the LSID. Thus, every data edit would be automatically issued a new LSID, because the data component itself has changed. As I stated above, I'm not too keen on this approach, based on my current understanding.
2) Utilize the Revision ID part of an LSID to track the history of metadata changes In this scenario, the LSIDs would themselves be data-less, and the metadata would be our typical data fields. If any of our data fields changed, we would issue a LSID differing from the original only by the Revision ID component. This way, each version of the data gets its own LSID, and resolving any one of the versions automatically redirects to the latest/most recent version, using the LSID versioning features. This way, if you strip the revision ID part of the LSID, you're essentially left with an LSID that applies to the "concept" of the specimen (i.e., the "hub" LSID). This seems almost the same method you described above (if I understood you correctly), except that the new LSIDs are generated by altering only the Revision ID part of the LSID, rather than creating a new LSID with a different Object ID. I'm not sure why you would want to issue new LSIDs with new Object ID components for what effectively represent different versions of metadata for the same object. The main problem I have with this approach is that I don't think this is what was intended by the LSID revision ID component. I believe the revision ID was intended as a mechanism to allow altering the *data*, not to track changes to the metadata. In other words, I think it goes against the spirit of the intent of LSIDs to use the revision ID (or issue new LSIDs with different Object Ids) to track changing metadata. But I may well be wrong about this.
3) Issue data-less LSIDs without using the revision ID feature, and track data change history separately from the LSIDs In this scenario, the LSIDs are *not* used as a tool to track versioning of metadata. Rather, they are issued to the "concept" of an object (e.g., collecting event or specimen), with no inherent binary "data", and the metadata resolved for the LSID would be a function of whatever resolve service is used. Tracking historical changes to the metadata would be the responsibility of the data issuer, but would not involve the generation of new LSIDs. Indeed, there's nothing stopping the resolver service from maintaining a complete log of metadata changes as part of the metadata associated with the LSID.
I personally favor the third approach, because only a small fraction of people are concerned with metadata edit history. I say this in the context that multiple historical determinations are *not*, in my mind, examples of metadata edit history. To me, a determination is an object in its own right, perhaps worthy of its own LSID. Part of the metadata of a specimen could be selecting from among multiple determinations which is deemed to be correct/current from the perspective of the specimen owner (=museum collection). But when I think of metadata edits and versioning, I think of correcting typos and otherwise fixing mistakes -- not the act of linking new information to an existing LSID (as a determination would be).
I'm not sure if any of this addresses your questions, but I think these issues are all inter-related. I would very-much like to hear from others on this stuff.
Aloha, Rich
Richard L. Pyle, PhD Database Coordinator for Natural Sciences and Associate Zoologist in Ichthyology Department of Natural Sciences, Bishop Museum 1525 Bernice St., Honolulu, HI 96817 Ph: (808)848-4115, Fax: (808)847-8252 email: deepreef@bishopmuseum.org http://hbs.bishopmuseum.org/staff/pylerichard.html
_______________________________________________ tdwg-guid mailing list tdwg-guid@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-guid
************************************************************************ The information contained in this e-mail and any files transmitted with it is confidential and is for the exclusive use of the intended recipient. If you are not the intended recipient please note that any distribution, copying or use of this communication or the information in it is prohibited.
Whilst CAB International trading as CABI takes steps to prevent the transmission of viruses via e-mail, we cannot guarantee that any e-mail or attachment is free from computer viruses and you are strongly advised to undertake your own anti-virus precautions.
If you have received this communication in error, please notify us by e-mail at cabi@cabi.org or by telephone on +44 (0)1491 829199 and then delete the e-mail and any copies of it.
CABI is an International Organization recognised by the UK Government under Statutory Instrument 1982 No. 1071.
**************************************************************************
Paul and List,
First, I should clarify something about my earlier post. I wrote at the start of Scenario 3:
"3) Issue data-less LSIDs without using the revision ID feature, and track data change history separately from the LSIDs"
That should have been "...and track *metadata* change history separately from the LSIDs" (metadata, not data).
So, without making things too complicated as we 'start to walk' in this domain of biodiversity informatics my vote is for a variation of scenario 3) from Rich. The reason I vote for this is that in the fullness of time, and the 'herb.IMI' database has already started this, much of the metadata with be LSIDs and it's correctness (i.e. sorting out typos etc) will be delegated to the entities who issue those LSIDs. As IPNI improves the quality of the metadata associated with the LSIDs they issue (and if I understand correctly they do use the scenario 3) from Rich) so the quality of the metadata associated with a 'herb.IMI' LSID improves. The reason I prefer the data + metadate 'model' is that in this instance the data is fixed ... who changes collection/accession numbers? ... so perfect for this role. Even if a collection moves to a new owner the original data need not 'disappear' in the same way that DOI's move with the objects as book and journal titles change from one publisher to another.
So...if I understand correctly, you differ from my scenario 3 in that you do generate data-bearing LSIDs for specimens, but the data part is limited to only the Accession number, not the complete set of data fields associated with the record -- correct? So, in effect, the object LSID actially applies to is the binary accession number, not the "concept" of the specimen. I can imagine in this case that the LSID can be thought of as representing the "concept of the specimen" because the accession number itself is a surrogate for the physical specimen. The only thing that concerns me about this approach is that there is a non-zero incidence of accidental duplicate catalog numbers within a given collection, and possibly errors in associating catalog numbers. For example, if the computer database for a collection had an error created by a technician who, for example, entered the metadata for accession number IMI1234569 by mistake, when it should have been IMI1234596 (and vice versa), then branding the accession number as "data" for the LSID means that the LSID technically *must* stay with the accession number (not the specimen associated with the metadata for that LSID), after the error is discovered. Not a huge problem, but could surprise people who had indexed the LSID before the error was discovered, who then came back to resolve it again after the error was fixed (i.e., they would get totally wrong information). Given how rare this problem is likely to be (against a backdrop of many far more likely problems we will have to overcome), I don't see this as a strong reason not to proceed with your plan.
Final point, the 'data' is the 'herb.IMI' accession number; in context this is a GUI because of the existence of Index Herbariorum. So, our data will be 123456 not IMI123456 because ... in the fullness of time we will include an Index Herbariorum LSID to 'identify' the 'institutional acronym' element of the metadata.
Is the binary data for the accession number in 8-bit, or 16-bit? I'm assuming 8-bit would be fine, as I suspect all collections would have accession numbers that can be rendered with 256-character ASCII. Is there any "wrapper" to the number as binary data, or is it a straight ASCII binary representation (e.g.: 001100010011001000110011001101000011010100110110 for "12345")?
I'm not sure I follow the logic of how embedding the accession number as data for the LSID allows the LSID to move to a new owner. I would think the opposite. Isn't it likely that the new owner would create their own accession number for the specimen? In this case, they would be forced to generate a new LSID if they were following the same practice of encoding the accession number as "data", rather than metadata.
Also, wouldn't it make more sense to include the acronym (IMI) as part of the data for the LSID? At least that way the "12345" would have *some* context.
Finally, this approach would work only for collections where there is a strict 1:1 correlation between accession numbers and specimen objects for which an LSID is desired.
Thanks for your comments -- this thread is already forcing me to think about things in a way I hadn't thought of them before.
Aloha, Rich
Richard L. Pyle, PhD Database Coordinator for Natural Sciences and Associate Zoologist in Ichthyology Department of Natural Sciences, Bishop Museum 1525 Bernice St., Honolulu, HI 96817 Ph: (808)848-4115, Fax: (808)847-8252 email: deepreef@bishopmuseum.org http://hbs.bishopmuseum.org/staff/pylerichard.html
Yes Rich, our plan is to apply the LSID to the accession 'number' (actually an accession 'code' as we have an historical legacy of suffix 'a', 'b', etc for subdivisions of the original collection which in many cases is a collection of objects rather than one physical object - a bag of leaves for example). And yes, there are some possible problems with errors associated with the metadata but ... in the cotenxt of a DBMS where the accession number is set to unique values only, duplications are in reality impossible, and yes there are far more important challenges to address than this ... ;-)
I assume you are correct about the 001100010011001000110011001101000011010100110110 ... I'm a systematist leaning towards nomenclature rather than an IT person.
I guess the 'change of ownership' comment was directed at the importance of retaining the accession number as this is cited in the literature, and the utility of keeping this as a resolvable LSID.
A rather complex model is required for 'managing' the objects of a collecting event and what subsequently happens to those objects, which others have more experience of and valid opinions on - I refer, for example, to a pit trap for insects where multiple objects are assigned an initial accession number, the objects are subsequently divided and divided again and again and finally a few may end up on pins as name bearing types.
Cheers,
Paul
________________________________
From: Richard Pyle [mailto:deepreef@bishopmuseum.org] Sent: Sat 02/06/2007 10:08 To: 'Paul Kirk'; 'Jason Best'; tdwg-guid@lists.tdwg.org Subject: RE: [tdwg-guid] First step in implementing LSIDs?[Scanned]
Paul and List,
First, I should clarify something about my earlier post. I wrote at the start of Scenario 3:
"3) Issue data-less LSIDs without using the revision ID feature, and track data change history separately from the LSIDs"
That should have been "...and track *metadata* change history separately from the LSIDs" (metadata, not data).
So, without making things too complicated as we 'start to walk' in this domain of biodiversity informatics my vote is for a variation of scenario 3) from Rich. The reason I vote for this is that in the fullness of time, and the 'herb.IMI' database has already started this, much of the metadata with be LSIDs and it's correctness (i.e. sorting out typos etc) will be delegated to the entities who issue those LSIDs. As IPNI improves the quality of the metadata associated with the LSIDs they issue (and if I understand correctly they do use the scenario 3) from Rich) so the quality of the metadata associated with a 'herb.IMI' LSID improves. The reason I prefer the data + metadate 'model' is that in this instance the data is fixed ... who changes collection/accession numbers? ... so perfect for this role. Even if a collection moves to a new owner the original data need not 'disappear' in the same way that DOI's move with the objects as book and journal titles change from one publisher to another.
So...if I understand correctly, you differ from my scenario 3 in that you do generate data-bearing LSIDs for specimens, but the data part is limited to only the Accession number, not the complete set of data fields associated with the record -- correct? So, in effect, the object LSID actially applies to is the binary accession number, not the "concept" of the specimen. I can imagine in this case that the LSID can be thought of as representing the "concept of the specimen" because the accession number itself is a surrogate for the physical specimen. The only thing that concerns me about this approach is that there is a non-zero incidence of accidental duplicate catalog numbers within a given collection, and possibly errors in associating catalog numbers. For example, if the computer database for a collection had an error created by a technician who, for example, entered the metadata for accession number IMI1234569 by mistake, when it should have been IMI1234596 (and vice versa), then branding the accession number as "data" for the LSID means that the LSID technically *must* stay with the accession number (not the specimen associated with the metadata for that LSID), after the error is discovered. Not a huge problem, but could surprise people who had indexed the LSID before the error was discovered, who then came back to resolve it again after the error was fixed (i.e., they would get totally wrong information). Given how rare this problem is likely to be (against a backdrop of many far more likely problems we will have to overcome), I don't see this as a strong reason not to proceed with your plan.
Final point, the 'data' is the 'herb.IMI' accession number; in context this is a GUI because of the existence of Index Herbariorum. So, our data will be 123456 not IMI123456 because ... in the fullness of time we will include an Index Herbariorum LSID to 'identify' the 'institutional acronym' element of the metadata.
Is the binary data for the accession number in 8-bit, or 16-bit? I'm assuming 8-bit would be fine, as I suspect all collections would have accession numbers that can be rendered with 256-character ASCII. Is there any "wrapper" to the number as binary data, or is it a straight ASCII binary representation (e.g.: 001100010011001000110011001101000011010100110110 for "12345")?
I'm not sure I follow the logic of how embedding the accession number as data for the LSID allows the LSID to move to a new owner. I would think the opposite. Isn't it likely that the new owner would create their own accession number for the specimen? In this case, they would be forced to generate a new LSID if they were following the same practice of encoding the accession number as "data", rather than metadata.
Also, wouldn't it make more sense to include the acronym (IMI) as part of the data for the LSID? At least that way the "12345" would have *some* context.
Finally, this approach would work only for collections where there is a strict 1:1 correlation between accession numbers and specimen objects for which an LSID is desired.
Thanks for your comments -- this thread is already forcing me to think about things in a way I hadn't thought of them before.
Aloha, Rich
Richard L. Pyle, PhD Database Coordinator for Natural Sciences and Associate Zoologist in Ichthyology Department of Natural Sciences, Bishop Museum 1525 Bernice St., Honolulu, HI 96817 Ph: (808)848-4115, Fax: (808)847-8252 email: deepreef@bishopmuseum.org http://hbs.bishopmuseum.org/staff/pylerichard.html
************************************************************************ The information contained in this e-mail and any files transmitted with it is confidential and is for the exclusive use of the intended recipient. If you are not the intended recipient please note that any distribution, copying or use of this communication or the information in it is prohibited.
Whilst CAB International trading as CABI takes steps to prevent the transmission of viruses via e-mail, we cannot guarantee that any e-mail or attachment is free from computer viruses and you are strongly advised to undertake your own anti-virus precautions.
If you have received this communication in error, please notify us by e-mail at cabi@cabi.org or by telephone on +44 (0)1491 829199 and then delete the e-mail and any copies of it.
CABI is an International Organization recognised by the UK Government under Statutory Instrument 1982 No. 1071.
**************************************************************************
Thanks for the additional information, Paul.
Yes Rich, our plan is to apply the LSID to the accession 'number' (actually an accession 'code' as we have an historical legacy of suffix 'a', 'b', etc for subdivisions of the original collection which in many cases is a collection of objects rather than one physical object - a bag of leaves for example). And yes, there are some possible problems with errors associated with the metadata but ... in the cotenxt of a DBMS where the accession number is set to unique values only, duplications are in reality impossible, and yes there are far more important challenges to address than this ... ;-)
O.K., thanks -- and I agree!
I assume you are correct about the
001100010011001000110011001101000011010100110110
... I'm a systematist leaning towards nomenclature rather than an IT person.
You and me both. I was able to create the binary conversion not because I'm a techno-whiz, but because I know just enough about how to use Google to be dangerous (i.e., http://www.theskull.com/javascript/ascii-binary.html). But the main question was about exactly how one would convert the text "12345" into a binary data blob for the LSID "data" (as opposed to metadata).
I guess the 'change of ownership' comment was directed at the importance of retaining the accession number as this is cited in the literature, and the utility of keeping this as a resolvable LSID.
Ah! That makes sense. But still, I'm a little uneasy about "committing" an LSID to an accession number by branding it with data. The main advantage I see is that no matter how much manipulation happens with the metadata, there will still be "something" permanently included with the LSID as data that a human might be able to use to sort things out if the metadata get changed too much. Even still, though, I think I would want to also include an institutional and/or collection prefix, so that the embedded number is (potentially) more interpretable to an outside observer.
A rather complex model is required for 'managing' the objects of a collecting event and what subsequently happens to those objects, which others have more experience of and valid opinions on - I refer, for example, to a pit trap for insects where multiple objects are assigned an initial accession number, the objects are subsequently divided and divided again and again and finally a few may end up on pins as name bearing types.
Right -- that's a common occurrence in natural history collections. The number "12345" is assigned to a multi-species lot, and then later that lot is split up into constituent parts. I don't see this as a major problem, though. In cases where institutions typically retain the original number for one of the original parts, and simply assign new numbers to the bits that were "removed", then the only thing that changes on the original accession number/LSID is the metadata for its contents, and the new accession numbers/LSIDs would, presumably, include pointers back to the original number/LSID as a "removed from" indicator -- but again, it only affects metadata. Conversely, institutions that assign new numbers to all components of a split-up lot (effectively depreciating the original number and retaining its meaning to the original multi-part lot) will also only need to manage metadata changes.
The more I think about it, the more I like your approach of branding the LSID to the accession event, rather than some conceptual notion of a "specimen" -- which is actually more dynamic than I think most people realize). But I'd like to see TDWG create some sort of standard that we can all collectively follow in how to actually do this. Personally, I'd like to see that standard include institution code and collection code within the binary data blob.
Aloha, Rich
Hi Rich (et al.), I'm going to join this particular discussion in spite of the fact that I have not been able to follow the entire GUID discussion over the past couple of years and I may be repeating things that have been resolved.
Let's continue to investigate whether an LSID applies to the physical specimen or the database record (or both?).
What about the record(s) for that same physical object in the literature? As we mark up literature, we are going to generate LSIDs for specimen records that will need to be resolved to be related to the same physical object (in a collection) and the data record (usually in that same collection's database).
Let's look at the example that Chris Lyal and I are contemplating as we work on implementing an INOTAXA pilot to show in Bratislava: 1) a weevil specimen here at USNM (a type described in the BCA) 2) a record for it in the museum's database (we do have a type database for insects, and it will be available in a year or two), available on the museum's website, through GBIF, and through INOTAXA 3) a record from digitized and parsed BCA in INOTAXA (presumably shortly also available to GBIF in some form) 4) a record for the same weevil from a paper published in the 1950s available through INOTAXA (presumably shortly also available to GBIF in some form) 5) a record for that weevil from a paper published in the 1990s available through INOTAXA (presumably shortly also available to GBIF in some form) 6) a published image (or series of images) in the paper from the 1990s -- but now also digitized and made available through INOTAXA (presumably shortly also available to GBIF in some form) 7) a digitized image (or series of images) made in our imaging project and made available through the museum's database, INOTAXA, GBIF and MorphoBank
Either each of these (1-7) will need to have its own LSID (or an equivalent in the case of the specimen itself) or they will all need to have the same LSID. If the former, they will all have to resolve to the same parent LSID--is this for the specimen or the record in its home database?--in order for the overall biodiversity information system to really work.
Or let's take that a step further and make that a fish, where not only is there a record in the museum's database with its LSID, but that same record for the same fish that was imported some years ago into FishBase (now out of date perhaps, but still available to GBIF and via Fishbase). At the time, it was imported without an LSID and FishBase has (presumably) assigned it's own LSID...
Or let's say that someone else digitized their copy of the same BCA volume and followed the INOTAXA (taXMLit) and assigned yet another LSID for the specimen record...is that really the same 'record' or different from the one in #3?
I would like to think that in the long run we do not need multiple LSIDs for records that refer to the same specimen or record (as long as we can be truly certain that they are 'the same'. After all, the literature markup has a whole series of unique IDs for its various parts already, so can't we refer to 'the use of LSID 123 in workID 987' or 'the use of LSID 123 on pageID 456 in workID 987'?
There are a lot of IDs here, but unless every collection database already has an LSID that we can 'grab' and use in INOTAXA we are going to have to create our own LSIDs and count on a community resolver to sort it all out (and even if that were true, not all the specimens that we are going to be referring to from INOTAXA have been put in electronic form anyplace else, so we will have to assign LSIDs at least temporarily--Paul did not mention how they are going to deal with the Zoological name LSIDs as at least a temporary solution--but I assume that they have a similar problem).
I'm sure I don't know what the best solution is, but that's what I'm counting on the computer scientists in this group to tell me. I just hope they tell me soon, since we're going to need answers soon!
Cheers, Anna
Anna L. Weitzman, PhD Botanical and Biodiversity Informatics Research National Museum of Natural History Smithsonian Institution
office: 202.633.0846 mobile: 202.415.4684 weitzman@si.edu
________________________________
From: tdwg-guid-bounces@lists.tdwg.org on behalf of Richard Pyle Sent: Sat 02-Jun-07 5:08 AM To: 'Paul Kirk'; 'Jason Best'; tdwg-guid@lists.tdwg.org Subject: RE: [tdwg-guid] First step in implementing LSIDs?[Scanned]
Paul and List,
First, I should clarify something about my earlier post. I wrote at the start of Scenario 3:
"3) Issue data-less LSIDs without using the revision ID feature, and track data change history separately from the LSIDs"
That should have been "...and track *metadata* change history separately from the LSIDs" (metadata, not data).
So, without making things too complicated as we 'start to walk' in this domain of biodiversity informatics my vote is for a variation of scenario 3) from Rich. The reason I vote for this is that in the fullness of time, and the 'herb.IMI' database has already started this, much of the metadata with be LSIDs and it's correctness (i.e. sorting out typos etc) will be delegated to the entities who issue those LSIDs. As IPNI improves the quality of the metadata associated with the LSIDs they issue (and if I understand correctly they do use the scenario 3) from Rich) so the quality of the metadata associated with a 'herb.IMI' LSID improves. The reason I prefer the data + metadate 'model' is that in this instance the data is fixed ... who changes collection/accession numbers? ... so perfect for this role. Even if a collection moves to a new owner the original data need not 'disappear' in the same way that DOI's move with the objects as book and journal titles change from one publisher to another.
So...if I understand correctly, you differ from my scenario 3 in that you do generate data-bearing LSIDs for specimens, but the data part is limited to only the Accession number, not the complete set of data fields associated with the record -- correct? So, in effect, the object LSID actially applies to is the binary accession number, not the "concept" of the specimen. I can imagine in this case that the LSID can be thought of as representing the "concept of the specimen" because the accession number itself is a surrogate for the physical specimen. The only thing that concerns me about this approach is that there is a non-zero incidence of accidental duplicate catalog numbers within a given collection, and possibly errors in associating catalog numbers. For example, if the computer database for a collection had an error created by a technician who, for example, entered the metadata for accession number IMI1234569 by mistake, when it should have been IMI1234596 (and vice versa), then branding the accession number as "data" for the LSID means that the LSID technically *must* stay with the accession number (not the specimen associated with the metadata for that LSID), after the error is discovered. Not a huge problem, but could surprise people who had indexed the LSID before the error was discovered, who then came back to resolve it again after the error was fixed (i.e., they would get totally wrong information). Given how rare this problem is likely to be (against a backdrop of many far more likely problems we will have to overcome), I don't see this as a strong reason not to proceed with your plan.
Final point, the 'data' is the 'herb.IMI' accession number; in context this is a GUI because of the existence of Index Herbariorum. So, our data will be 123456 not IMI123456 because ... in the fullness of time we will include an Index Herbariorum LSID to 'identify' the 'institutional acronym' element of the metadata.
Is the binary data for the accession number in 8-bit, or 16-bit? I'm assuming 8-bit would be fine, as I suspect all collections would have accession numbers that can be rendered with 256-character ASCII. Is there any "wrapper" to the number as binary data, or is it a straight ASCII binary representation (e.g.: 001100010011001000110011001101000011010100110110 for "12345")?
I'm not sure I follow the logic of how embedding the accession number as data for the LSID allows the LSID to move to a new owner. I would think the opposite. Isn't it likely that the new owner would create their own accession number for the specimen? In this case, they would be forced to generate a new LSID if they were following the same practice of encoding the accession number as "data", rather than metadata.
Also, wouldn't it make more sense to include the acronym (IMI) as part of the data for the LSID? At least that way the "12345" would have *some* context.
Finally, this approach would work only for collections where there is a strict 1:1 correlation between accession numbers and specimen objects for which an LSID is desired.
Thanks for your comments -- this thread is already forcing me to think about things in a way I hadn't thought of them before.
Aloha, Rich
Richard L. Pyle, PhD Database Coordinator for Natural Sciences and Associate Zoologist in Ichthyology Department of Natural Sciences, Bishop Museum 1525 Bernice St., Honolulu, HI 96817 Ph: (808)848-4115, Fax: (808)847-8252 email: deepreef@bishopmuseum.org http://hbs.bishopmuseum.org/staff/pylerichard.html
_______________________________________________ tdwg-guid mailing list tdwg-guid@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-guid
On 6/2/07, Weitzman, Anna WEITZMAN@si.edu wrote:
[... 7 examples omitted]
Either each of these (1-7) will need to have its own LSID (or an equivalent in the case of the specimen itself) or they will all need to have the same LSID. If the former, they will all have to resolve to the same parent LSID--is this for the specimen or the record in its home database?--in order for the overall biodiversity information system to really work.
Two different objects cannot have the same LSID by definition. [This is more or less the sole overarching point of GUIDs].
I don't know what is meant by "parent LSID", but TDWG requires that an LSID resolution service return its metadata in RDF, the Resource Description Framework semantic web language. By its design, RDF is especially good at expressing relations between things it describes, so there is plenty of room for the LSID metadata to express whatever relations between these examples each of its resolution services might wish to. Furthermore, the emergent TDWG ontology standards (see TDWG-TAG) support some particularly convenient ways to do this, should the various interest groups be motivated to visit this question. That would be Good Thing, so that different resolvers of similar objects might actually offer similar, or at least as to relations, easily comparable, metadata. Still, each subgroup is likely to need to thrash these issues out separately. The TCS group is historically ahead of everybody else in this regard, since they expressed a fixed set of relations among Taxon Concepts more or less ab initio.
Bob,
Thanks for the explanation.
Excuse my lack of knowledge about this--but I trying to understand this in a way that taxonomists like myself will need to (and I need to understand it in terms that I can use to explain it to other taxonomists). So much of what we are doing now in TDWG is so foreign to taxonomists, and I fear that you are going to completely leave us (even those of us who are 'relatively technically inclined' behind--which I don't think is helpful.
Your explanation does help (though I think my calling it a parent LSID vs. resolving to something in RDF is somewhat semantic--if the resolver does not allow all of the things mentioned in 1-7 (and so on) to resolve to the same "something" that relates them all to the same 'parent' (a term that taxonomists will understand) specimen it really isn't going to work--but I assume that you CS guys have that sorted and I just need to read and ask more questions so that I can translate it somehow into terminology that I and other taxonomists understand).
So, to follow on that line: 'all' we, in INOTAXA, have to do is assign LSIDs within INOTAXA (temporarily at least); that we come up with the ontology for that in the Taxonomic literature interest group (but I assume that it will be better if they are similar to those for similar objects described by every other interest group since nearly everything that we will assign LSIDs to will relate to other interest groups); and that the resolver, once we have all this designed will be able to relate all of the things I referred to together?
Finally, what does what you just said mean that Rich's question about whether the LSID applies to the specimen or the data record that describes it? Following your logic, isn't it really better if we think of them each as having an LSID and making sure that we can bring all of them together somehow? Or, perhaps the specimen does not have an 'official' LSID, but it should have some sort of GUID that allows the institution that holds them to link the specimen to the record that has the LSID (if only that were true--our Entomology Dept. gave up requiring GUIDs on specimens that match to records in the database--even for types--years ago and are only now starting to see that this was not a wise decision!). In the latter case, clearly Rich needs to think of the LSID as applying to the record and not to the specimen, correct?
Thanks, Anna
Anna L. Weitzman, PhD Botanical and Biodiversity Informatics Research National Museum of Natural History Smithsonian Institution
office: 202.633.0846 mobile: 202.415.4684 weitzman@si.edu
________________________________
From: Bob Morris [mailto:morris.bob@gmail.com] Sent: Sat 02-Jun-07 3:29 PM To: Weitzman, Anna Cc: Richard Pyle; Paul Kirk; Jason Best; tdwg-guid@lists.tdwg.org Subject: Re: [tdwg-guid] First step in implementing LSIDs?[Scanned]
On 6/2/07, Weitzman, Anna WEITZMAN@si.edu wrote:
[... 7 examples omitted]
Either each of these (1-7) will need to have its own LSID (or an equivalent in the case of the specimen itself) or they will all need to have the same LSID. If the former, they will all have to resolve to the same parent LSID--is this for the specimen or the record in its home database?--in order for the overall biodiversity information system to really work.
Two different objects cannot have the same LSID by definition. [This is more or less the sole overarching point of GUIDs].
I don't know what is meant by "parent LSID", but TDWG requires that an LSID resolution service return its metadata in RDF, the Resource Description Framework semantic web language. By its design, RDF is especially good at expressing relations between things it describes, so there is plenty of room for the LSID metadata to express whatever relations between these examples each of its resolution services might wish to. Furthermore, the emergent TDWG ontology standards (see TDWG-TAG) support some particularly convenient ways to do this, should the various interest groups be motivated to visit this question. That would be Good Thing, so that different resolvers of similar objects might actually offer similar, or at least as to relations, easily comparable, metadata. Still, each subgroup is likely to need to thrash these issues out separately. The TCS group is historically ahead of everybody else in this regard, since they expressed a fixed set of relations among Taxon Concepts more or less ab initio.
On 6/2/07, Weitzman, Anna WEITZMAN@si.edu wrote:
Bob,
Thanks for the explanation.
Excuse my lack of knowledge about this--but I trying to understand this in a way that taxonomists like myself will need to (and I need to understand it in terms that I can use to explain it to other taxonomists). So much of what we are doing now in TDWG is so foreign to taxonomists, and I fear that you are going to completely leave us (even those of us who are 'relatively technically inclined' behind--which I don't think is helpful.
Use scenarios all have to come from the users, so things like your 1-7 are valuable.
Your explanation does help (though I think my calling it a parent LSID vs. resolving to something in RDF is somewhat semantic--if the resolver does not allow all of the things mentioned in 1-7 (and so on) to resolve to the same "something" that relates them all to the same 'parent' (a term that taxonomists will understand) specimen it really isn't going to work--but I assume that you CS guys have that sorted and I just need to read and ask more questions so that I can translate it somehow into terminology that I and other taxonomists understand).
I think which of your 7 things deserve GUIDS is not so much about GUIDS but about what the community wants to do with them, especially across applications. The simplest use of course is to make sure that two mentions are talking about the same thing. That's the minimal guarantee of GUIDS. If your data or document and my data or document mention the same GUID, then they must be referring to the same object (digital or physical). No resolution at all is needed for that use case. Anything further than that is basically a community issue. For example, if you want to use GUIDs to help guarantee that two Taxon Concepts are described in the same publication, then you need GUIDS on pubs and both TCs need to use the same GUID. [An object can have more than one GUID, but no two objects can have the same GUID]. But does your GUID resolution need to say on what page the description for the TC appears? Not necessarily.
More generally, in the case of LSIDs each community has to ask three questions, the first of which is usually---but not always---easy:
(a)Under what circumstances do we want to distinguish two objects from one another
(b)what information about an object do we want to say fundamentally will never change, for the entire future of the universe, even if there are no people, computers, or anything but black holes left; including the object, if it was physical;
(c)what information do we want to associate with an object that might change as our views about it---scientific or curatorial--- themselves might change.
For (a), the LSID spec requires a guarantee that nowhere, nohow, nowhen will the same LSID be issued twice. LSIDs resolvers are obliged to offer the stuff in (b) as LSID resolution data, but TDWG takes no position about how. LSID resolvers are required by TDWG to offer the stuff in (c) as LSID resolution metadata, represented in RDF. TDWG may ultimately require that this be done using the TDWG ontology. This is less onerous than it might seem because that is extensible. Or rather, it is no more onerous than RDF is in the first place. An important special case of (c) is the question of what, if any, are the relationships among the things we say about these objects and about other objects.
One can see from this that the LSID resolution metadata holds the most interesting, useful, and potentially complex information about most biodiversity digital objects and records in catalogs of physical objects or events. The term metadata here is confusing to database folks. It would have been close to the model in most people's head had that stuff been called the data and the persistent stuff called something else.
From the beginning, and still, I've believed that adopting RDF as the
exchange format for the only interesting part of LSID resolution---the metadata---is technically sweet but running much further ahead of the TDWG membership than XML Schema did. This is at least in part because, although RDF is over 10 years old, the enterprise tools are only now emerging. Further, a lot of XML instance documents make sense both to machine and human readers without much knowledge of XML-Schema, whereas the corresponding thing for RDF is, in my opinion, much less the case. I believe that for several more years TDWG communities will need fairly high-powered programmers to turn answers to the questions (a-c) above into actual LSID resolvers and applications exploiting them.
Meanwhile, my students and I are happy to join in funding proposals to be the interim code jockeys. Hah, hah, just serious.
Bob p.s. I might be wrong about the time frame if Wasabi can or has replaced Steve Perry, who I understand has gone to industry.
So, to follow on that line: 'all' we, in INOTAXA, have to do is assign LSIDs within INOTAXA (temporarily at least); that we come up with the ontology for that in the Taxonomic literature interest group (but I assume that it will be better if they are similar to those for similar objects described by every other interest group since nearly everything that we will assign LSIDs to will relate to other interest groups); and that the resolver, once we have all this designed will be able to relate all of the things I referred to together?
Finally, what does what you just said mean that Rich's question about whether the LSID applies to the specimen or the data record that describes it? Following your logic, isn't it really better if we think of them each as having an LSID and making sure that we can bring all of them together somehow? Or, perhaps the specimen does not have an 'official' LSID, but it should have some sort of GUID that allows the institution that holds them to link the specimen to the record that has the LSID (if only that were true--our Entomology Dept. gave up requiring GUIDs on specimens that match to records in the database--even for types--years ago and are only now starting to see that this was not a wise decision!). In the latter case, clearly Rich needs to think of the LSID as applying to the record and not to the specimen, correct?
Thanks, Anna
Anna L. Weitzman, PhD Botanical and Biodiversity Informatics Research National Museum of Natural History Smithsonian Institution
office: 202.633.0846 mobile: 202.415.4684 weitzman@si.edu
From: Bob Morris [mailto:morris.bob@gmail.com] Sent: Sat 02-Jun-07 3:29 PM To: Weitzman, Anna Cc: Richard Pyle; Paul Kirk; Jason Best; tdwg-guid@lists.tdwg.org Subject: Re: [tdwg-guid] First step in implementing LSIDs?[Scanned]
On 6/2/07, Weitzman, Anna WEITZMAN@si.edu wrote:
[... 7 examples omitted]
Either each of these (1-7) will need to have its own LSID (or an equivalent in the case of the specimen itself) or they will all need to have the same LSID. If the former, they will all have to resolve to the same parent LSID--is this for the specimen or the record in its home database?--in order for the overall biodiversity information system to really work.
Two different objects cannot have the same LSID by definition. [This is more or less the sole overarching point of GUIDs].
I don't know what is meant by "parent LSID", but TDWG requires that an LSID resolution service return its metadata in RDF, the Resource Description Framework semantic web language. By its design, RDF is especially good at expressing relations between things it describes, so there is plenty of room for the LSID metadata to express whatever relations between these examples each of its resolution services might wish to. Furthermore, the emergent TDWG ontology standards (see TDWG-TAG) support some particularly convenient ways to do this, should the various interest groups be motivated to visit this question. That would be Good Thing, so that different resolvers of similar objects might actually offer similar, or at least as to relations, easily comparable, metadata. Still, each subgroup is likely to need to thrash these issues out separately. The TCS group is historically ahead of everybody else in this regard, since they expressed a fixed set of relations among Taxon Concepts more or less ab initio.
Hi Bob, I'll get to your most recent message later, but right now, I want to focus in on one thing that you said earlier this afternoon:
"Two different objects cannot have the same LSID by definition. [This is more or less the sole overarching point of GUIDs]."
There is a fundamental semantic issue here in what we (probably most computer scientists vs. most taxonomists) call objects. You have described LSIDs for digital 'objects' in the main, which makes perfect sense from your perspective.
To most taxonomists, digital 'objects' are not objects at all but non-physical representations (images or metadata) of or references to objects (and of course most taxonomists refer to data and metadata interchangeably--partially because metadata about an object can also need metadata to describe it, so the line becomes very fuzzy).
With the taxonomists' world view, it makes perfect sense to use the same GUID each time the same object (by this we are most likely going to mean an object or lot in our physical collections; a taxon name; a taxon concept; a person (whether the action being referred to in the instance is collector, taxon author, publication author, curator, etc); and the like) is referred to anywhere--hence my examples 1-7 and beyond. In fact we would like the world to work like that, and have you computer scientists give us a perfect solution to make that happen seamlessly with as little thought for us as possible. Effectively, what we want is to give an ID to example 1, and let you tell us how to make sure that every instance that refers to it (e.g., 2-7) throughout the entire digital world used by the taxonomic community.
I realize that this seems simplistic and that what you are defining is more flexible, but I'm not entirely convinced that it is as useful to the taxonomic community (at least for the foreseeable future) as what I described.
Making sure that the same ID for these digital objects in all different publications refers to the same Collection Object is one of the main reasons that I want to parse the literature in the way that I am working toward. By doing that, I can then compare and analyze the Taxon Name that the various authors applied to the same Collection Objects over time. This will then allow me to understand the Taxon Concepts that the different authors who studied the Collection Objects and Taxon Names had. It is that history of Taxon Concepts that Chris and I tried to describe in a talk at the New Zealand TDWG meeting--and how that history is entirely different from the history of the Taxon Name.
I hope that clarifies what this taxonomist at least is looking for (at least in part...there are many other examples that I can use at later times).
Cheers, Anna
________________________________
From: Bob Morris [mailto:morris.bob@gmail.com] Sent: Sat 02-Jun-07 3:29 PM To: Weitzman, Anna Cc: Richard Pyle; Paul Kirk; Jason Best; tdwg-guid@lists.tdwg.org Subject: Re: [tdwg-guid] First step in implementing LSIDs?[Scanned]
On 6/2/07, Weitzman, Anna WEITZMAN@si.edu wrote:
[... 7 examples omitted]
Either each of these (1-7) will need to have its own LSID (or an equivalent in the case of the specimen itself) or they will all need to have the same LSID. If the former, they will all have to resolve to the same parent LSID--is this for the specimen or the record in its home database?--in order for the overall biodiversity information system to really work.
Two different objects cannot have the same LSID by definition. [This is more or less the sole overarching point of GUIDs].
I don't know what is meant by "parent LSID", but TDWG requires that an LSID resolution service return its metadata in RDF, the Resource Description Framework semantic web language. By its design, RDF is especially good at expressing relations between things it describes, so there is plenty of room for the LSID metadata to express whatever relations between these examples each of its resolution services might wish to. Furthermore, the emergent TDWG ontology standards (see TDWG-TAG) support some particularly convenient ways to do this, should the various interest groups be motivated to visit this question. That would be Good Thing, so that different resolvers of similar objects might actually offer similar, or at least as to relations, easily comparable, metadata. Still, each subgroup is likely to need to thrash these issues out separately. The TCS group is historically ahead of everybody else in this regard, since they expressed a fixed set of relations among Taxon Concepts more or less ab initio.
It sounds like you have issues that are discussed in the TNC (Taxonomic Names and Concepts) http://www.tdwg.org/activities/tnc/ and NCD (Natural Collections Descriptions) groups http://www.tdwg.org/activities/ncd/
In the case of physical objects, as Rich or Paul mentioned, the idiom in the LSID world for physical objects is that the LSID data resolution is empty (you can't return the physical object to the invoker without a Startrek Transporter :-) ) and whatever interests you about the physical object is returned in the metadata resolution, including relations to other stuff including the LSIDs of other real objects).
On 6/2/07, Weitzman, Anna WEITZMAN@si.edu wrote:
Hi Bob, I'll get to your most recent message later, but right now, I want to focus in on one thing that you said earlier this afternoon:
"Two different objects cannot have the same LSID by definition. [This is more or less the sole overarching point of GUIDs]."
There is a fundamental semantic issue here in what we (probably most computer scientists vs. most taxonomists) call objects. You have described LSIDs for digital 'objects' in the main, which makes perfect sense from your perspective.
To most taxonomists, digital 'objects' are not objects at all but non-physical representations (images or metadata) of or references to objects (and of course most taxonomists refer to data and metadata interchangeably--partially because metadata about an object can also need metadata to describe it, so the line becomes very fuzzy).
With the taxonomists' world view, it makes perfect sense to use the same GUID each time the same object (by this we are most likely going to mean an object or lot in our physical collections; a taxon name; a taxon concept; a person (whether the action being referred to in the instance is collector, taxon author, publication author, curator, etc); and the like) is referred to anywhere--hence my examples 1-7 and beyond. In fact we would like the world to work like that, and have you computer scientists give us a perfect solution to make that happen seamlessly with as little thought for us as possible. Effectively, what we want is to give an ID to example 1, and let you tell us how to make sure that every instance that refers to it (e.g., 2-7) throughout the entire digital world used by the taxonomic community.
I realize that this seems simplistic and that what you are defining is more flexible, but I'm not entirely convinced that it is as useful to the taxonomic community (at least for the foreseeable future) as what I described.
Making sure that the same ID for these digital objects in all different publications refers to the same Collection Object is one of the main reasons that I want to parse the literature in the way that I am working toward. By doing that, I can then compare and analyze the Taxon Name that the various authors applied to the same Collection Objects over time. This will then allow me to understand the Taxon Concepts that the different authors who studied the Collection Objects and Taxon Names had. It is that history of Taxon Concepts that Chris and I tried to describe in a talk at the New Zealand TDWG meeting--and how that history is entirely different from the history of the Taxon Name.
I hope that clarifies what this taxonomist at least is looking for (at least in part...there are many other examples that I can use at later times).
Cheers, Anna
From: Bob Morris [mailto:morris.bob@gmail.com] Sent: Sat 02-Jun-07 3:29 PM To: Weitzman, Anna Cc: Richard Pyle; Paul Kirk; Jason Best; tdwg-guid@lists.tdwg.org Subject: Re: [tdwg-guid] First step in implementing LSIDs?[Scanned]
On 6/2/07, Weitzman, Anna WEITZMAN@si.edu wrote:
[... 7 examples omitted]
Either each of these (1-7) will need to have its own LSID (or an equivalent in the case of the specimen itself) or they will all need to have the same LSID. If the former, they will all have to resolve to the same parent LSID--is this for the specimen or the record in its home database?--in order for the overall biodiversity information system to really work.
Two different objects cannot have the same LSID by definition. [This is more or less the sole overarching point of GUIDs].
I don't know what is meant by "parent LSID", but TDWG requires that an LSID resolution service return its metadata in RDF, the Resource Description Framework semantic web language. By its design, RDF is especially good at expressing relations between things it describes, so there is plenty of room for the LSID metadata to express whatever relations between these examples each of its resolution services might wish to. Furthermore, the emergent TDWG ontology standards (see TDWG-TAG) support some particularly convenient ways to do this, should the various interest groups be motivated to visit this question. That would be Good Thing, so that different resolvers of similar objects might actually offer similar, or at least as to relations, easily comparable, metadata. Still, each subgroup is likely to need to thrash these issues out separately. The TCS group is historically ahead of everybody else in this regard, since they expressed a fixed set of relations among Taxon Concepts more or less ab initio.
I think we need to be clear what gets an LSID (or a GUID in general).
Some of the things listed by Anna are digital records, such as an image. It seems simplest to give these GUIDs that identify the image, with metadata linking the image to the thing the image depicts (there are existing RDF vocabularies to do this).
Some things listed, such as a specimen, are physical objects. These are different from digital objects, and they way in which GUIDs that identify real things are handled has caused all manner of discussion (see http://www.w3.org/DesignIssues/HTTP-URI and related pages bookmarked at http://del.icio.us/rdmpage/303). LSIDs don't handle this well, unless we rely on metadata saying "the thing identified by."
So, at least on this level to say that all seven things get the same GUID is clearly a non starter.
Relationships between things can be easily specified in metadata ("is part of", "depicts", "is kind of").
The final issue is GUID reuse, that is, if somebody uses a INOTAXA record, they should at a minimum refer to the INOTAXA LSID. This would particularly apply to aggregators such as GBIF, who should not present their own identifiers unless GBIF has actually created the data. You often state "presumably shortly also available to GBIF in some form". It's not clear to what that means, but if it's GBIF because INOTAXA serves it, then I think GBIF should use INOTAXA LSIDs to refer to INOTAXA records.
Clearly, generating a plethora a new, effectively local ids (masquerading as global) is not a recipe for progress. If we don't reuse GUIDs we are wasting our time.
Regards
Rod
On 2 Jun 2007, at 18:53, Weitzman, Anna wrote:
Hi Rich (et al.), I'm going to join this particular discussion in spite of the fact that I have not been able to follow the entire GUID discussion over the past couple of years and I may be repeating things that have been resolved.
Let's continue to investigate whether an LSID applies to the physical specimen or the database record (or both?).
What about the record(s) for that same physical object in the literature? As we mark up literature, we are going to generate LSIDs for specimen records that will need to be resolved to be related to the same physical object (in a collection) and the data record (usually in that same collection's database).
Let's look at the example that Chris Lyal and I are contemplating as we work on implementing an INOTAXA pilot to show in Bratislava:
- a weevil specimen here at USNM (a type described in the BCA)
- a record for it in the museum's database (we do have a type
database for insects, and it will be available in a year or two), available on the museum's website, through GBIF, and through INOTAXA 3) a record from digitized and parsed BCA in INOTAXA (presumably shortly also available to GBIF in some form) 4) a record for the same weevil from a paper published in the 1950s available through INOTAXA (presumably shortly also available to GBIF in some form) 5) a record for that weevil from a paper published in the 1990s available through INOTAXA (presumably shortly also available to GBIF in some form) 6) a published image (or series of images) in the paper from the 1990s -- but now also digitized and made available through INOTAXA (presumably shortly also available to GBIF in some form) 7) a digitized image (or series of images) made in our imaging project and made available through the museum's database, INOTAXA, GBIF and MorphoBank
Either each of these (1-7) will need to have its own LSID (or an equivalent in the case of the specimen itself) or they will all need to have the same LSID. If the former, they will all have to resolve to the same parent LSID--is this for the specimen or the record in its home database?--in order for the overall biodiversity information system to really work.
Or let's take that a step further and make that a fish, where not only is there a record in the museum's database with its LSID, but that same record for the same fish that was imported some years ago into FishBase (now out of date perhaps, but still available to GBIF and via Fishbase). At the time, it was imported without an LSID and FishBase has (presumably) assigned it's own LSID...
Or let's say that someone else digitized their copy of the same BCA volume and followed the INOTAXA (taXMLit) and assigned yet another LSID for the specimen record...is that really the same 'record' or different from the one in #3?
I would like to think that in the long run we do not need multiple LSIDs for records that refer to the same specimen or record (as long as we can be truly certain that they are 'the same'. After all, the literature markup has a whole series of unique IDs for its various parts already, so can't we refer to 'the use of LSID 123 in workID 987' or 'the use of LSID 123 on pageID 456 in workID 987'?
There are a lot of IDs here, but unless every collection database already has an LSID that we can 'grab' and use in INOTAXA we are going to have to create our own LSIDs and count on a community resolver to sort it all out (and even if that were true, not all the specimens that we are going to be referring to from INOTAXA have been put in electronic form anyplace else, so we will have to assign LSIDs at least temporarily--Paul did not mention how they are going to deal with the Zoological name LSIDs as at least a temporary solution--but I assume that they have a similar problem).
I'm sure I don't know what the best solution is, but that's what I'm counting on the computer scientists in this group to tell me. I just hope they tell me soon, since we're going to need answers soon!
Cheers, Anna
Anna L. Weitzman, PhD Botanical and Biodiversity Informatics Research National Museum of Natural History Smithsonian Institution
office: 202.633.0846 mobile: 202.415.4684 weitzman@si.edu
From: tdwg-guid-bounces@lists.tdwg.org on behalf of Richard Pyle Sent: Sat 02-Jun-07 5:08 AM To: 'Paul Kirk'; 'Jason Best'; tdwg-guid@lists.tdwg.org Subject: RE: [tdwg-guid] First step in implementing LSIDs?[Scanned]
Paul and List,
First, I should clarify something about my earlier post. I wrote at the start of Scenario 3:
"3) Issue data-less LSIDs without using the revision ID feature, and track data change history separately from the LSIDs"
That should have been "...and track *metadata* change history separately from the LSIDs" (metadata, not data).
So, without making things too complicated as we 'start to walk' in this domain of biodiversity informatics my vote is for a variation of scenario 3) from Rich. The reason I vote for this is that in the fullness of time, and the 'herb.IMI' database has already started this, much of the metadata with be LSIDs and it's correctness (i.e. sorting out typos etc) will be delegated to the entities who issue those LSIDs. As IPNI improves the quality of the metadata associated with the LSIDs they issue (and if I understand correctly they do use the scenario 3) from Rich) so the quality of the metadata associated with a 'herb.IMI' LSID improves. The reason I prefer the data + metadate 'model' is that in this instance the data is fixed ... who changes collection/accession numbers? ... so perfect for this role. Even if a collection moves to a new owner the original data need not 'disappear' in the same way that DOI's move with the objects as book and journal titles change from one publisher to another.
So...if I understand correctly, you differ from my scenario 3 in that you do generate data-bearing LSIDs for specimens, but the data part is limited to only the Accession number, not the complete set of data fields associated with the record -- correct? So, in effect, the object LSID actially applies to is the binary accession number, not the "concept" of the specimen. I can imagine in this case that the LSID can be thought of as representing the "concept of the specimen" because the accession number itself is a surrogate for the physical specimen. The only thing that concerns me about this approach is that there is a non-zero incidence of accidental duplicate catalog numbers within a given collection, and possibly errors in associating catalog numbers. For example, if the computer database for a collection had an error created by a technician who, for example, entered the metadata for accession number IMI1234569 by mistake, when it should have been IMI1234596 (and vice versa), then branding the accession number as "data" for the LSID means that the LSID technically *must* stay with the accession number (not the specimen associated with the metadata for that LSID), after the error is discovered. Not a huge problem, but could surprise people who had indexed the LSID before the error was discovered, who then came back to resolve it again after the error was fixed (i.e., they would get totally wrong information). Given how rare this problem is likely to be (against a backdrop of many far more likely problems we will have to overcome), I don't see this as a strong reason not to proceed with your plan.
Final point, the 'data' is the 'herb.IMI' accession number; in context this is a GUI because of the existence of Index Herbariorum. So, our data will be 123456 not IMI123456 because ... in the fullness of time we will include an Index Herbariorum LSID to 'identify' the 'institutional acronym' element of the metadata.
Is the binary data for the accession number in 8-bit, or 16-bit? I'm assuming 8-bit would be fine, as I suspect all collections would have accession numbers that can be rendered with 256-character ASCII. Is there any "wrapper" to the number as binary data, or is it a straight ASCII binary representation (e.g.: 001100010011001000110011001101000011010100110110 for "12345")?
I'm not sure I follow the logic of how embedding the accession number as data for the LSID allows the LSID to move to a new owner. I would think the opposite. Isn't it likely that the new owner would create their own accession number for the specimen? In this case, they would be forced to generate a new LSID if they were following the same practice of encoding the accession number as "data", rather than metadata.
Also, wouldn't it make more sense to include the acronym (IMI) as part of the data for the LSID? At least that way the "12345" would have *some* context.
Finally, this approach would work only for collections where there is a strict 1:1 correlation between accession numbers and specimen objects for which an LSID is desired.
Thanks for your comments -- this thread is already forcing me to think about things in a way I hadn't thought of them before.
Aloha, Rich
Richard L. Pyle, PhD Database Coordinator for Natural Sciences and Associate Zoologist in Ichthyology Department of Natural Sciences, Bishop Museum 1525 Bernice St., Honolulu, HI 96817 Ph: (808)848-4115, Fax: (808)847-8252 email: deepreef@bishopmuseum.org http://hbs.bishopmuseum.org/staff/pylerichard.html
tdwg-guid mailing list tdwg-guid@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-guid
tdwg-guid mailing list tdwg-guid@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-guid
---------------------------------------- Professor Roderic D. M. Page Editor, Systematic Biology DEEB, IBLS Graham Kerr Building University of Glasgow Glasgow G12 8QP United Kingdom
Phone: +44 141 330 4778 Fax: +44 141 330 2792 email: r.page@bio.gla.ac.uk web: http://taxonomy.zoology.gla.ac.uk/rod/rod.html iChat: aim://rodpage1962 reprints: http://taxonomy.zoology.gla.ac.uk/rod/pubs.html
Subscribe to Systematic Biology through the Society of Systematic Biologists Website: http://systematicbiology.org Search for taxon names: http://darwin.zoology.gla.ac.uk/~rpage/portal/ Find out what we know about a species: http://ispecies.org Rod's rants on phyloinformatics: http://iphylo.blogspot.com Rod's rants on ants: http://semant.blogspot.com
Anybody got any views (strong, otherwise or proxy for others views) on whether the LSID should refer to data+metadata or just metadata?
From where I sit, closer to physical objects than bits in a bit stream, I favour the former. Take names for example. Strings of characters and spaces whose form is governed by Codes and one of the means, if not the primary mean, by which we communicate (verbally, in print or electronically) about biodiversity. For LSIDs applied to names my understanding is that they must resolve to an unchanging bit stream representing the name (we implemented this in Index Fungorum 1st May 2005 when we set up the demo resolver) but the associated metadata may change. If I'm correct on this one how does it work for LSIDs only resolving metadata, which is not fixed. I know Roger tried to explain this one to me but I'm still not sure it's entirely logical.
I think I'm with Rod on the LSIDs for specimens - they do not represent the physical object but are a sort of digital substitute (or substitutes) of that object.
And I also support Rods view that we should as far as possible avoid the duplication of GUIDs. Thus, for names it appears logical (although I must declare an 'interest' here so others may see a conflict) that the globally recognized nomenclators (IPNI, IF, ZooBank (soon), the bacterial list, the algal list, the virus database - I forget the acronyms here) be charged with providing these GUIDs (currently as LSIDs) for all of us to use. And following on from that, the 'institution' which is charged with providing the digital representation of specimens is the institution which is the custodian of the physical object.
Regards,
Paul
________________________________
From: Roderic Page [mailto:r.page@bio.gla.ac.uk] Sent: Sun 03/06/2007 12:03 To: Weitzman, Anna Cc: Richard Pyle; Paul Kirk; Jason Best; tdwg-guid@lists.tdwg.org Subject: Re: [tdwg-guid] First step in implementing LSIDs?[Scanned]
I think we need to be clear what gets an LSID (or a GUID in general).
Some of the things listed by Anna are digital records, such as an image. It seems simplest to give these GUIDs that identify the image, with metadata linking the image to the thing the image depicts (there are existing RDF vocabularies to do this).
Some things listed, such as a specimen, are physical objects. These are different from digital objects, and they way in which GUIDs that identify real things are handled has caused all manner of discussion (see http://www.w3.org/DesignIssues/HTTP-URI and related pages bookmarked at http://del.icio.us/rdmpage/303). LSIDs don't handle this well, unless we rely on metadata saying "the thing identified by."
So, at least on this level to say that all seven things get the same GUID is clearly a non starter.
Relationships between things can be easily specified in metadata ("is part of", "depicts", "is kind of").
The final issue is GUID reuse, that is, if somebody uses a INOTAXA record, they should at a minimum refer to the INOTAXA LSID. This would particularly apply to aggregators such as GBIF, who should not present their own identifiers unless GBIF has actually created the data. You often state "presumably shortly also available to GBIF in some form". It's not clear to what that means, but if it's GBIF because INOTAXA serves it, then I think GBIF should use INOTAXA LSIDs to refer to INOTAXA records.
Clearly, generating a plethora a new, effectively local ids (masquerading as global) is not a recipe for progress. If we don't reuse GUIDs we are wasting our time.
Regards
Rod
On 2 Jun 2007, at 18:53, Weitzman, Anna wrote:
Hi Rich (et al.), I'm going to join this particular discussion in spite of the fact that I have not been able to follow the entire GUID discussion over the past couple of years and I may be repeating things that have been resolved.
Let's continue to investigate whether an LSID applies to the physical specimen or the database record (or both?).
What about the record(s) for that same physical object in the literature? As we mark up literature, we are going to generate LSIDs for specimen records that will need to be resolved to be related to the same physical object (in a collection) and the data record (usually in that same collection's database).
Let's look at the example that Chris Lyal and I are contemplating as we work on implementing an INOTAXA pilot to show in Bratislava: 1) a weevil specimen here at USNM (a type described in the BCA) 2) a record for it in the museum's database (we do have a type database for insects, and it will be available in a year or two), available on the museum's website, through GBIF, and through INOTAXA 3) a record from digitized and parsed BCA in INOTAXA (presumably shortly also available to GBIF in some form) 4) a record for the same weevil from a paper published in the 1950s available through INOTAXA (presumably shortly also available to GBIF in some form) 5) a record for that weevil from a paper published in the 1990s available through INOTAXA (presumably shortly also available to GBIF in some form) 6) a published image (or series of images) in the paper from the 1990s -- but now also digitized and made available through INOTAXA (presumably shortly also available to GBIF in some form) 7) a digitized image (or series of images) made in our imaging project and made available through the museum's database, INOTAXA, GBIF and MorphoBank
Either each of these (1-7) will need to have its own LSID (or an equivalent in the case of the specimen itself) or they will all need to have the same LSID. If the former, they will all have to resolve to the same parent LSID--is this for the specimen or the record in its home database?--in order for the overall biodiversity information system to really work.
Or let's take that a step further and make that a fish, where not only is there a record in the museum's database with its LSID, but that same record for the same fish that was imported some years ago into FishBase (now out of date perhaps, but still available to GBIF and via Fishbase). At the time, it was imported without an LSID and FishBase has (presumably) assigned it's own LSID...
Or let's say that someone else digitized their copy of the same BCA volume and followed the INOTAXA (taXMLit) and assigned yet another LSID for the specimen record...is that really the same 'record' or different from the one in #3?
I would like to think that in the long run we do not need multiple LSIDs for records that refer to the same specimen or record (as long as we can be truly certain that they are 'the same'. After all, the literature markup has a whole series of unique IDs for its various parts already, so can't we refer to 'the use of LSID 123 in workID 987' or 'the use of LSID 123 on pageID 456 in workID 987'?
There are a lot of IDs here, but unless every collection database already has an LSID that we can 'grab' and use in INOTAXA we are going to have to create our own LSIDs and count on a community resolver to sort it all out (and even if that were true, not all the specimens that we are going to be referring to from INOTAXA have been put in electronic form anyplace else, so we will have to assign LSIDs at least temporarily--Paul did not mention how they are going to deal with the Zoological name LSIDs as at least a temporary solution--but I assume that they have a similar problem).
I'm sure I don't know what the best solution is, but that's what I'm counting on the computer scientists in this group to tell me. I just hope they tell me soon, since we're going to need answers soon!
Cheers, Anna
Anna L. Weitzman, PhD Botanical and Biodiversity Informatics Research National Museum of Natural History Smithsonian Institution
office: 202.633.0846 mobile: 202.415.4684 weitzman@si.edu
________________________________
From: tdwg-guid-bounces@lists.tdwg.org on behalf of Richard Pyle Sent: Sat 02-Jun-07 5:08 AM To: 'Paul Kirk'; 'Jason Best'; tdwg-guid@lists.tdwg.org Subject: RE: [tdwg-guid] First step in implementing LSIDs?[Scanned]
Paul and List,
First, I should clarify something about my earlier post. I wrote at the start of Scenario 3:
"3) Issue data-less LSIDs without using the revision ID feature, and track data change history separately from the LSIDs"
That should have been "...and track *metadata* change history separately from the LSIDs" (metadata, not data).
So, without making things too complicated as we 'start to walk' in this domain of biodiversity informatics my vote is for a variation of scenario 3) from Rich. The reason I vote for this is that in the fullness of time, and the 'herb.IMI' database has already started this, much of the metadata with be LSIDs and it's correctness (i.e. sorting out typos etc) will be delegated to the entities who issue those LSIDs. As IPNI improves the quality of the metadata associated with the LSIDs they issue (and if I understand correctly they do use the scenario 3) from Rich) so the quality of the metadata associated with a 'herb.IMI' LSID improves. The reason I prefer the data + metadate 'model' is that in this instance the data is fixed ... who changes collection/accession numbers? ... so perfect for this role. Even if a collection moves to a new owner the original data need not 'disappear' in the same way that DOI's move with the objects as book and journal titles change from one publisher to another.
So...if I understand correctly, you differ from my scenario 3 in that you do generate data-bearing LSIDs for specimens, but the data part is limited to only the Accession number, not the complete set of data fields associated with the record -- correct? So, in effect, the object LSID actially applies to is the binary accession number, not the "concept" of the specimen. I can imagine in this case that the LSID can be thought of as representing the "concept of the specimen" because the accession number itself is a surrogate for the physical specimen. The only thing that concerns me about this approach is that there is a non-zero incidence of accidental duplicate catalog numbers within a given collection, and possibly errors in associating catalog numbers. For example, if the computer database for a collection had an error created by a technician who, for example, entered the metadata for accession number IMI1234569 by mistake, when it should have been IMI1234596 (and vice versa), then branding the accession number as "data" for the LSID means that the LSID technically *must* stay with the accession number (not the specimen associated with the metadata for that LSID), after the error is discovered. Not a huge problem, but could surprise people who had indexed the LSID before the error was discovered, who then came back to resolve it again after the error was fixed (i.e., they would get totally wrong information). Given how rare this problem is likely to be (against a backdrop of many far more likely problems we will have to overcome), I don't see this as a strong reason not to proceed with your plan.
Final point, the 'data' is the 'herb.IMI' accession number; in context this is a GUI because of the existence of Index Herbariorum. So, our data will be 123456 not IMI123456 because ... in the fullness of time we will include an Index Herbariorum LSID to 'identify' the 'institutional acronym' element of the metadata.
Is the binary data for the accession number in 8-bit, or 16-bit? I'm assuming 8-bit would be fine, as I suspect all collections would have accession numbers that can be rendered with 256-character ASCII. Is there any "wrapper" to the number as binary data, or is it a straight ASCII binary representation (e.g.: 001100010011001000110011001101000011010100110110 for "12345")?
I'm not sure I follow the logic of how embedding the accession number as data for the LSID allows the LSID to move to a new owner. I would think the opposite. Isn't it likely that the new owner would create their own accession number for the specimen? In this case, they would be forced to generate a new LSID if they were following the same practice of encoding the accession number as "data", rather than metadata.
Also, wouldn't it make more sense to include the acronym (IMI) as part of the data for the LSID? At least that way the "12345" would have *some* context.
Finally, this approach would work only for collections where there is a strict 1:1 correlation between accession numbers and specimen objects for which an LSID is desired.
Thanks for your comments -- this thread is already forcing me to think about things in a way I hadn't thought of them before.
Aloha, Rich
Richard L. Pyle, PhD Database Coordinator for Natural Sciences and Associate Zoologist in Ichthyology Department of Natural Sciences, Bishop Museum 1525 Bernice St., Honolulu, HI 96817 Ph: (808)848-4115, Fax: (808)847-8252 email: deepreef@bishopmuseum.org http://hbs.bishopmuseum.org/staff/pylerichard.html
_______________________________________________ tdwg-guid mailing list tdwg-guid@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-guid
_______________________________________________ tdwg-guid mailing list tdwg-guid@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-guid
---------------------------------------- Professor Roderic D. M. Page Editor, Systematic Biology DEEB, IBLS Graham Kerr Building University of Glasgow Glasgow G12 8QP United Kingdom
Phone: +44 141 330 4778 Fax: +44 141 330 2792 email: r.page@bio.gla.ac.uk web: http://taxonomy.zoology.gla.ac.uk/rod/rod.html iChat: aim://rodpage1962 reprints: http://taxonomy.zoology.gla.ac.uk/rod/pubs.html
Subscribe to Systematic Biology through the Society of Systematic Biologists Website: http://systematicbiology.org http://systematicbiology.org/ Search for taxon names: http://darwin.zoology.gla.ac.uk/~rpage/portal/ Find out what we know about a species: http://ispecies.org http://ispecies.org/ Rod's rants on phyloinformatics: http://iphylo.blogspot.com http://iphylo.blogspot.com/ Rod's rants on ants: http://semant.blogspot.com http://semant.blogspot.com/
************************************************************************ The information contained in this e-mail and any files transmitted with it is confidential and is for the exclusive use of the intended recipient. If you are not the intended recipient please note that any distribution, copying or use of this communication or the information in it is prohibited.
Whilst CAB International trading as CABI takes steps to prevent the transmission of viruses via e-mail, we cannot guarantee that any e-mail or attachment is free from computer viruses and you are strongly advised to undertake your own anti-virus precautions.
If you have received this communication in error, please notify us by e-mail at cabi@cabi.org or by telephone on +44 (0)1491 829199 and then delete the e-mail and any copies of it.
CABI is an International Organization recognised by the UK Government under Statutory Instrument 1982 No. 1071.
**************************************************************************
Wow this is a big thread appears over the weekend!
I am posting without having time to read and digest everything in its entirety but only to answer Jason's original question I hope.
Received wisdom is don't use the version part of an LSID. If you did you would be creating new LSIDs anyhow so in a way it doesn't matter. The identifier is supposed to be opaque so the client should never break the LSID into parts anyhow and would only to byte identical comparisons to the LSIDs themselves to see whether they are the same things or not.
If you want to do versioning create an LSID for the "thing that changes" and an LSID for each version of that thing.
Each of the versions are linked by dcterm:replaces and dcterm:isReplacedBy.
http://purl.org/dc/terms/isReplacedBy
http://purl.org/dc/terms/replaces
Each version points to the LSID for the "thing that changes" with dcterm:isVersionOf
http://purl.org/dc/terms/isVersionOf
We have our own vocabulary item for the one link in the change that isn't support by Dublin Core. tcom:versionedAs
http://rs.tdwg.org/ontology/voc/Common#versionedAs
This points from the "thing that changes" to the current version. i.e. the version that has identical metadata to itself. Anyone who caches the data for the current version of the LSID can know which version they have so if it becomes retrospectively important to get back to the actual version it is possible.
Philosophically when do you version? Only the provider can say that a change is significant enough to warrant a new version - but if you have gone to the trouble of implementing version control you may as well do it for any change. Only the data provider can say whether the "thing that changes" has changed so much that it is no longer the same thing.
Personally I believe this approach nails the versioning issues.
There is the perennial debate about whether an LSID points to a physical object or not (when it doesn't have byte stream associated with it). The answer is easy. It points to a digital object. If you doubt this try destroying a physical specimen and then asking whether you should do away with the LSID and associated data? Clearly you would maintain a record of something you once had so that you could still return the data. Likewise if you gave the specimen to another institution you would maintain a record of having had it but would, hopefully, link to the new institutions record of it.
Hope this is helpful.
Roger
On 3 Jun 2007, at 15:08, Paul Kirk wrote:
Anybody got any views (strong, otherwise or proxy for others views) on whether the LSID should refer to data+metadata or just metadata?
From where I sit, closer to physical objects than bits in a bit stream, I favour the former. Take names for example. Strings of characters and spaces whose form is governed by Codes and one of the means, if not the primary mean, by which we communicate (verbally, in print or electronically) about biodiversity. For LSIDs applied to names my understanding is that they must resolve to an unchanging bit stream representing the name (we implemented this in Index Fungorum 1st May 2005 when we set up the demo resolver) but the associated metadata may change. If I'm correct on this one how does it work for LSIDs only resolving metadata, which is not fixed. I know Roger tried to explain this one to me but I'm still not sure it's entirely logical.
I think I'm with Rod on the LSIDs for specimens - they do not represent the physical object but are a sort of digital substitute (or substitutes) of that object.
And I also support Rods view that we should as far as possible avoid the duplication of GUIDs. Thus, for names it appears logical (although I must declare an 'interest' here so others may see a conflict) that the globally recognized nomenclators (IPNI, IF, ZooBank (soon), the bacterial list, the algal list, the virus database - I forget the acronyms here) be charged with providing these GUIDs (currently as LSIDs) for all of us to use. And following on from that, the 'institution' which is charged with providing the digital representation of specimens is the institution which is the custodian of the physical object.
Regards,
Paul
From: Roderic Page [mailto:r.page@bio.gla.ac.uk] Sent: Sun 03/06/2007 12:03 To: Weitzman, Anna Cc: Richard Pyle; Paul Kirk; Jason Best; tdwg-guid@lists.tdwg.org Subject: Re: [tdwg-guid] First step in implementing LSIDs?[Scanned]
I think we need to be clear what gets an LSID (or a GUID in general).
Some of the things listed by Anna are digital records, such as an image. It seems simplest to give these GUIDs that identify the image, with metadata linking the image to the thing the image depicts (there are existing RDF vocabularies to do this).
Some things listed, such as a specimen, are physical objects. These are different from digital objects, and they way in which GUIDs that identify real things are handled has caused all manner of discussion (see http://www.w3.org/DesignIssues/HTTP-URI and related pages bookmarked at http://del.icio.us/rdmpage/303). LSIDs don't handle this well, unless we rely on metadata saying "the thing identified by."
So, at least on this level to say that all seven things get the same GUID is clearly a non starter.
Relationships between things can be easily specified in metadata ("is part of", "depicts", "is kind of").
The final issue is GUID reuse, that is, if somebody uses a INOTAXA record, they should at a minimum refer to the INOTAXA LSID. This would particularly apply to aggregators such as GBIF, who should not present their own identifiers unless GBIF has actually created the data. You often state "presumably shortly also available to GBIF in some form". It's not clear to what that means, but if it's GBIF because INOTAXA serves it, then I think GBIF should use INOTAXA LSIDs to refer to INOTAXA records.
Clearly, generating a plethora a new, effectively local ids (masquerading as global) is not a recipe for progress. If we don't reuse GUIDs we are wasting our time.
Regards
Rod
On 2 Jun 2007, at 18:53, Weitzman, Anna wrote:
Hi Rich (et al.), I'm going to join this particular discussion in spite of the fact that I have not been able to follow the entire GUID discussion over the past couple of years and I may be repeating things that have been resolved.
Let's continue to investigate whether an LSID applies to the physical specimen or the database record (or both?).
What about the record(s) for that same physical object in the literature? As we mark up literature, we are going to generate LSIDs for specimen records that will need to be resolved to be related to the same physical object (in a collection) and the data record (usually in that same collection's database).
Let's look at the example that Chris Lyal and I are contemplating as we work on implementing an INOTAXA pilot to show in Bratislava:
- a weevil specimen here at USNM (a type described in the BCA)
- a record for it in the museum's database (we do have a type
database for insects, and it will be available in a year or two), available on the museum's website, through GBIF, and through INOTAXA 3) a record from digitized and parsed BCA in INOTAXA (presumably shortly also available to GBIF in some form) 4) a record for the same weevil from a paper published in the 1950s available through INOTAXA (presumably shortly also available to GBIF in some form) 5) a record for that weevil from a paper published in the 1990s available through INOTAXA (presumably shortly also available to GBIF in some form) 6) a published image (or series of images) in the paper from the 1990s -- but now also digitized and made available through INOTAXA (presumably shortly also available to GBIF in some form) 7) a digitized image (or series of images) made in our imaging project and made available through the museum's database, INOTAXA, GBIF and MorphoBank
Either each of these (1-7) will need to have its own LSID (or an equivalent in the case of the specimen itself) or they will all need to have the same LSID. If the former, they will all have to resolve to the same parent LSID--is this for the specimen or the record in its home database?--in order for the overall biodiversity information system to really work.
Or let's take that a step further and make that a fish, where not only is there a record in the museum's database with its LSID, but that same record for the same fish that was imported some years ago into FishBase (now out of date perhaps, but still available to GBIF and via Fishbase). At the time, it was imported without an LSID and FishBase has (presumably) assigned it's own LSID...
Or let's say that someone else digitized their copy of the same BCA volume and followed the INOTAXA (taXMLit) and assigned yet another LSID for the specimen record...is that really the same 'record' or different from the one in #3?
I would like to think that in the long run we do not need multiple LSIDs for records that refer to the same specimen or record (as long as we can be truly certain that they are 'the same'. After all, the literature markup has a whole series of unique IDs for its various parts already, so can't we refer to 'the use of LSID 123 in workID 987' or 'the use of LSID 123 on pageID 456 in workID 987'?
There are a lot of IDs here, but unless every collection database already has an LSID that we can 'grab' and use in INOTAXA we are going to have to create our own LSIDs and count on a community resolver to sort it all out (and even if that were true, not all the specimens that we are going to be referring to from INOTAXA have been put in electronic form anyplace else, so we will have to assign LSIDs at least temporarily--Paul did not mention how they are going to deal with the Zoological name LSIDs as at least a temporary solution--but I assume that they have a similar problem).
I'm sure I don't know what the best solution is, but that's what I'm counting on the computer scientists in this group to tell me. I just hope they tell me soon, since we're going to need answers soon!
Cheers, Anna
Anna L. Weitzman, PhD Botanical and Biodiversity Informatics Research National Museum of Natural History Smithsonian Institution
office: 202.633.0846 mobile: 202.415.4684 weitzman@si.edu
From: tdwg-guid-bounces@lists.tdwg.org on behalf of Richard Pyle Sent: Sat 02-Jun-07 5:08 AM To: 'Paul Kirk'; 'Jason Best'; tdwg-guid@lists.tdwg.org Subject: RE: [tdwg-guid] First step in implementing LSIDs?[Scanned]
Paul and List,
First, I should clarify something about my earlier post. I wrote at the start of Scenario 3:
"3) Issue data-less LSIDs without using the revision ID feature, and track data change history separately from the LSIDs"
That should have been "...and track *metadata* change history separately from the LSIDs" (metadata, not data).
So, without making things too complicated as we 'start to walk' in this domain of biodiversity informatics my vote is for a variation of scenario 3) from Rich. The reason I vote for this is that in the fullness of time, and the 'herb.IMI' database has already started this, much of the metadata with be LSIDs and it's correctness (i.e. sorting out typos etc) will be delegated to the entities who issue those LSIDs. As IPNI improves the quality of the metadata associated with the LSIDs they issue (and if I understand correctly they do use the scenario 3) from Rich) so the quality of the metadata associated with a 'herb.IMI' LSID improves. The reason I prefer the data + metadate 'model' is that in this instance the data is fixed ... who changes collection/accession numbers? ... so perfect for this role. Even if a collection moves to a new owner the original data need not 'disappear' in the same way that DOI's move with the objects as book and journal titles change from one publisher to another.
So...if I understand correctly, you differ from my scenario 3 in that you do generate data-bearing LSIDs for specimens, but the data part is limited to only the Accession number, not the complete set of data fields associated with the record -- correct? So, in effect, the object LSID actially applies to is the binary accession number, not the "concept" of the specimen. I can imagine in this case that the LSID can be thought of as representing the "concept of the specimen" because the accession number itself is a surrogate for the physical specimen. The only thing that concerns me about this approach is that there is a non-zero incidence of accidental duplicate catalog numbers within a given collection, and possibly errors in associating catalog numbers. For example, if the computer database for a collection had an error created by a technician who, for example, entered the metadata for accession number IMI1234569 by mistake, when it should have been IMI1234596 (and vice versa), then branding the accession number as "data" for the LSID means that the LSID technically *must* stay with the accession number (not the specimen associated with the metadata for that LSID), after the error is discovered. Not a huge problem, but could surprise people who had indexed the LSID before the error was discovered, who then came back to resolve it again after the error was fixed (i.e., they would get totally wrong information). Given how rare this problem is likely to be (against a backdrop of many far more likely problems we will have to overcome), I don't see this as a strong reason not to proceed with your plan.
Final point, the 'data' is the 'herb.IMI' accession number; in context this is a GUI because of the existence of Index Herbariorum. So, our data will be 123456 not IMI123456 because ... in the fullness of time we will include an Index Herbariorum LSID to 'identify' the 'institutional acronym' element of the metadata.
Is the binary data for the accession number in 8-bit, or 16-bit? I'm assuming 8-bit would be fine, as I suspect all collections would have accession numbers that can be rendered with 256-character ASCII. Is there any "wrapper" to the number as binary data, or is it a straight ASCII binary representation (e.g.: 001100010011001000110011001101000011010100110110 for "12345")?
I'm not sure I follow the logic of how embedding the accession number as data for the LSID allows the LSID to move to a new owner. I would think the opposite. Isn't it likely that the new owner would create their own accession number for the specimen? In this case, they would be forced to generate a new LSID if they were following the same practice of encoding the accession number as "data", rather than metadata.
Also, wouldn't it make more sense to include the acronym (IMI) as part of the data for the LSID? At least that way the "12345" would have *some* context.
Finally, this approach would work only for collections where there is a strict 1:1 correlation between accession numbers and specimen objects for which an LSID is desired.
Thanks for your comments -- this thread is already forcing me to think about things in a way I hadn't thought of them before.
Aloha, Rich
Richard L. Pyle, PhD Database Coordinator for Natural Sciences and Associate Zoologist in Ichthyology Department of Natural Sciences, Bishop Museum 1525 Bernice St., Honolulu, HI 96817 Ph: (808)848-4115, Fax: (808)847-8252 email: deepreef@bishopmuseum.org http://hbs.bishopmuseum.org/staff/pylerichard.html
tdwg-guid mailing list tdwg-guid@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-guid
tdwg-guid mailing list tdwg-guid@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-guid
Professor Roderic D. M. Page Editor, Systematic Biology DEEB, IBLS Graham Kerr Building University of Glasgow Glasgow G12 8QP United Kingdom
Phone: +44 141 330 4778 Fax: +44 141 330 2792 email: r.page@bio.gla.ac.uk web: http://taxonomy.zoology.gla.ac.uk/rod/rod.html iChat: aim://rodpage1962 reprints: http://taxonomy.zoology.gla.ac.uk/rod/pubs.html
Subscribe to Systematic Biology through the Society of Systematic Biologists Website: http://systematicbiology.org Search for taxon names: http://darwin.zoology.gla.ac.uk/~rpage/portal/ Find out what we know about a species: http://ispecies.org Rod's rants on phyloinformatics: http://iphylo.blogspot.com Rod's rants on ants: http://semant.blogspot.com
** The information contained in this e-mail and any files transmitted with it is confidential and is for the exclusive use of the intended recipient. If you are not the intended recipient please note that any distribution, copying or use of this communication or the information in it is prohibited.
Whilst CAB International trading as CABI takes steps to prevent the transmission of viruses via e-mail, we cannot guarantee that any e- mail or attachment is free from computer viruses and you are strongly advised to undertake your own anti-virus precautions.
If you have received this communication in error, please notify us by e-mail at cabi@cabi.org or by telephone on +44 (0)1491 829199 and then delete the e-mail and any copies of it.
CABI is an International Organization recognised by the UK Government under Statutory Instrument 1982 No. 1071.
tdwg-guid mailing list tdwg-guid@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-guid
To answer Paul's question directly I think it is not a good idea to put a string representation of the name (or any other string) in the getData call of the LSID. This is for two reasons. One of which is debatable the other which isn't.
1) TaxonNames are abstract objects in that the spelling may not be stable. The change from æ to ae does not necessarily create a new name although I would argue that it would be a good policy for a nomenclator to act as if it did or was a spelling mistake etc. (this is the one that could be debated about). What couldn't be debated is putting any serialization of any XML in the getData call as the byte stream is entirely reliant on the serializer implementation.
2) A string of characters are not a stable byte stream unless and encoding is specified. The bytes for Poa annua in Latin 1 are the same as in UTF-8 but not in UTF-16. There are many many encodings out there. We could specify only bytes below 125 but then we couldn't express incorrect names that use non-basic ASCII characters. Any client looking at the data would be far better off to look at the nice RDF/XML metadata where the character encoding is explicitly stated. There is therefore no benefit in ever calling the getData method especially as the client would have already called the getMetadata() method before the getData call just to find out what kind of object they were dealing with and would therefore already be in possession of the information in the getData call.
All the best,
Roger
On 3 Jun 2007, at 15:08, Paul Kirk wrote:
Anybody got any views (strong, otherwise or proxy for others views) on whether the LSID should refer to data+metadata or just metadata?
From where I sit, closer to physical objects than bits in a bit stream, I favour the former. Take names for example. Strings of characters and spaces whose form is governed by Codes and one of the means, if not the primary mean, by which we communicate (verbally, in print or electronically) about biodiversity. For LSIDs applied to names my understanding is that they must resolve to an unchanging bit stream representing the name (we implemented this in Index Fungorum 1st May 2005 when we set up the demo resolver) but the associated metadata may change. If I'm correct on this one how does it work for LSIDs only resolving metadata, which is not fixed. I know Roger tried to explain this one to me but I'm still not sure it's entirely logical.
I think I'm with Rod on the LSIDs for specimens - they do not represent the physical object but are a sort of digital substitute (or substitutes) of that object.
And I also support Rods view that we should as far as possible avoid the duplication of GUIDs. Thus, for names it appears logical (although I must declare an 'interest' here so others may see a conflict) that the globally recognized nomenclators (IPNI, IF, ZooBank (soon), the bacterial list, the algal list, the virus database - I forget the acronyms here) be charged with providing these GUIDs (currently as LSIDs) for all of us to use. And following on from that, the 'institution' which is charged with providing the digital representation of specimens is the institution which is the custodian of the physical object.
Regards,
Paul
From: Roderic Page [mailto:r.page@bio.gla.ac.uk] Sent: Sun 03/06/2007 12:03 To: Weitzman, Anna Cc: Richard Pyle; Paul Kirk; Jason Best; tdwg-guid@lists.tdwg.org Subject: Re: [tdwg-guid] First step in implementing LSIDs?[Scanned]
I think we need to be clear what gets an LSID (or a GUID in general).
Some of the things listed by Anna are digital records, such as an image. It seems simplest to give these GUIDs that identify the image, with metadata linking the image to the thing the image depicts (there are existing RDF vocabularies to do this).
Some things listed, such as a specimen, are physical objects. These are different from digital objects, and they way in which GUIDs that identify real things are handled has caused all manner of discussion (see http://www.w3.org/DesignIssues/HTTP-URI and related pages bookmarked at http://del.icio.us/rdmpage/303). LSIDs don't handle this well, unless we rely on metadata saying "the thing identified by."
So, at least on this level to say that all seven things get the same GUID is clearly a non starter.
Relationships between things can be easily specified in metadata ("is part of", "depicts", "is kind of").
The final issue is GUID reuse, that is, if somebody uses a INOTAXA record, they should at a minimum refer to the INOTAXA LSID. This would particularly apply to aggregators such as GBIF, who should not present their own identifiers unless GBIF has actually created the data. You often state "presumably shortly also available to GBIF in some form". It's not clear to what that means, but if it's GBIF because INOTAXA serves it, then I think GBIF should use INOTAXA LSIDs to refer to INOTAXA records.
Clearly, generating a plethora a new, effectively local ids (masquerading as global) is not a recipe for progress. If we don't reuse GUIDs we are wasting our time.
Regards
Rod
On 2 Jun 2007, at 18:53, Weitzman, Anna wrote:
Hi Rich (et al.), I'm going to join this particular discussion in spite of the fact that I have not been able to follow the entire GUID discussion over the past couple of years and I may be repeating things that have been resolved.
Let's continue to investigate whether an LSID applies to the physical specimen or the database record (or both?).
What about the record(s) for that same physical object in the literature? As we mark up literature, we are going to generate LSIDs for specimen records that will need to be resolved to be related to the same physical object (in a collection) and the data record (usually in that same collection's database).
Let's look at the example that Chris Lyal and I are contemplating as we work on implementing an INOTAXA pilot to show in Bratislava:
- a weevil specimen here at USNM (a type described in the BCA)
- a record for it in the museum's database (we do have a type
database for insects, and it will be available in a year or two), available on the museum's website, through GBIF, and through INOTAXA 3) a record from digitized and parsed BCA in INOTAXA (presumably shortly also available to GBIF in some form) 4) a record for the same weevil from a paper published in the 1950s available through INOTAXA (presumably shortly also available to GBIF in some form) 5) a record for that weevil from a paper published in the 1990s available through INOTAXA (presumably shortly also available to GBIF in some form) 6) a published image (or series of images) in the paper from the 1990s -- but now also digitized and made available through INOTAXA (presumably shortly also available to GBIF in some form) 7) a digitized image (or series of images) made in our imaging project and made available through the museum's database, INOTAXA, GBIF and MorphoBank
Either each of these (1-7) will need to have its own LSID (or an equivalent in the case of the specimen itself) or they will all need to have the same LSID. If the former, they will all have to resolve to the same parent LSID--is this for the specimen or the record in its home database?--in order for the overall biodiversity information system to really work.
Or let's take that a step further and make that a fish, where not only is there a record in the museum's database with its LSID, but that same record for the same fish that was imported some years ago into FishBase (now out of date perhaps, but still available to GBIF and via Fishbase). At the time, it was imported without an LSID and FishBase has (presumably) assigned it's own LSID...
Or let's say that someone else digitized their copy of the same BCA volume and followed the INOTAXA (taXMLit) and assigned yet another LSID for the specimen record...is that really the same 'record' or different from the one in #3?
I would like to think that in the long run we do not need multiple LSIDs for records that refer to the same specimen or record (as long as we can be truly certain that they are 'the same'. After all, the literature markup has a whole series of unique IDs for its various parts already, so can't we refer to 'the use of LSID 123 in workID 987' or 'the use of LSID 123 on pageID 456 in workID 987'?
There are a lot of IDs here, but unless every collection database already has an LSID that we can 'grab' and use in INOTAXA we are going to have to create our own LSIDs and count on a community resolver to sort it all out (and even if that were true, not all the specimens that we are going to be referring to from INOTAXA have been put in electronic form anyplace else, so we will have to assign LSIDs at least temporarily--Paul did not mention how they are going to deal with the Zoological name LSIDs as at least a temporary solution--but I assume that they have a similar problem).
I'm sure I don't know what the best solution is, but that's what I'm counting on the computer scientists in this group to tell me. I just hope they tell me soon, since we're going to need answers soon!
Cheers, Anna
Anna L. Weitzman, PhD Botanical and Biodiversity Informatics Research National Museum of Natural History Smithsonian Institution
office: 202.633.0846 mobile: 202.415.4684 weitzman@si.edu
From: tdwg-guid-bounces@lists.tdwg.org on behalf of Richard Pyle Sent: Sat 02-Jun-07 5:08 AM To: 'Paul Kirk'; 'Jason Best'; tdwg-guid@lists.tdwg.org Subject: RE: [tdwg-guid] First step in implementing LSIDs?[Scanned]
Paul and List,
First, I should clarify something about my earlier post. I wrote at the start of Scenario 3:
"3) Issue data-less LSIDs without using the revision ID feature, and track data change history separately from the LSIDs"
That should have been "...and track *metadata* change history separately from the LSIDs" (metadata, not data).
So, without making things too complicated as we 'start to walk' in this domain of biodiversity informatics my vote is for a variation of scenario 3) from Rich. The reason I vote for this is that in the fullness of time, and the 'herb.IMI' database has already started this, much of the metadata with be LSIDs and it's correctness (i.e. sorting out typos etc) will be delegated to the entities who issue those LSIDs. As IPNI improves the quality of the metadata associated with the LSIDs they issue (and if I understand correctly they do use the scenario 3) from Rich) so the quality of the metadata associated with a 'herb.IMI' LSID improves. The reason I prefer the data + metadate 'model' is that in this instance the data is fixed ... who changes collection/accession numbers? ... so perfect for this role. Even if a collection moves to a new owner the original data need not 'disappear' in the same way that DOI's move with the objects as book and journal titles change from one publisher to another.
So...if I understand correctly, you differ from my scenario 3 in that you do generate data-bearing LSIDs for specimens, but the data part is limited to only the Accession number, not the complete set of data fields associated with the record -- correct? So, in effect, the object LSID actially applies to is the binary accession number, not the "concept" of the specimen. I can imagine in this case that the LSID can be thought of as representing the "concept of the specimen" because the accession number itself is a surrogate for the physical specimen. The only thing that concerns me about this approach is that there is a non-zero incidence of accidental duplicate catalog numbers within a given collection, and possibly errors in associating catalog numbers. For example, if the computer database for a collection had an error created by a technician who, for example, entered the metadata for accession number IMI1234569 by mistake, when it should have been IMI1234596 (and vice versa), then branding the accession number as "data" for the LSID means that the LSID technically *must* stay with the accession number (not the specimen associated with the metadata for that LSID), after the error is discovered. Not a huge problem, but could surprise people who had indexed the LSID before the error was discovered, who then came back to resolve it again after the error was fixed (i.e., they would get totally wrong information). Given how rare this problem is likely to be (against a backdrop of many far more likely problems we will have to overcome), I don't see this as a strong reason not to proceed with your plan.
Final point, the 'data' is the 'herb.IMI' accession number; in context this is a GUI because of the existence of Index Herbariorum. So, our data will be 123456 not IMI123456 because ... in the fullness of time we will include an Index Herbariorum LSID to 'identify' the 'institutional acronym' element of the metadata.
Is the binary data for the accession number in 8-bit, or 16-bit? I'm assuming 8-bit would be fine, as I suspect all collections would have accession numbers that can be rendered with 256-character ASCII. Is there any "wrapper" to the number as binary data, or is it a straight ASCII binary representation (e.g.: 001100010011001000110011001101000011010100110110 for "12345")?
I'm not sure I follow the logic of how embedding the accession number as data for the LSID allows the LSID to move to a new owner. I would think the opposite. Isn't it likely that the new owner would create their own accession number for the specimen? In this case, they would be forced to generate a new LSID if they were following the same practice of encoding the accession number as "data", rather than metadata.
Also, wouldn't it make more sense to include the acronym (IMI) as part of the data for the LSID? At least that way the "12345" would have *some* context.
Finally, this approach would work only for collections where there is a strict 1:1 correlation between accession numbers and specimen objects for which an LSID is desired.
Thanks for your comments -- this thread is already forcing me to think about things in a way I hadn't thought of them before.
Aloha, Rich
Richard L. Pyle, PhD Database Coordinator for Natural Sciences and Associate Zoologist in Ichthyology Department of Natural Sciences, Bishop Museum 1525 Bernice St., Honolulu, HI 96817 Ph: (808)848-4115, Fax: (808)847-8252 email: deepreef@bishopmuseum.org http://hbs.bishopmuseum.org/staff/pylerichard.html
tdwg-guid mailing list tdwg-guid@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-guid
tdwg-guid mailing list tdwg-guid@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-guid
Professor Roderic D. M. Page Editor, Systematic Biology DEEB, IBLS Graham Kerr Building University of Glasgow Glasgow G12 8QP United Kingdom
Phone: +44 141 330 4778 Fax: +44 141 330 2792 email: r.page@bio.gla.ac.uk web: http://taxonomy.zoology.gla.ac.uk/rod/rod.html iChat: aim://rodpage1962 reprints: http://taxonomy.zoology.gla.ac.uk/rod/pubs.html
Subscribe to Systematic Biology through the Society of Systematic Biologists Website: http://systematicbiology.org Search for taxon names: http://darwin.zoology.gla.ac.uk/~rpage/portal/ Find out what we know about a species: http://ispecies.org Rod's rants on phyloinformatics: http://iphylo.blogspot.com Rod's rants on ants: http://semant.blogspot.com
** The information contained in this e-mail and any files transmitted with it is confidential and is for the exclusive use of the intended recipient. If you are not the intended recipient please note that any distribution, copying or use of this communication or the information in it is prohibited.
Whilst CAB International trading as CABI takes steps to prevent the transmission of viruses via e-mail, we cannot guarantee that any e- mail or attachment is free from computer viruses and you are strongly advised to undertake your own anti-virus precautions.
If you have received this communication in error, please notify us by e-mail at cabi@cabi.org or by telephone on +44 (0)1491 829199 and then delete the e-mail and any copies of it.
CABI is an International Organization recognised by the UK Government under Statutory Instrument 1982 No. 1071.
tdwg-guid mailing list tdwg-guid@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-guid
participants (7)
-
Bob Morris
-
Jason Best
-
Paul Kirk
-
Richard Pyle
-
Roderic Page
-
Roger Hyam
-
Weitzman, Anna