
I've recently written an number of posts on the implications of the lack of specimen-level identifiers, which makes it very hard to link different sources of data together, such as GBIF and Genbank http://iphylo.blogspot.com/2012/02/linking-gbif-and-genbank.html , and are also a factor in creating duplicate records in GBIF http://iphylo.blogspot.com/2012/02/how-many-specimens-does-gbif-really.html I know this is something of a hobby horse of mine, but we can have all the wonderful ontologies and vocabularies we want, if we don't have globally unique, shared identifiers to glue this stuff together we are going to find ourselves making yet more silos... Regards Rod --------------------------------------------------------- Roderic Page Professor of Taxonomy Institute of Biodiversity, Animal Health and Comparative Medicine College of Medical, Veterinary and Life Sciences Graham Kerr Building University of Glasgow Glasgow G12 8QQ, UK Email: r.page@bio.gla.ac.uk Tel: +44 141 330 4778 Fax: +44 141 330 2792 AIM: rodpage1962@aim.com Facebook: http://www.facebook.com/profile.php?id=1112517192 Twitter: http://twitter.com/rdmpage Blog: http://iphylo.blogspot.com Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html

I agree Rod, it would be ideal to have unique, shared identifiers for specimens, and as many other types of data as possible. The problem here is the "shared" bit. This is what most people hope for and hoped would come out of all the GUID and vocabulary work that has been done. But you know how hard it is to get different projects, organisations, datasets to really share IDs. Pretty much impossible, so I have moved on from this dream and hope to solve this more by linkages, linked data type approaches instead. Another problem is what the identifier refers to. As someone (I think Rich) said in a recent post, two different people may apply the same identifier to slightly different things - eg to the "name" of a person, or to the "person" itself. This is another barrier to reuse of shared identifiers. You may think that specimens should be very simple, it is just a specimen that you refer to, but there can be subtle differences, for example if someone has data about the accessioned physical specimen and another has an image of that specimen - they could both well say that they are discussing the same specimen so give these two "different" objects the same identifier. Kevin -----Original Message----- From: tdwg-tag-bounces@lists.tdwg.org [mailto:tdwg-tag-bounces@lists.tdwg.org] On Behalf Of Roderic Page Sent: Thursday, 23 February 2012 11:38 p.m. To: TDWG TAG Subject: [tdwg-tag] Specimen identifiers I've recently written an number of posts on the implications of the lack of specimen-level identifiers, which makes it very hard to link different sources of data together, such as GBIF and Genbank http://iphylo.blogspot.com/2012/02/linking-gbif-and-genbank.html , and are also a factor in creating duplicate records in GBIF http://iphylo.blogspot.com/2012/02/how-many-specimens-does-gbif-really.html I know this is something of a hobby horse of mine, but we can have all the wonderful ontologies and vocabularies we want, if we don't have globally unique, shared identifiers to glue this stuff together we are going to find ourselves making yet more silos... Regards Rod --------------------------------------------------------- Roderic Page Professor of Taxonomy Institute of Biodiversity, Animal Health and Comparative Medicine College of Medical, Veterinary and Life Sciences Graham Kerr Building University of Glasgow Glasgow G12 8QQ, UK Email: r.page@bio.gla.ac.uk Tel: +44 141 330 4778 Fax: +44 141 330 2792 AIM: rodpage1962@aim.com Facebook: http://www.facebook.com/profile.php?id=1112517192 Twitter: http://twitter.com/rdmpage Blog: http://iphylo.blogspot.com Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html _______________________________________________ tdwg-tag mailing list tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag Please consider the environment before printing this email Warning: This electronic message together with any attachments is confidential. If you receive it in error: (i) you must not read, use, disclose, copy or retain it; (ii) please contact the sender immediately by reply email and then delete the emails. The views expressed in this email may not be those of Landcare Research New Zealand Limited. http://www.landcareresearch.co.nz

Hi All, As I've said many times before, the "shared" bit is useful, but far less important than the "globally unique", "persistent", and "actionable" bits. As Kevin says, we can handle the non-shared GUIDs (as long as they meet the other three criteria) by simply building a cross-mapping service; but that's only useful to the extent that the identifiers are truly unique, persistent, and actionable (in that order of importance). Once we have a real infrastructure that achieves critical mass of adoption for integrating the silos, then I'm sure eventually our community will converge toward shared identifiers (specifically, towards the ones that are most robustly persistent, and provide the best services when actioned upon), and the superfluous identifiers will eventually fade into becoming historical metadata (like NODC numbers in the context of ITIS). But without an infrastructure to get people to come out of their silos and "plug in" to the biodiversity informatics "matrix", it's unlikely that we'll ever get to the point of collapsing multiple identical GUIDs into a single shared GUID for the same object. Aloha, Rich
-----Original Message----- From: tdwg-tag-bounces@lists.tdwg.org [mailto:tdwg-tag- bounces@lists.tdwg.org] On Behalf Of Kevin Richards Sent: Thursday, February 23, 2012 9:41 AM To: Roderic Page; TDWG TAG Subject: Re: [tdwg-tag] Specimen identifiers
I agree Rod, it would be ideal to have unique, shared identifiers for specimens, and as many other types of data as possible. The problem here is the "shared" bit. This is what most people hope for and hoped would come out of all the GUID and vocabulary work that has been done. But you know how hard it is to get different projects, organisations, datasets to really share IDs. Pretty much impossible, so I have moved on from this dream and hope to solve this more by linkages, linked data type approaches instead.
Another problem is what the identifier refers to. As someone (I think Rich) said in a recent post, two different people may apply the same identifier to slightly different things - eg to the "name" of a person, or to the "person" itself. This is another barrier to reuse of shared identifiers. You may think that specimens should be very simple, it is just a specimen that you refer to, but there can be subtle differences, for example if someone has data about the accessioned physical specimen and another has an image of that specimen - they could both well say that they are discussing the same specimen so give these two "different" objects the same identifier.
Kevin
-----Original Message----- From: tdwg-tag-bounces@lists.tdwg.org [mailto:tdwg-tag- bounces@lists.tdwg.org] On Behalf Of Roderic Page Sent: Thursday, 23 February 2012 11:38 p.m. To: TDWG TAG Subject: [tdwg-tag] Specimen identifiers
I've recently written an number of posts on the implications of the lack of specimen-level identifiers, which makes it very hard to link different sources of data together, such as GBIF and Genbank http://iphylo.blogspot.com/2012/02/linking-gbif-and-genbank.html , and are also a factor in creating duplicate records in GBIF http://iphylo.blogspot.com/2012/02/how-many-specimens-does-gbif- really.html
I know this is something of a hobby horse of mine, but we can have all the wonderful ontologies and vocabularies we want, if we don't have globally unique, shared identifiers to glue this stuff together we are going to find ourselves making yet more silos...
Regards
Rod
--------------------------------------------------------- Roderic Page Professor of Taxonomy Institute of Biodiversity, Animal Health and Comparative Medicine College of Medical, Veterinary and Life Sciences Graham Kerr Building University of Glasgow Glasgow G12 8QQ, UK
Email: r.page@bio.gla.ac.uk Tel: +44 141 330 4778 Fax: +44 141 330 2792 AIM: rodpage1962@aim.com Facebook: http://www.facebook.com/profile.php?id=1112517192 Twitter: http://twitter.com/rdmpage Blog: http://iphylo.blogspot.com Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html
_______________________________________________ tdwg-tag mailing list tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag
Please consider the environment before printing this email Warning: This electronic message together with any attachments is confidential. If you receive it in error: (i) you must not read, use, disclose, copy or retain it; (ii) please contact the sender immediately by reply email and then delete the emails. The views expressed in this email may not be those of Landcare Research New Zealand Limited. http://www.landcareresearch.co.nz _______________________________________________ tdwg-tag mailing list tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag
This message is only intended for the addressee named above. Its contents may be privileged or otherwise protected. Any unauthorized use, disclosure or copying of this message or its contents is prohibited. If you have received this message by mistake, please notify us immediately by reply mail or by collect telephone call. Any personal opinions expressed in this message do not necessarily represent the views of the Bishop Museum.

Thought you all might be interested in a parallel universe, in which the Code4Lib community is also discussing this issue of shared identifiers, resolvability, and persistence at the same time as TDWG. Here's one particularly related message in that thread.... http://serials.infomotions.com/code4lib/archive/2012/201202/0680.html Jonathan makes some good points about the importance of shared identifiers, and some interesting remarks about whether expressing these in RDF really helps at all. Matt On Thu, Feb 23, 2012 at 12:09 PM, Richard Pyle <deepreef@bishopmuseum.org>wrote:
Hi All,
As I've said many times before, the "shared" bit is useful, but far less important than the "globally unique", "persistent", and "actionable" bits. As Kevin says, we can handle the non-shared GUIDs (as long as they meet the other three criteria) by simply building a cross-mapping service; but that's only useful to the extent that the identifiers are truly unique, persistent, and actionable (in that order of importance).
Once we have a real infrastructure that achieves critical mass of adoption for integrating the silos, then I'm sure eventually our community will converge toward shared identifiers (specifically, towards the ones that are most robustly persistent, and provide the best services when actioned upon), and the superfluous identifiers will eventually fade into becoming historical metadata (like NODC numbers in the context of ITIS).
But without an infrastructure to get people to come out of their silos and "plug in" to the biodiversity informatics "matrix", it's unlikely that we'll ever get to the point of collapsing multiple identical GUIDs into a single shared GUID for the same object.
Aloha, Rich
-----Original Message----- From: tdwg-tag-bounces@lists.tdwg.org [mailto:tdwg-tag- bounces@lists.tdwg.org] On Behalf Of Kevin Richards Sent: Thursday, February 23, 2012 9:41 AM To: Roderic Page; TDWG TAG Subject: Re: [tdwg-tag] Specimen identifiers
I agree Rod, it would be ideal to have unique, shared identifiers for specimens, and as many other types of data as possible. The problem here is the "shared" bit. This is what most people hope for and hoped would come out of all the GUID and vocabulary work that has been done. But you know how hard it is to get different projects, organisations, datasets to really share IDs. Pretty much impossible, so I have moved on from this dream and hope to solve this more by linkages, linked data type approaches instead.
Another problem is what the identifier refers to. As someone (I think Rich) said in a recent post, two different people may apply the same identifier to slightly different things - eg to the "name" of a person, or to the "person" itself. This is another barrier to reuse of shared identifiers. You may think that specimens should be very simple, it is just a specimen that you refer to, but there can be subtle differences, for example if someone has data about the accessioned physical specimen and another has an image of that specimen - they could both well say that they are discussing the same specimen so give these two "different" objects the same identifier.
Kevin
-----Original Message----- From: tdwg-tag-bounces@lists.tdwg.org [mailto:tdwg-tag- bounces@lists.tdwg.org] On Behalf Of Roderic Page Sent: Thursday, 23 February 2012 11:38 p.m. To: TDWG TAG Subject: [tdwg-tag] Specimen identifiers
I've recently written an number of posts on the implications of the lack of specimen-level identifiers, which makes it very hard to link different sources of data together, such as GBIF and Genbank http://iphylo.blogspot.com/2012/02/linking-gbif-and-genbank.html , and are also a factor in creating duplicate records in GBIF http://iphylo.blogspot.com/2012/02/how-many-specimens-does-gbif- really.html
I know this is something of a hobby horse of mine, but we can have all the wonderful ontologies and vocabularies we want, if we don't have globally unique, shared identifiers to glue this stuff together we are going to find ourselves making yet more silos...
Regards
Rod
--------------------------------------------------------- Roderic Page Professor of Taxonomy Institute of Biodiversity, Animal Health and Comparative Medicine College of Medical, Veterinary and Life Sciences Graham Kerr Building University of Glasgow Glasgow G12 8QQ, UK
Email: r.page@bio.gla.ac.uk Tel: +44 141 330 4778 Fax: +44 141 330 2792 AIM: rodpage1962@aim.com Facebook: http://www.facebook.com/profile.php?id=1112517192 Twitter: http://twitter.com/rdmpage Blog: http://iphylo.blogspot.com Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html
_______________________________________________ tdwg-tag mailing list tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag
Please consider the environment before printing this email Warning: This electronic message together with any attachments is confidential. If you receive it in error: (i) you must not read, use, disclose, copy or retain it; (ii) please contact the sender immediately by reply email and then delete the emails. The views expressed in this email may not be those of Landcare Research New Zealand Limited. http://www.landcareresearch.co.nz _______________________________________________ tdwg-tag mailing list tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag
This message is only intended for the addressee named above. Its contents may be privileged or otherwise protected. Any unauthorized use, disclosure or copying of this message or its contents is prohibited. If you have received this message by mistake, please notify us immediately by reply mail or by collect telephone call. Any personal opinions expressed in this message do not necessarily represent the views of the Bishop Museum. _______________________________________________ tdwg-tag mailing list tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag

Dear Rich, I guess I'd argue the reverse, in that pumping data out with unique identifiers demonstrably doesn't get us very far. We've had "globally unique", "persistent", and "actionable" identifiers for years (LSIDs, URLs, DOIs, etc.) and very little to show for it. In other words, there isn't a biodiversity informatics "matrix" to "plug in" to. Building a "cross-mapping service" is not necessarily simple, again because the individual data providers rarely use existing identifiers for things outside their domain. Hence we have text strings for literature when perfectly good identifiers exist. The benefits of the "matrix" come from the links, and we aren't providing them. The notion this is all going to magically coalesce at some unspecified point in the future strikes me as wishful thinking. Someone is soon going to point out that the Emperor has no clothes... Regards Rod On 23 Feb 2012, at 21:09, Richard Pyle wrote:
Hi All,
As I've said many times before, the "shared" bit is useful, but far less important than the "globally unique", "persistent", and "actionable" bits. As Kevin says, we can handle the non-shared GUIDs (as long as they meet the other three criteria) by simply building a cross-mapping service; but that's only useful to the extent that the identifiers are truly unique, persistent, and actionable (in that order of importance).
Once we have a real infrastructure that achieves critical mass of adoption for integrating the silos, then I'm sure eventually our community will converge toward shared identifiers (specifically, towards the ones that are most robustly persistent, and provide the best services when actioned upon), and the superfluous identifiers will eventually fade into becoming historical metadata (like NODC numbers in the context of ITIS).
But without an infrastructure to get people to come out of their silos and "plug in" to the biodiversity informatics "matrix", it's unlikely that we'll ever get to the point of collapsing multiple identical GUIDs into a single shared GUID for the same object.
Aloha, Rich
-----Original Message----- From: tdwg-tag-bounces@lists.tdwg.org [mailto:tdwg-tag- bounces@lists.tdwg.org] On Behalf Of Kevin Richards Sent: Thursday, February 23, 2012 9:41 AM To: Roderic Page; TDWG TAG Subject: Re: [tdwg-tag] Specimen identifiers
I agree Rod, it would be ideal to have unique, shared identifiers for specimens, and as many other types of data as possible. The problem here is the "shared" bit. This is what most people hope for and hoped would come out of all the GUID and vocabulary work that has been done. But you know how hard it is to get different projects, organisations, datasets to really share IDs. Pretty much impossible, so I have moved on from this dream and hope to solve this more by linkages, linked data type approaches instead.
Another problem is what the identifier refers to. As someone (I think Rich) said in a recent post, two different people may apply the same identifier to slightly different things - eg to the "name" of a person, or to the "person" itself. This is another barrier to reuse of shared identifiers. You may think that specimens should be very simple, it is just a specimen that you refer to, but there can be subtle differences, for example if someone has data about the accessioned physical specimen and another has an image of that specimen - they could both well say that they are discussing the same specimen so give these two "different" objects the same identifier.
Kevin
-----Original Message----- From: tdwg-tag-bounces@lists.tdwg.org [mailto:tdwg-tag- bounces@lists.tdwg.org] On Behalf Of Roderic Page Sent: Thursday, 23 February 2012 11:38 p.m. To: TDWG TAG Subject: [tdwg-tag] Specimen identifiers
I've recently written an number of posts on the implications of the lack of specimen-level identifiers, which makes it very hard to link different sources of data together, such as GBIF and Genbank http://iphylo.blogspot.com/2012/02/linking-gbif-and-genbank.html , and are also a factor in creating duplicate records in GBIF http://iphylo.blogspot.com/2012/02/how-many-specimens-does-gbif- really.html
I know this is something of a hobby horse of mine, but we can have all the wonderful ontologies and vocabularies we want, if we don't have globally unique, shared identifiers to glue this stuff together we are going to find ourselves making yet more silos...
Regards
Rod
--------------------------------------------------------- Roderic Page Professor of Taxonomy Institute of Biodiversity, Animal Health and Comparative Medicine College of Medical, Veterinary and Life Sciences Graham Kerr Building University of Glasgow Glasgow G12 8QQ, UK
Email: r.page@bio.gla.ac.uk Tel: +44 141 330 4778 Fax: +44 141 330 2792 AIM: rodpage1962@aim.com Facebook: http://www.facebook.com/profile.php?id=1112517192 Twitter: http://twitter.com/rdmpage Blog: http://iphylo.blogspot.com Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html
_______________________________________________ tdwg-tag mailing list tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag
Please consider the environment before printing this email Warning: This electronic message together with any attachments is confidential. If you receive it in error: (i) you must not read, use, disclose, copy or retain it; (ii) please contact the sender immediately by reply email and then delete the emails. The views expressed in this email may not be those of Landcare Research New Zealand Limited. http://www.landcareresearch.co.nz _______________________________________________ tdwg-tag mailing list tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag
This message is only intended for the addressee named above. Its contents may be privileged or otherwise protected. Any unauthorized use, disclosure or copying of this message or its contents is prohibited. If you have received this message by mistake, please notify us immediately by reply mail or by collect telephone call. Any personal opinions expressed in this message do not necessarily represent the views of the Bishop Museum.
--------------------------------------------------------- Roderic Page Professor of Taxonomy Institute of Biodiversity, Animal Health and Comparative Medicine College of Medical, Veterinary and Life Sciences Graham Kerr Building University of Glasgow Glasgow G12 8QQ, UK Email: r.page@bio.gla.ac.uk Tel: +44 141 330 4778 Fax: +44 141 330 2792 AIM: rodpage1962@aim.com Facebook: http://www.facebook.com/profile.php?id=1112517192 Twitter: http://twitter.com/rdmpage Blog: http://iphylo.blogspot.com Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html

I certainly agree; but we've been talking about shared identifiers for more than twenty years now, and that hasn't gotten us anywhere either. In fact, the focus on that goal may have hampered or delayed the development of the "matrix". I used to be a HUGE proponent of shared identifiers, back in the day. It seemed like the obvious answer to data integration. But I've come to realize that the silos exist, and each silo has its own way of doing things that relies on their own internal identifiers, and in most cases, the silos don't have the resources (or the motivation) to update their systems to incorporate shared identifiers. Speaking as someone who manages the natural sciences data resources for a Museum, I CERTAINLY wouldn't want to make that sort of investment unless there were universally acknowledged identifiers for the data objects I manage (which, for most of our classes of data objects, there decidedly is not). I'd much rather leave my legacy systems in place, then build a small indexing system that cross-links my own identifiers to the ones that exist "out there" (TSNs, LSIDs, DOIs, OCLC, ISSN/ISBN, etc.). If others do the same, then I see that as a first step towards building a bridge between the silos. By far, the most arduous part in the process is reconciling one's own identifiers against the identifiers of other data sources. This is particularly messy for items that are otherwise identified with "noisy text" - like taxon names, people names, place names, and literature citations (which, collectively, represent by far the bulk of the records that overlap among datasets, and hence are the identifiers we stand the most to gain from by sharing the same identifiers. This work would have to be done anyway - regardless of whether we aimed for a shared identifier approach, or a cross-mapped identifier approach. With the mapped-identifier approach, that's almost all the work that needs to be done. With the shared identifier approach, it's only part of the work that would need to be done - the other part is to upgrade all the applications used by all the silos to incorporate the shared identifers. And, of course, there's the problem with actually converging on what the shared identifiers are. It is indeed wishful thinking that by building the infrastructure, the identifiers will coalesce. However, it is also wishful thinking that all the players will ever come to an agreement on what the "one true" identifier is for shared objects, and moreover devote the necessary resources to convert existing systems to accommodate them. I've come to believe that the former wishful thinking is more plausible (if only slightly) than the latter wishful thinking. Also, the latter has had more time to demonstrate its infeasibility. Aloha, Rich From: Roderic Page [mailto:r.page@bio.gla.ac.uk] Sent: Thursday, February 23, 2012 11:54 AM To: TDWG TAG Cc: Kevin Richards; Richard Pyle Subject: Re: [tdwg-tag] Specimen identifiers Dear Rich, I guess I'd argue the reverse, in that pumping data out with unique identifiers demonstrably doesn't get us very far. We've had "globally unique", "persistent", and "actionable" identifiers for years (LSIDs, URLs, DOIs, etc.) and very little to show for it. In other words, there isn't a biodiversity informatics "matrix" to "plug in" to. Building a "cross-mapping service" is not necessarily simple, again because the individual data providers rarely use existing identifiers for things outside their domain. Hence we have text strings for literature when perfectly good identifiers exist. The benefits of the "matrix" come from the links, and we aren't providing them. The notion this is all going to magically coalesce at some unspecified point in the future strikes me as wishful thinking. Someone is soon going to point out that the Emperor has no clothes... Regards Rod On 23 Feb 2012, at 21:09, Richard Pyle wrote: Hi All, As I've said many times before, the "shared" bit is useful, but far less important than the "globally unique", "persistent", and "actionable" bits. As Kevin says, we can handle the non-shared GUIDs (as long as they meet the other three criteria) by simply building a cross-mapping service; but that's only useful to the extent that the identifiers are truly unique, persistent, and actionable (in that order of importance). Once we have a real infrastructure that achieves critical mass of adoption for integrating the silos, then I'm sure eventually our community will converge toward shared identifiers (specifically, towards the ones that are most robustly persistent, and provide the best services when actioned upon), and the superfluous identifiers will eventually fade into becoming historical metadata (like NODC numbers in the context of ITIS). But without an infrastructure to get people to come out of their silos and "plug in" to the biodiversity informatics "matrix", it's unlikely that we'll ever get to the point of collapsing multiple identical GUIDs into a single shared GUID for the same object. Aloha, Rich -----Original Message----- From: tdwg-tag-bounces@lists.tdwg.org [mailto:tdwg-tag- bounces@lists.tdwg.org] On Behalf Of Kevin Richards Sent: Thursday, February 23, 2012 9:41 AM To: Roderic Page; TDWG TAG Subject: Re: [tdwg-tag] Specimen identifiers I agree Rod, it would be ideal to have unique, shared identifiers for specimens, and as many other types of data as possible. The problem here is the "shared" bit. This is what most people hope for and hoped would come out of all the GUID and vocabulary work that has been done. But you know how hard it is to get different projects, organisations, datasets to really share IDs. Pretty much impossible, so I have moved on from this dream and hope to solve this more by linkages, linked data type approaches instead. Another problem is what the identifier refers to. As someone (I think Rich) said in a recent post, two different people may apply the same identifier to slightly different things - eg to the "name" of a person, or to the "person" itself. This is another barrier to reuse of shared identifiers. You may think that specimens should be very simple, it is just a specimen that you refer to, but there can be subtle differences, for example if someone has data about the accessioned physical specimen and another has an image of that specimen - they could both well say that they are discussing the same specimen so give these two "different" objects the same identifier. Kevin -----Original Message----- From: tdwg-tag-bounces@lists.tdwg.org [mailto:tdwg-tag- bounces@lists.tdwg.org] On Behalf Of Roderic Page Sent: Thursday, 23 February 2012 11:38 p.m. To: TDWG TAG Subject: [tdwg-tag] Specimen identifiers I've recently written an number of posts on the implications of the lack of specimen-level identifiers, which makes it very hard to link different sources of data together, such as GBIF and Genbank http://iphylo.blogspot.com/2012/02/linking-gbif-and-genbank.html , and are also a factor in creating duplicate records in GBIF http://iphylo.blogspot.com/2012/02/how-many-specimens-does-gbif- really.html I know this is something of a hobby horse of mine, but we can have all the wonderful ontologies and vocabularies we want, if we don't have globally unique, shared identifiers to glue this stuff together we are going to find ourselves making yet more silos... Regards Rod --------------------------------------------------------- Roderic Page Professor of Taxonomy Institute of Biodiversity, Animal Health and Comparative Medicine College of Medical, Veterinary and Life Sciences Graham Kerr Building University of Glasgow Glasgow G12 8QQ, UK Email: r.page@bio.gla.ac.uk Tel: +44 141 330 4778 Fax: +44 141 330 2792 AIM: rodpage1962@aim.com Facebook: http://www.facebook.com/profile.php?id=1112517192 Twitter: http://twitter.com/rdmpage Blog: http://iphylo.blogspot.com Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html _______________________________________________ tdwg-tag mailing list tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag Please consider the environment before printing this email Warning: This electronic message together with any attachments is confidential. If you receive it in error: (i) you must not read, use, disclose, copy or retain it; (ii) please contact the sender immediately by reply email and then delete the emails. The views expressed in this email may not be those of Landcare Research New Zealand Limited. http://www.landcareresearch.co.nz _______________________________________________ tdwg-tag mailing list tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag This message is only intended for the addressee named above. Its contents may be privileged or otherwise protected. Any unauthorized use, disclosure or copying of this message or its contents is prohibited. If you have received this message by mistake, please notify us immediately by reply mail or by collect telephone call. Any personal opinions expressed in this message do not necessarily represent the views of the Bishop Museum. --------------------------------------------------------- Roderic Page Professor of Taxonomy Institute of Biodiversity, Animal Health and Comparative Medicine College of Medical, Veterinary and Life Sciences Graham Kerr Building University of Glasgow Glasgow G12 8QQ, UK Email: r.page@bio.gla.ac.uk Tel: +44 141 330 4778 Fax: +44 141 330 2792 AIM: rodpage1962@aim.com Facebook: http://www.facebook.com/profile.php?id=1112517192 Twitter: http://twitter.com/rdmpage Blog: http://iphylo.blogspot.com Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html

On 23 February 2012 20:41, Kevin Richards <RichardsK@landcareresearch.co.nz> wrote:
Another problem is what the identifier refers to. As someone (I think Rich) said in a recent post, two different people may apply the same identifier to slightly different things - eg to the "name" of a person, or to the "person" itself. This is another barrier to reuse of shared identifiers.
My understanding of the semantic web is actually that giving DIFFERENT identifiers (and as Rich says: "globally unique", "persistent", and "actionable") to things is a GOOD thing. Depending on your purpose, two identifiers may or may not be the same. This is a standard problem, and I believe it is much easier solved if I have 2 identifiers and separate sameAs assertions. I can use the existing sameAs as my default, but can easily differ in opinion (by using contradictory sameAs - which may involve localizing and modifying the default sameAs, but at least it is solvable.) With regard to publications: Should a a) OpenAccess Preprint b) OpenAccess Postprint c) ClosedAccess Elsevier Journal (well, rare to find these together :-) ) have the same ID or not? For many purposes they are sameAs, but not for all. ---- However, Rod is correct about the poor history (although I don't consider LSIDs actionable, they are about as actionable as the text strings - you can machine-resolve both, but...). I just thing Rod should call for re-use of identifiers where it is believed to be identical for all purposes and new identifiers PLUS sameAs relations where uncertain. Gregor

Dear Gregor, On 24 Feb 2012, at 08:53, Gregor Hagedorn wrote:
On 23 February 2012 20:41, Kevin Richards <RichardsK@landcareresearch.co.nz> wrote:
Another problem is what the identifier refers to. As someone (I think Rich) said in a recent post, two different people may apply the same identifier to slightly different things - eg to the "name" of a person, or to the "person" itself. This is another barrier to reuse of shared identifiers.
My understanding of the semantic web is actually that giving DIFFERENT identifiers (and as Rich says: "globally unique", "persistent", and "actionable") to things is a GOOD thing.
No, it's not ;)
Depending on your purpose, two identifiers may or may not be the same. This is a standard problem, and I believe it is much easier solved if I have 2 identifiers and separate sameAs assertions. I can use the existing sameAs as my default, but can easily differ in opinion (by using contradictory sameAs - which may involve localizing and modifying the default sameAs, but at least it is solvable.)
With regard to publications: Should a a) OpenAccess Preprint b) OpenAccess Postprint c) ClosedAccess Elsevier Journal (well, rare to find these together :-) ) have the same ID or not?
For many purposes they are sameAs, but not for all.
One identifier for the publication (the "thing"), multiple identifiers for the different representations if you want to be that granular, but these all link to the overall id. Otherwise we have a mess. One of the most sensible things the science publishing industry did was assign DOIs to articles, not their individual representations (DOIs can support multiple resolutions, as can HTTP URIs through content negotiation). This means I can cite an article by linking to the DOI, and ignore what is for most purposes irrelevant (the representation). The citation network would be a hellish mess without this simplification. Contrast this with ISSNs, which are different depending on the representation (print or electronic). Result - mess, it's not clear what the unique identifier for a journal should be, and then people have to create tools to assert that two ISSNs are the "same". We seemed determined to make this harder than it needs to be, especially for end users.
----
However, Rod is correct about the poor history (although I don't consider LSIDs actionable, they are about as actionable as the text strings - you can machine-resolve both, but...).
No, they are actionable, you just need the right tools. Despite the fact I think they suck, I consume them in numerous projects.
I just thing Rod should call for re-use of identifiers where it is believed to be identical for all purposes and new identifiers PLUS sameAs relations where uncertain.
I'd make it simpler, just re-use identifiers wherever possible. If your metadata includes a bibliographic citation, include whatever bibliographic identifiers you can find (DOI, PubMed, ISSN, ISBN, etc.). Same for taxonomic names, if you include a name include associated identifiers. That way this cloud of data we are generating has a fighting chance of coalescing. I guess I'm letting my frustration show, but every time we introduce another layer of complexity, another epicycle in our models, a kitten dies. Regards Rod
Gregor
--------------------------------------------------------- Roderic Page Professor of Taxonomy Institute of Biodiversity, Animal Health and Comparative Medicine College of Medical, Veterinary and Life Sciences Graham Kerr Building University of Glasgow Glasgow G12 8QQ, UK Email: r.page@bio.gla.ac.uk Tel: +44 141 330 4778 Fax: +44 141 330 2792 AIM: rodpage1962@aim.com Facebook: http://www.facebook.com/profile.php?id=1112517192 Twitter: http://twitter.com/rdmpage Blog: http://iphylo.blogspot.com Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html

Perhaps the problem is that there is too little incentive in spending one owns resources, or giving finance to relations (sameAs or other). So this does not happen. Would the world look less grim if we had such incentives? But I guess that is a chicken and egg problem... I disagree about the dois a bit: they make it real easy to find the behind-the-paywall article, but of course they make it real hard to find an open access version of the same article in a institutional or subject repository. I am not too charmed by this anyways, but the doi is a brilliant invention by commercial publishers to make it even less likely to work. Otherwise I love dois, of course... ---- Gregor

Dear Gregor, On 24 Feb 2012, at 11:40, Gregor Hagedorn wrote:
Perhaps the problem is that there is too little incentive in spending one owns resources, or giving finance to relations (sameAs or other). So this does not happen. Would the world look less grim if we had such incentives? But I guess that is a chicken and egg problem...
I agree there seems little incentive to make the links. My concern is that if we don't invest in these then the promised benefits of RDF, etc. simply won't materialise.
I disagree about the dois a bit: they make it real easy to find the behind-the-paywall article, but of course they make it real hard to find an open access version of the same article in a institutional or subject repository. I am not too charmed by this anyways, but the doi is a brilliant invention by commercial publishers to make it even less likely to work.
Otherwise I love dois, of course...
Identity and access are two separate things, and DOIs used by the flagship open access journals (and more recently by BHL). Finding open versions of an article is a separate problem (one which will, of course, be most efficiently done with a service that uses DOIs to identify the article you are looking for). I think we've a lot to learn from DOIs and the associated infrastructure and services. Imagine if we'd been in charge of creating citation linking for journals... the horror, the horror. Regards Rod
----
Gregor
--------------------------------------------------------- Roderic Page Professor of Taxonomy Institute of Biodiversity, Animal Health and Comparative Medicine College of Medical, Veterinary and Life Sciences Graham Kerr Building University of Glasgow Glasgow G12 8QQ, UK Email: r.page@bio.gla.ac.uk Tel: +44 141 330 4778 Fax: +44 141 330 2792 AIM: rodpage1962@aim.com Facebook: http://www.facebook.com/profile.php?id=1112517192 Twitter: http://twitter.com/rdmpage Blog: http://iphylo.blogspot.com Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html

On 23/02/2012, at 9:37 PM, Roderic Page wrote:
I've recently written an number of posts on the implications of the lack of specimen-level identifiers, which makes it very hard to link different sources of data together, such as GBIF and Genbank http://iphylo.blogspot.com/2012/02/linking-gbif-and-genbank.html , and are also a factor in creating duplicate records in GBIF http://iphylo.blogspot.com/2012/02/how-many-specimens-does-gbif-really.html
This is definitely an issue. In AFD (which is not a specimen database), we hold a "museum code" and an "accession number" for types specimens. Ideally, I would like to be able to get from these two fields to a URI. For instance, given the data nameT typeTypeT museumT museumDesc accessonNo materialElement latLong locality comments Holothuria bivittata Mitsukuri, 1912 Syntype TIU Tokyo Imperial University, Tokyo, Japan 1217 Okinawa, Riu Kiu and Yayeyana Ils, Japan Holothuria bivittata Mitsukuri, 1912 Syntype TIU Tokyo Imperial University, Tokyo, Japan 1218 I would like the AFD type specimen records (which are anonymous nodes in our profile data) to point to "http://collections.tiu.edu.jp/colleciton-X/1217" (or whatever), which could be generated from the data we already have. The key is the individual institutions holding collections. The only way I can imagine this happening is for each institution with collections to state "you construct URIs from our accession numbers like so". With that declaration, stores exposing data (such as the boa silos) can perform the mapping when the news reaches them. Once this is in place, anyone handling (for instance) TIU accession numbers can publish correct URIs in their RDF. Most particularly, other institutions accepting specimens from TUI could publish that their new URI for the item is "owl:sameAs" the TUI one. And the whole thing begins to knit together. Importantly: it is not necessary to actually make these URIs resolvable. Hopefully, one day there *would* be something at that URL which would issue a 303 redirect, but the existence of the identifier as an identifier doesn't rely on it. All that is needed is that commitment to the namespace on the part of the issuer. My point is first, that this can be done in stages, and doesn't depend on everybody implementing a big and expensive solution right away or in synchrony; and second, that we don't need a top-down assignment of identifiers. A bottom-up solution can work. Perhaps the main thing missing is a forum on which an institution can announce its creation and assignment of a URI namespace for persistent identifiers. Having said all that, Rod's point is about identification of individuals. An accession number is put on a "token", of course, a given individual may have many "tokens". A case in point is this record in AFD: nameT typeTypeT museumT museumDesc accessonNo materialElement latLong locality comments Bregmaceros pseudolanceolatus Torii, Javonillo & Ozawa, 2004 Paratype URM University of the Ryukyus, Nishihara, Okinawa, Japan P. 12156, 27508–27511, 29172, 29620, 33056 The type specimen has 8 URM accession numbers, and there's really no way around that. Even then, however, the question of identifying the individuals comes down to the same solution: if it's to happen, then it will have to be done by the curators of the collections - it's only the curators who actually know what items are from the same individual. A third party generating UUIDs for all these things just isn't going to work out - they won't get it right. What is needed is for the curator to announce, for instance, "individuals shall be identified by http://specimens.mymuseum.edu/<collection id>/<collector's field number for the individual>". It really doesn't matter how the URIs are done, as long as it's consistent, persistent, and public. If you have received this transmission in error please notify us immediately by return e-mail and delete all copies. If this e-mail or any attachments have been sent to you in error, that error does not constitute waiver of any confidentiality, privilege or copyright in respect of information in the e-mail or attachments. Please consider the environment before printing this email.

Dear Paul, A few quick comments. Constructing URLs from specimen codes is a nice ideal, but in practise breaks down because museum acronyms are not globally unique, and specimen codes are not always unique within institutions (this is a big issue for vertebrate collections where the same code may be a used for a fish, a herp, a mammal, and a bird). So we need ways to disambiguate these. The Darwin Core triplet I've been complaining about on my blog is one attempt to do this by using collectionCodes as part of the specimen code. But these are not terribly stable (a lot of the duplication in GBIF is due to museums mucking about with collection codes). I personally don't hold out much hope for museums being able to develop and maintain rules for converting specimen codes into URIs. Let's be realistic, most museums have no idea about the web beyond creating pretty public interfaces. There are DiGiR servers at major museums running on machines with no domain name, just an IP address. I suspect it's going to be easier to delegate resolving specimens this to something like GBIF. As a data consumer, I'd much prefer going to one place and getting the codes resolved, rather than have to first figure out where to go to find out the rule. If I want metadata for a scientific article I go to CrossRef, not the individual publisher. Distributed begats centralised. I think not insisting on resolvable identifiers is a big mistake. It's like saying it's OK to publish source code that you haven't actually bothered to check whether it compiles. If they don't have to resolve I can publish any identifier I want (witness the number of "fake" LSIDs in the wild) and I've made zero commitment that it means anything. And you've taken away the ability of the user to test whether your identifier is meaningful, and thus build any degree of trust. The acid test of whether you are serious is whether your identifiers are "live." The minute we say it's OK for them to be unresolvable we are buggered. Regards Rod On 24 Feb 2012, at 06:14, Paul Murray wrote:
On 23/02/2012, at 9:37 PM, Roderic Page wrote:
I've recently written an number of posts on the implications of the lack of specimen-level identifiers, which makes it very hard to link different sources of data together, such as GBIF and Genbank http://iphylo.blogspot.com/2012/02/linking-gbif-and-genbank.html , and are also a factor in creating duplicate records in GBIF http://iphylo.blogspot.com/2012/02/how-many-specimens-does-gbif-really.html
This is definitely an issue. In AFD (which is not a specimen database), we hold a "museum code" and an "accession number" for types specimens. Ideally, I would like to be able to get from these two fields to a URI.
For instance, given the data nameT typeTypeT museumT museumDesc accessonNo materialElement latLong locality comments Holothuria bivittata Mitsukuri, 1912 Syntype TIU Tokyo Imperial University, Tokyo, Japan 1217 Okinawa, Riu Kiu and Yayeyana Ils, Japan Holothuria bivittata Mitsukuri, 1912 Syntype TIU Tokyo Imperial University, Tokyo, Japan 1218
I would like the AFD type specimen records (which are anonymous nodes in our profile data) to point to "http://collections.tiu.edu.jp/colleciton-X/1217" (or whatever), which could be generated from the data we already have. The key is the individual institutions holding collections.
The only way I can imagine this happening is for each institution with collections to state "you construct URIs from our accession numbers like so". With that declaration, stores exposing data (such as the boa silos) can perform the mapping when the news reaches them. Once this is in place, anyone handling (for instance) TIU accession numbers can publish correct URIs in their RDF. Most particularly, other institutions accepting specimens from TUI could publish that their new URI for the item is "owl:sameAs" the TUI one. And the whole thing begins to knit together.
Importantly: it is not necessary to actually make these URIs resolvable. Hopefully, one day there *would* be something at that URL which would issue a 303 redirect, but the existence of the identifier as an identifier doesn't rely on it. All that is needed is that commitment to the namespace on the part of the issuer.
My point is first, that this can be done in stages, and doesn't depend on everybody implementing a big and expensive solution right away or in synchrony; and second, that we don't need a top-down assignment of identifiers. A bottom-up solution can work. Perhaps the main thing missing is a forum on which an institution can announce its creation and assignment of a URI namespace for persistent identifiers.
Having said all that, Rod's point is about identification of individuals. An accession number is put on a "token", of course, a given individual may have many "tokens". A case in point is this record in AFD:
nameT typeTypeT museumT museumDesc accessonNo materialElement latLong locality comments Bregmaceros pseudolanceolatus Torii, Javonillo & Ozawa, 2004 Paratype URM University of the Ryukyus, Nishihara, Okinawa, Japan P. 12156, 27508–27511, 29172, 29620, 33056
The type specimen has 8 URM accession numbers, and there's really no way around that.
Even then, however, the question of identifying the individuals comes down to the same solution: if it's to happen, then it will have to be done by the curators of the collections - it's only the curators who actually know what items are from the same individual. A third party generating UUIDs for all these things just isn't going to work out - they won't get it right. What is needed is for the curator to announce, for instance, "individuals shall be identified by http://specimens.mymuseum.edu/<collection id>/<collector's field number for the individual>". It really doesn't matter how the URIs are done, as long as it's consistent, persistent, and public.
If you have received this transmission in error please notify us immediately by return e-mail and delete all copies. If this e-mail or any attachments have been sent to you in error, that error does not constitute waiver of any confidentiality, privilege or copyright in respect of information in the e-mail or attachments.
Please consider the environment before printing this email.
--------------------------------------------------------- Roderic Page Professor of Taxonomy Institute of Biodiversity, Animal Health and Comparative Medicine College of Medical, Veterinary and Life Sciences Graham Kerr Building University of Glasgow Glasgow G12 8QQ, UK Email: r.page@bio.gla.ac.uk Tel: +44 141 330 4778 Fax: +44 141 330 2792 AIM: rodpage1962@aim.com Facebook: http://www.facebook.com/profile.php?id=1112517192 Twitter: http://twitter.com/rdmpage Blog: http://iphylo.blogspot.com Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html

On Fri, Feb 24, 2012 at 3:23 AM, Roderic Page <r.page@bio.gla.ac.uk> wrote:
[...]
I think not insisting on resolvable identifiers is a big mistake. It's like saying it's OK to publish source code that you haven't actually bothered to check whether it compiles. If they don't have to resolve I can publish any identifier I want (witness the number of "fake" LSIDs in the wild) and I've made zero commitment that it means anything. And you've taken away the ability of the user to test whether your identifier is meaningful, and thus build any degree of trust. The acid test of whether you are serious is whether your identifiers are "live." The minute we say it's OK for them to be unresolvable we are buggered.
Rod- First, please forgive me for accusing you of thinking like a human. :-) Well of course I mean, thinking like a human about problems which have to be solved by machines. Second, I agree with you that identifiers should be resolvable, but it is neither universally necessary nor does it always solve the problems one hopes. IMO for science data, the desire for resolution and dereferencing arises from the replicability practices in science that essentially require that original data supporting claims should always be examinable by third parties. But to a machine, this is not the only way, sometimes not even the best, way to solve some problems that dereferencing solves.. One alternative in information science lies in the theory and practice of software trust relationships, which, happily, often models the similarly named human theory and practice. For example, to start with your analogy, there are important real cases where it is not actually necessary to compile your code to come to a belief that it is compilable. That case is the one where the source code has been generated by another program that is "known" to generate only compilable code. Closer to the discussion at hand, consider a message that arrives at a software agent and whose content is, in human terms: 1. The URI http;//md5.hash/fb3d0c347e2c602f4ec650c0e777c1d3 designates specimen with accession number 3251 at the Harvard University Herbaria. 2. There are no other specimens at the Harvard University Herbaria with that accession number and never have been. 3. As of Fri Feb 24 14:36:45 UTC 2012 the most recent determination carried in the Harvard records for this specimen is Aus bus. 4. My name is Roderic Page http;//md5.hash/7cee01cb3cff705f850d15c357767ca0 and I approved this message. 5. This message has MD5 hash code 88f1c348afea5082f1f375910fe814f3 . Even if NONE of the identifiers in the above are resolvable or dereferencable, and whether or not there is a dereferencable identifier at all for the specimen mentioned, there are scenarios in which the above kind of message is at least a trustworthy as information delivered via an http request based on an identifier for the specimen itself. Going to the primary sources is a time honored scientific and scholarly practice---and following a community's human practices, can, if done with great care, produce more usable and trustworthy software than following the practices of software engineers--- but so is the use of trusted secondary sources, and the latter serve many purposes well. Hey, Rod, why do you think I read iPhylo at all? :-) Anyway, on the internet, \all/ acquisition of data and information is mediated by software, so in the end, trust by humans or software in the assertions about the real world that are delivered on the internet should never depend alone on whether the identifiers are resolvable and dereferencable. Inside joke: in the message above, assertion 5 is the only one that would always have a very low probability of being correct. Why? The Wikipedia plot summary of Borges' "The Library of Babel" ends with the wonderful paragraph, perhaps appropriate to tdwg-tag: "Despite — indeed, because of — this glut of information, all books are totally useless to the reader, leaving the librarians in a state of suicidal despair. This leads some librarians to superstitions and cult-like behaviour, such as the "Purifiers", who arbitrarily destroy books they deem nonsense as they scour through the library seeking the "Crimson Hexagon" and its illustrated, magical books. Another is the belief that since all books exist in the library, somewhere one of the books must be a perfect index of the library's contents; some even believe that a messianic figure known as the "Man of the Book" has read it, and they travel through the library seeking him." http://en.wikipedia.org/w/index.php?title=Special:Cite&page=The_Library_of_B... --- Bob Morris -- Robert A. Morris Emeritus Professor of Computer Science UMASS-Boston 100 Morrissey Blvd Boston, MA 02125-3390 IT Staff Filtered Push Project Harvard University Herbaria Harvard University email: morris.bob@gmail.com web: http://efg.cs.umb.edu/ web: http://etaxonomy.org/mw/FilteredPush http://www.cs.umb.edu/~ram === The content of this communication is made entirely on my own behalf and in no way should be deemed to express official positions of The University of Massachusetts at Boston or Harvard University.

Dear Bob, Perhaps I used "trust" a little too loosely. It's not so much whether I trust your content (a whole separate question), it's whether I can trust your identifiers. Put another way, if identifiers are cheap to create, and there's no expectation that they resolve, then we can end up with identifiers that have no value, in which case why would I use them? I "trust" DOIs because they tend to work, they cost money, and the agencies that issue them frown on them not working. Hence it's unlikely that someone is going to use one to identify some data and make no commitment that the identifier will resolve, and that it will resolve to something useful. Given that, I'm more confident of linking my data to a DOI than, say, a URL from a publisher's web site. Given that I want to link stuff together I am reliant on using other people's identifiers to make those links. If those identifiers are labile then my hard work may be all for nought. So I need some way of judging whether an identifier is likely to persist or not (this may influence whether I decide to rely on the external resource being around, or whether I cache it locally, for example). So I guess I'm using the resolvability of identifiers as a proxy of whether to take someone seriously or not. If you can't be bothered to make them resolvable, then you clearly don't value your own content, and therefore why should I? Regards Rod On 24 Feb 2012, at 15:30, Bob Morris wrote:
On Fri, Feb 24, 2012 at 3:23 AM, Roderic Page <r.page@bio.gla.ac.uk> wrote:
[...]
I think not insisting on resolvable identifiers is a big mistake. It's like saying it's OK to publish source code that you haven't actually bothered to check whether it compiles. If they don't have to resolve I can publish any identifier I want (witness the number of "fake" LSIDs in the wild) and I've made zero commitment that it means anything. And you've taken away the ability of the user to test whether your identifier is meaningful, and thus build any degree of trust. The acid test of whether you are serious is whether your identifiers are "live." The minute we say it's OK for them to be unresolvable we are buggered.
Rod-
First, please forgive me for accusing you of thinking like a human. :-) Well of course I mean, thinking like a human about problems which have to be solved by machines.
Second, I agree with you that identifiers should be resolvable, but it is neither universally necessary nor does it always solve the problems one hopes. IMO for science data, the desire for resolution and dereferencing arises from the replicability practices in science that essentially require that original data supporting claims should always be examinable by third parties. But to a machine, this is not the only way, sometimes not even the best, way to solve some problems that dereferencing solves.. One alternative in information science lies in the theory and practice of software trust relationships, which, happily, often models the similarly named human theory and practice.
For example, to start with your analogy, there are important real cases where it is not actually necessary to compile your code to come to a belief that it is compilable. That case is the one where the source code has been generated by another program that is "known" to generate only compilable code.
Closer to the discussion at hand, consider a message that arrives at a software agent and whose content is, in human terms:
1. The URI http;//md5.hash/fb3d0c347e2c602f4ec650c0e777c1d3 designates specimen with accession number 3251 at the Harvard University Herbaria. 2. There are no other specimens at the Harvard University Herbaria with that accession number and never have been. 3. As of Fri Feb 24 14:36:45 UTC 2012 the most recent determination carried in the Harvard records for this specimen is Aus bus. 4. My name is Roderic Page http;//md5.hash/7cee01cb3cff705f850d15c357767ca0 and I approved this message. 5. This message has MD5 hash code 88f1c348afea5082f1f375910fe814f3 .
Even if NONE of the identifiers in the above are resolvable or dereferencable, and whether or not there is a dereferencable identifier at all for the specimen mentioned, there are scenarios in which the above kind of message is at least a trustworthy as information delivered via an http request based on an identifier for the specimen itself.
Going to the primary sources is a time honored scientific and scholarly practice---and following a community's human practices, can, if done with great care, produce more usable and trustworthy software than following the practices of software engineers--- but so is the use of trusted secondary sources, and the latter serve many purposes well. Hey, Rod, why do you think I read iPhylo at all? :-) Anyway, on the internet, \all/ acquisition of data and information is mediated by software, so in the end, trust by humans or software in the assertions about the real world that are delivered on the internet should never depend alone on whether the identifiers are resolvable and dereferencable.
Inside joke: in the message above, assertion 5 is the only one that would always have a very low probability of being correct. Why?
The Wikipedia plot summary of Borges' "The Library of Babel" ends with the wonderful paragraph, perhaps appropriate to tdwg-tag:
"Despite — indeed, because of — this glut of information, all books are totally useless to the reader, leaving the librarians in a state of suicidal despair. This leads some librarians to superstitions and cult-like behaviour, such as the "Purifiers", who arbitrarily destroy books they deem nonsense as they scour through the library seeking the "Crimson Hexagon" and its illustrated, magical books. Another is the belief that since all books exist in the library, somewhere one of the books must be a perfect index of the library's contents; some even believe that a messianic figure known as the "Man of the Book" has read it, and they travel through the library seeking him." http://en.wikipedia.org/w/index.php?title=Special:Cite&page=The_Library_of_B...
---
Bob Morris -- Robert A. Morris
Emeritus Professor of Computer Science UMASS-Boston 100 Morrissey Blvd Boston, MA 02125-3390
IT Staff Filtered Push Project Harvard University Herbaria Harvard University
email: morris.bob@gmail.com web: http://efg.cs.umb.edu/ web: http://etaxonomy.org/mw/FilteredPush http://www.cs.umb.edu/~ram === The content of this communication is made entirely on my own behalf and in no way should be deemed to express official positions of The University of Massachusetts at Boston or Harvard University.
--------------------------------------------------------- Roderic Page Professor of Taxonomy Institute of Biodiversity, Animal Health and Comparative Medicine College of Medical, Veterinary and Life Sciences Graham Kerr Building University of Glasgow Glasgow G12 8QQ, UK Email: r.page@bio.gla.ac.uk Tel: +44 141 330 4778 Fax: +44 141 330 2792 AIM: rodpage1962@aim.com Facebook: http://www.facebook.com/profile.php?id=1112517192 Twitter: http://twitter.com/rdmpage Blog: http://iphylo.blogspot.com Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html

On 25/02/2012, at 11:46 PM, Roderic Page wrote:
Put another way, if identifiers are cheap to create, and there's no expectation that they resolve, then we can end up with identifiers that have no value, in which case why would I use them?
Which is why I made the point that commitment on the part of the organisation is key to the process. t has to be treated as serious business. One manifestation of this, for instance, is that AFD type specimen data available at our SPARQL service here is in blank RDF nodes attached to the names. I (the techo) *could* simply create IDs for them from the database sequence number, but it is terribly important that I not do so. When we do expose data from the herbarium database, there will be identifiers for the specimens that will be correctly build and of ongoing value. The identifications in that database, BTW, have APNI name ids on them. So it will all link together as it should. If you have received this transmission in error please notify us immediately by return e-mail and delete all copies. If this e-mail or any attachments have been sent to you in error, that error does not constitute waiver of any confidentiality, privilege or copyright in respect of information in the e-mail or attachments. Please consider the environment before printing this email.

This is directly in response to Rod's response to Paul. I think the two of you may have just articulated nearly the same idea, though you seem not to think you did. Paul envisions institutions each declaring their own URI-creating formula (to resolve down to a specimen at that institution), promulgated at a "forum" location. Rod envisions URI formulation as happening at a GBIFesque centralized site. If Paul's forum were GBIF (or similar), with an added function that GBIF (or similar) renegotiates any institutional declaration that collides with a pre-existing declaration, does that map to the same thing for both of you? -Dean -- Dean Pentcheff pentcheff@gmail.com dpentche@nhm.org On Fri, Feb 24, 2012 at 12:23 AM, Roderic Page <r.page@bio.gla.ac.uk> wrote:
Dear Paul,
A few quick comments.
Constructing URLs from specimen codes is a nice ideal, but in practise breaks down because museum acronyms are not globally unique, and specimen codes are not always unique within institutions (this is a big issue for vertebrate collections where the same code may be a used for a fish, a herp, a mammal, and a bird). So we need ways to disambiguate these. The Darwin Core triplet I've been complaining about on my blog is one attempt to do this by using collectionCodes as part of the specimen code. But these are not terribly stable (a lot of the duplication in GBIF is due to museums mucking about with collection codes).
I personally don't hold out much hope for museums being able to develop and maintain rules for converting specimen codes into URIs. Let's be realistic, most museums have no idea about the web beyond creating pretty public interfaces. There are DiGiR servers at major museums running on machines with no domain name, just an IP address.
I suspect it's going to be easier to delegate resolving specimens this to something like GBIF. As a data consumer, I'd much prefer going to one place and getting the codes resolved, rather than have to first figure out where to go to find out the rule. If I want metadata for a scientific article I go to CrossRef, not the individual publisher. Distributed begats centralised.
I think not insisting on resolvable identifiers is a big mistake. It's like saying it's OK to publish source code that you haven't actually bothered to check whether it compiles. If they don't have to resolve I can publish any identifier I want (witness the number of "fake" LSIDs in the wild) and I've made zero commitment that it means anything. And you've taken away the ability of the user to test whether your identifier is meaningful, and thus build any degree of trust. The acid test of whether you are serious is whether your identifiers are "live." The minute we say it's OK for them to be unresolvable we are buggered.
Regards
Rod
On 24 Feb 2012, at 06:14, Paul Murray wrote:
On 23/02/2012, at 9:37 PM, Roderic Page wrote:
I've recently written an number of posts on the implications of the lack of specimen-level identifiers, which makes it very hard to link different sources of data together, such as GBIF and Genbank http://iphylo.blogspot.com/2012/02/linking-gbif-and-genbank.html , and are also a factor in creating duplicate records in GBIF http://iphylo.blogspot.com/2012/02/how-many-specimens-does-gbif-really.html
This is definitely an issue. In AFD (which is not a specimen database), we hold a "museum code" and an "accession number" for types specimens. Ideally, I would like to be able to get from these two fields to a URI.
For instance, given the data nameT typeTypeT museumT museumDesc accessonNo materialElement latLong locality comments Holothuria bivittata Mitsukuri, 1912 Syntype TIU Tokyo Imperial University, Tokyo, Japan 1217 Okinawa, Riu Kiu and Yayeyana Ils, Japan Holothuria bivittata Mitsukuri, 1912 Syntype TIU Tokyo Imperial University, Tokyo, Japan 1218
I would like the AFD type specimen records (which are anonymous nodes in our profile data) to point to " http://collections.tiu.edu.jp/colleciton-X/1217" (or whatever), which could be generated from the data we already have. The key is the individual institutions holding collections.
The only way I can imagine this happening is for each institution with collections to state "you construct URIs from our accession numbers like so". With that declaration, stores exposing data (such as the boa silos) can perform the mapping when the news reaches them. Once this is in place, anyone handling (for instance) TIU accession numbers can publish correct URIs in their RDF. Most particularly, other institutions accepting specimens from TUI could publish that their new URI for the item is "owl:sameAs" the TUI one. And the whole thing begins to knit together.
Importantly: it is not necessary to actually make these URIs resolvable. Hopefully, one day there *would* be something at that URL which would issue a 303 redirect, but the existence of the identifier as an identifier doesn't rely on it. All that is needed is that commitment to the namespace on the part of the issuer.
My point is first, that this can be done in stages, and doesn't depend on everybody implementing a big and expensive solution right away or in synchrony; and second, that we don't need a top-down assignment of identifiers. A bottom-up solution can work. Perhaps the main thing missing is a forum on which an institution can announce its creation and assignment of a URI namespace for persistent identifiers.
Having said all that, Rod's point is about identification of individuals. An accession number is put on a "token", of course, a given individual may have many "tokens". A case in point is this record in AFD:
nameT typeTypeT museumT museumDesc accessonNo materialElement latLong locality comments Bregmaceros pseudolanceolatus Torii, Javonillo & Ozawa, 2004 Paratype URM University of the Ryukyus, Nishihara, Okinawa, Japan P. 12156, 27508–27511, 29172, 29620, 33056
The type specimen has 8 URM accession numbers, and there's really no way around that.
Even then, however, the question of identifying the individuals comes down to the same solution: if it's to happen, then it will have to be done by the curators of the collections - it's only the curators who actually know what items are from the same individual. A third party generating UUIDs for all these things just isn't going to work out - they won't get it right. What is needed is for the curator to announce, for instance, "individuals shall be identified by http://specimens.mymuseum.edu/<collection id>/<collector's field number for the individual>". It really doesn't matter how the URIs are done, as long as it's consistent, persistent, and public.
If you have received this transmission in error please notify us immediately by return e-mail and delete all copies. If this e-mail or any attachments have been sent to you in error, that error does not constitute waiver of any confidentiality, privilege or copyright in respect of information in the e-mail or attachments.
Please consider the environment before printing this email.
--------------------------------------------------------- Roderic Page Professor of Taxonomy Institute of Biodiversity, Animal Health and Comparative Medicine College of Medical, Veterinary and Life Sciences Graham Kerr Building University of Glasgow Glasgow G12 8QQ, UK
Email: r.page@bio.gla.ac.uk Tel: +44 141 330 4778 Fax: +44 141 330 2792 AIM: rodpage1962@aim.com Facebook: http://www.facebook.com/profile.php?id=1112517192 Twitter: http://twitter.com/rdmpage Blog: http://iphylo.blogspot.com Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html
_______________________________________________ tdwg-tag mailing list tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag

Dear Dean, In essence, yes, so long as we: a) avoid collisions due to non-unique acronyms (hence we can't automatically generate URIs from specimen codes without some fussing) b) realise that we can't necessarily unpack a URI and use that to locate the specimen (often we could, sometimes we won't be able to, in this sense the identifiers are "opaque") c) avoid changing the URI if a specimen moves collection/institution or if the host institution relabels it. Once minted the identifier doesn't change (because that will break any links to it, defeating the point of having the URIs). Regards Rod On 24 Feb 2012, at 17:29, Dean Pentcheff wrote:
This is directly in response to Rod's response to Paul. I think the two of you may have just articulated nearly the same idea, though you seem not to think you did.
Paul envisions institutions each declaring their own URI-creating formula (to resolve down to a specimen at that institution), promulgated at a "forum" location.
Rod envisions URI formulation as happening at a GBIFesque centralized site.
If Paul's forum were GBIF (or similar), with an added function that GBIF (or similar) renegotiates any institutional declaration that collides with a pre-existing declaration, does that map to the same thing for both of you?
-Dean -- Dean Pentcheff pentcheff@gmail.com dpentche@nhm.org
On Fri, Feb 24, 2012 at 12:23 AM, Roderic Page <r.page@bio.gla.ac.uk> wrote: Dear Paul,
A few quick comments.
Constructing URLs from specimen codes is a nice ideal, but in practise breaks down because museum acronyms are not globally unique, and specimen codes are not always unique within institutions (this is a big issue for vertebrate collections where the same code may be a used for a fish, a herp, a mammal, and a bird). So we need ways to disambiguate these. The Darwin Core triplet I've been complaining about on my blog is one attempt to do this by using collectionCodes as part of the specimen code. But these are not terribly stable (a lot of the duplication in GBIF is due to museums mucking about with collection codes).
I personally don't hold out much hope for museums being able to develop and maintain rules for converting specimen codes into URIs. Let's be realistic, most museums have no idea about the web beyond creating pretty public interfaces. There are DiGiR servers at major museums running on machines with no domain name, just an IP address.
I suspect it's going to be easier to delegate resolving specimens this to something like GBIF. As a data consumer, I'd much prefer going to one place and getting the codes resolved, rather than have to first figure out where to go to find out the rule. If I want metadata for a scientific article I go to CrossRef, not the individual publisher. Distributed begats centralised.
I think not insisting on resolvable identifiers is a big mistake. It's like saying it's OK to publish source code that you haven't actually bothered to check whether it compiles. If they don't have to resolve I can publish any identifier I want (witness the number of "fake" LSIDs in the wild) and I've made zero commitment that it means anything. And you've taken away the ability of the user to test whether your identifier is meaningful, and thus build any degree of trust. The acid test of whether you are serious is whether your identifiers are "live." The minute we say it's OK for them to be unresolvable we are buggered.
Regards
Rod
On 24 Feb 2012, at 06:14, Paul Murray wrote:
On 23/02/2012, at 9:37 PM, Roderic Page wrote:
I've recently written an number of posts on the implications of the lack of specimen-level identifiers, which makes it very hard to link different sources of data together, such as GBIF and Genbank http://iphylo.blogspot.com/2012/02/linking-gbif-and-genbank.html , and are also a factor in creating duplicate records in GBIF http://iphylo.blogspot.com/2012/02/how-many-specimens-does-gbif-really.html
This is definitely an issue. In AFD (which is not a specimen database), we hold a "museum code" and an "accession number" for types specimens. Ideally, I would like to be able to get from these two fields to a URI.
For instance, given the data nameT typeTypeT museumT museumDesc accessonNo materialElement latLong locality comments Holothuria bivittata Mitsukuri, 1912 Syntype TIU Tokyo Imperial University, Tokyo, Japan 1217 Okinawa, Riu Kiu and Yayeyana Ils, Japan Holothuria bivittata Mitsukuri, 1912 Syntype TIU Tokyo Imperial University, Tokyo, Japan 1218
I would like the AFD type specimen records (which are anonymous nodes in our profile data) to point to "http://collections.tiu.edu.jp/colleciton-X/1217" (or whatever), which could be generated from the data we already have. The key is the individual institutions holding collections.
The only way I can imagine this happening is for each institution with collections to state "you construct URIs from our accession numbers like so". With that declaration, stores exposing data (such as the boa silos) can perform the mapping when the news reaches them. Once this is in place, anyone handling (for instance) TIU accession numbers can publish correct URIs in their RDF. Most particularly, other institutions accepting specimens from TUI could publish that their new URI for the item is "owl:sameAs" the TUI one. And the whole thing begins to knit together.
Importantly: it is not necessary to actually make these URIs resolvable. Hopefully, one day there *would* be something at that URL which would issue a 303 redirect, but the existence of the identifier as an identifier doesn't rely on it. All that is needed is that commitment to the namespace on the part of the issuer.
My point is first, that this can be done in stages, and doesn't depend on everybody implementing a big and expensive solution right away or in synchrony; and second, that we don't need a top-down assignment of identifiers. A bottom-up solution can work. Perhaps the main thing missing is a forum on which an institution can announce its creation and assignment of a URI namespace for persistent identifiers.
Having said all that, Rod's point is about identification of individuals. An accession number is put on a "token", of course, a given individual may have many "tokens". A case in point is this record in AFD:
nameT typeTypeT museumT museumDesc accessonNo materialElement latLong locality comments Bregmaceros pseudolanceolatus Torii, Javonillo & Ozawa, 2004 Paratype URM University of the Ryukyus, Nishihara, Okinawa, Japan P. 12156, 27508–27511, 29172, 29620, 33056
The type specimen has 8 URM accession numbers, and there's really no way around that.
Even then, however, the question of identifying the individuals comes down to the same solution: if it's to happen, then it will have to be done by the curators of the collections - it's only the curators who actually know what items are from the same individual. A third party generating UUIDs for all these things just isn't going to work out - they won't get it right. What is needed is for the curator to announce, for instance, "individuals shall be identified by http://specimens.mymuseum.edu/<collection id>/<collector's field number for the individual>". It really doesn't matter how the URIs are done, as long as it's consistent, persistent, and public.
If you have received this transmission in error please notify us immediately by return e-mail and delete all copies. If this e-mail or any attachments have been sent to you in error, that error does not constitute waiver of any confidentiality, privilege or copyright in respect of information in the e-mail or attachments.
Please consider the environment before printing this email.
--------------------------------------------------------- Roderic Page Professor of Taxonomy Institute of Biodiversity, Animal Health and Comparative Medicine College of Medical, Veterinary and Life Sciences Graham Kerr Building University of Glasgow Glasgow G12 8QQ, UK
Email: r.page@bio.gla.ac.uk Tel: +44 141 330 4778 Fax: +44 141 330 2792 AIM: rodpage1962@aim.com Facebook: http://www.facebook.com/profile.php?id=1112517192 Twitter: http://twitter.com/rdmpage Blog: http://iphylo.blogspot.com Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html
_______________________________________________ tdwg-tag mailing list tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag
--------------------------------------------------------- Roderic Page Professor of Taxonomy Institute of Biodiversity, Animal Health and Comparative Medicine College of Medical, Veterinary and Life Sciences Graham Kerr Building University of Glasgow Glasgow G12 8QQ, UK Email: r.page@bio.gla.ac.uk Tel: +44 141 330 4778 Fax: +44 141 330 2792 AIM: rodpage1962@aim.com Facebook: http://www.facebook.com/profile.php?id=1112517192 Twitter: http://twitter.com/rdmpage Blog: http://iphylo.blogspot.com Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html

Yes. And I would qualify what you said as follows: On Sat, Feb 25, 2012 at 4:55 AM, Roderic Page <r.page@bio.gla.ac.uk> wrote:
Dear Dean,
In essence, yes, so long as we:
a) avoid collisions due to non-unique acronyms (hence we can't automatically generate URIs from specimen codes without some fussing)
It's the function of the centralizing agency to ensure this when they accept a "listing formula" from an organization. If the URI-generating formula could result in a collision with an existing listing, the formula would have to be renegotiated before being accepted for the registry.
b) realise that we can't necessarily unpack a URI and use that to locate the specimen (often we could, sometimes we won't be able to, in this sense the identifiers are "opaque")
Yep. I'm all in favor of opaque, non-information-bearing identifiers. The moment you accuse the text of the identifier of having intrinsic meaning, you accept all the ugliness of figuring out how to "update" the identifier when the underlying data are updated. [Real-life case in point: some departments in our institution had a system of minting specimen IDs based on the year of collection plus other digits. With some frequency we discover that the specimens were actually collected in some other year. So either: (a) we change the identifier (unacceptable for all the reasons we know and love); or (b) we know that we cannot trust the year-part of any identifier (so we used this formula why?).]
c) avoid changing the URI if a specimen moves collection/institution or if the host institution relabels it. Once minted the identifier doesn't change (because that will break any links to it, defeating the point of having the URIs).
Yes. It's supposed to be a non-data-bearing opaque identifier. In the worse (but inevitable) case where specimens get additional identifiers, or get subsampled into additional identifiable pieces, there has to be a "synonymy" service that would (perhaps recursively) return the other relevant identifiers. That would be (cough, cough) trivial to implement as long as any subsequent identifier assignment includes a reference to the already-existing identifier.
Regards
Rod
On 24 Feb 2012, at 17:29, Dean Pentcheff wrote:
This is directly in response to Rod's response to Paul. I think the two of you may have just articulated nearly the same idea, though you seem not to think you did.
Paul envisions institutions each declaring their own URI-creating formula (to resolve down to a specimen at that institution), promulgated at a "forum" location.
Rod envisions URI formulation as happening at a GBIFesque centralized site.
If Paul's forum were GBIF (or similar), with an added function that GBIF (or similar) renegotiates any institutional declaration that collides with a pre-existing declaration, does that map to the same thing for both of you?
-Dean -- Dean Pentcheff pentcheff@gmail.com dpentche@nhm.org
On Fri, Feb 24, 2012 at 12:23 AM, Roderic Page <r.page@bio.gla.ac.uk> wrote:
Dear Paul,
A few quick comments.
Constructing URLs from specimen codes is a nice ideal, but in practise breaks down because museum acronyms are not globally unique, and specimen codes are not always unique within institutions (this is a big issue for vertebrate collections where the same code may be a used for a fish, a herp, a mammal, and a bird). So we need ways to disambiguate these. The Darwin Core triplet I've been complaining about on my blog is one attempt to do this by using collectionCodes as part of the specimen code. But these are not terribly stable (a lot of the duplication in GBIF is due to museums mucking about with collection codes).
I personally don't hold out much hope for museums being able to develop and maintain rules for converting specimen codes into URIs. Let's be realistic, most museums have no idea about the web beyond creating pretty public interfaces. There are DiGiR servers at major museums running on machines with no domain name, just an IP address.
I suspect it's going to be easier to delegate resolving specimens this to something like GBIF. As a data consumer, I'd much prefer going to one place and getting the codes resolved, rather than have to first figure out where to go to find out the rule. If I want metadata for a scientific article I go to CrossRef, not the individual publisher. Distributed begats centralised.
I think not insisting on resolvable identifiers is a big mistake. It's like saying it's OK to publish source code that you haven't actually bothered to check whether it compiles. If they don't have to resolve I can publish any identifier I want (witness the number of "fake" LSIDs in the wild) and I've made zero commitment that it means anything. And you've taken away the ability of the user to test whether your identifier is meaningful, and thus build any degree of trust. The acid test of whether you are serious is whether your identifiers are "live." The minute we say it's OK for them to be unresolvable we are buggered.
Regards
Rod
On 24 Feb 2012, at 06:14, Paul Murray wrote:
On 23/02/2012, at 9:37 PM, Roderic Page wrote:
I've recently written an number of posts on the implications of the lack of specimen-level identifiers, which makes it very hard to link different sources of data together, such as GBIF and Genbank http://iphylo.blogspot.com/2012/02/linking-gbif-and-genbank.html , and are also a factor in creating duplicate records in GBIF http://iphylo.blogspot.com/2012/02/how-many-specimens-does-gbif-really.html
This is definitely an issue. In AFD (which is not a specimen database), we hold a "museum code" and an "accession number" for types specimens. Ideally, I would like to be able to get from these two fields to a URI.
For instance, given the data nameT typeTypeT museumT museumDesc accessonNo materialElement latLong locality comments Holothuria bivittata Mitsukuri, 1912 Syntype TIU Tokyo Imperial University, Tokyo, Japan 1217 Okinawa, Riu Kiu and Yayeyana Ils, Japan Holothuria bivittata Mitsukuri, 1912 Syntype TIU Tokyo Imperial University, Tokyo, Japan 1218
I would like the AFD type specimen records (which are anonymous nodes in our profile data) to point to "http://collections.tiu.edu.jp/colleciton-X/1217" (or whatever), which could be generated from the data we already have. The key is the individual institutions holding collections.
The only way I can imagine this happening is for each institution with collections to state "you construct URIs from our accession numbers like so". With that declaration, stores exposing data (such as the boa silos) can perform the mapping when the news reaches them. Once this is in place, anyone handling (for instance) TIU accession numbers can publish correct URIs in their RDF. Most particularly, other institutions accepting specimens from TUI could publish that their new URI for the item is "owl:sameAs" the TUI one. And the whole thing begins to knit together.
Importantly: it is not necessary to actually make these URIs resolvable. Hopefully, one day there *would* be something at that URL which would issue a 303 redirect, but the existence of the identifier as an identifier doesn't rely on it. All that is needed is that commitment to the namespace on the part of the issuer.
My point is first, that this can be done in stages, and doesn't depend on everybody implementing a big and expensive solution right away or in synchrony; and second, that we don't need a top-down assignment of identifiers. A bottom-up solution can work. Perhaps the main thing missing is a forum on which an institution can announce its creation and assignment of a URI namespace for persistent identifiers.
Having said all that, Rod's point is about identification of individuals. An accession number is put on a "token", of course, a given individual may have many "tokens". A case in point is this record in AFD:
nameT typeTypeT museumT museumDesc accessonNo materialElement latLong locality comments Bregmaceros pseudolanceolatus Torii, Javonillo & Ozawa, 2004 Paratype URM University of the Ryukyus, Nishihara, Okinawa, Japan P. 12156, 27508–27511, 29172, 29620, 33056
The type specimen has 8 URM accession numbers, and there's really no way around that.
Even then, however, the question of identifying the individuals comes down to the same solution: if it's to happen, then it will have to be done by the curators of the collections - it's only the curators who actually know what items are from the same individual. A third party generating UUIDs for all these things just isn't going to work out - they won't get it right. What is needed is for the curator to announce, for instance, "individuals shall be identified by http://specimens.mymuseum.edu/<collection id>/<collector's field number for the individual>". It really doesn't matter how the URIs are done, as long as it's consistent, persistent, and public.
If you have received this transmission in error please notify us immediately by return e-mail and delete all copies. If this e-mail or any attachments have been sent to you in error, that error does not constitute waiver of any confidentiality, privilege or copyright in respect of information in the e-mail or attachments.
Please consider the environment before printing this email.
--------------------------------------------------------- Roderic Page Professor of Taxonomy Institute of Biodiversity, Animal Health and Comparative Medicine College of Medical, Veterinary and Life Sciences Graham Kerr Building University of Glasgow Glasgow G12 8QQ, UK
Email: r.page@bio.gla.ac.uk Tel: +44 141 330 4778 Fax: +44 141 330 2792 AIM: rodpage1962@aim.com Facebook: http://www.facebook.com/profile.php?id=1112517192 Twitter: http://twitter.com/rdmpage Blog: http://iphylo.blogspot.com Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html
_______________________________________________ tdwg-tag mailing list tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag
--------------------------------------------------------- Roderic Page Professor of Taxonomy Institute of Biodiversity, Animal Health and Comparative Medicine College of Medical, Veterinary and Life Sciences Graham Kerr Building University of Glasgow Glasgow G12 8QQ, UK
Email: r.page@bio.gla.ac.uk Tel: +44 141 330 4778 Fax: +44 141 330 2792 AIM: rodpage1962@aim.com Facebook: http://www.facebook.com/profile.php?id=1112517192 Twitter: http://twitter.com/rdmpage Blog: http://iphylo.blogspot.com Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html

Dear Dean,
Yep. I'm all in favor of opaque, non-information-bearing identifiers. The moment you accuse the text of the identifier of having intrinsic meaning, you accept all the ugliness of figuring out how to "update" the identifier when the underlying data are updated. [Real-life case in point: some departments in our institution had a system of minting specimen IDs based on the year of collection plus other digits. With some frequency we discover that the specimens were actually collected in some other year. So either: (a) we change the identifier (unacceptable for all the reasons we know and love); or (b) we know that we cannot trust the year-part of any identifier (so we used this formula why?).]
One thing I'd clarify here is that opacity != obscurity. I'm OK with identifiers being generated from metadata so that they appear human-readable (which also means they may be hackable), so long as we accept that what our interpretation may be wrong. Apart from human-readable, hackable identifiers being easier to work with in some situations (e.g., when you are harvesting data, navigating a hierarchy of records, or trying to figure out whether an identifier is a typo), I worry that people misinterpret opacity (this identifier might look meaningful but caveat emptor) as requiring obscurity (e.g., I will use UUIDs because nobody will try and interpret those). Obviously, overtime the ability to interpret an identifier will decay, and it will be impossible to reliably interpret them as carrying meaning, and that's fine. And many of the things non-opacque identifiers are useful for would disappear if we had decent services (e.g., ways to download data), but I think creating deliberately opaque identifiers from the start may be a mistake. Nobody likes UUIDs, and no amount of "yeah, but in an ideal world you'll never see them" softens the fact they look ugly (and often connected to technologies such as LSIDs that nobody gets). In other words, lets have identifiers that aren't ugly, and which are connected to some immediately valuable services so we provide value to people, rather than ugly identifiers with no obvious benefits. Regards Rod On 26 Feb 2012, at 21:39, Dean Pentcheff wrote:
Yes. And I would qualify what you said as follows:
On Sat, Feb 25, 2012 at 4:55 AM, Roderic Page <r.page@bio.gla.ac.uk> wrote:
Dear Dean,
In essence, yes, so long as we:
a) avoid collisions due to non-unique acronyms (hence we can't automatically generate URIs from specimen codes without some fussing)
It's the function of the centralizing agency to ensure this when they accept a "listing formula" from an organization. If the URI-generating formula could result in a collision with an existing listing, the formula would have to be renegotiated before being accepted for the registry.
b) realise that we can't necessarily unpack a URI and use that to locate the specimen (often we could, sometimes we won't be able to, in this sense the identifiers are "opaque")
Yep. I'm all in favor of opaque, non-information-bearing identifiers. The moment you accuse the text of the identifier of having intrinsic meaning, you accept all the ugliness of figuring out how to "update" the identifier when the underlying data are updated. [Real-life case in point: some departments in our institution had a system of minting specimen IDs based on the year of collection plus other digits. With some frequency we discover that the specimens were actually collected in some other year. So either: (a) we change the identifier (unacceptable for all the reasons we know and love); or (b) we know that we cannot trust the year-part of any identifier (so we used this formula why?).]
c) avoid changing the URI if a specimen moves collection/institution or if the host institution relabels it. Once minted the identifier doesn't change (because that will break any links to it, defeating the point of having the URIs).
Yes. It's supposed to be a non-data-bearing opaque identifier. In the worse (but inevitable) case where specimens get additional identifiers, or get subsampled into additional identifiable pieces, there has to be a "synonymy" service that would (perhaps recursively) return the other relevant identifiers. That would be (cough, cough) trivial to implement as long as any subsequent identifier assignment includes a reference to the already-existing identifier.
Regards
Rod
On 24 Feb 2012, at 17:29, Dean Pentcheff wrote:
This is directly in response to Rod's response to Paul. I think the two of you may have just articulated nearly the same idea, though you seem not to think you did.
Paul envisions institutions each declaring their own URI-creating formula (to resolve down to a specimen at that institution), promulgated at a "forum" location.
Rod envisions URI formulation as happening at a GBIFesque centralized site.
If Paul's forum were GBIF (or similar), with an added function that GBIF (or similar) renegotiates any institutional declaration that collides with a pre-existing declaration, does that map to the same thing for both of you?
-Dean -- Dean Pentcheff pentcheff@gmail.com dpentche@nhm.org
On Fri, Feb 24, 2012 at 12:23 AM, Roderic Page <r.page@bio.gla.ac.uk> wrote:
Dear Paul,
A few quick comments.
Constructing URLs from specimen codes is a nice ideal, but in practise breaks down because museum acronyms are not globally unique, and specimen codes are not always unique within institutions (this is a big issue for vertebrate collections where the same code may be a used for a fish, a herp, a mammal, and a bird). So we need ways to disambiguate these. The Darwin Core triplet I've been complaining about on my blog is one attempt to do this by using collectionCodes as part of the specimen code. But these are not terribly stable (a lot of the duplication in GBIF is due to museums mucking about with collection codes).
I personally don't hold out much hope for museums being able to develop and maintain rules for converting specimen codes into URIs. Let's be realistic, most museums have no idea about the web beyond creating pretty public interfaces. There are DiGiR servers at major museums running on machines with no domain name, just an IP address.
I suspect it's going to be easier to delegate resolving specimens this to something like GBIF. As a data consumer, I'd much prefer going to one place and getting the codes resolved, rather than have to first figure out where to go to find out the rule. If I want metadata for a scientific article I go to CrossRef, not the individual publisher. Distributed begats centralised.
I think not insisting on resolvable identifiers is a big mistake. It's like saying it's OK to publish source code that you haven't actually bothered to check whether it compiles. If they don't have to resolve I can publish any identifier I want (witness the number of "fake" LSIDs in the wild) and I've made zero commitment that it means anything. And you've taken away the ability of the user to test whether your identifier is meaningful, and thus build any degree of trust. The acid test of whether you are serious is whether your identifiers are "live." The minute we say it's OK for them to be unresolvable we are buggered.
Regards
Rod
On 24 Feb 2012, at 06:14, Paul Murray wrote:
On 23/02/2012, at 9:37 PM, Roderic Page wrote:
I've recently written an number of posts on the implications of the lack of specimen-level identifiers, which makes it very hard to link different sources of data together, such as GBIF and Genbank http://iphylo.blogspot.com/2012/02/linking-gbif-and-genbank.html , and are also a factor in creating duplicate records in GBIF http://iphylo.blogspot.com/2012/02/how-many-specimens-does-gbif-really.html
This is definitely an issue. In AFD (which is not a specimen database), we hold a "museum code" and an "accession number" for types specimens. Ideally, I would like to be able to get from these two fields to a URI.
For instance, given the data nameT typeTypeT museumT museumDesc accessonNo materialElement latLong locality comments Holothuria bivittata Mitsukuri, 1912 Syntype TIU Tokyo Imperial University, Tokyo, Japan 1217 Okinawa, Riu Kiu and Yayeyana Ils, Japan Holothuria bivittata Mitsukuri, 1912 Syntype TIU Tokyo Imperial University, Tokyo, Japan 1218
I would like the AFD type specimen records (which are anonymous nodes in our profile data) to point to "http://collections.tiu.edu.jp/colleciton-X/1217" (or whatever), which could be generated from the data we already have. The key is the individual institutions holding collections.
The only way I can imagine this happening is for each institution with collections to state "you construct URIs from our accession numbers like so". With that declaration, stores exposing data (such as the boa silos) can perform the mapping when the news reaches them. Once this is in place, anyone handling (for instance) TIU accession numbers can publish correct URIs in their RDF. Most particularly, other institutions accepting specimens from TUI could publish that their new URI for the item is "owl:sameAs" the TUI one. And the whole thing begins to knit together.
Importantly: it is not necessary to actually make these URIs resolvable. Hopefully, one day there *would* be something at that URL which would issue a 303 redirect, but the existence of the identifier as an identifier doesn't rely on it. All that is needed is that commitment to the namespace on the part of the issuer.
My point is first, that this can be done in stages, and doesn't depend on everybody implementing a big and expensive solution right away or in synchrony; and second, that we don't need a top-down assignment of identifiers. A bottom-up solution can work. Perhaps the main thing missing is a forum on which an institution can announce its creation and assignment of a URI namespace for persistent identifiers.
Having said all that, Rod's point is about identification of individuals. An accession number is put on a "token", of course, a given individual may have many "tokens". A case in point is this record in AFD:
nameT typeTypeT museumT museumDesc accessonNo materialElement latLong locality comments Bregmaceros pseudolanceolatus Torii, Javonillo & Ozawa, 2004 Paratype URM University of the Ryukyus, Nishihara, Okinawa, Japan P. 12156, 27508–27511, 29172, 29620, 33056
The type specimen has 8 URM accession numbers, and there's really no way around that.
Even then, however, the question of identifying the individuals comes down to the same solution: if it's to happen, then it will have to be done by the curators of the collections - it's only the curators who actually know what items are from the same individual. A third party generating UUIDs for all these things just isn't going to work out - they won't get it right. What is needed is for the curator to announce, for instance, "individuals shall be identified by http://specimens.mymuseum.edu/<collection id>/<collector's field number for the individual>". It really doesn't matter how the URIs are done, as long as it's consistent, persistent, and public.
If you have received this transmission in error please notify us immediately by return e-mail and delete all copies. If this e-mail or any attachments have been sent to you in error, that error does not constitute waiver of any confidentiality, privilege or copyright in respect of information in the e-mail or attachments.
Please consider the environment before printing this email.
--------------------------------------------------------- Roderic Page Professor of Taxonomy Institute of Biodiversity, Animal Health and Comparative Medicine College of Medical, Veterinary and Life Sciences Graham Kerr Building University of Glasgow Glasgow G12 8QQ, UK
Email: r.page@bio.gla.ac.uk Tel: +44 141 330 4778 Fax: +44 141 330 2792 AIM: rodpage1962@aim.com Facebook: http://www.facebook.com/profile.php?id=1112517192 Twitter: http://twitter.com/rdmpage Blog: http://iphylo.blogspot.com Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html
_______________________________________________ tdwg-tag mailing list tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag
--------------------------------------------------------- Roderic Page Professor of Taxonomy Institute of Biodiversity, Animal Health and Comparative Medicine College of Medical, Veterinary and Life Sciences Graham Kerr Building University of Glasgow Glasgow G12 8QQ, UK
Email: r.page@bio.gla.ac.uk Tel: +44 141 330 4778 Fax: +44 141 330 2792 AIM: rodpage1962@aim.com Facebook: http://www.facebook.com/profile.php?id=1112517192 Twitter: http://twitter.com/rdmpage Blog: http://iphylo.blogspot.com Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html
--------------------------------------------------------- Roderic Page Professor of Taxonomy Institute of Biodiversity, Animal Health and Comparative Medicine College of Medical, Veterinary and Life Sciences Graham Kerr Building University of Glasgow Glasgow G12 8QQ, UK Email: r.page@bio.gla.ac.uk Tel: +44 141 330 4778 Fax: +44 141 330 2792 AIM: rodpage1962@aim.com Facebook: http://www.facebook.com/profile.php?id=1112517192 Twitter: http://twitter.com/rdmpage Blog: http://iphylo.blogspot.com Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html

On 25/02/2012, at 11:55 PM, Roderic Page wrote:
c) avoid changing the URI if a specimen moves collection/institution or if the host institution relabels it. Once minted the identifier doesn't change (because that will break any links to it, defeating the point of having the URIs).
I was suggesting that it be good practise to keep * the original URI * the immediately prior URI * the URI you have assigned to the item Carrying the original URI means that all facts attached to the specimen can be recovered in a fixed number of "joins", rather than having to traverse a list. Carrying the immediately prior URI means that the chain of provenance can - in principle - be reconstructed. Carrying your own URI means that you can continue to use your existing system for managing your collections. It's keeping track of this original one that is new, and it's the idea of URIs that makes it possible. Without *globally* unique IDs, an original accession number means nothing without knowing the collection (ie: namespace) that number came from. That in turn means you need a system for identifying all the collections that specimens might *originally* have come from, and that means also that you need to know how the places that you accept specimens from identify those original collections, so that you can translate their ids to your own. It's just impossible. But URIs fix this at a stroke. Recording three IDs is, I think, not a big ask. I believe that our herbarium data also keeps the number given by an institution which accepts one of our specimens - so it's a doubly-linked list. We'd want the vocabulary to also have a predicate for "provenance record list" which will be an RDF list of provenance record objects, and it would be a nice-to-have for collections to keep track of this, too. The question is - who has the job of declaring what the "original URI" is for existing specimens that already have a history? And what should that URI be? Perhaps this is where GBIF-issued ids become important. Or perhaps we could ditch the idea of "original URI", and just track the "GBIF URI". It's the responsibility of anyone with a specimen that does not already have a GBIF URI to get one for it. If you have received this transmission in error please notify us immediately by return e-mail and delete all copies. If this e-mail or any attachments have been sent to you in error, that error does not constitute waiver of any confidentiality, privilege or copyright in respect of information in the e-mail or attachments. Please consider the environment before printing this email.

On 25/02/2012, at 4:29 AM, Dean Pentcheff wrote:
This is directly in response to Rod's response to Paul. I think the two of you may have just articulated nearly the same idea, though you seem not to think you did.
Paul envisions institutions each declaring their own URI-creating formula (to resolve down to a specimen at that institution), promulgated at a "forum" location.
Rod envisions URI formulation as happening at a GBIFesque centralized site.
If Paul's forum were GBIF (or similar), with an added function that GBIF (or similar) renegotiates any institutional declaration that collides with a pre-existing declaration, does that map to the same thing for both of you?
Well, if institutions are assigning URIs with their own domain names in them, or if GBIF is handing out URI prefixes that the institutions use, then collisions wouldn't be an issue. As a technical person, perhaps I don't quite see things from the point of view of institutions whose interest in the web stops at having a pretty website, as someone suggested. It seems to me the easiest thing in the world to spark up a server and say "these are our URIs". But if people are outsourcing their web presence, then I can appreciate that creating a SemWeb presence might not seem as easy a thing to do to them. This is also the case for people who live in large institutions with byzantine rules about what may and may not go on the corporate websites. If there are places where the issuing of ids to specimens is as chaotic as Rod describes, well - I think the flip side of what I was saying earlier, that people that create the numbers can easily create URIs, is that if the people who create the numbers have bits and bobs all over the place, then an external institution like GBIF is not going to be able to sort it out remotely. Someone has to be on the ground, treading the dusty caverns under the museum, their feeble yellowish torch beam counterpoint to the flickering and burned-out bluish fluorescent lights above, flicking the spiders away and copying labels into their iPad and working out what's what, trying not to accidentally kick over the skeletons. Or the equivalent in cyberspace - the forgotten databases with their cryptic column names distant echoes of those hidden recesses where the specimen boxes are packed. A start might be: * GBIF issues URI prefixes to people/institutions that want them. A system for doing this would need to be decided on, and that will involve (shudder) people. * GBIF advises the institution on setting up the namespace under that, trying to make the point that URIs should be persistent, unique, all those good things * GBIF acts as a registry for these namespaces, a place to declare "if you have a specimen record from collection X, then for sem-web purposes the URI should look like *this*" - allowing all that legacy data to be knitted together. The GBIF webserver might manage incoming http requests by * holding some very basic, minimal data - even just a dcterms:title and nothing else * or, 303 redirecting to the institution's own webserver (much in the manner of a PURL server) according to rules expressed simply as a regular expression find/replace. * or, fetching the RDF from the institutions' server, and ADDING some RDF facts of its own to the result This third option means that the GBIF database can serve as a central spot where movements of specimens (ie, the assignment of a new accession number) can be put. Hopefully not the only spot, though. Best practice is always to serve up the initial and the immediately prior URI along with any URI you give to the specimen. (this only makes sense for RDF, though: you can't just "add" things to a nicely formatted HTML page). To make all this happen, you would want some sort of usable machine-to-machine service, you'll have to manage authentication (Passwords are a bit of a pain - perhaps a cryptographic certificate given out when the namespace prefix is assigned? Easy enough to do.). You'll want a test/staging service and a real service … Its a fair bit of work, come to think of it, just on the technical side, and this is without starting on the "part-of" issues. ---------------- (Perhaps "uri.gbif.org" as the virtual host name? http:/uri.gbif.org/institution-code/collection-id/number. We'd also like a URI for "the list of institutions" and for each institution "the list of collections". Perhaps reserve "meta"? Thus http:/uri.gbif.org/uq/meta, http:/uri.gbif.org/uq/collectionX/meta as the well-known locations for README information.) (Allocation of URIs would cover more than just specimens. here at biodiversity.org.au, we use dotted names rather than slashes for our namespaces, meaning that our URIs have natural LSID equivalents. I think LSID componens can have slashes, so urn:lsid:uri.gbif.org:uq/collecitonX:12345) If you have received this transmission in error please notify us immediately by return e-mail and delete all copies. If this e-mail or any attachments have been sent to you in error, that error does not constitute waiver of any confidentiality, privilege or copyright in respect of information in the e-mail or attachments. Please consider the environment before printing this email.

In all of this discussion I am surprised that there has been no mention of Biodiversity Collections Index (BCI; http://www.biodiversitycollectionsindex.org/). To my knowledge, it has never been "down" for any significant period of time and has an extremely comprehensive listing of collections. Any collection that isn't there can be added in a matter of a few minutes. The reason why URLs are globally unique is because a centralized authority (ICANN) makes sure that no two entities can have the same domain name. It is the responsibility of the domain owner to not have two URLs that are the same within that domain. In other words, the domain owner makes sure that they identify their resources using locally unique identifiers which in combination with the domain name creates a globally unique identifier. BCI essentially performs an analogous function to ICANN in the biodiversity informatics community. It assigns a unique number to each collection and ensures that no two collections can have the same number. It slaps that number onto the end of the string "urn:lsid:biocol.org:col:" to create an LSID and onto the end of "http://biocol.org/urn:lsid:biocol.org:col:" to create an HTTP URI, both of which are globally unique, actionable (in their own ways), and persistent. All of the hand wringing about people changing their collection codes or institution codes, or about two institutions in different fields (or units within the same institution) having the same institution codes goes away if we simply use the BCI-assigned number to identify the collection. Within a particular collection, it is the institution's responsibility to create and maintain locally unique identifiers for their specimens. BCI has a systematic way to relate subcollections within collections (each with their own identifier) and a large institution with subcollections would just have to delegate at what level the coordination of locally unique identifiers would be done. Nobody outside the institution can do it for them - they just need to bear the responsibility to stick with a system and not change it. I mention this because there are really three categories of specimen-containing institutions: 1. Those with enough stability and the financial and IT resources to generate and provide dereferencing for their own actionable GUIDs. 2. Those with the ability to generate and maintain a database of non-HTTP-dereferenceable globally unique identifiers (I'm thinking about UUIDs or UUIDs that are part of LSIDs) and to associate them with specimens in their database, but which do not have the IT infrastructure or the inclination to provide actionability for their globally unique identifiers. 3. Those who have a system of assigning locally unique identifiers (I'm thinking bar codes) to their specimens but who because of small size will probably never have sophisticated IT capabilities nor the ability to provide dereferencing for actionable GUIDs. Either categories 2 or 3 would include institutions that do not have control over a stable domain name or which have institutional restrictions on the use of domain names that would preclude use of their domain name as part of an HTTP URI. Category 1 institutions create HTTP URI GUIDs using their domain names and do whatever they want as far as the locally unique part of their GUID is concerned. Their freedom comes with the responsibility of providing dereferencing under their domain name forever. Category 2 and 3 institutions create globally unique and persistent, but not (yet) dereferenceable identifiers with the hope of transforming them into HTTP URIs at a later time. Category 2 institutions have this already in the form of their UUIDs. Category 3 institutions create their own globally unique identifiers by means of a simple rule: "place the BCI number for our collection, followed by a slash, in front of our locally unique identifier" (e.g. "15590/" for the LSU herbarium + "LSU00000434" for the barcode to create "15590/LSU00000434" as an identifier for the specimen shown at http://images.cyberfloralouisiana.com/images/specimensheets/lsu/0/0/4/34/LSU...). Category 3 institutions go to BCI and write in the "note" for their collection what their rule is and then anybody who knows the barcode (or accession number or whatever kind of locally unique number they commit to) for the specimen knows the non-actionable globally unique identifier. If the institution already consistently uses a "Darwin Core triple" (institutionID:collectionID:catalogNumber) as a "poor-man's GUID" in their database, they could slap "the BCI number for our collection, followed by a slash" in front of it to guarantee that it didn't clash with any others Darwin Core triples. As for the transformation of the non-actionable globally unique identifiers created by category 2 and 3 institutions into actionable ones, a benevolent large institution (let us assume GBIF) who is willing to take on the job of providing dereferencing services for the category 2 and 3 institutions acquires "http://purl.org/specimen/" (or some other purl.org name) else if that's already taken) to use as the means to create the HTTP-proxied forms of the non-actionable globally unique identifiers. I suggest using a purl.org prefix rather than using a subdomain of gbif.org in the event that in the next hundred years gbif looses their funding or gets tired of providing this service. (See http://www.nbii.gov/termination/index.html for an example of how a big program with a nearly 20 year history can disappear in a puff of political idiocy.) If necessary, the "http://purl.org/specimen/" prefix could get passed over to some other big benevolent institution without requiring GBIF to give control of part of their domain to a non-GBIF entity. Now we have another simple rule. If we discover an identifier that has http:// at its front end, we dereference it to access its metadata. If we discover an identifier which we think represents a specimen that does not begin with "http://", we try putting "http://purl.org/specimen/" on the front of it. If nothing happens we are no worse off than before. If we are lucky, we get metadata. Preferably the proxy system would get established quickly and we would tell the type 3 institutions to place "http://purl.org/specimen/" + the BCI number for our collection, followed by a slash, in front of our locally unique identifier". But if in typical TDWG fashion it takes five years to decide to do this, the small institution still has an identifier (in the form of the non-actionable identifier) guaranteed to be globally unique among identifiers generated by institutions who agree to abide by this set of rules. In any case, we don't risk mucking up the Linked Data cloud with a bunch of synonymous URIs that need to be linked with owl:sameAs, since the UUIDs and category 3 globally unique identifiers can't be used as URI references in RDF. One could later write in RDF: <rdf:Description rdf:about="http://purl.org/specimen/15590/LSU00000434"> <dc:identifier>15590/LSU00000434</dc:identifier> </rdf:Description> to make sure that semantic clients understand that the non-URI globally unique identifier is associated with the proxied version. There would be technical details to figure out how the information about the specimens would be transferred between the smaller data-providing institution and the benevolent provider of dereferencing, but people are already doing that with GBIF so it doesn't seem so impossible to imagine that this could be worked out. The unveiling of BCI was done with great fanfare and it is one of the few biodiversity-related resources which actually follows all of the rules about persistent, actionable, and unique identifiers. Yet it rarely gets mentioned any more. Let's leverage it. Steve On 2/26/2012 9:27 PM, Paul Murray wrote:
On 25/02/2012, at 4:29 AM, Dean Pentcheff wrote:
This is directly in response to Rod's response to Paul. I think the two of you may have just articulated nearly the same idea, though you seem not to think you did.
Paul envisions institutions each declaring their own URI-creating formula (to resolve down to a specimen at that institution), promulgated at a "forum" location.
Rod envisions URI formulation as happening at a GBIFesque centralized site.
If Paul's forum were GBIF (or similar), with an added function that GBIF (or similar) renegotiates any institutional declaration that collides with a pre-existing declaration, does that map to the same thing for both of you? Well, if institutions are assigning URIs with their own domain names in them, or if GBIF is handing out URI prefixes that the institutions use, then collisions wouldn't be an issue.
As a technical person, perhaps I don't quite see things from the point of view of institutions whose interest in the web stops at having a pretty website, as someone suggested. It seems to me the easiest thing in the world to spark up a server and say "these are our URIs". But if people are outsourcing their web presence, then I can appreciate that creating a SemWeb presence might not seem as easy a thing to do to them. This is also the case for people who live in large institutions with byzantine rules about what may and may not go on the corporate websites.
If there are places where the issuing of ids to specimens is as chaotic as Rod describes, well - I think the flip side of what I was saying earlier, that people that create the numbers can easily create URIs, is that if the people who create the numbers have bits and bobs all over the place, then an external institution like GBIF is not going to be able to sort it out remotely. Someone has to be on the ground, treading the dusty caverns under the museum, their feeble yellowish torch beam counterpoint to the flickering and burned-out bluish fluorescent lights above, flicking the spiders away and copying labels into their iPad and working out what's what, trying not to accidentally kick over the skeletons.
Or the equivalent in cyberspace - the forgotten databases with their cryptic column names distant echoes of those hidden recesses where the specimen boxes are packed.
A start might be:
* GBIF issues URI prefixes to people/institutions that want them. A system for doing this would need to be decided on, and that will involve (shudder) people. * GBIF advises the institution on setting up the namespace under that, trying to make the point that URIs should be persistent, unique, all those good things * GBIF acts as a registry for these namespaces, a place to declare "if you have a specimen record from collection X, then for sem-web purposes the URI should look like *this*" - allowing all that legacy data to be knitted together.
The GBIF webserver might manage incoming http requests by * holding some very basic, minimal data - even just a dcterms:title and nothing else * or, 303 redirecting to the institution's own webserver (much in the manner of a PURL server) according to rules expressed simply as a regular expression find/replace. * or, fetching the RDF from the institutions' server, and ADDING some RDF facts of its own to the result
This third option means that the GBIF database can serve as a central spot where movements of specimens (ie, the assignment of a new accession number) can be put. Hopefully not the only spot, though. Best practice is always to serve up the initial and the immediately prior URI along with any URI you give to the specimen. (this only makes sense for RDF, though: you can't just "add" things to a nicely formatted HTML page).
To make all this happen, you would want some sort of usable machine-to-machine service, you'll have to manage authentication (Passwords are a bit of a pain - perhaps a cryptographic certificate given out when the namespace prefix is assigned? Easy enough to do.). You'll want a test/staging service and a real service …
Its a fair bit of work, come to think of it, just on the technical side, and this is without starting on the "part-of" issues.
---------------- (Perhaps "uri.gbif.org" as the virtual host name? http:/uri.gbif.org/institution-code/collection-id/number. We'd also like a URI for "the list of institutions" and for each institution "the list of collections". Perhaps reserve "meta"? Thus http:/uri.gbif.org/uq/meta, http:/uri.gbif.org/uq/collectionX/meta as the well-known locations for README information.)
(Allocation of URIs would cover more than just specimens. here at biodiversity.org.au, we use dotted names rather than slashes for our namespaces, meaning that our URIs have natural LSID equivalents. I think LSID componens can have slashes, so urn:lsid:uri.gbif.org:uq/collecitonX:12345)
If you have received this transmission in error please notify us immediately by return e-mail and delete all copies. If this e-mail or any attachments have been sent to you in error, that error does not constitute waiver of any confidentiality, privilege or copyright in respect of information in the e-mail or attachments.
Please consider the environment before printing this email. _______________________________________________ tdwg-tag mailing list tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag
.
-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A. delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235 office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707 http://bioimages.vanderbilt.edu

Dear Steve, I like BCI -- Roger Hyam did a very nice job creating this service. Indeed, I think Roger was offering to set up something rather like what you describe (see http://www.biocol.org/static/bcisgs.html ). BCI would be one way to create a namespace for specimen identifiers. As always, there's more than one such tool in our community. The Repository of Biological Repositories (http://biorepositories.org/) is a similar service from the barcoding community, and I gather there are moves to try and integrate these two resources (sigh). The other consideration would be how the BCI identifiers actually map to digital resources at the institutions (for example do the BCI identifiers map onto the dataset ids that GBIF has for each collection?). Let's hope that implementing resolvable specimen identifiers does not the typical fives years to actually happen... Regards Rod On 27 Feb 2012, at 21:17, Steve Baskauf wrote:
In all of this discussion I am surprised that there has been no mention of Biodiversity Collections Index (BCI; http://www.biodiversitycollectionsindex.org/). To my knowledge, it has never been "down" for any significant period of time and has an extremely comprehensive listing of collections. Any collection that isn't there can be added in a matter of a few minutes.
The reason why URLs are globally unique is because a centralized authority (ICANN) makes sure that no two entities can have the same domain name. It is the responsibility of the domain owner to not have two URLs that are the same within that domain. In other words, the domain owner makes sure that they identify their resources using locally unique identifiers which in combination with the domain name creates a globally unique identifier.
BCI essentially performs an analogous function to ICANN in the biodiversity informatics community. It assigns a unique number to each collection and ensures that no two collections can have the same number. It slaps that number onto the end of the string "urn:lsid:biocol.org:col:" to create an LSID and onto the end of "http://biocol.org/urn:lsid:biocol.org:col:" to create an HTTP URI, both of which are globally unique, actionable (in their own ways), and persistent.
All of the hand wringing about people changing their collection codes or institution codes, or about two institutions in different fields (or units within the same institution) having the same institution codes goes away if we simply use the BCI-assigned number to identify the collection. Within a particular collection, it is the institution's responsibility to create and maintain locally unique identifiers for their specimens. BCI has a systematic way to relate subcollections within collections (each with their own identifier) and a large institution with subcollections would just have to delegate at what level the coordination of locally unique identifiers would be done. Nobody outside the institution can do it for them - they just need to bear the responsibility to stick with a system and not change it.
I mention this because there are really three categories of specimen-containing institutions: 1. Those with enough stability and the financial and IT resources to generate and provide dereferencing for their own actionable GUIDs. 2. Those with the ability to generate and maintain a database of non-HTTP-dereferenceable globally unique identifiers (I'm thinking about UUIDs or UUIDs that are part of LSIDs) and to associate them with specimens in their database, but which do not have the IT infrastructure or the inclination to provide actionability for their globally unique identifiers. 3. Those who have a system of assigning locally unique identifiers (I'm thinking bar codes) to their specimens but who because of small size will probably never have sophisticated IT capabilities nor the ability to provide dereferencing for actionable GUIDs.
Either categories 2 or 3 would include institutions that do not have control over a stable domain name or which have institutional restrictions on the use of domain names that would preclude use of their domain name as part of an HTTP URI.
Category 1 institutions create HTTP URI GUIDs using their domain names and do whatever they want as far as the locally unique part of their GUID is concerned. Their freedom comes with the responsibility of providing dereferencing under their domain name forever.
Category 2 and 3 institutions create globally unique and persistent, but not (yet) dereferenceable identifiers with the hope of transforming them into HTTP URIs at a later time. Category 2 institutions have this already in the form of their UUIDs. Category 3 institutions create their own globally unique identifiers by means of a simple rule: "place the BCI number for our collection, followed by a slash, in front of our locally unique identifier" (e.g. "15590/" for the LSU herbarium + "LSU00000434" for the barcode to create "15590/LSU00000434" as an identifier for the specimen shown at http://images.cyberfloralouisiana.com/images/specimensheets/lsu/0/0/4/34/LSU...). Category 3 institutions go to BCI and write in the "note" for their collection what their rule is and then anybody who knows the barcode (or accession number or whatever kind of locally unique number they commit to) for the specimen knows the non-actionable globally unique identifier. If the institution already consistently uses a "Darwin Core triple" (institutionID:collectionID:catalogNumber) as a "poor-man's GUID" in their database, they could slap "the BCI number for our collection, followed by a slash" in front of it to guarantee that it didn't clash with any others Darwin Core triples.
As for the transformation of the non-actionable globally unique identifiers created by category 2 and 3 institutions into actionable ones, a benevolent large institution (let us assume GBIF) who is willing to take on the job of providing dereferencing services for the category 2 and 3 institutions acquires "http://purl.org/specimen/" (or some other purl.org name) else if that's already taken) to use as the means to create the HTTP-proxied forms of the non-actionable globally unique identifiers. I suggest using a purl.org prefix rather than using a subdomain of gbif.org in the event that in the next hundred years gbif looses their funding or gets tired of providing this service. (See http://www.nbii.gov/termination/index.html for an example of how a big program with a nearly 20 year history can disappear in a puff of political idiocy.) If necessary, the "http://purl.org/specimen/" prefix could get passed over to some other big benevolent institution without requiring GBIF to give control of part of their domain to a non-GBIF entity.
Now we have another simple rule. If we discover an identifier that has http:// at its front end, we dereference it to access its metadata. If we discover an identifier which we think represents a specimen that does not begin with "http://", we try putting "http://purl.org/specimen/" on the front of it. If nothing happens we are no worse off than before. If we are lucky, we get metadata. Preferably the proxy system would get established quickly and we would tell the type 3 institutions to place "http://purl.org/specimen/" + the BCI number for our collection, followed by a slash, in front of our locally unique identifier". But if in typical TDWG fashion it takes five years to decide to do this, the small institution still has an identifier (in the form of the non-actionable identifier) guaranteed to be globally unique among identifiers generated by institutions who agree to abide by this set of rules. In any case, we don't risk mucking up the Linked Data cloud with a bunch of synonymous URIs that need to be linked with owl:sameAs, since the UUIDs and category 3 globally unique identifiers can't be used as URI references in RDF. One could later write in RDF:
<rdf:Description rdf:about="http://purl.org/specimen/15590/LSU00000434"> <dc:identifier>15590/LSU00000434</dc:identifier> </rdf:Description>
to make sure that semantic clients understand that the non-URI globally unique identifier is associated with the proxied version.
There would be technical details to figure out how the information about the specimens would be transferred between the smaller data-providing institution and the benevolent provider of dereferencing, but people are already doing that with GBIF so it doesn't seem so impossible to imagine that this could be worked out.
The unveiling of BCI was done with great fanfare and it is one of the few biodiversity-related resources which actually follows all of the rules about persistent, actionable, and unique identifiers. Yet it rarely gets mentioned any more. Let's leverage it.
Steve
On 2/26/2012 9:27 PM, Paul Murray wrote:
On 25/02/2012, at 4:29 AM, Dean Pentcheff wrote:
This is directly in response to Rod's response to Paul. I think the two of you may have just articulated nearly the same idea, though you seem not to think you did.
Paul envisions institutions each declaring their own URI-creating formula (to resolve down to a specimen at that institution), promulgated at a "forum" location.
Rod envisions URI formulation as happening at a GBIFesque centralized site.
If Paul's forum were GBIF (or similar), with an added function that GBIF (or similar) renegotiates any institutional declaration that collides with a pre-existing declaration, does that map to the same thing for both of you? Well, if institutions are assigning URIs with their own domain names in them, or if GBIF is handing out URI prefixes that the institutions use, then collisions wouldn't be an issue.
As a technical person, perhaps I don't quite see things from the point of view of institutions whose interest in the web stops at having a pretty website, as someone suggested. It seems to me the easiest thing in the world to spark up a server and say "these are our URIs". But if people are outsourcing their web presence, then I can appreciate that creating a SemWeb presence might not seem as easy a thing to do to them. This is also the case for people who live in large institutions with byzantine rules about what may and may not go on the corporate websites.
If there are places where the issuing of ids to specimens is as chaotic as Rod describes, well - I think the flip side of what I was saying earlier, that people that create the numbers can easily create URIs, is that if the people who create the numbers have bits and bobs all over the place, then an external institution like GBIF is not going to be able to sort it out remotely. Someone has to be on the ground, treading the dusty caverns under the museum, their feeble yellowish torch beam counterpoint to the flickering and burned-out bluish fluorescent lights above, flicking the spiders away and copying labels into their iPad and working out what's what, trying not to accidentally kick over the skeletons.
Or the equivalent in cyberspace - the forgotten databases with their cryptic column names distant echoes of those hidden recesses where the specimen boxes are packed.
A start might be:
* GBIF issues URI prefixes to people/institutions that want them. A system for doing this would need to be decided on, and that will involve (shudder) people. * GBIF advises the institution on setting up the namespace under that, trying to make the point that URIs should be persistent, unique, all those good things * GBIF acts as a registry for these namespaces, a place to declare "if you have a specimen record from collection X, then for sem-web purposes the URI should look like *this*" - allowing all that legacy data to be knitted together.
The GBIF webserver might manage incoming http requests by * holding some very basic, minimal data - even just a dcterms:title and nothing else * or, 303 redirecting to the institution's own webserver (much in the manner of a PURL server) according to rules expressed simply as a regular expression find/replace. * or, fetching the RDF from the institutions' server, and ADDING some RDF facts of its own to the result
This third option means that the GBIF database can serve as a central spot where movements of specimens (ie, the assignment of a new accession number) can be put. Hopefully not the only spot, though. Best practice is always to serve up the initial and the immediately prior URI along with any URI you give to the specimen. (this only makes sense for RDF, though: you can't just "add" things to a nicely formatted HTML page).
To make all this happen, you would want some sort of usable machine-to-machine service, you'll have to manage authentication (Passwords are a bit of a pain - perhaps a cryptographic certificate given out when the namespace prefix is assigned? Easy enough to do.). You'll want a test/staging service and a real service …
Its a fair bit of work, come to think of it, just on the technical side, and this is without starting on the "part-of" issues.
---------------- (Perhaps "uri.gbif.org" as the virtual host name? http:/uri.gbif.org/institution-code/collection-id/number. We'd also like a URI for "the list of institutions" and for each institution "the list of collections". Perhaps reserve "meta"? Thus http:/uri.gbif.org/uq/meta, http:/uri.gbif.org/uq/collectionX/meta as the well-known locations for README information.)
(Allocation of URIs would cover more than just specimens. here at biodiversity.org.au, we use dotted names rather than slashes for our namespaces, meaning that our URIs have natural LSID equivalents. I think LSID componens can have slashes, so urn:lsid:uri.gbif.org:uq/collecitonX:12345)
If you have received this transmission in error please notify us immediately by return e-mail and delete all copies. If this e-mail or any attachments have been sent to you in error, that error does not constitute waiver of any confidentiality, privilege or copyright in respect of information in the e-mail or attachments.
Please consider the environment before printing this email. _______________________________________________ tdwg-tag mailing list tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag
.
-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences
postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A.
delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235
office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707 http://bioimages.vanderbilt.edu
_______________________________________________ tdwg-tag mailing list tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag
--------------------------------------------------------- Roderic Page Professor of Taxonomy Institute of Biodiversity, Animal Health and Comparative Medicine College of Medical, Veterinary and Life Sciences Graham Kerr Building University of Glasgow Glasgow G12 8QQ, UK Email: r.page@bio.gla.ac.uk Tel: +44 141 330 4778 Fax: +44 141 330 2792 AIM: rodpage1962@aim.com Facebook: http://www.facebook.com/profile.php?id=1112517192 Twitter: http://twitter.com/rdmpage Blog: http://iphylo.blogspot.com Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html

I'm trying not to get sucked into this discussion but thank you for all the kind words about BCI - flattery will get you almost anywhere! I'll say my tuppence worth but I have not followed everything so please excuse me if I am out of line. I am just working on a contribution to a paper that I hope will sum up these thoughts. Basically I am nervous about any middleman approach to issuing identifiers for specimens. For publications it is different as one may be able to retrieve the actual works from several places. If a DOI resolves to metadata about the work that is often enough because the metadata can be used to retrieve the actual publication from a library somewhere even if the publishers site is gone. If you are reading a paper that talks about a specimen and you want to find out more about the specimen you invariably already have the key metadata in the paper (location and recent determination) what you want to do is actually see the specimen from the authoritative source. To do this you need to resolve an identifier back to that source. It doesn't matter if you have a middle man running a DOI/LSID/PURL service you still need the target *data* HTTP URI to be live or the link is "broken". Collections must maintain live HTTP URIs for each specimen for click through to raw data to work. There are no quick third party fixes. Most specimens are in big collections and doing this is a matter of education and resource prioritisation not total lack of resources. Talking about middleman solutions just clouds the water because managers begin to think they can outsource the solution and it will go away. They can't. Maintaining an online catalogue is now a core curation task. None of this precludes the fact that we need big indexes and services linking things together but again this is different from publications - specimens don't contain many links to other specimens whereas it is a major feature of publications. Specimens tend to have pointers to them and not to point to other things. Enough already. I have some deadlines. Roger On 27 Feb 2012, at 21:59, Roderic Page wrote:
Dear Steve,
I like BCI -- Roger Hyam did a very nice job creating this service. Indeed, I think Roger was offering to set up something rather like what you describe (see http://www.biocol.org/static/bcisgs.html ).
BCI would be one way to create a namespace for specimen identifiers. As always, there's more than one such tool in our community. The Repository of Biological Repositories (http://biorepositories.org/) is a similar service from the barcoding community, and I gather there are moves to try and integrate these two resources (sigh). The other consideration would be how the BCI identifiers actually map to digital resources at the institutions (for example do the BCI identifiers map onto the dataset ids that GBIF has for each collection?).
Let's hope that implementing resolvable specimen identifiers does not the typical fives years to actually happen...
Regards
Rod
On 27 Feb 2012, at 21:17, Steve Baskauf wrote:
In all of this discussion I am surprised that there has been no mention of Biodiversity Collections Index (BCI; http://www.biodiversitycollectionsindex.org/). To my knowledge, it has never been "down" for any significant period of time and has an extremely comprehensive listing of collections. Any collection that isn't there can be added in a matter of a few minutes.
The reason why URLs are globally unique is because a centralized authority (ICANN) makes sure that no two entities can have the same domain name. It is the responsibility of the domain owner to not have two URLs that are the same within that domain. In other words, the domain owner makes sure that they identify their resources using locally unique identifiers which in combination with the domain name creates a globally unique identifier.
BCI essentially performs an analogous function to ICANN in the biodiversity informatics community. It assigns a unique number to each collection and ensures that no two collections can have the same number. It slaps that number onto the end of the string "urn:lsid:biocol.org:col:" to create an LSID and onto the end of "http://biocol.org/urn:lsid:biocol.org:col:" to create an HTTP URI, both of which are globally unique, actionable (in their own ways), and persistent.
All of the hand wringing about people changing their collection codes or institution codes, or about two institutions in different fields (or units within the same institution) having the same institution codes goes away if we simply use the BCI-assigned number to identify the collection. Within a particular collection, it is the institution's responsibility to create and maintain locally unique identifiers for their specimens. BCI has a systematic way to relate subcollections within collections (each with their own identifier) and a large institution with subcollections would just have to delegate at what level the coordination of locally unique identifiers would be done. Nobody outside the institution can do it for them - they just need to bear the responsibility to stick with a system and not change it.
I mention this because there are really three categories of specimen-containing institutions: 1. Those with enough stability and the financial and IT resources to generate and provide dereferencing for their own actionable GUIDs. 2. Those with the ability to generate and maintain a database of non-HTTP-dereferenceable globally unique identifiers (I'm thinking about UUIDs or UUIDs that are part of LSIDs) and to associate them with specimens in their database, but which do not have the IT infrastructure or the inclination to provide actionability for their globally unique identifiers. 3. Those who have a system of assigning locally unique identifiers (I'm thinking bar codes) to their specimens but who because of small size will probably never have sophisticated IT capabilities nor the ability to provide dereferencing for actionable GUIDs.
Either categories 2 or 3 would include institutions that do not have control over a stable domain name or which have institutional restrictions on the use of domain names that would preclude use of their domain name as part of an HTTP URI.
Category 1 institutions create HTTP URI GUIDs using their domain names and do whatever they want as far as the locally unique part of their GUID is concerned. Their freedom comes with the responsibility of providing dereferencing under their domain name forever.
Category 2 and 3 institutions create globally unique and persistent, but not (yet) dereferenceable identifiers with the hope of transforming them into HTTP URIs at a later time. Category 2 institutions have this already in the form of their UUIDs. Category 3 institutions create their own globally unique identifiers by means of a simple rule: "place the BCI number for our collection, followed by a slash, in front of our locally unique identifier" (e.g. "15590/" for the LSU herbarium + "LSU00000434" for the barcode to create "15590/LSU00000434" as an identifier for the specimen shown at http://images.cyberfloralouisiana.com/images/specimensheets/lsu/0/0/4/34/LSU...). Category 3 institutions go to BCI and write in the "note" for their collection what their rule is and then anybody who knows the barcode (or accession number or whatever kind of locally unique number they commit to) for the specimen knows the non-actionable globally unique identifier. If the institution already consistently uses a "Darwin Core triple" (institutionID:collectionID:catalogNumber) as a "poor-man's GUID" in their database, they could slap "the BCI number for our collection, followed by a slash" in front of it to guarantee that it didn't clash with any others Darwin Core triples.
As for the transformation of the non-actionable globally unique identifiers created by category 2 and 3 institutions into actionable ones, a benevolent large institution (let us assume GBIF) who is willing to take on the job of providing dereferencing services for the category 2 and 3 institutions acquires "http://purl.org/specimen/" (or some other purl.org name) else if that's already taken) to use as the means to create the HTTP-proxied forms of the non-actionable globally unique identifiers. I suggest using a purl.org prefix rather than using a subdomain of gbif.org in the event that in the next hundred years gbif looses their funding or gets tired of providing this service. (See http://www.nbii.gov/termination/index.html for an example of how a big program with a nearly 20 year history can disappear in a puff of political idiocy.) If necessary, the "http://purl.org/specimen/" prefix could get passed over to some other big benevolent institution without requiring GBIF to give control of part of their domain to a non-GBIF entity.
Now we have another simple rule. If we discover an identifier that has http:// at its front end, we dereference it to access its metadata. If we discover an identifier which we think represents a specimen that does not begin with "http://", we try putting "http://purl.org/specimen/" on the front of it. If nothing happens we are no worse off than before. If we are lucky, we get metadata. Preferably the proxy system would get established quickly and we would tell the type 3 institutions to place "http://purl.org/specimen/" + the BCI number for our collection, followed by a slash, in front of our locally unique identifier". But if in typical TDWG fashion it takes five years to decide to do this, the small institution still has an identifier (in the form of the non-actionable identifier) guaranteed to be globally unique among identifiers generated by institutions who agree to abide by this set of rules. In any case, we don't risk mucking up the Linked Data cloud with a bunch of synonymous URIs that need to be linked with owl:sameAs, since the UUIDs and category 3 globally unique identifiers can't be used as URI references in RDF. One could later write in RDF:
<rdf:Description rdf:about="http://purl.org/specimen/15590/LSU00000434"> <dc:identifier>15590/LSU00000434</dc:identifier> </rdf:Description>
to make sure that semantic clients understand that the non-URI globally unique identifier is associated with the proxied version.
There would be technical details to figure out how the information about the specimens would be transferred between the smaller data-providing institution and the benevolent provider of dereferencing, but people are already doing that with GBIF so it doesn't seem so impossible to imagine that this could be worked out.
The unveiling of BCI was done with great fanfare and it is one of the few biodiversity-related resources which actually follows all of the rules about persistent, actionable, and unique identifiers. Yet it rarely gets mentioned any more. Let's leverage it.
Steve
On 2/26/2012 9:27 PM, Paul Murray wrote:
On 25/02/2012, at 4:29 AM, Dean Pentcheff wrote:
This is directly in response to Rod's response to Paul. I think the two of you may have just articulated nearly the same idea, though you seem not to think you did.
Paul envisions institutions each declaring their own URI-creating formula (to resolve down to a specimen at that institution), promulgated at a "forum" location.
Rod envisions URI formulation as happening at a GBIFesque centralized site.
If Paul's forum were GBIF (or similar), with an added function that GBIF (or similar) renegotiates any institutional declaration that collides with a pre-existing declaration, does that map to the same thing for both of you? Well, if institutions are assigning URIs with their own domain names in them, or if GBIF is handing out URI prefixes that the institutions use, then collisions wouldn't be an issue.
As a technical person, perhaps I don't quite see things from the point of view of institutions whose interest in the web stops at having a pretty website, as someone suggested. It seems to me the easiest thing in the world to spark up a server and say "these are our URIs". But if people are outsourcing their web presence, then I can appreciate that creating a SemWeb presence might not seem as easy a thing to do to them. This is also the case for people who live in large institutions with byzantine rules about what may and may not go on the corporate websites.
If there are places where the issuing of ids to specimens is as chaotic as Rod describes, well - I think the flip side of what I was saying earlier, that people that create the numbers can easily create URIs, is that if the people who create the numbers have bits and bobs all over the place, then an external institution like GBIF is not going to be able to sort it out remotely. Someone has to be on the ground, treading the dusty caverns under the museum, their feeble yellowish torch beam counterpoint to the flickering and burned-out bluish fluorescent lights above, flicking the spiders away and copying labels into their iPad and working out what's what, trying not to accidentally kick over the skeletons.
Or the equivalent in cyberspace - the forgotten databases with their cryptic column names distant echoes of those hidden recesses where the specimen boxes are packed.
A start might be:
* GBIF issues URI prefixes to people/institutions that want them. A system for doing this would need to be decided on, and that will involve (shudder) people. * GBIF advises the institution on setting up the namespace under that, trying to make the point that URIs should be persistent, unique, all those good things * GBIF acts as a registry for these namespaces, a place to declare "if you have a specimen record from collection X, then for sem-web purposes the URI should look like *this*" - allowing all that legacy data to be knitted together.
The GBIF webserver might manage incoming http requests by * holding some very basic, minimal data - even just a dcterms:title and nothing else * or, 303 redirecting to the institution's own webserver (much in the manner of a PURL server) according to rules expressed simply as a regular expression find/replace. * or, fetching the RDF from the institutions' server, and ADDING some RDF facts of its own to the result
This third option means that the GBIF database can serve as a central spot where movements of specimens (ie, the assignment of a new accession number) can be put. Hopefully not the only spot, though. Best practice is always to serve up the initial and the immediately prior URI along with any URI you give to the specimen. (this only makes sense for RDF, though: you can't just "add" things to a nicely formatted HTML page).
To make all this happen, you would want some sort of usable machine-to-machine service, you'll have to manage authentication (Passwords are a bit of a pain - perhaps a cryptographic certificate given out when the namespace prefix is assigned? Easy enough to do.). You'll want a test/staging service and a real service …
Its a fair bit of work, come to think of it, just on the technical side, and this is without starting on the "part-of" issues.
---------------- (Perhaps "uri.gbif.org" as the virtual host name? http:/uri.gbif.org/institution-code/collection-id/number. We'd also like a URI for "the list of institutions" and for each institution "the list of collections". Perhaps reserve "meta"? Thus http:/uri.gbif.org/uq/meta, http:/uri.gbif.org/uq/collectionX/meta as the well-known locations for README information.)
(Allocation of URIs would cover more than just specimens. here at biodiversity.org.au, we use dotted names rather than slashes for our namespaces, meaning that our URIs have natural LSID equivalents. I think LSID componens can have slashes, so urn:lsid:uri.gbif.org:uq/collecitonX:12345)
If you have received this transmission in error please notify us immediately by return e-mail and delete all copies. If this e-mail or any attachments have been sent to you in error, that error does not constitute waiver of any confidentiality, privilege or copyright in respect of information in the e-mail or attachments.
Please consider the environment before printing this email. _______________________________________________ tdwg-tag mailing list tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag
.
-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences
postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A.
delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235
office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707 http://bioimages.vanderbilt.edu
_______________________________________________ tdwg-tag mailing list tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag
--------------------------------------------------------- Roderic Page Professor of Taxonomy Institute of Biodiversity, Animal Health and Comparative Medicine College of Medical, Veterinary and Life Sciences Graham Kerr Building University of Glasgow Glasgow G12 8QQ, UK
Email: r.page@bio.gla.ac.uk Tel: +44 141 330 4778 Fax: +44 141 330 2792 AIM: rodpage1962@aim.com Facebook: http://www.facebook.com/profile.php?id=1112517192 Twitter: http://twitter.com/rdmpage Blog: http://iphylo.blogspot.com Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html

So I'm going to insist on muddying the waters. I think we are talking about different parts of the same thing, albeit from different perspectives. I want identifiers for specimens so I can talk about them (i.e., say that this specimen was cited in these publications, and is the source of these sequences, and is shown in these images). I have lots of sources, such as BHL and GenBank where specimens are listed using various codes (which vary among sources, sigh). To figure out what these are, and whether they are the same specimen I want a service that tells me what AMNH 146335 is. I'd like it to give me an identifier that I can use to link this stuff together. I want to do this for lots of specimens from multiple sources. The only way I can see this being tenable is if there is a central aggregation of metadata, such as GBIF. In the same way, if publishers are going to start marking up specimen codes in articles, I'm guessing they want the same kind of service. It would be nice if authors did this themselves, but I doubt that will happen anytime soon (how many people use DOIs in their list of references?). This is one reason CrossRef exists, to take the citation strings from our articles and convert them into links (building a citation network in the process). I also want some assurance that if I go to the trouble of finding these identifiers (and publishing them on my websites) that they will not simply disappear. Right now I don't trust museums to keep these identifiers alive (hell, I don't trust GBIF to do this because it's clear that millions of their records are duplicates). If you want to get more data about the specimen, yes following a link to the primary source would be great. But how do I discover that link? If we had every museum specimen with it's own URL, I still have to find them. Ideally authors will include them, but this assumes authors themselves know them, and that they and publishers trust them sufficiently to use them. So decentralised begats centralised. I think it's a case of having both, and I'd argue that the kind of benefits we put forward when trying to convince the powers that be that specimen identifiers are a good thing are going to need both parts of the equation. I think this is a case of the genius of "AND". We need both parts to make this useful Regards Rod On 28 Feb 2012, at 10:23, Roger Hyam wrote:
I'm trying not to get sucked into this discussion but thank you for all the kind words about BCI - flattery will get you almost anywhere!
I'll say my tuppence worth but I have not followed everything so please excuse me if I am out of line.
I am just working on a contribution to a paper that I hope will sum up these thoughts.
Basically I am nervous about any middleman approach to issuing identifiers for specimens. For publications it is different as one may be able to retrieve the actual works from several places. If a DOI resolves to metadata about the work that is often enough because the metadata can be used to retrieve the actual publication from a library somewhere even if the publishers site is gone.
If you are reading a paper that talks about a specimen and you want to find out more about the specimen you invariably already have the key metadata in the paper (location and recent determination) what you want to do is actually see the specimen from the authoritative source. To do this you need to resolve an identifier back to that source. It doesn't matter if you have a middle man running a DOI/LSID/PURL service you still need the target *data* HTTP URI to be live or the link is "broken".
Collections must maintain live HTTP URIs for each specimen for click through to raw data to work. There are no quick third party fixes.
Most specimens are in big collections and doing this is a matter of education and resource prioritisation not total lack of resources.
Talking about middleman solutions just clouds the water because managers begin to think they can outsource the solution and it will go away. They can't. Maintaining an online catalogue is now a core curation task.
None of this precludes the fact that we need big indexes and services linking things together but again this is different from publications - specimens don't contain many links to other specimens whereas it is a major feature of publications. Specimens tend to have pointers to them and not to point to other things.
Enough already. I have some deadlines.
Roger
On 27 Feb 2012, at 21:59, Roderic Page wrote:
Dear Steve,
I like BCI -- Roger Hyam did a very nice job creating this service. Indeed, I think Roger was offering to set up something rather like what you describe (see http://www.biocol.org/static/bcisgs.html ).
BCI would be one way to create a namespace for specimen identifiers. As always, there's more than one such tool in our community. The Repository of Biological Repositories (http://biorepositories.org/) is a similar service from the barcoding community, and I gather there are moves to try and integrate these two resources (sigh). The other consideration would be how the BCI identifiers actually map to digital resources at the institutions (for example do the BCI identifiers map onto the dataset ids that GBIF has for each collection?).
Let's hope that implementing resolvable specimen identifiers does not the typical fives years to actually happen...
Regards
Rod
On 27 Feb 2012, at 21:17, Steve Baskauf wrote:
In all of this discussion I am surprised that there has been no mention of Biodiversity Collections Index (BCI; http://www.biodiversitycollectionsindex.org/). To my knowledge, it has never been "down" for any significant period of time and has an extremely comprehensive listing of collections. Any collection that isn't there can be added in a matter of a few minutes.
The reason why URLs are globally unique is because a centralized authority (ICANN) makes sure that no two entities can have the same domain name. It is the responsibility of the domain owner to not have two URLs that are the same within that domain. In other words, the domain owner makes sure that they identify their resources using locally unique identifiers which in combination with the domain name creates a globally unique identifier.
BCI essentially performs an analogous function to ICANN in the biodiversity informatics community. It assigns a unique number to each collection and ensures that no two collections can have the same number. It slaps that number onto the end of the string "urn:lsid:biocol.org:col:" to create an LSID and onto the end of "http://biocol.org/urn:lsid:biocol.org:col:" to create an HTTP URI, both of which are globally unique, actionable (in their own ways), and persistent.
All of the hand wringing about people changing their collection codes or institution codes, or about two institutions in different fields (or units within the same institution) having the same institution codes goes away if we simply use the BCI-assigned number to identify the collection. Within a particular collection, it is the institution's responsibility to create and maintain locally unique identifiers for their specimens. BCI has a systematic way to relate subcollections within collections (each with their own identifier) and a large institution with subcollections would just have to delegate at what level the coordination of locally unique identifiers would be done. Nobody outside the institution can do it for them - they just need to bear the responsibility to stick with a system and not change it.
I mention this because there are really three categories of specimen-containing institutions: 1. Those with enough stability and the financial and IT resources to generate and provide dereferencing for their own actionable GUIDs. 2. Those with the ability to generate and maintain a database of non-HTTP-dereferenceable globally unique identifiers (I'm thinking about UUIDs or UUIDs that are part of LSIDs) and to associate them with specimens in their database, but which do not have the IT infrastructure or the inclination to provide actionability for their globally unique identifiers. 3. Those who have a system of assigning locally unique identifiers (I'm thinking bar codes) to their specimens but who because of small size will probably never have sophisticated IT capabilities nor the ability to provide dereferencing for actionable GUIDs.
Either categories 2 or 3 would include institutions that do not have control over a stable domain name or which have institutional restrictions on the use of domain names that would preclude use of their domain name as part of an HTTP URI.
Category 1 institutions create HTTP URI GUIDs using their domain names and do whatever they want as far as the locally unique part of their GUID is concerned. Their freedom comes with the responsibility of providing dereferencing under their domain name forever.
Category 2 and 3 institutions create globally unique and persistent, but not (yet) dereferenceable identifiers with the hope of transforming them into HTTP URIs at a later time. Category 2 institutions have this already in the form of their UUIDs. Category 3 institutions create their own globally unique identifiers by means of a simple rule: "place the BCI number for our collection, followed by a slash, in front of our locally unique identifier" (e.g. "15590/" for the LSU herbarium + "LSU00000434" for the barcode to create "15590/LSU00000434" as an identifier for the specimen shown at http://images.cyberfloralouisiana.com/images/specimensheets/lsu/0/0/4/34/LSU...). Category 3 institutions go to BCI and write in the "note" for their collection what their rule is and then anybody who knows the barcode (or accession number or whatever kind of locally unique number they commit to) for the specimen knows the non-actionable globally unique identifier. If the institution already consistently uses a "Darwin Core triple" (institutionID:collectionID:catalogNumber) as a "poor-man's GUID" in their database, they could slap "the BCI number for our collection, followed by a slash" in front of it to guarantee that it didn't clash with any others Darwin Core triples.
As for the transformation of the non-actionable globally unique identifiers created by category 2 and 3 institutions into actionable ones, a benevolent large institution (let us assume GBIF) who is willing to take on the job of providing dereferencing services for the category 2 and 3 institutions acquires "http://purl.org/specimen/" (or some other purl.org name) else if that's already taken) to use as the means to create the HTTP-proxied forms of the non-actionable globally unique identifiers. I suggest using a purl.org prefix rather than using a subdomain of gbif.org in the event that in the next hundred years gbif looses their funding or gets tired of providing this service. (See http://www.nbii.gov/termination/index.html for an example of how a big program with a nearly 20 year history can disappear in a puff of political idiocy.) If necessary, the "http://purl.org/specimen/" prefix could get passed over to some other big benevolent institution without requiring GBIF to give control of part of their domain to a non-GBIF entity.
Now we have another simple rule. If we discover an identifier that has http:// at its front end, we dereference it to access its metadata. If we discover an identifier which we think represents a specimen that does not begin with "http://", we try putting "http://purl.org/specimen/" on the front of it. If nothing happens we are no worse off than before. If we are lucky, we get metadata. Preferably the proxy system would get established quickly and we would tell the type 3 institutions to place "http://purl.org/specimen/" + the BCI number for our collection, followed by a slash, in front of our locally unique identifier". But if in typical TDWG fashion it takes five years to decide to do this, the small institution still has an identifier (in the form of the non-actionable identifier) guaranteed to be globally unique among identifiers generated by institutions who agree to abide by this set of rules. In any case, we don't risk mucking up the Linked Data cloud with a bunch of synonymous URIs that need to be linked with owl:sameAs, since the UUIDs and category 3 globally unique identifiers can't be used as URI references in RDF. One could later write in RDF:
<rdf:Description rdf:about="http://purl.org/specimen/15590/LSU00000434"> <dc:identifier>15590/LSU00000434</dc:identifier> </rdf:Description>
to make sure that semantic clients understand that the non-URI globally unique identifier is associated with the proxied version.
There would be technical details to figure out how the information about the specimens would be transferred between the smaller data-providing institution and the benevolent provider of dereferencing, but people are already doing that with GBIF so it doesn't seem so impossible to imagine that this could be worked out.
The unveiling of BCI was done with great fanfare and it is one of the few biodiversity-related resources which actually follows all of the rules about persistent, actionable, and unique identifiers. Yet it rarely gets mentioned any more. Let's leverage it.
Steve
On 2/26/2012 9:27 PM, Paul Murray wrote:
On 25/02/2012, at 4:29 AM, Dean Pentcheff wrote:
This is directly in response to Rod's response to Paul. I think the two of you may have just articulated nearly the same idea, though you seem not to think you did.
Paul envisions institutions each declaring their own URI-creating formula (to resolve down to a specimen at that institution), promulgated at a "forum" location.
Rod envisions URI formulation as happening at a GBIFesque centralized site.
If Paul's forum were GBIF (or similar), with an added function that GBIF (or similar) renegotiates any institutional declaration that collides with a pre-existing declaration, does that map to the same thing for both of you? Well, if institutions are assigning URIs with their own domain names in them, or if GBIF is handing out URI prefixes that the institutions use, then collisions wouldn't be an issue.
As a technical person, perhaps I don't quite see things from the point of view of institutions whose interest in the web stops at having a pretty website, as someone suggested. It seems to me the easiest thing in the world to spark up a server and say "these are our URIs". But if people are outsourcing their web presence, then I can appreciate that creating a SemWeb presence might not seem as easy a thing to do to them. This is also the case for people who live in large institutions with byzantine rules about what may and may not go on the corporate websites.
If there are places where the issuing of ids to specimens is as chaotic as Rod describes, well - I think the flip side of what I was saying earlier, that people that create the numbers can easily create URIs, is that if the people who create the numbers have bits and bobs all over the place, then an external institution like GBIF is not going to be able to sort it out remotely. Someone has to be on the ground, treading the dusty caverns under the museum, their feeble yellowish torch beam counterpoint to the flickering and burned-out bluish fluorescent lights above, flicking the spiders away and copying labels into their iPad and working out what's what, trying not to accidentally kick over the skeletons.
Or the equivalent in cyberspace - the forgotten databases with their cryptic column names distant echoes of those hidden recesses where the specimen boxes are packed.
A start might be:
* GBIF issues URI prefixes to people/institutions that want them. A system for doing this would need to be decided on, and that will involve (shudder) people. * GBIF advises the institution on setting up the namespace under that, trying to make the point that URIs should be persistent, unique, all those good things * GBIF acts as a registry for these namespaces, a place to declare "if you have a specimen record from collection X, then for sem-web purposes the URI should look like *this*" - allowing all that legacy data to be knitted together.
The GBIF webserver might manage incoming http requests by * holding some very basic, minimal data - even just a dcterms:title and nothing else * or, 303 redirecting to the institution's own webserver (much in the manner of a PURL server) according to rules expressed simply as a regular expression find/replace. * or, fetching the RDF from the institutions' server, and ADDING some RDF facts of its own to the result
This third option means that the GBIF database can serve as a central spot where movements of specimens (ie, the assignment of a new accession number) can be put. Hopefully not the only spot, though. Best practice is always to serve up the initial and the immediately prior URI along with any URI you give to the specimen. (this only makes sense for RDF, though: you can't just "add" things to a nicely formatted HTML page).
To make all this happen, you would want some sort of usable machine-to-machine service, you'll have to manage authentication (Passwords are a bit of a pain - perhaps a cryptographic certificate given out when the namespace prefix is assigned? Easy enough to do.). You'll want a test/staging service and a real service …
Its a fair bit of work, come to think of it, just on the technical side, and this is without starting on the "part-of" issues.
---------------- (Perhaps "uri.gbif.org" as the virtual host name? http:/uri.gbif.org/institution-code/collection-id/number. We'd also like a URI for "the list of institutions" and for each institution "the list of collections". Perhaps reserve "meta"? Thus http:/uri.gbif.org/uq/meta, http:/uri.gbif.org/uq/collectionX/meta as the well-known locations for README information.)
(Allocation of URIs would cover more than just specimens. here at biodiversity.org.au, we use dotted names rather than slashes for our namespaces, meaning that our URIs have natural LSID equivalents. I think LSID componens can have slashes, so urn:lsid:uri.gbif.org:uq/collecitonX:12345)
If you have received this transmission in error please notify us immediately by return e-mail and delete all copies. If this e-mail or any attachments have been sent to you in error, that error does not constitute waiver of any confidentiality, privilege or copyright in respect of information in the e-mail or attachments.
Please consider the environment before printing this email. _______________________________________________ tdwg-tag mailing list tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag
.
-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences
postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A.
delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235
office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707 http://bioimages.vanderbilt.edu
_______________________________________________ tdwg-tag mailing list tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag
--------------------------------------------------------- Roderic Page Professor of Taxonomy Institute of Biodiversity, Animal Health and Comparative Medicine College of Medical, Veterinary and Life Sciences Graham Kerr Building University of Glasgow Glasgow G12 8QQ, UK
Email: r.page@bio.gla.ac.uk Tel: +44 141 330 4778 Fax: +44 141 330 2792 AIM: rodpage1962@aim.com Facebook: http://www.facebook.com/profile.php?id=1112517192 Twitter: http://twitter.com/rdmpage Blog: http://iphylo.blogspot.com Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html
--------------------------------------------------------- Roderic Page Professor of Taxonomy Institute of Biodiversity, Animal Health and Comparative Medicine College of Medical, Veterinary and Life Sciences Graham Kerr Building University of Glasgow Glasgow G12 8QQ, UK Email: r.page@bio.gla.ac.uk Tel: +44 141 330 4778 Fax: +44 141 330 2792 AIM: rodpage1962@aim.com Facebook: http://www.facebook.com/profile.php?id=1112517192 Twitter: http://twitter.com/rdmpage Blog: http://iphylo.blogspot.com Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html

I'm new to posting on this list, but I've been reading with interest. We are one of the relatively small institutions setting up our GUID's for our specimens. For us, we do intend to assign web-actionable URI's to each of our specimens. Although we are small, our specific collection and institutional umbrella are both old and we (and our IT support) expect to be around for the foreseeable future. Our interest in a centralized services such as BCI has more to do with our uncertainty about the stability of the 'domain' portion of the URI. We'd like to maintain full control over assigning/editing/updating the particular specimen records - but have the assurance that IF our domain were to require a change (something we'd do everything in our power to avoid...but as with every domain.. could change) -- the central service could redirect our URI's to a new server...without requiring any change in the aphanumeric string that makes up the URI itself. I am relatively new to this subject, but it was my understanding that this is what DOI's do for publications. The example given by Roger, while true for many cases, doesn't seem right for all cases, especially when the referenced publication only exists in a few (or one) copy. Meaning, as the referenced object moves from 'mass produced' to 'singular'...the importance of retrieving the data/metadata from the owner of the object increases. Secondly, I can think of many cases where a person who has the identifier of a specimen might not know the location or the determination of the specimen.... An author (or publisher) might wish to cite (and provide a web-actionable identifier) to a specimen that has not been precisely identified (in fact...there is good reason to think that these cases are exactly the ones where access to the specimen itself would be most critical). Molecular studies and ecological studies come to mind -- these publications rarely provide the location or determination information -- in fact we'd be lucky IF they'd provide a web-actionable specimen identifier so this information could be retrieved. I agree with Rod that a centralized approach to retrieving this data is a must (such as the DOI) but the metadata/data should remain with (and be editable/manageable by) the local authorities that control the objects themselves. I think if there were a 'shared' authority/domain namespace in the specimen identifier (like the first component of the DOI) - this would be something our collection would consider adopting OVER our own domain/authority (or simply creating our own PURL). apologies if what I've written above disregards past posts/comments, as I mentioned, I'm relatively new to the TDWG newsgroup and haven't read through all the archives. -Chris On Feb 28, 2012, at 4:32 AM, Roderic Page wrote:
So I'm going to insist on muddying the waters. I think we are talking about different parts of the same thing, albeit from different perspectives.
I want identifiers for specimens so I can talk about them (i.e., say that this specimen was cited in these publications, and is the source of these sequences, and is shown in these images). I have lots of sources, such as BHL and GenBank where specimens are listed using various codes (which vary among sources, sigh). To figure out what these are, and whether they are the same specimen I want a service that tells me what AMNH 146335 is. I'd like it to give me an identifier that I can use to link this stuff together. I want to do this for lots of specimens from multiple sources. The only way I can see this being tenable is if there is a central aggregation of metadata, such as GBIF.
In the same way, if publishers are going to start marking up specimen codes in articles, I'm guessing they want the same kind of service. It would be nice if authors did this themselves, but I doubt that will happen anytime soon (how many people use DOIs in their list of references?). This is one reason CrossRef exists, to take the citation strings from our articles and convert them into links (building a citation network in the process).
I also want some assurance that if I go to the trouble of finding these identifiers (and publishing them on my websites) that they will not simply disappear. Right now I don't trust museums to keep these identifiers alive (hell, I don't trust GBIF to do this because it's clear that millions of their records are duplicates).
If you want to get more data about the specimen, yes following a link to the primary source would be great. But how do I discover that link? If we had every museum specimen with it's own URL, I still have to find them. Ideally authors will include them, but this assumes authors themselves know them, and that they and publishers trust them sufficiently to use them.
So decentralised begats centralised. I think it's a case of having both, and I'd argue that the kind of benefits we put forward when trying to convince the powers that be that specimen identifiers are a good thing are going to need both parts of the equation.
I think this is a case of the genius of "AND". We need both parts to make this useful
Regards
Rod
On 28 Feb 2012, at 10:23, Roger Hyam wrote:
I'm trying not to get sucked into this discussion but thank you for all the kind words about BCI - flattery will get you almost anywhere!
I'll say my tuppence worth but I have not followed everything so please excuse me if I am out of line.
I am just working on a contribution to a paper that I hope will sum up these thoughts.
Basically I am nervous about any middleman approach to issuing identifiers for specimens. For publications it is different as one may be able to retrieve the actual works from several places. If a DOI resolves to metadata about the work that is often enough because the metadata can be used to retrieve the actual publication from a library somewhere even if the publishers site is gone.
If you are reading a paper that talks about a specimen and you want to find out more about the specimen you invariably already have the key metadata in the paper (location and recent determination) what you want to do is actually see the specimen from the authoritative source. To do this you need to resolve an identifier back to that source. It doesn't matter if you have a middle man running a DOI/LSID/PURL service you still need the target *data* HTTP URI to be live or the link is "broken".
Collections must maintain live HTTP URIs for each specimen for click through to raw data to work. There are no quick third party fixes.
Most specimens are in big collections and doing this is a matter of education and resource prioritisation not total lack of resources.
Talking about middleman solutions just clouds the water because managers begin to think they can outsource the solution and it will go away. They can't. Maintaining an online catalogue is now a core curation task.
None of this precludes the fact that we need big indexes and services linking things together but again this is different from publications - specimens don't contain many links to other specimens whereas it is a major feature of publications. Specimens tend to have pointers to them and not to point to other things.
Enough already. I have some deadlines.
Roger
On 27 Feb 2012, at 21:59, Roderic Page wrote:
Dear Steve,
I like BCI -- Roger Hyam did a very nice job creating this service. Indeed, I think Roger was offering to set up something rather like what you describe (see http://www.biocol.org/static/bcisgs.html ).
BCI would be one way to create a namespace for specimen identifiers. As always, there's more than one such tool in our community. The Repository of Biological Repositories (http://biorepositories.org/) is a similar service from the barcoding community, and I gather there are moves to try and integrate these two resources (sigh). The other consideration would be how the BCI identifiers actually map to digital resources at the institutions (for example do the BCI identifiers map onto the dataset ids that GBIF has for each collection?).
Let's hope that implementing resolvable specimen identifiers does not the typical fives years to actually happen...
Regards
Rod
On 27 Feb 2012, at 21:17, Steve Baskauf wrote:
In all of this discussion I am surprised that there has been no mention of Biodiversity Collections Index (BCI; http://www.biodiversitycollectionsindex.org/). To my knowledge, it has never been "down" for any significant period of time and has an extremely comprehensive listing of collections. Any collection that isn't there can be added in a matter of a few minutes.
The reason why URLs are globally unique is because a centralized authority (ICANN) makes sure that no two entities can have the same domain name. It is the responsibility of the domain owner to not have two URLs that are the same within that domain. In other words, the domain owner makes sure that they identify their resources using locally unique identifiers which in combination with the domain name creates a globally unique identifier.
BCI essentially performs an analogous function to ICANN in the biodiversity informatics community. It assigns a unique number to each collection and ensures that no two collections can have the same number. It slaps that number onto the end of the string "urn:lsid:biocol.org:col:" to create an LSID and onto the end of "http://biocol.org/urn:lsid:biocol.org:col:" to create an HTTP URI, both of which are globally unique, actionable (in their own ways), and persistent.
All of the hand wringing about people changing their collection codes or institution codes, or about two institutions in different fields (or units within the same institution) having the same institution codes goes away if we simply use the BCI-assigned number to identify the collection. Within a particular collection, it is the institution's responsibility to create and maintain locally unique identifiers for their specimens. BCI has a systematic way to relate subcollections within collections (each with their own identifier) and a large institution with subcollections would just have to delegate at what level the coordination of locally unique identifiers would be done. Nobody outside the institution can do it for them - they just need to bear the responsibility to stick with a system and not change it.
I mention this because there are really three categories of specimen-containing institutions: 1. Those with enough stability and the financial and IT resources to generate and provide dereferencing for their own actionable GUIDs. 2. Those with the ability to generate and maintain a database of non-HTTP-dereferenceable globally unique identifiers (I'm thinking about UUIDs or UUIDs that are part of LSIDs) and to associate them with specimens in their database, but which do not have the IT infrastructure or the inclination to provide actionability for their globally unique identifiers. 3. Those who have a system of assigning locally unique identifiers (I'm thinking bar codes) to their specimens but who because of small size will probably never have sophisticated IT capabilities nor the ability to provide dereferencing for actionable GUIDs.
Either categories 2 or 3 would include institutions that do not have control over a stable domain name or which have institutional restrictions on the use of domain names that would preclude use of their domain name as part of an HTTP URI.
Category 1 institutions create HTTP URI GUIDs using their domain names and do whatever they want as far as the locally unique part of their GUID is concerned. Their freedom comes with the responsibility of providing dereferencing under their domain name forever.
Category 2 and 3 institutions create globally unique and persistent, but not (yet) dereferenceable identifiers with the hope of transforming them into HTTP URIs at a later time. Category 2 institutions have this already in the form of their UUIDs. Category 3 institutions create their own globally unique identifiers by means of a simple rule: "place the BCI number for our collection, followed by a slash, in front of our locally unique identifier" (e.g. "15590/" for the LSU herbarium + "LSU00000434" for the barcode to create "15590/LSU00000434" as an identifier for the specimen shown at http://images.cyberfloralouisiana.com/images/specimensheets/lsu/0/0/4/34/LSU...). Category 3 institutions go to BCI and write in the "note" for their collection what their rule is and then anybody who knows the barcode (or accession number or whatever kind of locally unique number they commit to) for the specimen knows the non-actionable globally unique identifier. If the institution already consistently uses a "Darwin Core triple" (institutionID:collectionID:catalogNumber) as a "poor-man's GUID" in their database, they could slap "the BCI number for our collection, followed by a slash" in front of it to guarantee that it didn't clash with any others Darwin Core triples.
As for the transformation of the non-actionable globally unique identifiers created by category 2 and 3 institutions into actionable ones, a benevolent large institution (let us assume GBIF) who is willing to take on the job of providing dereferencing services for the category 2 and 3 institutions acquires "http://purl.org/specimen/" (or some other purl.org name) else if that's already taken) to use as the means to create the HTTP-proxied forms of the non-actionable globally unique identifiers. I suggest using a purl.org prefix rather than using a subdomain of gbif.org in the event that in the next hundred years gbif looses their funding or gets tired of providing this service. (See http://www.nbii.gov/termination/index.html for an example of how a big program with a nearly 20 year history can disappear in a puff of political idiocy.) If necessary, the "http://purl.org/specimen/" prefix could get passed over to some other big benevolent institution without requiring GBIF to give control of part of their domain to a non-GBIF entity.
Now we have another simple rule. If we discover an identifier that has http:// at its front end, we dereference it to access its metadata. If we discover an identifier which we think represents a specimen that does not begin with "http://", we try putting "http://purl.org/specimen/" on the front of it. If nothing happens we are no worse off than before. If we are lucky, we get metadata. Preferably the proxy system would get established quickly and we would tell the type 3 institutions to place "http://purl.org/specimen/" + the BCI number for our collection, followed by a slash, in front of our locally unique identifier". But if in typical TDWG fashion it takes five years to decide to do this, the small institution still has an identifier (in the form of the non-actionable identifier) guaranteed to be globally unique among identifiers generated by institutions who agree to abide by this set of rules. In any case, we don't risk mucking up the Linked Data cloud with a bunch of synonymous URIs that need to be linked with owl:sameAs, since the UUIDs and category 3 globally unique identifiers can't be used as URI references in RDF. One could later write in RDF:
<rdf:Description rdf:about="http://purl.org/specimen/15590/LSU00000434"> <dc:identifier>15590/LSU00000434</dc:identifier> </rdf:Description>
to make sure that semantic clients understand that the non-URI globally unique identifier is associated with the proxied version.
There would be technical details to figure out how the information about the specimens would be transferred between the smaller data-providing institution and the benevolent provider of dereferencing, but people are already doing that with GBIF so it doesn't seem so impossible to imagine that this could be worked out.
The unveiling of BCI was done with great fanfare and it is one of the few biodiversity-related resources which actually follows all of the rules about persistent, actionable, and unique identifiers. Yet it rarely gets mentioned any more. Let's leverage it.
Steve
On 2/26/2012 9:27 PM, Paul Murray wrote:
On 25/02/2012, at 4:29 AM, Dean Pentcheff wrote:
This is directly in response to Rod's response to Paul. I think the two of you may have just articulated nearly the same idea, though you seem not to think you did.
Paul envisions institutions each declaring their own URI-creating formula (to resolve down to a specimen at that institution), promulgated at a "forum" location.
Rod envisions URI formulation as happening at a GBIFesque centralized site.
If Paul's forum were GBIF (or similar), with an added function that GBIF (or similar) renegotiates any institutional declaration that collides with a pre-existing declaration, does that map to the same thing for both of you? Well, if institutions are assigning URIs with their own domain names in them, or if GBIF is handing out URI prefixes that the institutions use, then collisions wouldn't be an issue.
As a technical person, perhaps I don't quite see things from the point of view of institutions whose interest in the web stops at having a pretty website, as someone suggested. It seems to me the easiest thing in the world to spark up a server and say "these are our URIs". But if people are outsourcing their web presence, then I can appreciate that creating a SemWeb presence might not seem as easy a thing to do to them. This is also the case for people who live in large institutions with byzantine rules about what may and may not go on the corporate websites.
If there are places where the issuing of ids to specimens is as chaotic as Rod describes, well - I think the flip side of what I was saying earlier, that people that create the numbers can easily create URIs, is that if the people who create the numbers have bits and bobs all over the place, then an external institution like GBIF is not going to be able to sort it out remotely. Someone has to be on the ground, treading the dusty caverns under the museum, their feeble yellowish torch beam counterpoint to the flickering and burned-out bluish fluorescent lights above, flicking the spiders away and copying labels into their iPad and working out what's what, trying not to accidentally kick over the skeletons.
Or the equivalent in cyberspace - the forgotten databases with their cryptic column names distant echoes of those hidden recesses where the specimen boxes are packed.
A start might be:
* GBIF issues URI prefixes to people/institutions that want them. A system for doing this would need to be decided on, and that will involve (shudder) people. * GBIF advises the institution on setting up the namespace under that, trying to make the point that URIs should be persistent, unique, all those good things * GBIF acts as a registry for these namespaces, a place to declare "if you have a specimen record from collection X, then for sem-web purposes the URI should look like *this*" - allowing all that legacy data to be knitted together.
The GBIF webserver might manage incoming http requests by * holding some very basic, minimal data - even just a dcterms:title and nothing else * or, 303 redirecting to the institution's own webserver (much in the manner of a PURL server) according to rules expressed simply as a regular expression find/replace. * or, fetching the RDF from the institutions' server, and ADDING some RDF facts of its own to the result
This third option means that the GBIF database can serve as a central spot where movements of specimens (ie, the assignment of a new accession number) can be put. Hopefully not the only spot, though. Best practice is always to serve up the initial and the immediately prior URI along with any URI you give to the specimen. (this only makes sense for RDF, though: you can't just "add" things to a nicely formatted HTML page).
To make all this happen, you would want some sort of usable machine-to-machine service, you'll have to manage authentication (Passwords are a bit of a pain - perhaps a cryptographic certificate given out when the namespace prefix is assigned? Easy enough to do.). You'll want a test/staging service and a real service …
Its a fair bit of work, come to think of it, just on the technical side, and this is without starting on the "part-of" issues.
---------------- (Perhaps "uri.gbif.org" as the virtual host name? http:/uri.gbif.org/institution-code/collection-id/number. We'd also like a URI for "the list of institutions" and for each institution "the list of collections". Perhaps reserve "meta"? Thus http:/uri.gbif.org/uq/meta, http:/uri.gbif.org/uq/collectionX/meta as the well-known locations for README information.)
(Allocation of URIs would cover more than just specimens. here at biodiversity.org.au, we use dotted names rather than slashes for our namespaces, meaning that our URIs have natural LSID equivalents. I think LSID componens can have slashes, so urn:lsid:uri.gbif.org:uq/collecitonX:12345)
If you have received this transmission in error please notify us immediately by return e-mail and delete all copies. If this e-mail or any attachments have been sent to you in error, that error does not constitute waiver of any confidentiality, privilege or copyright in respect of information in the e-mail or attachments.
Please consider the environment before printing this email. _______________________________________________ tdwg-tag mailing list tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag
.
-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences
postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A.
delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235
office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707 http://bioimages.vanderbilt.edu
_______________________________________________ tdwg-tag mailing list tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag
--------------------------------------------------------- Roderic Page Professor of Taxonomy Institute of Biodiversity, Animal Health and Comparative Medicine College of Medical, Veterinary and Life Sciences Graham Kerr Building University of Glasgow Glasgow G12 8QQ, UK
Email: r.page@bio.gla.ac.uk Tel: +44 141 330 4778 Fax: +44 141 330 2792 AIM: rodpage1962@aim.com Facebook: http://www.facebook.com/profile.php?id=1112517192 Twitter: http://twitter.com/rdmpage Blog: http://iphylo.blogspot.com Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html
--------------------------------------------------------- Roderic Page Professor of Taxonomy Institute of Biodiversity, Animal Health and Comparative Medicine College of Medical, Veterinary and Life Sciences Graham Kerr Building University of Glasgow Glasgow G12 8QQ, UK
Email: r.page@bio.gla.ac.uk Tel: +44 141 330 4778 Fax: +44 141 330 2792 AIM: rodpage1962@aim.com Facebook: http://www.facebook.com/profile.php?id=1112517192 Twitter: http://twitter.com/rdmpage Blog: http://iphylo.blogspot.com Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html
_______________________________________________ tdwg-tag mailing list tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag
Christopher Marshall Curator & Collections Manager Oregon State Arthropod Collection Zoology - Oregon State University Corvallis OR, 97331-2914 marshach@science.oregonstate.edu

One further reason for centralisation (again, not "instead of" but "as well as") is consistency of metadata. When I'm mapping specimen codes to GBIF I have one query interface and one return format. If I have to go to individual providers then all bets are off. Perhaps I'm lucky and the provider supports something like linked data, so I can figure out how to retrieve data (as opposed to a human-friendly web page). But instead I expect we will have all sorts of formats. For example, today I discovered records in GenBank that are linked to a tissue database with web pages like this: http://collections.nhm.ku.edu/KU_Tissue/detail.jsp?record=367 (from sequence http://www.ncbi.nlm.nih.gov/nuccore/FJ215165 ) So, I have to write code to scrape this page and get the bit I need (the voucher code). Really? In this day and age? On the one had it's great that this information exists, but if it's not computer readable then make it harder to integrate the data. Even if we use standard vocabularies we can still have problems. BigDig found a whole range of different versions of Darwin Core in the wild (see http://bigdig.ecoforge.net/wiki/SchemaStatus ), and I suspect this is one of the sources of GBIF's problems (whoever decided that catalogNumber and catalogNumberText where a good idea has a lot to answer for). This is one reason I argue that we want both centralisation and decentralisation. Regards Rod --------------------------------------------------------- Roderic Page Professor of Taxonomy Institute of Biodiversity, Animal Health and Comparative Medicine College of Medical, Veterinary and Life Sciences Graham Kerr Building University of Glasgow Glasgow G12 8QQ, UK Email: r.page@bio.gla.ac.uk Tel: +44 141 330 4778 Fax: +44 141 330 2792 AIM: rodpage1962@aim.com Facebook: http://www.facebook.com/profile.php?id=1112517192 Twitter: http://twitter.com/rdmpage Blog: http://iphylo.blogspot.com Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html
participants (11)
-
Bob Morris
-
Christopher Marshall
-
Dean Pentcheff
-
Gregor Hagedorn
-
Kevin Richards
-
Matt Jones
-
Paul Murray
-
Richard Pyle
-
Roderic Page
-
Roger Hyam
-
Steve Baskauf