- tdwg-content - lists.tdwg.org

Re: Taxonomic Search Engine - Now with GenBank
by Bob Morris 26 Oct '04

26 Oct '04

Nice. It would be great if you exposed this as a Web Service. There seem to be several frameworks for doing this with PHP applications. Bob Morris Roderic D. M. Page wrote: > The Taxonomic Search Engine I recently developed > (http://darwin.zoology.gla.ac.uk/~rpage/portal/ ) now queries the > GenBank taxonomy, in addition to ITIS, Index Fungorum, uBio, and > IPNI. Although GenBank is not an authoritative source of taxonomic > names, they do have many names not in these other databases. > > Regards > > Rod > > -- > -------------------------------------------------------- > Professor Roderic D. M. Page > Editor Elect, Systematic Biology > DEEB, IBLS > Graham Kerr Building > University of Glasgow > Glasgow G12 8QP > United Kingdom > > > Phone: +44 141 330 4778 > Fax: +44 141 330 2792 > email: r.page(a)bio.gla.ac.uk > web: http://taxonomy.zoology.gla.ac.uk/rod/rod.html > reprints: http://taxonomy.zoology.gla.ac.uk/rod/pubs.html > > Subscribe to Systematic Biology through the Society of Systematic > Biologists Website: http://systematicbiology.org -- Robert A. Morris Professor of Computer Science UMASS-Boston ram(a)cs.umb.edu http://www.cs.umb.edu/efg http://www.cs.umb.edu/~ram phone (+1)617 287 6466

1 0

Re: Taxonomic Search Engine - web services and LSIDs
by Joseph Poupin 12 Oct '04

12 Oct '04

Dear Roderic D. M. Page, Congratulation for your 'Taxonomic Search Engine' portal. I maintain a database on tropical crustacea on the Internet (link in my signature) and I am interested to know if your project could include (or point to) that particular database? For example, when I hit 'Calcinus' in your portal I am redirected to ITIS data. Is it possible also to get the list of species that is obtained when 'Calcinus' is entered in the field genus of request form at: http://decapoda.free.fr/search_data.php ? Sincerely. Roderic D. M. Page a écrit: > As a test of various tools for querying multiple taxonomic name > databases I've created a "Taxonomic Search Engine" > (http://darwin.zoology.gla.ac.uk/~rpage/portal/ ) that may be of > interest. > > The site will search external databases (currently ITIS, Index > Fungorum, IPNI, and uBio) for a name. If it finds the name, you can > click on the name to get more details, including a link to the > original web site that provided the name. > > Some examples to try are "Morus", "Physeter macrocephalus", and "Apus > apus". > > Behind the scenes the application talks to each database in turn and > outputs the result in a consistent format (or, as consistent as the > very different database outputs will allow). > > The search engine also outputs Life Science Identifiers (LSIDs) for > each name. Followers of recent discussions on "GUIDs" (Globally > Unique Identifiers) for taxonomic names will know that LSIDs are one > candidate for assigning a GUID to a name. The search engine provides > one way to explore the utility of LSIDs (including links to tools for > viewing them). > > I'd welcome comments/feedback. Future plans include adding more > databases and improving search performance using caching. > > Regards > > Rod Page > > -- > -------------------------------------------------------- > Professor Roderic D. M. Page > Editor Elect, Systematic Biology > DEEB, IBLS > Graham Kerr Building > University of Glasgow > Glasgow G12 8QP > United Kingdom > > > Phone: +44 141 330 4778 > Fax: +44 141 330 2792 > email: r.page(a)bio.gla.ac.uk > web: http://taxonomy.zoology.gla.ac.uk/rod/rod.html > reprints: http://taxonomy.zoology.gla.ac.uk/rod/pubs.html > > Subscribe to Systematic Biology through the Society of Systematic > Biologists Website: http://systematicbiology.org > > > -- ____________________________________________________ POUPIN Joseph - Systèmes d'Information Géographiques Institut de Recherche de l'Ecole Navale Ecole Navale et GEP, LANVEOC - POULMIC BP 600, F29240 BREST ARMEES FRANCE 02 98 23 37 57 - Fax (33) 02 98 23 38 57 - poupin(a)ecole-navale.fr Polynesian Decapoda at: http://decapoda.free.fr/

1 0

Taxonomic Search Engine - web services and LSIDs
by Roderic D. M. Page 12 Oct '04

12 Oct '04

As a test of various tools for querying multiple taxonomic name databases I've created a "Taxonomic Search Engine" (http://darwin.zoology.gla.ac.uk/~rpage/portal/ ) that may be of interest. The site will search external databases (currently ITIS, Index Fungorum, IPNI, and uBio) for a name. If it finds the name, you can click on the name to get more details, including a link to the original web site that provided the name. Some examples to try are "Morus", "Physeter macrocephalus", and "Apus apus". Behind the scenes the application talks to each database in turn and outputs the result in a consistent format (or, as consistent as the very different database outputs will allow). The search engine also outputs Life Science Identifiers (LSIDs) for each name. Followers of recent discussions on "GUIDs" (Globally Unique Identifiers) for taxonomic names will know that LSIDs are one candidate for assigning a GUID to a name. The search engine provides one way to explore the utility of LSIDs (including links to tools for viewing them). I'd welcome comments/feedback. Future plans include adding more databases and improving search performance using caching. Regards Rod Page -- -------------------------------------------------------- Professor Roderic D. M. Page Editor Elect, Systematic Biology DEEB, IBLS Graham Kerr Building University of Glasgow Glasgow G12 8QP United Kingdom Phone: +44 141 330 4778 Fax: +44 141 330 2792 email: r.page(a)bio.gla.ac.uk web: http://taxonomy.zoology.gla.ac.uk/rod/rod.html reprints: http://taxonomy.zoology.gla.ac.uk/rod/pubs.html Subscribe to Systematic Biology through the Society of Systematic Biologists Website: http://systematicbiology.org

1 0

Re: Globally Unique Identifier
by Richard Pyle 06 Oct '04

06 Oct '04

Not enough time to respond in full, but one comment: > I am more reserved about demands to provide a central registry for > taxon concepts (or "derived/secondary" taxon concepts, if the > nomenclatural act of creating a name itself is considered a taxon > concept as well): My solution to this, which I will describe in more detail in my presentation at TDWG, is to assign the GUID to every "Name+Reference" instance ("Reference" here defined broadly; not restricted to publications). A susbset of these instances will be "name-bearing" instances (i.e., "original taxonomic descriptions"), recursively serving as the "Name" part of the Name+Reference instances. Another (overlapping) susbset of these would be "concept-bearing" instances. Still others may simply be specimen determination labels. They all represent a documented use of a taxonomic name by a human (or set of humans). The idea is that taxonomic names do not exist outside of a usage context, and that the usage context is usually objectively discernable and "reusable" (and as such, well-suited for shared universal GUID assigment). Keeping it broad and simple like this allows the same GUID pool to be used for a variety of applications (handles to names and concepts, for the purposes of constructing nomenclatural synonymies, mapping concepts, applying names/concepts to specimens, etc.) Must....get....some....sleep..... Aloha, Rich Richard L. Pyle, PhD Natural Sciences Database Coordinator, Bishop Museum 1525 Bernice St., Honolulu, HI 96817 Ph: (808)848-4115, Fax: (808)847-8252 email: deepreef(a)bishopmuseum.org http://www.bishopmuseum.org/bishop/HBS/pylerichard.html

1 0

Meeting Christchurch
by Gregor Hagedorn 05 Oct '04

05 Oct '04

This is just a reminder to those who can make it to the TDWG 2004 meeting in Christchurch, NZ: After the main meeting, on Saturday 16th, Sunday 17th, and Monday 18th there will be SDD workgroup meeting. The official schedule (http://www.tdwg.org/2004meet/TDWG_2004_ScheduleOverview.htm) is a bit confusing in no longer mentioning the Monday that originally present. The main topics will be: 1. A discussion of the new Lucid version and what SDD and Lucid can learn by a detailed comparison 2. A discussion of the CIPRES program and how SDD can serve this large-scale program by the phylogenetic community. The meeting is open to participation for all those interested. Gregor---------------------------------------------------------- Gregor Hagedorn (G.Hagedorn(a)bba.de) Institute for Plant Virology, Microbiology, and Biosafety Federal Research Center for Agriculture and Forestry (BBA) Königin-Luise-Str. 19 Tel: +49-30-8304-2220 14195 Berlin, Germany Fax: +49-30-8304-2203

1 0

Re: Globally Unique Identifier
by Gregor Hagedorn 05 Oct '04

05 Oct '04

> Regarding assignment of GUIDs to electronic records rather than > physical specimens -- do you feel the same way about taxonomic names? > I'd hate to have 5-10 ID numbers for every taxon name (e.g., one > generated by GBIF, one generated by ITIS, one generated by > Species2000, etc.) My understanding of the whole point of BioGUIDs > was to get away from this sort of duplication. Yes and no. I think the question collapses in the case of nomenclatural records. I believe the codes should move and endorse nomenclatural databases to be authoritative for providing authoritative name records. Since these are data objects, a GUID is natural to them, and real world (which is abstract in the case of names anyways) and data world are congruent. I am more reserved about demands to provide a central registry for taxon concepts (or "derived/secondary" taxon concepts, if the nomenclatural act of creating a name itself is considered a taxon concept as well): If somebody publishes a description of a taxon in Germany, printed or digital, perhaps providing a GUID of the nomenclatural data record the description assigns itself to, perhaps providing a DOI for its publication - I see no reason to go to a separate database, create a taxon concept record there, and then cite it back in your own digital publication. Jessie Kennedy and I differ on this point. Gregor---------------------------------------------------------- Gregor Hagedorn (G.Hagedorn(a)bba.de) Institute for Plant Virology, Microbiology, and Biosafety Federal Research Center for Agriculture and Forestry (BBA) Königin-Luise-Str. 19 Tel: +49-30-8304-2220 14195 Berlin, Germany Fax: +49-30-8304-2203

1 0

Re: Globally Unique Identifier
by Gregor Hagedorn 05 Oct '04

05 Oct '04

Richard's points about unifying to observations are very good and relevant to use. Indeed reexamination possibilities are not 100% dependent on collection versus observation-only data. The reexamination of a film or still picture is somewhat intermediate between the reexamination of field notes and of an actual specimen. Certain questions can be asked, others not. However, the main point is that these are properties of the observation and that more cases exist than museum or not. I basically disgree about assigning GUIDs to physical objects, if these are not unambigously attached to the object, rather than to one of possibly several data records. I have no problem if a barcode or RFID is a GUID rather than a local ID. I think it is not practical to demand all museums to change their physical system to a GBIF-chosen accession numbering system. It certainly would simplify life - I agree about desirability! > I'm not sure I understand the question. I guess I would answer with > another question: How does a Social Security Number (SSN) for a U.S. > Citizen > get attached to an individual person? I don't think anyone would > think of a SSN as an identifier for a data object -- it is a unique > identifier for the physical person. Yes, but the association problem is solved here - and NOT by unreliable secondary data. You can ask a person to show you the government-produced SSN-card, which is at least somewhat difficult to falsify. If you can do the same with a physical specimen object - no problem. If you cannot, and have to "guess" the GUID by maiden name and last known address, SSN would be in trouble. Gregor---------------------------------------------------------- Gregor Hagedorn (G.Hagedorn(a)bba.de) Institute for Plant Virology, Microbiology, and Biosafety Federal Research Center for Agriculture and Forestry (BBA) Königin-Luise-Str. 19 Tel: +49-30-8304-2220 14195 Berlin, Germany Fax: +49-30-8304-2203

1 0

Re: Globally Unique Identifier
by Richard Pyle 05 Oct '04

05 Oct '04

I understand better your points here, and will consider them carefully this week. Regarding assignment of GUIDs to electronic records rather than physical specimens -- do you feel the same way about taxonomic names? I'd hate to have 5-10 ID numbers for every taxon name (e.g., one generated by GBIF, one generated by ITIS, one generated by Species2000, etc.) My understanding of the whole point of BioGUIDs was to get away from this sort of duplication. Clearly this will be a major point of discussion in New Zealand -- I'm looking forward to it! Aloha, Rich > -----Original Message----- > From: TDWG - Structure of Descriptive Data > [mailto:TDWG-SDD@LISTSERV.NHM.KU.EDU]On Behalf Of Gregor Hagedorn > Sent: Monday, October 04, 2004 11:36 PM > To: TDWG-SDD(a)LISTSERV.NHM.KU.EDU > Subject: Re: Globally Unique Identifier > > > Richard's points about unifying to observations are very good and > relevant to use. Indeed reexamination possibilities are not 100% > dependent on collection versus observation-only data. The > reexamination of a film or still picture is somewhat intermediate > between the reexamination of field notes and of an actual specimen. > Certain questions can be asked, others not. However, the main point > is that these are properties of the observation and that more cases > exist than museum or not. > > I basically disgree about assigning GUIDs to physical objects, if > these are not unambigously attached to the object, rather than to one > of possibly several data records. I have no problem if a barcode or > RFID is a GUID rather than a local ID. I think it is not practical to > demand all museums to change their physical system to a GBIF-chosen > accession numbering system. It certainly would simplify life - I > agree about desirability! > > > I'm not sure I understand the question. I guess I would answer with > > another question: How does a Social Security Number (SSN) for a U.S. > > Citizen > get attached to an individual person? I don't think > anyone would > > think of a SSN as an identifier for a data object -- it is a unique > > identifier for the physical person. > > Yes, but the association problem is solved here - and NOT by > unreliable secondary data. You can ask a person to show you the > government-produced SSN-card, which is at least somewhat difficult to > falsify. If you can do the same with a physical specimen object - no > problem. If you cannot, and have to "guess" the GUID by maiden name > and last known address, SSN would be in trouble. > > Gregor---------------------------------------------------------- > Gregor Hagedorn (G.Hagedorn(a)bba.de) > Institute for Plant Virology, Microbiology, and Biosafety > Federal Research Center for Agriculture and Forestry (BBA) > Königin-Luise-Str. 19 Tel: +49-30-8304-2220 > 14195 Berlin, Germany Fax: +49-30-8304-2203

1 0

Re: Globally Unique Identifier
by Gregor Hagedorn 04 Oct '04

04 Oct '04

Gregor wrote: > > Do you mean those GLOPP organism- > > interaction-data that have specimen voucher information can not be > > published/referenced in GBIF until I figure out whether a collection > > has digitized them (most have never digitized elsewhere!)? Richard wrote: > Not necessarily. I don't think the issue is whether or not the > collection has been digitized, but rather whether GUIDs have already > been assigned to the vouchers you want to document in the GLOPP > dataset. So, if your question is more along the lines of "do I need to > check to see if GUIDs have already been issued to voucher specimens > that I cite, before I issue new GUIDs", then my answer -- in the long > run, at least -- would be, "well....yes!" That's sort of the > fundamental point of the GUIDs, isn't it? But I don't see this as > being necessarily burdensome. For example, if your GLOPP dataset > included unambiguous pointers to specific voucher specimens (e.g., via > InstitutionCode+CollectionCode+CatalogNumber), then it *should* be a > relatively quick and straightforward process to find out if GUIDs have > already been assigned (if it's not quick & easy, then the GUID service > would be horribly inadequate!) I think this is the problem. No single record in our database has this information, and to my knowledge, most if not all of the physical specimen sheets referred to do not yet have a unique catalogue number. Adding truly unique catalogue numbers physically to specimens (as opposed to often non-unique batch accession numbers) has often not been made and is now done either during digitization, or it may in recent year also happen during loan processing. However, most of the printed literature cites collection name (which may be historic, if collections are merged), taxon name, plus one of: - simply the information that it is a type (often expressed by exclamation mark after collection acronym, indicating that a type has been studied) - for non-types 2-4 elements out of: collector, collection date, collectors field number and location. Having collector plus collectors field number is relatively good (although uniqueness is up to the collector, some assign batch numbers for collection events), but again in my experience it is relatively rarely cited. None of the GLOPP records taken from literature cited a field number. The other fields are normally sufficiently unique if you go into the collection and see what is there, but is a terrible key to try any matching against a data service - at least the location is usually comparable by automatic string matching. > If, on the other hand, the GLOPP > dataset does not provide unambiguous pointers to specific voucher > specimens, then the "vouchered" aspect of those specimen citations > seems unsupported, in which case your GUIDs would need to be assigned > to virtual/unvouchered "specimens" (analogous to observation records), > and hence non-duplicate. As said, if you go into the collection, it is easy to identify them. If you know that a fragment of the collection is completely digitized - as opposed to random digitization which only digitizes specimens recently loaned - you can manually identify it on the computer. I would guess every specimen identification takes at most 5 minutes and involves one or several queries and picking from the result list manually. I believe most biologists will consider the specimen citations used in print until now a voucher - I do not agree with the blank statement that this is something else. It is usually unambiguous - but not very good for machine processing. Contradictions? > > when the collection starts to digitize > > them, they would have to create for those that have already been > > published in GLOPP a new version of the GLOPP LSID? > > I would hope that if you assigned GUIDs to GLOPP-relevant voucher > specimens that belong to a collection that is not-yet digitized, you > would do the courtesy of providing the manager of that collection with > a listing of the GUIDs you created for the specific relevant > specimens. I would further hope that, when that collection is > eventually digitized, the manager would have the wherewithal to assign > new GUIDs only to those specimens that did not yet have them. The problem is: what would the collection manager make with such a list? It would be the problem in reverse: the list of GUIDs is not easy to connect to the physical specimens. Most collections where specimens can be handled attach unique numbers to the specimens during digitization (but perhaps even this is not true in some insect collections, where handling the specimen creates the danger of destroying it). The list delivered by GLOPP would contain information about specimens that have no such barcode/etc. number yet. > accommodated in any GUID system that is developed. My main point is > that such "redundant" GUID issuance should be minimized (i.e., never > done intentionally), and quickly/easily identified as such whenever it > is discovered. Certainly not intentionally, but is should be clear that a museum should not start to prohibit the use of laptops when a Ph.D. candidate comes in and "digitizes" some specimens for a taxonomic revision. If the museum system supports it, it is wise to ask to use the museum system, but if the system is too complex and requires long training, rather have the monography than nothing... > So....if/when the situation does come up that (for example) GLOPP > assigns GUIDs to vouchers on behalf of a non-digitized collection, and > that collection later (inadvertently) re-assigns redundant GUIDs to > the same set of specimens; that eventual discovery of this duplication > should be accommodated by a mechanism for "retiring" one of the IDs > into "objective synonomy" of the other ID, and automated systems > should be implemented in the resolver service that "auto-forward" the > retired ID to the active ID. I think you could rather view this as an optional deduplication layer. Your specification explicitly contradicts at least the LSID specifiction to retrieve repeatedly exactly the same data. If an analysis of GLOPP is based on some data - e.g. a misidentified host plant - and cites this, it should be recoverable. Being silently forwarded to different information only causes confusion. So I prefer a view where GUIDs refer to data objects. I still do not see, how you propose to attach them to the physical objects for those researchers working in the collection itself. A secondary service can then know about relationships of multiple data objects referring to the same physical object. This service may be able to find cross- references in the data itself, have smart methods to estimate uniqueness based even on location strings, or may have manually create cross-reference tables. An important point is that different "deduplication" scenarios exist. For example, in culture collection, many strains are cross-preserved in multiple collections. So "CBS 123.88" may be "equal" to "ATCC 1234132" or "BBA 77123". Ideally we may even know the history: "BBA 77123" > "CBS 123.88" > "ATCC 1234132". However, the chance that any of these strains (which are like "versions") has been mixed up (or mutations occurred) is always there. Thus, if I look for duplication of the collection event data, I want to deduplicate. If I want to check a confusing DNA sequence, I may want to know about other derived strains, but I absolutely need to know exactly which strain from which collections was sequenced. > For the most part, though -- I see these as "growing pains" of a GUID > system during its first years of existence. I would predict that two > decades from now, if one were to do an analysis of redundant GUIDs, > one would find the bulk of those having been issued relatively early > on. I agree, but I probably think it is more relevant than you seem to think. I believe the "early days" to last the next 50 years - the time needed until collections are fully digitized *plus* the time it takes to make publication without citing GUIDs inacceptable. Gregor---------------------------------------------------------- Gregor Hagedorn (G.Hagedorn(a)bba.de) Institute for Plant Virology, Microbiology, and Biosafety Federal Research Center for Agriculture and Forestry (BBA) Königin-Luise-Str. 19 Tel: +49-30-8304-2220 14195 Berlin, Germany Fax: +49-30-8304-2203

1 0

Re: Globally Unique Identifier
by Richard Pyle 04 Oct '04

04 Oct '04

> I think this is the problem. No single record in our database has > this information, and to my knowledge, most if not all of the > physical specimen sheets referred to do not yet have a unique > catalogue number. Adding truly unique catalogue numbers physically to > specimens (as opposed to often non-unique batch accession numbers) > has often not been made and is now done either during digitization, > or it may in recent year also happen during loan processing. I guess my point was, if the GLOPP database does not contain enough information to uniquely identify a given voucher specimen (whether it is a catalog number, or some combination of other data), then the GLOPP data record can't really be considered "vouchered", can it? Maybe I am not thinking hard enough about this, but it seems to me that without the ability to re-locate a cited specimen with a fair degree of certainty, then the record would need to be considered "unvouchered". So, if enough information is available to pinpoint the specimen, then it *should* (and I emphasize this word only because I know there will undoubtedly be exceptions) be possible to associate the GUID with the voucher specimen at a later time. If there is not enough information to re-locate the specimen, then it seems to me that the connection with the specimen is broken, and the record becomes a stand-alone "unvouchered" biological instance. > However, most of the printed literature cites collection name (which > may be historic, if collections are merged), taxon name, plus one of: > - simply the information that it is a type (often expressed by > exclamation mark after collection acronym, indicating that a type has > been studied) > - for non-types 2-4 elements out of: collector, collection date, > collectors field number and location. Having collector plus > collectors field number is relatively good (although uniqueness is up > to the collector, some assign batch numbers for collection events), > but again in my experience it is relatively rarely cited. None of the > GLOPP records taken from literature cited a field number. The other > fields are normally sufficiently unique if you go into the collection > and see what is there, but is a terrible key to try any matching > against a data service - at least the location is usually comparable > by automatic string matching. O.K., I think I understand now, that a human could likely re-locate the specimen based on some assemblage of data, but that this assemblage of data would certainly require a human to establish the connection. I guess my feeling is this: We should always do our best to avoid the assignment of duplicate GUIDs to a single "biological instance" (of the physical kind), but we should also acknowledge that inadvertent duplication will be inevitable (as it may well be in the scenario you describe), and therefore build-in a system for accommodating "objective GUID synonymies". > As said, if you go into the collection, it is easy to identify them. > If you know that a fragment of the collection is completely digitized > - as opposed to random digitization which only digitizes specimens > recently loaned - you can manually identify it on the computer. I > would guess every specimen identification takes at most 5 minutes and > involves one or several queries and picking from the result list > manually. I imagine it will depend, on a case-by-case-basis, whether the cost of manual match-up of this sort at the time a GUID is to be assigned exceeds the cost of risking duplicate GUID assignment. > I believe most biologists will consider the specimen citations used > in print until now a voucher - I do not agree with the blank > statement that this is something else. It is usually unambiguous - > but not very good for machine processing. Contradictions? No contradictions from me. But the concept of "voucher" to me becomes more ambiguous as time goes by. I have thought a lot about how one might pool "evidence" from Museum collections, published data, in-situ images (still-ph otos and video), unpublished sighting reports, etc., and there is no easy answer that I have found. My conceptual approach has been to reduce all of these to reported instances of a particular organism at a particular place and time. This is what I have meant by "biological instance". Where they differ is merely in how well-documented they are (unpublished word-of-mouth, published word-of-mouth, film or electronic image, tissue sample, preserved organism, etc.), and to what extent they can be re-examined by later researchers (in my mind, the distinction between "vouchered" and "unvouchered"). Some uncollected sight records (e.g., rare plants in Hawaii) have a high degree of re-examination potential, whereas some specimens collected and preserved in a Museum, but misplaced, lost, or deteriorated, have a very low degree of re-examination potential. In-situ images, though they may have limited documentation (external appearance only, from one angle only in the case of still photos, limited in resolution at which the image was captured), have a very high degree of re-examination potential. The point (to me, at least) is that in all cases there was some physical "biological instance", and it is that entity to which I think the GUID should be assigned. If a published report cites a specimen preserved in a Museum, then the same GUID should be used for both the Museum specimen, and the published citation thereof. If a published report cites an organism that was never collected or preserved in a Museum, and the publication itself constitutes the only record of that biological instance, then a new GUID should be assigned to it. When a record of the latter sort is later discovered to be in reference to a specimen that exists in a museum that already has its own GUID, then those two GUIDs should be branded as "synonyms". > > My main point is > > that such "redundant" GUID issuance should be minimized (i.e., never > > done intentionally), and quickly/easily identified as such whenever it > > is discovered. > > Certainly not intentionally, but is should be clear that a museum > should not start to prohibit the use of laptops when a Ph.D. > candidate comes in and "digitizes" some specimens for a taxonomic > revision. If the museum system supports it, it is wise to ask to use > the museum system, but if the system is too complex and requires long > training, rather have the monography than nothing... Agreed. But I would think the student should provide a listing of all GUIDs assigned to specimens within a collection, including as much information as is necessary to uniquely identify each GUID-assigned specimen. Whether the collection manager ever uses that information or not is a different question, but I think a "culture of respect" for avoiding duplicate GUID assignment should be integral to the whole GUID process. > > So....if/when the situation does come up that (for example) GLOPP > > assigns GUIDs to vouchers on behalf of a non-digitized collection, and > > that collection later (inadvertently) re-assigns redundant GUIDs to > > the same set of specimens; that eventual discovery of this duplication > > should be accommodated by a mechanism for "retiring" one of the IDs > > into "objective synonomy" of the other ID, and automated systems > > should be implemented in the resolver service that "auto-forward" the > > retired ID to the active ID. > > I think you could rather view this as an optional deduplication > layer. I'm not sure I understand exactly what you mean by "optional" (i.e., at whose option), but I think it should be a fundamental component to any resolution service. > Your specification explicitly contradicts at least the LSID > specifiction to retrieve repeatedly exactly the same data. Yes, I know -- which is why I'm feeling less cozy about LSIDs. I think the crux of the issue centers on the question that Donald asked in one of his PowerPoint slides: Are these numbers assigned to the physical or "conceptual" (=non-electronic "virtual") objects, or are they assigned to the electronic/digital representations thereof? My feeling, from the point of view of a taxonomist who develops databases for natural history collections, is that the ultimate goal (i.e., seamless transmission and exchange of biodiversity-relevant data) will be better served if: 1) The ID's are assigned to the non-electronic (physical or conceptual) objects; 2) "Static" data associated with those objects be allowed to be changed (as errors are discovered and corrected) without altering the GUID (and that data history logs for these static data attributes be thought of as a secondary function of the data management, not affecting GUIDs); 3) "Dynamic" data associated with those objects (e.g., multiple taxonomic identifications of a specimen) should be handled "robustly" (i.e., not as "versions" of the complete set of data associated with a particular specimen) > So I prefer a view where GUIDs refer to data objects. I still do not > see, how you propose to attach them to the physical objects for those > researchers working in the collection itself. I'm not sure I understand the question. I guess I would answer with another question: How does a Social Security Number (SSN) for a U.S. Citizen (NINO in the UK, SIN in Canada, INSEE in France, TFN in Australia, etc. -- see http://encyclopedia.thefreedictionary.com/Social%20Security%20number) get attached to an individual person? I don't think anyone would think of a SSN as an identifier for a data object -- it is a unique identifier for the physical person. I believe that commonly used physical objects in biology (specimens, taxon names/concepts, references, agents, character definitions, etc.) should have equivalents of SSNs assigned to them, for the same reasons that U.S. Citizens have SSNs (i.e., to provide an unambiguously unique identifier useful for managing information associated with a physical person). Obviously, SSNs are not the perfect model for BioGUIDs (for a number of reasons), but the point is that they represent an ID attached to a physical entity. > A secondary service can > then know about relationships of multiple data objects referring to > the same physical object. This service may be able to find cross- > references in the data itself, have smart methods to estimate > uniqueness based even on location strings, or may have manually > create cross-reference tables. This sounds to me like an unnecessary layer of complexity. But I remain open-minded on this issue. > An important point is that different "deduplication" scenarios exist. > For example, in culture collection, many strains are cross-preserved > in multiple collections. So "CBS 123.88" may be "equal" to "ATCC > 1234132" or "BBA 77123". Ideally we may even know the history: "BBA > 77123" > "CBS 123.88" > "ATCC 1234132". However, the chance that any > of these strains (which are like "versions") has been mixed up (or > mutations occurred) is always there. Thus, if I look for duplication > of the collection event data, I want to deduplicate. If I want to > check a confusing DNA sequence, I may want to know about other > derived strains, but I absolutely need to know exactly which strain > from which collections was sequenced. I'll have to think about this some more, but it comes back to the question of what "unit" a GUID is assigned to. This is not so much a problem for taxonomic objects; a little bit of a problem for Reference objects, and a potentially HUGE problem for specimen/"biological instance" objects. Again this raises the question: How important is it to use the same GUID scheme for all of these different classes of bio-objects? > > For the most part, though -- I see these as "growing pains" of a GUID > > system during its first years of existence. I would predict that two > > decades from now, if one were to do an analysis of redundant GUIDs, > > one would find the bulk of those having been issued relatively early > > on. > > I agree, but I probably think it is more relevant than you seem to > think. I believe the "early days" to last the next 50 years - the > time needed until collections are fully digitized *plus* the time it > takes to make publication without citing GUIDs inacceptable. I imagine that the vast majority of "publications" in science 50 years from now will be electronic. But I see your point -- even if it's only 20 years as I suggested, that's still a lot of headaches to deal with. Aloha, Rich Richard L. Pyle, PhD Natural Sciences Database Coordinator, Bishop Museum 1525 Bernice St., Honolulu, HI 96817 Ph: (808)848-4115, Fax: (808)847-8252 email: deepreef(a)bishopmuseum.org http://www.bishopmuseum.org/bishop/HBS/pylerichard.html

1 0