September 2005 - tdwg-tag - lists.tdwg.org

Re: GUIDs, LSIDs, and metadata
by Richard Pyle 11 Sep '05

11 Sep '05

Thanks, Kevin. I didn't realize that the LSID infrastructure was comparatively large compared to other GUID systems that have been suggested. Whenever I've been involved with discussions about GUIDs with people who understand the implications much better than I do, it always seems like the availability of open-source software tools is one of the reason people tend to favor LSIDs. My vision of the GUID itself would be the 64-bit integer, which could be wrapped into an LSID package, our used as a DOI number, or in some other GUID system. I also believe the resolution service should be mirrored (via robust and fast synchronization mechanisms) on hundreds or even thousands of servers around the world -- at least for the "data commons" (e.g., names, concepts, literature). I FULLY agree that it is very important to clearly define what objects should be assigned TDWG-standard GUIDs. In my view, the two object-domains in most need of GUIDs for the biological informatics community are taxonomic names, and "documentation" instanaces (~= authored/dated references, publications, etc.), with taxon concepts represented by the intersection of these two domains. Unfortunately, *neither* of these objects has been clearly defined within our community. It would be nice if we could simply adopt an exisitng literature-based GUID system developed by some other community, but from what I have learned, none quite meets the particular needs of the taxonomic informatics community (hence the emerging TDWG Literature Subgroup). The reason I single these two out from other data domains are: 1) they are (or should be) central to virtually all taxonomic data domains; and 2) they are particularly "thorny" in terms of unambiguous natural keys and cross-dataset resolution. Aloha, Rich Richard Pyle > -----Original Message----- > From: Taxonomic Databases Working Group GUID Project > [mailto:TDWG-GUID@LISTSERV.NHM.KU.EDU]On Behalf Of Kevin Richards > Sent: Sunday, September 11, 2005 9:55 AM > To: TDWG-GUID(a)LISTSERV.NHM.KU.EDU > Subject: Re: GUIDs, LSIDs, and metadata > > > Good points. > A few comments I have: > > I think LSIDs are assumed to solve all conflicts in the various > datasets of taxonomic data. However they are JUST resolvable > IDs, anything else is infrastructure surrounding the LSID > mechanisms. An LSID refers to a specific set of bytes that > resides on some computer somewhere. The assumption that an LSID > will refer to, for eaxample, a global 'taxon concept' that all > other taxon records should point to, is not correct. This relies > on a system to be in place that provides the functionality for > this global repository. > > Also I feel one argument AGAINST LSIDs is that the initial > investment in infrastructure is large, ie the development and > setting up of authorities, etc. So I think this would lean > people away from LSIDs, bot towards them. The advantage with the > LSID mechanism, I think, is that it is flexible enough to not > rely on existing software and internet configuration. > > A GUID really needs to refer to a reasonably basic record, eg a > name object rather than the entire taxon concept (although you > could have a GUID for either). This allows these individual > components to be referenced from other systems/datasets without > having to refer to and accept the enitre concept. It is probably > a good idea to map out which sort of taxonomic objects should get > GUIDs and how they relate to other objects. > > Kevin Richards > > >>> deepreef(a)BISHOPMUSEUM.ORG 09/11/05 6:50 AM >>> > Lots of good discuccion points on GUIDs -- thanks, Rod. I need to get two > little people to two different soccer (football) games soon, so I have no > time for an elaborate response. But I do want to comment on one point, > which I have been thinking a great deal about lately: > > > 7. I think the first priority for assigning GUIDs is museum specimens. > > For taxon names (if not concepts) this is trivial, given that most name > > databases have their own, internally unique ids (but not all -- those > > databases that use names as primary keys, or which don't expose integer > > identifiers will need to rethink their design). > > I think it's critical that, whatever GUID system we establish for taxon > names (and concepts), we do it in the context of the next several > decades of > informatic landscape; not just in the context of immediate needs > or current > political climate. > > As you said at the start of your message, GUIDs by themselves are trivial. > So the only real difference between establishing a system that is > intuitive > for the current needs and a system that will serve longer-term > future needs, > is a little bit of careful forethought. > > Official taxon name registration already exists for one of the major Codes > of Nomenclature (Bacterial), and within the next fortnight we will see a > public announcement of a plan for registration in another of the major > Codes. I predict that all Codes of nomenclature will implement mandatory > registration for all new names by about 2010, and for all > "available" names > (i.e., since Linnaeus) within five to ten years thereafter. So the > medium-term future landscape in this case will be one in which > all names are > issued a GUID through their respective Commission of Nomenclature. > > Further, it's not unreasonable to predict that sometime within > the next few > decades we will converge on a unified "BioCode" for all organism names, > meaning that the longer-term landscape has a single set of taxon names. > Wouldn't it be nice, after that time, if we didn't have to > forever maintain > legacy GUIDs? In other words, wouldn't it be nice if the established GUID > system for all taxon names were the same *now*, at the outset, so it's a > non-issue to combine them all as one set of GUIDs later on? > > I'm not entirely sold on LSIDs, but it does seem that a lot of smart and > knowledgable people are leaning that way. My hesitation is > mainly that one > of the main reasons for leaning that way is that all sorts of software > already exists for resolving them, so there is less overhead in initial > implementation. As long as LSID meet long-term needs, that shouldn't be a > problem. But 50 years from now, I'm not sure how wise it will > seem that the > universal GUID system adopted for biological data was influenced > strongly by > the available software of the time. Imagine being locked in now to a > universal system that was designed based on software that was available in > 1955! > > But, not being able to predict which GUID system will be the best in the > context of 2055, we really have no choice but to go with something that > makes a lot of sense now (which is justififable, in that it's also very > important that the delicate transition from no universal GUIDs to > widespread > universal GUIDs will be best supported by keeping it as painless > as possible > in the context of that transition time). > > But I still suggest we do things in a way that maximally keeps our options > open. For example, in the context of LSIDs, consider different paradigms > for registring the fish name, Mygenus myspecies Hyam (Hi, Roger! :-) ) > > One paradigm might have each major database create its own LSID: > > urn:lsid:catalogoffishes.org:SPNO:123456 > urn:lsid:gbif.org:ECAT:876543 > urn:lsid:itis.gov:TSN:567890 > > But then we're burdoned with the task of cross-mapping each of these, and > also preserving the legacy IDs into perpetuity after we've eventually > converged on a single taxon name GUID system. > > I was going to illustrate several other paradigms, but soccer > departure time > approaches, so I'll cut to the chase. In the LSID paradigm, I > would propose > the following system: > > urn:lsid:bioregistry.org:[Data Domain]:[randomly generated 64-bit integer] > > The "bioregistry.org" part represents the decoupling of the GUID from the > institution that initially created the GUID. It encompases all domains of > biological data (taxon names, concepts, specimens, etc.). It could be > "tdwg.org" or "gbif.org", but we're not sure those organizations will be > around 50 or 100 years from now. I imagine that GBIF would create and > manage the bioregistry.org domain for the near-term. > > The "Data Domain" represents a tag for the main domain of data (e.g. > "Specimens", or "TaxonNames", or whatever the major information > domains end > up being). > > The randomly generated 64-bit integer would be unique across all data > domains, so that it, by itself, is unique within bioregistry.org (no time > now to explain the rationale for this...) > > Gotta run....more later. > > Aloha, > Rich > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > ++++++++++ > WARNING: This email and any attachments may be confidential and/or > privileged. They are intended for the addressee only and are not > to be read, > used, copied or disseminated by anyone receiving them in error. > If you are > not the intended recipient, please notify the sender by return email and > delete this message and any attachments. > > The views expressed in this email are those of the sender and do not > necessarily reflect the official views of Landcare Research. > > Landcare Research > http://www.landcareresearch.co.nz > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > ++++++++++

1 0

GUIDs, LSIDs, and metadata
by Roderic Page 10 Sep '05

10 Sep '05

Not much is happening on the list side of things, so in the interest of sparking discussion here are a few thoughts. 1. GUIDs by themselves are trivial. We are awash in them (book ISBNs, GenBank accession numbers, etc.). Software developers generate them all the time for things Windows components, Firefox extensions, web objects, etc. There are tools for making these, e.g. here's one: AAF813DE-21E0-11DA-A940-000D93425524. 2. The key is to link GUIDs to information, and for that information to be in a predictable form. For example, DOIs are widely used GUIDs, but when you resolve a DOI you have no idea what to expect. You might get a PDF or HTML view of a manuscript, or just an abstract, or a page asking for money to view a manuscript. The format of the response varies widely. 3. Of course, GUIDs ARE vital. The DiGIR protocol's biggest weakness, in my opinion, is that it fails to provide GUIDs. Whereas it does provide information in a standard form (Darwin Core), the user has no way of getting a GUID. I'd briefly toyed with an interim solution for a project I'm working on. A DiGIR GUID would be digir.fieldmuseum.org:80/digir/DiGIR.php:MammalsDwC2:158106 which is the address of the DiGIR provider, the Resource name, and the specimen number (in this case, the specimen is FMNH 158106). This plan was scuppered by the fact that more than one specimen can have the same specimen code.For example the Museum of Vertebrate Zoology has three speciemns with the code MVZ 148946, corresponding to the taxa Chaetodipus baileyi baileyi, Calidris mauri, and Rana cascadae. A DiGIR request for specimen MVZ 148946 returns three totally different specimens! 4. I like LSIDs (despite the overhead of setting them up), but for me the main attraction is their use of metadata in RDF. This opens up a world of tools from the Semantci Web community, such as triple stores (databases for RDF). One can harvest metadata and store this is a "knowledge base." As this knowledge base grows we can uncover new facts. For example, NCBI doesn't know that Gliricidia ehrenbergii and Hybosema ehrenbergii are synonyms, whereas IPNI does. If these database soutput RDF we can extract this information. If you have IBM's LaunchPad and Internet Explorer 6, or Firefox with my LSID extension, then this link (lsidres:urn:lsid:ipni.org.lsid.zoology.gla.ac.uk:Id:1108320-2) displays RDF for one of IPNI's records for Gliricidia ehrenbergii (readers without any of these tools can view the raw RDF at http://ipni.org.lsid.zoology.gla.ac.uk/authority/metadata?lsid=urn: lsid:ipni.org.lsid.zoology.gla.ac.uk:Id:1108320-2 ). This RDF has links to LSIDs for nomenclatural synonyms for this name, and if you follow those you encounter Hybosema ehrenbergii. Hence, armed with consistent metadata one can make inferences about names. 5. Another attraction of RDF is it side steps the need for the huge, bloated XML schema which seem to bedevil the field at the moment. RDF tends to be simple, flat, and there are a number of existing vocabularies we can draw on (e.g., http://www.w3.org/2003/01/geo/) 6. I must confess I regard taxonomic concepts as a potential black hole. I understand the arguments in favour, I just don't buy that this is a tractable problem. I also think it is largely going to be of historical interest as more and more data become linked to specimens and to things like DNA barcodes. The fact that reconciling even two taxonomic classifications can be a major undertaking does not bode well for this project. For some more general thoughts on this issue, see http://shirky.com/writings/ontology_overrated.html (a taxonomic classification is an ontology). 7. I think the first priority for assigning GUIDs is museum specimens. For taxon names (if not concepts) this is trivial, given that most name databases have their own, internally unique ids (but not all -- those databases that use names as primary keys, or which don't expose integer identifiers will need to rethink their design). Regards Rod Professor Roderic D. M. Page Editor, Systematic Biology DEEB, IBLS Graham Kerr Building University of Glasgow Glasgow G12 8QP United Kingdom Phone: +44 141 330 4778 Fax: +44 141 330 2792 email: r.page(a)bio.gla.ac.uk web: http://taxonomy.zoology.gla.ac.uk/rod/rod.html reprints: http://taxonomy.zoology.gla.ac.uk/rod/pubs.html Subscribe to Systematic Biology through the Society of Systematic Biologists Website: http://systematicbiology.org Search for taxon names at http://darwin.zoology.gla.ac.uk/~rpage/portal/

1 0

Re: GUIDs, LSIDs, and metadata
by Richard Pyle 10 Sep '05

10 Sep '05

Lots of good discuccion points on GUIDs -- thanks, Rod. I need to get two little people to two different soccer (football) games soon, so I have no time for an elaborate response. But I do want to comment on one point, which I have been thinking a great deal about lately: > 7. I think the first priority for assigning GUIDs is museum specimens. > For taxon names (if not concepts) this is trivial, given that most name > databases have their own, internally unique ids (but not all -- those > databases that use names as primary keys, or which don't expose integer > identifiers will need to rethink their design). I think it's critical that, whatever GUID system we establish for taxon names (and concepts), we do it in the context of the next several decades of informatic landscape; not just in the context of immediate needs or current political climate. As you said at the start of your message, GUIDs by themselves are trivial. So the only real difference between establishing a system that is intuitive for the current needs and a system that will serve longer-term future needs, is a little bit of careful forethought. Official taxon name registration already exists for one of the major Codes of Nomenclature (Bacterial), and within the next fortnight we will see a public announcement of a plan for registration in another of the major Codes. I predict that all Codes of nomenclature will implement mandatory registration for all new names by about 2010, and for all "available" names (i.e., since Linnaeus) within five to ten years thereafter. So the medium-term future landscape in this case will be one in which all names are issued a GUID through their respective Commission of Nomenclature. Further, it's not unreasonable to predict that sometime within the next few decades we will converge on a unified "BioCode" for all organism names, meaning that the longer-term landscape has a single set of taxon names. Wouldn't it be nice, after that time, if we didn't have to forever maintain legacy GUIDs? In other words, wouldn't it be nice if the established GUID system for all taxon names were the same *now*, at the outset, so it's a non-issue to combine them all as one set of GUIDs later on? I'm not entirely sold on LSIDs, but it does seem that a lot of smart and knowledgable people are leaning that way. My hesitation is mainly that one of the main reasons for leaning that way is that all sorts of software already exists for resolving them, so there is less overhead in initial implementation. As long as LSID meet long-term needs, that shouldn't be a problem. But 50 years from now, I'm not sure how wise it will seem that the universal GUID system adopted for biological data was influenced strongly by the available software of the time. Imagine being locked in now to a universal system that was designed based on software that was available in 1955! But, not being able to predict which GUID system will be the best in the context of 2055, we really have no choice but to go with something that makes a lot of sense now (which is justififable, in that it's also very important that the delicate transition from no universal GUIDs to widespread universal GUIDs will be best supported by keeping it as painless as possible in the context of that transition time). But I still suggest we do things in a way that maximally keeps our options open. For example, in the context of LSIDs, consider different paradigms for registring the fish name, Mygenus myspecies Hyam (Hi, Roger! :-) ) One paradigm might have each major database create its own LSID: urn:lsid:catalogoffishes.org:SPNO:123456 urn:lsid:gbif.org:ECAT:876543 urn:lsid:itis.gov:TSN:567890 But then we're burdoned with the task of cross-mapping each of these, and also preserving the legacy IDs into perpetuity after we've eventually converged on a single taxon name GUID system. I was going to illustrate several other paradigms, but soccer departure time approaches, so I'll cut to the chase. In the LSID paradigm, I would propose the following system: urn:lsid:bioregistry.org:[Data Domain]:[randomly generated 64-bit integer] The "bioregistry.org" part represents the decoupling of the GUID from the institution that initially created the GUID. It encompases all domains of biological data (taxon names, concepts, specimens, etc.). It could be "tdwg.org" or "gbif.org", but we're not sure those organizations will be around 50 or 100 years from now. I imagine that GBIF would create and manage the bioregistry.org domain for the near-term. The "Data Domain" represents a tag for the main domain of data (e.g. "Specimens", or "TaxonNames", or whatever the major information domains end up being). The randomly generated 64-bit integer would be unique across all data domains, so that it, by itself, is unique within bioregistry.org (no time now to explain the rationale for this...) Gotta run....more later. Aloha, Rich

1 0