- tdwg-content - lists.tdwg.org

Re: Globally Unique Identifier
by Donald Hobern 23 Sep '04

23 Sep '04

This is precisely one of the key questions we need to address with any identifier framework we adopt. I think we could easily use LSIDs in a way that should overcome your concerns, and I think that the built-in mechanisms for discovery and metadata access within the LSID model are really exciting. I have just put together a PowerPoint presentation to explain some of what I think we could achieve with globally unique identifiers and particularly with LSIDS. It can be downloaded from: http://circa.gbif.net/Public/irc/gbif/dadi/library?l=/architecture/globa llyuniqueidentifier/ It may be clearest if you go through it as a slide show rather than in edit mode. Thanks, Donald --------------------------------------------------------------- Donald Hobern (dhobern(a)gbif.org) Programme Officer for Data Access and Database Interoperability Global Biodiversity Information Facility Secretariat Universitetsparken 15, DK-2100 Copenhagen, Denmark Tel: +45-35321483 Mobile: +45-28751483 Fax: +45-35321480 --------------------------------------------------------------- -----Original Message----- From: TDWG - Structure of Descriptive Data [mailto:TDWG-SDD@LISTSERV.NHM.KU.EDU] On Behalf Of Wouter Addink Sent: 23. september 2004 17:38 To: TDWG-SDD(a)LISTSERV.NHM.KU.EDU Subject: Re: Globally Unique Identifier It seems that DOI allows for any existing IDs to be used as part of the unique identifier. That seems to me as a fast to adopt short term solution but not a good idea for the long term. At first sight I very much liked the LSID specification, but the longer I think about it, the less I like some parts. What I think is missing in the LSID specification is that the unique identifier should be 'meaningless' apart from being an identifier to become time independent (and to avoid possible political problems). Any solution with a URN I can think of has some meaning, which makes solutions like a MAC-address generated GUID favorable in my opinion. And any meaning you need (like an authority of an object) can be specified in metadata instead of using it in the identifier. What is not very clear to me in the LSID specification is where the LSID generated by a LSIDAssigningService is actually stored. Wouter Addink ----- Original Message ----- From: "Gregor Hagedorn" <G.Hagedorn(a)BBA.DE> To: <TDWG-SDD(a)LISTSERV.NHM.KU.EDU> Sent: Wednesday, September 08, 2004 6:20 PM Subject: Re: Globally Unique Identifier >I am not quite sure, but to me it seems with "GUID" you refer to the > numeric, MAC-address generated GUID type. I have nothing against > these. However, any URN in my view is a GUID that has most of the > properties you mention: > >> - it is guaranteed to be unique globally, and can be created anywhere, >> anytime by any server or client machine - it has no meaning as to >> where the data is physically located and will there not confuse any >> user about this > >> - most id >> mechanisms, especially URI/URN ids require a 'governing body' to >> handle namespaces/urls to ensure every URN is unique, whereas a GUID >> is always unique > > The governing body is restricted to the primary web address, and in > most cases such an address is already available. Being a member of a > governmental institution that explicitly forbids the use without > prior consent, and forbids the use of its domain name once you are no > longer working for them, I realize some potential for problem. > >> I do think a URL of some kind would be useful for things such as >> global searches of multiple databases, as this will allow the search >> to go directly to the data source where the name, referene, etc comes >> from. But this should not be part of its ID. Maybe a name/id should >> have several foms, a GUID for an ID and a URL + a GUID for a fully >> specified name. >> >> What are the current thoughts on these ideas? > > A GUID is only part of the problem. The other half of the problem is > actually getting at the resource. URN schemes like DOI or LSID (I > prefer the latter) intend to define resolution mechanisms. That make > the URN not yet a URL - in my view the good comes with the good, > location and reorganization independence. > > I believe GBIF should install such an LSID resolver, which is why in > the UBIF proxy model, under Links, I propose to support a general URL > (including potentially URNS), a typed LSID and a typed DOI. This > could be simplified to have just a URN (LSID and DOI are URNs), but > that would then require string parsing to determine and recognize the > preferred resolvable GUID types. Comments on splitting/not splitting > this are welcome! > > There may be some need to define a non-resolvable URN/numeric GUID as > well. However, that would not be under the linking question. Is it > correct that linking requires resolvability, or am I thinking into a > wrong direction? > > Gregor >> > > > ---------------------------------------------------------- > Gregor Hagedorn (G.Hagedorn(a)bba.de) > Institute for Plant Virology, Microbiology, and Biosafety > Federal Research Center for Agriculture and Forestry (BBA) > Koenigin-Luise-Str. 19 Tel: +49-30-8304-2220 > 14195 Berlin, Germany Fax: +49-30-8304-2203 > > Often wrong but never in doubt!

1 0

Re: Globally Unique Identifier
by Wouter Addink 23 Sep '04

23 Sep '04

It seems that DOI allows for any existing IDs to be used as part of the unique identifier. That seems to me as a fast to adopt short term solution but not a good idea for the long term. At first sight I very much liked the LSID specification, but the longer I think about it, the less I like some parts. What I think is missing in the LSID specification is that the unique identifier should be 'meaningless' apart from being an identifier to become time independent (and to avoid possible political problems). Any solution with a URN I can think of has some meaning, which makes solutions like a MAC-address generated GUID favorable in my opinion. And any meaning you need (like an authority of an object) can be specified in metadata instead of using it in the identifier. What is not very clear to me in the LSID specification is where the LSID generated by a LSIDAssigningService is actually stored. Wouter Addink ----- Original Message ----- From: "Gregor Hagedorn" <G.Hagedorn(a)BBA.DE> To: <TDWG-SDD(a)LISTSERV.NHM.KU.EDU> Sent: Wednesday, September 08, 2004 6:20 PM Subject: Re: Globally Unique Identifier >I am not quite sure, but to me it seems with "GUID" you refer to the > numeric, MAC-address generated GUID type. I have nothing against > these. However, any URN in my view is a GUID that has most of the > properties you mention: > >> - it is guaranteed to be unique globally, and can be created anywhere, >> anytime by any server or client machine - it has no meaning as to >> where the data is physically located and will there not confuse any >> user about this > >> - most id >> mechanisms, especially URI/URN ids require a 'governing body' to >> handle namespaces/urls to ensure every URN is unique, whereas a GUID >> is always unique > > The governing body is restricted to the primary web address, and in > most cases such an address is already available. Being a member of a > governmental institution that explicitly forbids the use without > prior consent, and forbids the use of its domain name once you are no > longer working for them, I realize some potential for problem. > >> I do think a URL of some kind would be useful for things such as >> global searches of multiple databases, as this will allow the search >> to go directly to the data source where the name, referene, etc comes >> from. But this should not be part of its ID. Maybe a name/id should >> have several foms, a GUID for an ID and a URL + a GUID for a fully >> specified name. >> >> What are the current thoughts on these ideas? > > A GUID is only part of the problem. The other half of the problem is > actually getting at the resource. URN schemes like DOI or LSID (I > prefer the latter) intend to define resolution mechanisms. That make > the URN not yet a URL - in my view the good comes with the good, > location and reorganization independence. > > I believe GBIF should install such an LSID resolver, which is why in > the UBIF proxy model, under Links, I propose to support a general URL > (including potentially URNS), a typed LSID and a typed DOI. This > could be simplified to have just a URN (LSID and DOI are URNs), but > that would then require string parsing to determine and recognize the > preferred resolvable GUID types. Comments on splitting/not splitting > this are welcome! > > There may be some need to define a non-resolvable URN/numeric GUID as > well. However, that would not be under the linking question. Is it > correct that linking requires resolvability, or am I thinking into a > wrong direction? > > Gregor >> > > > ---------------------------------------------------------- > Gregor Hagedorn (G.Hagedorn(a)bba.de) > Institute for Plant Virology, Microbiology, and Biosafety > Federal Research Center for Agriculture and Forestry (BBA) > Koenigin-Luise-Str. 19 Tel: +49-30-8304-2220 > 14195 Berlin, Germany Fax: +49-30-8304-2203 > > Often wrong but never in doubt!

1 0

Re: Globally Unique Identifier
by Richard Pyle 23 Sep '04

23 Sep '04

> Mahalo for your informative discussion, Rich. > A few questions. You're pretty active on this so maybe you can help me out. > What about duplicate specimens? Although a specimen may be MO 1234, K 5678 and P AABB, > they may in fact all be SMITH 10001 and duplicates of the exact same specimen, not > different specimens. Is that one GUID or 3? In my view, we would assign only ONE GUID, which represents the actual, physical specimen. That this one specimen has multiple catalog number assigned to it is simply additional information associated with that one specimen (in the same way that many specimens may have more than one taxonomic name applied to it, by different investigators at different times). This is part of the problem with using the "soft" GUID surrogate of [InstitutionCode]+[CollectionCode]+[CatalogNumber]. A simple solution would be to select one of these catalog numbers (e.g., SMITH 10001) as the "current" catalog number, and enter that in the appropriate DarwinCore (DwC) fields (either [CollectionCode]+[CatalogNumber] or [InstitutionCode]+[CatalogNumber], in this case). The MaNIS implementation of DwC included a "OtherCatalogNumbers" element, which would store the other numbers. I imagine two main problems: 1) Data for the single specimen may be represented more than once in an Aggregator, if different providers represent the "soft" GUID for the specimen with two different catalog numbers. For human-viewed search results, it would probably be evident soon by looking at the other data that the two records are the same. For statistical search results, the specimen would be counted more than once, which could cause errors in the numeric results of statistical queries. 2) If the record is only represented by one of its catalog numbers, then how is someone supposed to locate it by one of the other catalog numbers? One way is to include support for a "OtherCatalogNumbers" element, in such a way that it can be searched in addition to the "soft" GUID of [InstitutionCode]+[CollectionCode]+[CatalogNumber]. But that's a bit convoluted. So, the real solution, in my mind, is to implement a "hard" GUID ("GlobalUniqueIdentifier": http://darwincore.calacademy.org/Documentation/DarwinCore2DraftHTML) That way, the specimen could be represented in four different Provider records, but easily combine as one by an Aggregator via the shared GUID. > When attempting to use world-wide specimen records via GBIF for biodiversity counts > and species analyses, these duplicates artificially inflate the counts significantly > in some cases. Yes -- that's what I meant by "statistical search results". Presumably, DiGIR Providers should only provide data on specimens that they current hold. For instance, if BPBM 12345 was donated to Smithsonian, and now has the new catalog number USNM 987654, then Bishop Museum should not include the record in its DiGIR provider under its original catalog number (BPBM 12345). Bishop could either represent it with the current catalog number (USNM 987654), in which case an Aggregator could easily identify it as the same specimen, or Bishop should exclude it from its DiGIR provider altogether. Of course, none of this is perfect -- there are likely to be all kinds of errors of this sort when institutions wholesale dump their electronic catalogs online in the form of DiGIR providers. But the same is true of "hard" GUIDs. What's to stop Bishop Museum from assigning one GUID to its record of BPBM 12345, and Smithsonian assigning another GUID to its record of USNM 987654? The correct answer is, "nothing, really" -- except to whatever extent the people in charge of assigning these GUIDs to specimens in their charge are careful to avoid making such duplications. But nobody is perfect -- which is why *any* GUID system is going to require some sort of integrated "inadvertent duplication index", to keep a permanent index of "objective" duplications (not to be confused with "subjective" record equivalencies, such as this taxonomic concept is equivalent to that taxonomic concept). > What about triplicate names? IPNI is often given as the example for a set of name records. > But, IPNI can have three records for the same exact name and reference--one from IK, APNI > and Grey Cards. IPNI has no plans to ever deduplicate these records due to the nature of > the creation of the IPNI collaboration. So, do the three duplicate records get three GUIDs? Not intentionally -- no (at least not in my view). But I can very easily see how they would inadvertently be assigned different GUIDs -- hence the need to be able to seamlessly deal with objective duplicates when they are discovered. > Where are the GUIDs actually to be perpetually located after they are assigned? That's the crux of the question posed in Donald's PowerPoint file. My inclination is to pick a more centralized organization that seems likely to survive in the long run (GBIF seems to me to be a leading candidate; although for taxonomic names, I would still favor the respective nomenclatural Commissions). > Are all the originating organizations supposed to modify their databases to add > the GUID attribute and then build a mechanism to send out their records and then > receive the GUID back from somewhere and finally update their records with it so > the record+GUID can then in turn be published from their database onto the web? I would like to think so, yes. Certainly all organizations that set up a DiGIR provider. If you follow the link (above) to the DwC2 draft, you'll see that the first element is "GlobalUniqueIdentifier", which is required in the current draft. A stop-gap solution is to concatenate a "soft" GUID in the form of: URN:catalog:[InstitutionCode]:[CollectionCode]:[CatalogNumber] ...but personally, I see this only as a temporary solution. I'd rather see the bioinformatics community bite the bullet and commit to a "hard" GUID system. > Couldn't agree more on the need for a single index/GUIDs to all references, > but beyond that is needed the single database containing all the GUIDS plus > the standard abbreviations and descriptions for them. Nobody has this database. > There are subsets like BPH and TL2. But no single, definitive list of all > references, online, in one place with GUIDs. This science needs that in the worst way. I agree on all counts. Which is why I think someone (GBIF?) needs to build it. It won't suddenly materialize out of nothing -- it will have to be built over time. If you want to assign a GUID to a Taxon Name, you must first enter the citation details for its original description Reference in the Reference GUID issuer. > If a concept is Name+Reference, then don't IPNI and Tropicos contain millions > of concept records? It depends on what you mean by "Concept". Note in my last email that I explicitly identified "Concepts" as a *subset* of Name+Reference instances. Who decides which Name+Reference instances are "concept-bearers" and which are not? Tough question -- but one that is being thought about by the SEEK folks. Similarly, who decides which Name+Reference instances are "Name-bearers"? That's easier to answer: the respective Code of Nomenclature. Will there be millions of concept records? Well, given that there are millions of names, I imagine there will probably be tens of millions of Concepts to which those names have been, or will be, applied. There will, of course, be BILLIONS of Name+Reference instances. I say this with confidence because, in my view, every identification label of every specimen in the world could potentially be considered as a "Reference", and there are presumably billions of specimens out there. But I'm not terribly concerned about such large numbers. As of this moment, there are 4,285,199,774 web pages indexed by Google, yet it can find what I'm looking for with AMAZING speed and efficiency -- and that's without any semantic context. What we're talking about here is highly structured data in a tightly controlled semantic context. Computers are exceedingly good and managing vast quantities of data very quickly -- and they're getting better and faster all the time. By the time we (the Bioinformatics community) get around to digitizing billions of specimens and Name+Reference instances, the hard drive on my laptop will be measured in Terabytes. Aloha, Rich Richard L. Pyle, PhD Natural Sciences Database Coordinator, Bishop Museum 1525 Bernice St., Honolulu, HI 96817 Ph: (808)848-4115, Fax: (808)847-8252 email: deepreef(a)bishopmuseum.org http://www.bishopmuseum.org/bishop/HBS/pylerichard.html

1 0

Re: Globally Unique Identifier
by Richard Pyle 23 Sep '04

23 Sep '04

I want to start by wholeheartedly endorsing Wouter's plea for non-information-bearing (meaningless) GUIDs. This feature is CRITICAL to the long-term success of any GUID system. It is absolutely imperative that there NEVER be any motivation to change the content of a GUID (i.e., it should be permanent). If the GUID itself contains any information whatsoever, there may be motivation to change that information at a later time. For this reason, I had initially preferred the DOI approach, but over time, I am gradually warming up to the LSID approach. While components of an LSID do, indeed, represent information, they represent the one piece of information that I think may legitimately belong embedded within a GUID: context. That is, the context, or domain, of the GUID itself. The context in this case would be the "issuer" of the GUID -- not necessarily the current "owner" of the GUID (see more discussion on this below). Though the organization that issued a GUID may eventually disappear, the fact that the organization was the one to issue the GUID in the first place will never change, and thus represents a permanent and unchanging component of the GUID. Without the context portion, the GUID itself is really nothing more than a random string of characters. In summary, I'm warming up to the LSID approach because it represents embedded context, without the risk of temptation to change the content of a GUID after it has been issued. Regarding Donald's PPT file, I have a couple of comments and questions: (Assumes Title slide is "Slide 1") Slide 2: You note there is "No reliable mechanism" to relate the same record from different providers to each other. But in the context of DarwinCore, the combination of [InstitutionCode]+[CollectionCode]+[CatalogNumber] should represent a virtual GUID (provided that the Global Provider Registry ensures no duplication of [InstitutionCode]). I do realize that words like "should" and "reliable" are critical here. Perhaps the DarwinCore implementation should enforce the requirement of uniqueness of [CollectionCode]+[CatalogNumber] within a single [InstitutionCode], and further ensure globally unique [InstitutionCode] values via the Global Provider Registry. Slide 3: Wouldn't most of the problems indicated in the first four bulleted points be largely solved by the Global Provider Registry? Using the [InstitutionCode] would allow lookup in the registry for a (current/active) metadata URL, and the metadata URL would provide information on where to access a particular [CollectionCode]+[CatalogNumber] piece of data. The issue of specimens changing numbers and/or collections is problematic, of course. The issue of versioning is a bit dicey, in my mind (e.g., at what resolution of information change)? Some things, like changing taxonomic determinations (i.e., "real" changes) need to be handled in a robust way. Other things, like the correction of typos and different styles of representing the exact same information (e.g., R.L. Pile==>R.L. Pyle; or R.L. Pyle==>Pyle, R.L.) probably don't need to be versioned. Other sorts of changes (e.g., the elaboration of previously existing information, such as the addition of retroactively-generated georeference coordinates) fall somewhere in-between. Slide 4: We should all get behind SEEK in addressing these issues (Taxon concept mapping). Ultimately, we minimally need a GUID pool for References (inclusive of unpublished works), and a GUID pool for what I call "Protonyms" (original creations of IC_N Code-compliant names). The union of these two GUIDs (what I would call "Assertions") would itself represent a GUID to a "potential concept" (Berendsohn). (Note: my preference would be to define Protonyms as a subtype of Assertions, and therefore Protonym GUIDs would be a subset drawn from the same pool as Assertion GUIDs -- but this is a technical discussion for another time). Slide 5: Nice summary!! Slide 6: Good stuff here, but I'll respond with some of my personal opinions: - RevisionID: see points of concern already expressed above - Specimen Record LSIDs: I gather from subsequent slides that you recognize two alternative approaches: having the "owner" of a specimen assign the LSID within the context of their own <domainName>, or adopting GBIF as the international standard issuer for ALL specimen GUID. In other words, GBIF would represent the centralized issuer of GUIDs for all biological specimens, and the biological specimen community would/should rally around GBIF for thus purpose, and adopt GBIF specimen GUIDs as their own. I personally have no problem with this (I do not live in fear of "Big Brother" centralization when it serves the benefit of all, as I believe it would in this case) -- but I know there are many who might have a problem with it, and therefore it might not garner widespread adoption without large volumes of "fuss". If, on the other hand, each organization issues its own GUIDs for its own set of specimens, then the question is when, if ever, GBIF would assign a specimen GUID? Perhaps as a surrogate for institutions that lack the technological ability to assign their own LSIDs? But I wonder, how many institutions that could server electronic data of their holdings to the internet would lack the ability to assign their own LSIDs? As you've outlined in subsequent slides, I see two alternative paths: A) Get the biological world to rally around GBIF as the centralized provider of GUIDs for specimens for all collections; or B) Have each collection/institution issue its own set of LSIDs for its own specimens, and have GBIF adopt those LSIDs for its own internal purposes. I could get behind either approach, but I see danger in the adoption of a mixture of these two approaches. I'll defer elaboration, but a lot of it has to do with potential confusion about whether the GUID applies fundamentally to the physical specimen, or the electronic conglomeration of data associated with the specimen. Also, I think we should avoid the risk of assigning two separate GUIDs for the same "single data element" (sensu your Slide 5). - Name record LSIDs: I understand the example of an IPNI LSID for a plant name, and presumably there would be analogous "Catalog of Fishes" LSIDs for each fish name, etc. But I don't think that would be a wise approach. Unlike specimen records, where there are fairly unambiguous "owner" institutions (or at least "original owner" institutions that issued a GUID), taxonomic aggregators (IPNI, ITIS, Species2000, GBIF, uBio, etc.) are most certainly not owners of the taxonomic names that they include in their databases. We would want to avoid the risk of duplicate GUIDs for the same name, and thus the need for mapping, e.g., an IPNI GUID for a name to its ITIS equivalent. Again, I can't help but think that the world will be a better place if we can avoid assigning multiple GUIDs to the same "single data element". One approach would be to rally around GBIF, and rely on them to issue GUIDs for all taxon names. However, I also recognize that we do not exist in a political/personality vacuum with regards to "ownership" of taxonomic names, or the electronic representations thereof. Therefore, the closest thing that exists to an "owner" of a taxonomic name is the Commission of Nomenclature (and it's respective Code of Nomenclature) under which the name was established. Thus, when it comes to assigning GUIDs for names (not concepts), I would propose the following: urn:lsid:ICZN.org:TaxonName:XXXXXX (all zoological names) urn:lsid:ICBN.org:TaxonName:XXXXXX (all botanical names) urn:lsid:ICNB[or LBSN??].org:TaxonName:XXXXXX (all bacteriological names) urn:lsid:ICTV[or ICVCN??].org:TaxonName:XXXXXX (all virus names) In an ideal world, we'd get to the point where there would be a need for only one registrar of nomenclature, e.g.: urn:lsid:BioCode.org:TaxonName:XXXXXXX Or, perhaps: urn:lsid:gbif.net:TaxonName:XXXXXXX But I don't think we're quite there yet. In any case, the idea would be for the taxon name aggregators to adopt the unambiguously unique GUID for each taxon name. Taxonomic concepts are a whole 'nother ball of wax.... Slide 8: I actually prefer this approach (GBIF as the central issuer of specimen GUIDs), for a variety of reasons. One of the main reasons is that it would assure uniqueness of an integer within a given <namespace> (e.g., Specimens), which would make things a bit easier for those of us who like to use integers as primary keys in databases. In other words, it avoids the possibility of urn:lsid:bishopmuseum.org:Specimen:1234567 colliding with urn:lsid:usnm.gov:Specimen:1234567, when reducing the GUID to just its integer component for local application purposes (where context can be enforced by other means). However, I should point something out regarding the "Advantage" part of this slide, which is that the "problem" of transferring record locations doesn't exist, provided that the <domainName> component of the LSID is taken as the issuer of the GUID, not as the current owner of the specimen. In other words, if Bishop Museum assigned GUID urn:lsid:bishopmuseum.org:Specimen:1234567 to a specimen, and then gave that specimen to Smithsonian, then Smithsonian would retain the complete GUID intact as: urn:lsid:bishopmuseum.org:Specimen:1234567. The danger comes when you try to use the <domainName> component as metadata to represent the current location of the specimen and/or its electronically represented data. This is where Wouter's original point about 'meaningless' GUIDs comes into play. If the whole point of using LSIDs is to embed the "current location" information within the ID itself so that applications can retrieve additional data associated with the GUID directly, then I have some concerns (mostly address already). Why there is a reference to urn:lsid:gbif.net:TaxonConcept:106734 at the top of this slide??? Slide 9: Again, I'm not sure I understand on this slide why there is a reference to urn:lsid:ipni.org:TaxonName:82090-3:1.1 Also, in this model, what function does the LSID serve that is not met by the concatenated [InstitutionCode]+[CollectionCode]+[CatalogNumber] (in the context of Global Provider Registry). Slide 10 (taxon concepts and literature): This message is already getting too long... :-) I already touched on this above under "Slide 4". I definitely agree that we need a GUID system for References. This should include more than just published references. It doesn't quite exist yet among the existing Reference registrars (as far as I can tell) to accommodate the specific needs of taxonomists (e.g. referring to a subsection of a reference as representing an original taxonomic description), so I do see a need to create a Reference GUID system specific to biology. I could rant for pages on this, but I'll summarize simply with a plea to *DEFINE* a Concept GUID as an intersection between an Name GUID and a Reference GUID (i.e., what I would call an "Assertion"). Not all Name-Reference combinations will be worthy of recognition as a distinct "Concept", but all are *potentially* representative of a concept (Berendsohn), and thus all should be drawn from the same pool of GUIDs as Concept GUIDs. In other words, "Concepts" should be thought of as a subtype of Name-Reference instances. I would go further to suggest (as I did above) that "Name" GUIDs should also be a subtype of Name-Reference instances (non-exclusive of Concept subtype instances), using the Name-Reference instance that represents the Code-recognized original description of the name as the "handle" to the Name. By this approach, you need only two GUID object classes <objectClass>: one for References, and one for Name-Reference intersections (Assertions). The latter of these could serve as the source for both Concept GUIDs and Name GUIDs. Last Slide: My own answers to your questions: 1) Are LSIDs the most appropriate technology? I'm increasingly coming to that conclusion. 2) Should identifiers be assigned and resolved centrally or via a fully distributed model (or should providers have the option of using either model)? I think the best option would be central. The next option would be full distributed. Leaving it as an option would, in my opinion, be a BIG mistake. 3) Which objects should receive identifiers? Specimens, References, Name-Reference intersections (Assertions), and perhaps Agents. [TaxonNames and Concepts can be subsets of Name-Reference intersections]. 3a) Should we develop a set of object classes for biodiversity informatics and assign identifiers to instances of all of these? I think so, yes. Of course, it depends a bit on who you mean by "we". I'm thinking sensu lato. 3b) Should identifiers be associated with real world objects (e.g. specimens), or with digitised records representing them (e.g. perhaps multiple records representing different digitisation attempts by different researchers for the same specimen), or both? I would say definitely real-world objects (treating things like Code-recognized original descriptions of taxon names, and citable references as "real-world objects"). I do NOT think we should have separate GUIDs for digital representations thereof. Alternative digital representations are simply clutter that will eventually be weeded out of the system, once we all get organized on this stuff, and harness the power of the internet to implement a global editing/QA system. 4) What should be done about existing records without identifiers? As far as I know, ALL records are currently without identifiers (unless someone established a widely accepted GUID system and I missed the announcement...) 4a) Should they be left alone? Ultimately, no. 4b) Should they all be updated with identifiers? Ultimately, yes. 4c) Should the provider software be modified to generate "soft" identifiers (ones which we cannot guarantee in all cases to be unique) based e.g. on the combination of InstitutionCode, CollectionCode and CatalogNumber? As an interim solution, perhaps. See my comments under "Slide 2" above. 5) Are revision identifiers a useful feature? I would like to think not. If the information is truly dynamic over time (e.g., re-determinations of taxonomic identity of specimens), then individual instances should probably receive their own set of GUIDs (as opposed to versions of the "parent" GUID). If the information is static over time, and changes represent objective corrections, then I don't see a real need to track that within the context of a GUID (record edit history may or may not need to be tracked, but this seems to me to be a separate issue from GUIDs). 5b) How many providers will be able to provide and handle them? If versioning is incorporated, then it should be designed such that a "default" version is provided automatically when versioning is not handled. Sorry for the long post, but I feel that this issue is extremely important at this point in bioinformatics history. Aloha, Rich Richard L. Pyle, PhD Natural Sciences Database Coordinator, Bishop Museum 1525 Bernice St., Honolulu, HI 96817 Ph: (808)848-4115, Fax: (808)847-8252 email: deepreef(a)bishopmuseum.org http://www.bishopmuseum.org/bishop/HBS/pylerichard.html > -----Original Message----- > From: TDWG - Structure of Descriptive Data > [mailto:TDWG-SDD@LISTSERV.NHM.KU.EDU]On Behalf Of Donald Hobern > Sent: Thursday, September 23, 2004 6:22 AM > To: TDWG-SDD(a)LISTSERV.NHM.KU.EDU > Subject: Re: Globally Unique Identifier > > > This is precisely one of the key questions we need to address with any > identifier framework we adopt. I think we could easily use LSIDs in a > way that should overcome your concerns, and I think that the built-in > mechanisms for discovery and metadata access within the LSID model are > really exciting. > > I have just put together a PowerPoint presentation to explain some of > what I think we could achieve with globally unique identifiers and > particularly with LSIDS. It can be downloaded from: > > http://circa.gbif.net/Public/irc/gbif/dadi/library?l=/architecture/globa > llyuniqueidentifier/ > > It may be clearest if you go through it as a slide show rather than in > edit mode. > > Thanks, > > Donald > > --------------------------------------------------------------- > Donald Hobern (dhobern(a)gbif.org) > Programme Officer for Data Access and Database Interoperability > Global Biodiversity Information Facility Secretariat > Universitetsparken 15, DK-2100 Copenhagen, Denmark > Tel: +45-35321483 Mobile: +45-28751483 Fax: +45-35321480 > --------------------------------------------------------------- > > > -----Original Message----- > From: TDWG - Structure of Descriptive Data > [mailto:TDWG-SDD@LISTSERV.NHM.KU.EDU] On Behalf Of Wouter Addink > Sent: 23. september 2004 17:38 > To: TDWG-SDD(a)LISTSERV.NHM.KU.EDU > Subject: Re: Globally Unique Identifier > > It seems that DOI allows for any existing IDs to be used as part of the > unique identifier. That seems to me as a fast to adopt short term > solution > but not a good idea for the long term. At first sight I very much liked > the > LSID specification, but the longer I think about it, the less I like > some > parts. What I think is missing in the LSID specification is that the > unique > identifier should be 'meaningless' apart from being an identifier to > become > time independent (and to avoid possible political problems). Any > solution > with a URN I can think of has some meaning, which makes solutions like a > MAC-address generated GUID favorable in my opinion. And any meaning you > need > (like an authority of an object) can be specified in metadata instead of > using it in the identifier. What is not very clear to me in the LSID > specification is where the LSID generated by a LSIDAssigningService is > actually stored. > > Wouter Addink > > ----- Original Message ----- > From: "Gregor Hagedorn" <G.Hagedorn(a)BBA.DE> > To: <TDWG-SDD(a)LISTSERV.NHM.KU.EDU> > Sent: Wednesday, September 08, 2004 6:20 PM > Subject: Re: Globally Unique Identifier > > > >I am not quite sure, but to me it seems with "GUID" you refer to the > > numeric, MAC-address generated GUID type. I have nothing against > > these. However, any URN in my view is a GUID that has most of the > > properties you mention: > > > >> - it is guaranteed to be unique globally, and can be created > anywhere, > >> anytime by any server or client machine - it has no meaning as to > >> where the data is physically located and will there not confuse any > >> user about this > > > >> - most id > >> mechanisms, especially URI/URN ids require a 'governing body' to > >> handle namespaces/urls to ensure every URN is unique, whereas a GUID > >> is always unique > > > > The governing body is restricted to the primary web address, and in > > most cases such an address is already available. Being a member of a > > governmental institution that explicitly forbids the use without > > prior consent, and forbids the use of its domain name once you are no > > longer working for them, I realize some potential for problem. > > > >> I do think a URL of some kind would be useful for things such as > >> global searches of multiple databases, as this will allow the search > >> to go directly to the data source where the name, referene, etc comes > >> from. But this should not be part of its ID. Maybe a name/id should > >> have several foms, a GUID for an ID and a URL + a GUID for a fully > >> specified name. > >> > >> What are the current thoughts on these ideas? > > > > A GUID is only part of the problem. The other half of the problem is > > actually getting at the resource. URN schemes like DOI or LSID (I > > prefer the latter) intend to define resolution mechanisms. That make > > the URN not yet a URL - in my view the good comes with the good, > > location and reorganization independence. > > > > I believe GBIF should install such an LSID resolver, which is why in > > the UBIF proxy model, under Links, I propose to support a general URL > > (including potentially URNS), a typed LSID and a typed DOI. This > > could be simplified to have just a URN (LSID and DOI are URNs), but > > that would then require string parsing to determine and recognize the > > preferred resolvable GUID types. Comments on splitting/not splitting > > this are welcome! > > > > There may be some need to define a non-resolvable URN/numeric GUID as > > well. However, that would not be under the linking question. Is it > > correct that linking requires resolvability, or am I thinking into a > > wrong direction? > > > > Gregor > >> > > > > > > ---------------------------------------------------------- > > Gregor Hagedorn (G.Hagedorn(a)bba.de) > > Institute for Plant Virology, Microbiology, and Biosafety > > Federal Research Center for Agriculture and Forestry (BBA) > > Koenigin-Luise-Str. 19 Tel: +49-30-8304-2220 > > 14195 Berlin, Germany Fax: +49-30-8304-2203 > > > > Often wrong but never in doubt!

1 0

Re: Globally Unique Identifier
by Donald Hobern 16 Sep '04

16 Sep '04

Dear Gregor, Like you, I really like the potential of LSIDs. They seem to offer exactly the characteristics we need (and everyone I have met who has looked closely at the alternatives seems to have come to the same conclusion). I want to hold a meeting as soon as possible to get input (and buy-in) from the widest possible community and to resolve any outstanding issues. Then I want to get going with this as a GBIF-supported model. Thanks, Donald --------------------------------------------------------------- Donald Hobern (dhobern(a)gbif.org) Programme Officer for Data Access and Database Interoperability Global Biodiversity Information Facility Secretariat Universitetsparken 15, DK-2100 Copenhagen, Denmark Tel: +45-35321483 Mobile: +45-28751483 Fax: +45-35321480 --------------------------------------------------------------- -----Original Message----- From: TDWG - Structure of Descriptive Data [mailto:TDWG-SDD@LISTSERV.NHM.KU.EDU] On Behalf Of Gregor Hagedorn Sent: 8. september 2004 18:20 To: TDWG-SDD(a)LISTSERV.NHM.KU.EDU Subject: Re: Globally Unique Identifier I am not quite sure, but to me it seems with "GUID" you refer to the numeric, MAC-address generated GUID type. I have nothing against these. However, any URN in my view is a GUID that has most of the properties you mention: > - it is guaranteed to be unique globally, and can be created anywhere, > anytime by any server or client machine - it has no meaning as to > where the data is physically located and will there not confuse any > user about this > - most id > mechanisms, especially URI/URN ids require a 'governing body' to > handle namespaces/urls to ensure every URN is unique, whereas a GUID > is always unique The governing body is restricted to the primary web address, and in most cases such an address is already available. Being a member of a governmental institution that explicitly forbids the use without prior consent, and forbids the use of its domain name once you are no longer working for them, I realize some potential for problem. > I do think a URL of some kind would be useful for things such as > global searches of multiple databases, as this will allow the search > to go directly to the data source where the name, referene, etc comes > from. But this should not be part of its ID. Maybe a name/id should > have several foms, a GUID for an ID and a URL + a GUID for a fully > specified name. > > What are the current thoughts on these ideas? A GUID is only part of the problem. The other half of the problem is actually getting at the resource. URN schemes like DOI or LSID (I prefer the latter) intend to define resolution mechanisms. That make the URN not yet a URL - in my view the good comes with the good, location and reorganization independence. I believe GBIF should install such an LSID resolver, which is why in the UBIF proxy model, under Links, I propose to support a general URL (including potentially URNS), a typed LSID and a typed DOI. This could be simplified to have just a URN (LSID and DOI are URNs), but that would then require string parsing to determine and recognize the preferred resolvable GUID types. Comments on splitting/not splitting this are welcome! There may be some need to define a non-resolvable URN/numeric GUID as well. However, that would not be under the linking question. Is it correct that linking requires resolvability, or am I thinking into a wrong direction? Gregor > ---------------------------------------------------------- Gregor Hagedorn (G.Hagedorn(a)bba.de) Institute for Plant Virology, Microbiology, and Biosafety Federal Research Center for Agriculture and Forestry (BBA) Koenigin-Luise-Str. 19 Tel: +49-30-8304-2220 14195 Berlin, Germany Fax: +49-30-8304-2203 Often wrong but never in doubt!

1 0

Re: Globally Unique Identifier
by Gregor Hagedorn 08 Sep '04

08 Sep '04

I am not quite sure, but to me it seems with "GUID" you refer to the numeric, MAC-address generated GUID type. I have nothing against these. However, any URN in my view is a GUID that has most of the properties you mention: > - it is guaranteed to be unique globally, and can be created anywhere, > anytime by any server or client machine - it has no meaning as to > where the data is physically located and will there not confuse any > user about this > - most id > mechanisms, especially URI/URN ids require a 'governing body' to > handle namespaces/urls to ensure every URN is unique, whereas a GUID > is always unique The governing body is restricted to the primary web address, and in most cases such an address is already available. Being a member of a governmental institution that explicitly forbids the use without prior consent, and forbids the use of its domain name once you are no longer working for them, I realize some potential for problem. > I do think a URL of some kind would be useful for things such as > global searches of multiple databases, as this will allow the search > to go directly to the data source where the name, referene, etc comes > from. But this should not be part of its ID. Maybe a name/id should > have several foms, a GUID for an ID and a URL + a GUID for a fully > specified name. > > What are the current thoughts on these ideas? A GUID is only part of the problem. The other half of the problem is actually getting at the resource. URN schemes like DOI or LSID (I prefer the latter) intend to define resolution mechanisms. That make the URN not yet a URL - in my view the good comes with the good, location and reorganization independence. I believe GBIF should install such an LSID resolver, which is why in the UBIF proxy model, under Links, I propose to support a general URL (including potentially URNS), a typed LSID and a typed DOI. This could be simplified to have just a URN (LSID and DOI are URNs), but that would then require string parsing to determine and recognize the preferred resolvable GUID types. Comments on splitting/not splitting this are welcome! There may be some need to define a non-resolvable URN/numeric GUID as well. However, that would not be under the linking question. Is it correct that linking requires resolvability, or am I thinking into a wrong direction? Gregor > ---------------------------------------------------------- Gregor Hagedorn (G.Hagedorn(a)bba.de) Institute for Plant Virology, Microbiology, and Biosafety Federal Research Center for Agriculture and Forestry (BBA) Koenigin-Luise-Str. 19 Tel: +49-30-8304-2220 14195 Berlin, Germany Fax: +49-30-8304-2203 Often wrong but never in doubt!

1 0

Re: Globally Unique Identifier
by Dave Vieglais 08 Sep '04

08 Sep '04

Hi Kevin, In addition to other locations, there has been a lot of discussion about the use of GUIDs in the SEEK project (seek.ecoinformatics.org) which is also implementing a taxonomic name service and supporting the development of a Taxonomic Information Exchange Standard (TIES?). Some relevant discussion appears in the email archives of the taxon sub-group of the SEEK project, e.g. http://www.ecoinformatics.org/pipermail/seek-taxon/2004-May/thread.html and also in the wiki pages for that part of the project: http://seek.ecoinformatics.org/Wiki.jsp?page=SEEKTaxonCommunity Additional locations to look are the wiki pages for the taxonomic transfer schema: http://www.soc.napier.ac.uk/tdwg/index.php and pages within that site such as http://www.soc.napier.ac.uk/tdwg/index.php?pagename=GUID The LSID (http://www-124.ibm.com/developerworks/oss/lsid/) is currently favored for implementing GUIDs for services such as this. regards, Dave Vieglais Kevin Richards wrote: > I am a software developer at Landcare Research in New Zealand, where we are > developing a names taxonomic database and web service. We are at present > deciding on a unique identifier that will identify a specific Name, > Specimen, Vernacular, Reference etc, and would like to add my comments. > > We currently favour a GUID identifier for several reasons: > > - it is guaranteed to be unique globally, and can be created anywhere, > anytime by any server or client machine > - it has no meaning as to where the data is physically located and will > there not confuse any user about this > - it is not as usable by the user as ids such as integers, and a meaningful > id can lead to confusion/errors > - most id mechanisms, especially URI/URN ids require a 'governing body' to > handle namespaces/urls to ensure every URN is unique, whereas a GUID is > always unique > > I do think a URL of some kind would be useful for things such as global > searches of multiple databases, as this will allow the search to go > directly to the data source where the name, referene, etc comes from. But > this should not be part of its ID. Maybe a name/id should have several > foms, a GUID for an ID and a URL + a GUID for a fully specified name. > > What are the current thoughts on these ideas? >

1 0

Globally Unique Identifier
by Kevin Richards 07 Sep '04

07 Sep '04

I am a software developer at Landcare Research in New Zealand, where we are developing a names taxonomic database and web service. We are at present deciding on a unique identifier that will identify a specific Name, Specimen, Vernacular, Reference etc, and would like to add my comments. We currently favour a GUID identifier for several reasons: - it is guaranteed to be unique globally, and can be created anywhere, anytime by any server or client machine - it has no meaning as to where the data is physically located and will there not confuse any user about this - it is not as usable by the user as ids such as integers, and a meaningful id can lead to confusion/errors - most id mechanisms, especially URI/URN ids require a 'governing body' to handle namespaces/urls to ensure every URN is unique, whereas a GUID is always unique I do think a URL of some kind would be useful for things such as global searches of multiple databases, as this will allow the search to go directly to the data source where the name, referene, etc comes from. But this should not be part of its ID. Maybe a name/id should have several foms, a GUID for an ID and a URL + a GUID for a fully specified name. What are the current thoughts on these ideas?

1 0

New version / call for contributions
by Gregor Hagedorn 19 Aug '04

19 Aug '04

The new version of the Structured Descriptive Data schema (SDD 1.0 beta 2) is released and documented (... partially, primer and some manual documents still need updating) on: http://160.45.63.11/Projects/TDWG- SDD/Minutes/2004NZ_schema/DocuOverview.html We hope to get some feedback until the TDWG meeting in New Zealand in autumn. Please feel free to use the WIKIs (linked from page above) at your convenience to start discussions - no need to ask permission for new topics. It would perhaps be useful if as many people as possible could look at the minimal example file for coded descriptions, http://160.45.63.11/Projects/TDWG-SDD/Minutes/2004NZ_schema/SDD-Test- Min1.xml. Aside: For a long time I was very reluctant to ignore the depth of the topic in favor of creating something that would only replace basic DELTA/Lucid/Nexus but not provide for at least part of the requirements that have already been identified by DeltaAccess and the DELTA 2 proposal. However, I now believe that the proposed structures are able to deal with these requirements. This to me means that the basic structures are hopefully ok, and I see no problems in accepting proposals to leave aways some parts of the schema in version 1.0, with a perspective to add them later on - perhaps after more experience has been gained with the schema. We would need proposals and discussions to guide us in what should have 1.0 priority and what not, however. Thanks! Gregor ---------------------------------------------------------- Gregor Hagedorn (G.Hagedorn(a)bba.de) Institute for Plant Virology, Microbiology, and Biosafety Federal Research Center for Agriculture and Forestry (BBA) Koenigin-Luise-Str. 19 Tel: +49-30-8304-2220 14195 Berlin, Germany Fax: +49-30-8304-2203 Often wrong but never in doubt!

1 0

Re: SEEK Project and TDWG-SDD
by Donald Hobern 16 Apr '04

16 Apr '04

After reading this thread, I'd like to return to the key issue raised by Jim in his initial post: > Bryan described to us how your schema is intended in include all of the > data and metadata associated with a diagnostic character set for a group > of taxa. My understanding is thus incomplete and based only on our > single discussion, but it struck me that TDWG-SDD has an opportunity to > have much broader acceptance and support if your schema was not designed > as a single data object--to contain both the metadata about the package > (or work or whatever you refer to it as) *and* the descriptive data that > describe the individual concepts. > > If the taxa/concepts had their own schemas and were linked to the > package metadata with a GUID, maybe a DOI or some other globally unique > identifier, then the XML concept data sets could be used for other > systems like concept based classification or database management > systems. This would in theory, and in my view, give the work of your > group much more leverage, exposure and relevance to a broader group of > scientists and users of names and concepts. I know that the SDD team have already been considering these issues (in conjunction with the ABCD team and others), but I would like to make the following points reflecting my priorities here in GBIF. 1. It will certainly be beneficial to model the data elements from SDD in such a way that they can be reused in other documents. I would think that this should ideally allow at least for the Terminology, Entities, Descriptions and Keys elements to form independent schema elements which could be used in other schemas. We do not want the use of these elements to be restricted just because they have been tightly bound to a specific document schema. The intention with the ABCD schema was in part to start the development of a library of reusable XML biodiversity data types (not just a fixed document structure). I see much of the current work of TDWG (and of GBIF) to be developing just such a library, from which we can seek to compose a wide variety of top level document structures (specimen/observation data, character tables, diagnostic keys, taxonomic revisions, markup of legacy taxonomic literature like the Biologia Centrali-Americana). 2. On the other hand, it does seem sensible to provide for the metadata to be transferred with each data set so that ownership information, usage restrictions, known limitations, etc. are not lost. I would therefore like to put effort into adopting or developing a top-level document envelope suitable for all classes of biodiversity data exchange. This should include information on origin and ownership, data transformation history, taxonomic, geographic and temporal coverage (as appropriate) and any metadata necessary to allow processors to identify the schema(s) in use for the actual data within the document. The real content should be separable for re-use in other contexts, but such a metadata wrapper standard would bring us closer to automating the manipulation of a wide range of content. I have in mind here something like the ABCD structure, with a top level DataSets wrapper containing a number of different DataSet objects, each of which is made up of a set of Units. In effect the Units element would be a container for data elements from SDD, ABCD, TDWG-Names, etc. The DataSet-level elements would provide a common metadata model for all of these documents. I feel that we need a grand vision for how we will unify all of these different schemas into an overall information model. Consider the task of developing a full taxonomic revision for a group using XML documents. This would naturally include references to specimens underlying a concept (external references to ABCD documents?), character data (SDD), and nomenclatural data (TDWG-Names). The goal should be for a processor to be able to take such a document and treat it as an element within a comprehensive electronic library of biodiversity data. We will need registries of the locations of different documents (with some form of GUID for each document) and mechanisms for managing the cross-references (taxon names, catalog numbers for specimens, author names, character definitions, etc.). Such an infrastructure would allow us to populate our taxon concepts with all of the relevant information. Donald --------------------------------------------------------------- Donald Hobern (dhobern(a)gbif.org) Programme Officer for Data Access and Database Interoperability Global Biodiversity Information Facility Secretariat Universitetsparken 15, DK-2100 Copenhagen, Denmark Tel: +45-35321483 Mobile: +45-28751483 Fax: +45-35321480 ---------------------------------------------------------------

1 0