Re: [tdwg-content] most GUIDs/URIs for names/taxon stuff not ready for prime time

newer
[Fwd: Re: most GUIDs/URIs for...

older
Fwd: Tim Berners-Lee .. Is it best...

Tony.Rees＠csiro.au

7 Jan 2011 7 Jan '11

20:11

Dear all,

...

From where I sit (very much on the sideline of this debate, waiting to see what happens), the main trouble I see is that (1) anyone and his dog can mint yet another unique identifier for the same taxon name, leading to uncontrolled proliferation and never ending ID reconciliation issues, and (2) there are always some names not on any particular external "identifier assigning" list which therefore lack an identifier (however have a scientific name) just when you want one. No problem, just mint your own, however that feeds back into (1) again...

Show replies by date

Peter DeVries

7 Jan 7 Jan

22:00

New subject: most GUIDs/URIs for names/taxon stuff not ready for prime time

Hi Tony, That is why I think everyone should get behind the GlobalName Index which is a EoL.org / GBIF.org project. It includes the names from ITIS et al. The work I am doing with Dima is still experimental and in development, but it demonstrates how two independent databases can autogenerate a shared URI. That idea, in itself, is interesting even if you don't like UUID's etc. or the particular way the RDF is implemented now. I find ITIS very valuable, but it has a different ID's for the different name's for what would many would consider the same concept. So if a given species name changes from *Aedes triseriatus* to *Ochlerotatus triseriatus* a new ID is generated. This is different than how NCBI does it, but ITIS has more names. Also NCBI does not tell you anything about what is or is not an instance of a given species. Since I think ITIS and NCBI are useful resources I link to it when I find an appropriate ID to match to. You can see this in my RDF. I would encourage ITIS to continue and think about exposing at least some of the data as RDF using CoolURI's. http://www.w3.org/TR/cooluris/ <http://www.w3.org/TR/cooluris/>http://www.w3.org/Provider/Style/URI (i.e. do the best you can :-) There are LOD compliant URI's for the NCBI ID's via bio2rdf and uniprot. One of the major advantages of the Linked Open Data approach is that there does not have to be one central place for everything. Data sets can be distributed and each group can focus on it's core competencies. Even things like species concepts could be distributed, but I think it would be best to first get a common understanding of how they will work. Or at least a couple different "kinds" of standard species concepts. I see several kinds of species-like resources out there now, some are name-based (ITIS), others are more like concepts (NCBI). Some entail a particular classification (NCBI, CoL, etc.). Others coin a species concept to which various classifications are associated (TaxonConcept.org) We are at the start of trying to untangle this mess and a good place to start is one resource that contains all the name uses. Besides is there any one else willing to take on the responsibility to collect and curate the 400 name variants that can exist for one species?

...

From this we can begin to connect those names to each other and as well as related data sets like publications and occurrences etc.

I think it is good to have a diversity of projects even if there is some overlap. Each group adds some interesting ideas and perspective. Respectfully, - Pete P.S. Another thing we need is a shared set of URI's for attribution so that they can be easily and efficiently incorporated and tracked. e.g. dataprovidedBy <http://some.shared.org/providers#ITIS> A simple URI rather than a huge glob of text and images for each little thing. Perhaps using the void vocabulary http://vocab.deri.ie/void/guide On Fri, Jan 7, 2011 at 2:11 PM, <Tony.Rees@csiro.au> wrote:

...

Dear all,

From where I sit (very much on the sideline of this debate, waiting to see what happens), the main trouble I see is that (1) anyone and his dog can mint yet another unique identifier for the same taxon name, leading to uncontrolled proliferation and never ending ID reconciliation issues, and (2) there are always some names not on any particular external "identifier assigning" list which therefore lack an identifier (however have a scientific name) just when you want one. No problem, just mint your own, however that feeds back into (1) again...

Just curious - ITIS TSNs would have to be one of the longest established and promoted systems of "non-name" identifiers for taxon names - have they been successful in anyone's view, or if not, why not...

Any comments appreciated.

Regards - Tony

-- --------------------------------------------------------------- Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 TaxonConcept Knowledge Base <http://www.taxonconcept.org/> / GeoSpecies Knowledge Base <http://lod.geospecies.org/> About the GeoSpecies Knowledge Base <http://about.geospecies.org/> ------------------------------------------------------------

Tony.Rees＠csiro.au

8 Jan 8 Jan

08:52

New subject: most GUIDs/URIs for names/taxon stuff not ready for prime time

Dear Pete, I don't think you have really addressed the question I was attempting to ask - so I will try again... Let's take as an example the sperm whale, Physeter macrocephalus, syn. Physeter catodon (or vice versa in some sources e.g. Mammal Species of the World 3rd edition). P. catodon Linnaeus 1758 is given the taxonomic serial number (TSN) of 180489 in ITIS, while P. macrocephalus has the TSN 180488. These usages are then picked up by Cat. of Life, however rather than re-using the ITIS TSNs, they are allocated the LSIDs ??? (for P. catodon) and urn:lsid:catalogueoflife.org:taxon:415df5cc-52c2-102c-b3cd-957176fb88b9:col20101221 for P. macrocephalus. (Also my understanding is that these change every year with a new release of CoL). Meanwhile over at uBio P. catodon has the LSID urn:lsid:ubio.org:namebank:105910 while P. macrocephalus has the LSID urn:lsid:ubio.org:namebank:111731 . Of course being both Linnaean taxa, these also have ZooBank LSIDs i.e. P. catodon is urn:lsid:zoobank.org:act:046FA756-3A20-454E-8351-12EDE16574B4 while P. macrocephalus is urn:lsid:zoobank.org:act:A2F39087-C7A1-476F-88F6-B7C7B61D86AB . Meanwhile over in AFD we find the LSID urn:lsid:biodiversity.org.au:afd.taxon:587e6872-512b-402e-9c5e-f098c6495275 for P. macrocephalus (not sure about catodon); in ION P. catodon has the LSID urn:lsid:organismnames.com:name:553123 while P. macrocephalus has urn:lsid:organismnames.com:name:553124 and so on we go (GenBank IDs, WoRMS IDs, Fauna Europaea IDs, etc. etc.). Now if you look in GNI (which indexes namestrings) you will find the variants as follows: Physeter catodon (Linnaeus, 1758) Physeter catodon L. Physeter Catodon Linnaeus 1758 Physeter catodon Linnaeus, 1758 Physeter catodon Linnaeus, 1766 (and similar for P. macrocephalus) each of which no doubt also has its own unique ID somewhere behind the scenes as well, all presumably awaiting reconciliation. My question is simply how this apparently unregulated minting of LSIDs and other unique identifiers is contributing to a solution rather than becoming a new problem requiring additional resources to reconcile (bearing in mind that we do not even have a reliable list of all named taxa at this time). I am sure there is an answer somewhere, it's just that I cannot see it as yet :) - maybe someone will enlighten me however... Regards - Tony ________________________________ From: Peter DeVries [pete.devries@gmail.com] Sent: Saturday, 8 January 2011 9:00 AM To: Rees, Tony (CMAR, Hobart) Cc: jsachs@csee.umbc.edu; tdwg-content@lists.tdwg.org; pmurray@anbg.gov.au; pleary@mbl.edu; dpatterson@eol.org; dmozzherin; Nathan Wilson Subject: Re: [tdwg-content] most GUIDs/URIs for names/taxon stuff not ready for prime time Hi Tony, That is why I think everyone should get behind the GlobalName Index which is a EoL.org / GBIF.org project. It includes the names from ITIS et al. The work I am doing with Dima is still experimental and in development, but it demonstrates how two independent databases can autogenerate a shared URI. That idea, in itself, is interesting even if you don't like UUID's etc. or the particular way the RDF is implemented now. I find ITIS very valuable, but it has a different ID's for the different name's for what would many would consider the same concept. So if a given species name changes from Aedes triseriatus to Ochlerotatus triseriatus a new ID is generated. This is different than how NCBI does it, but ITIS has more names. Also NCBI does not tell you anything about what is or is not an instance of a given species. Since I think ITIS and NCBI are useful resources I link to it when I find an appropriate ID to match to. You can see this in my RDF. I would encourage ITIS to continue and think about exposing at least some of the data as RDF using CoolURI's. http://www.w3.org/TR/cooluris/ <http://www.w3.org/TR/cooluris/>http://www.w3.org/Provider/Style/URI (i.e. do the best you can :-) There are LOD compliant URI's for the NCBI ID's via bio2rdf and uniprot. One of the major advantages of the Linked Open Data approach is that there does not have to be one central place for everything. Data sets can be distributed and each group can focus on it's core competencies. Even things like species concepts could be distributed, but I think it would be best to first get a common understanding of how they will work. Or at least a couple different "kinds" of standard species concepts. I see several kinds of species-like resources out there now, some are name-based (ITIS), others are more like concepts (NCBI). Some entail a particular classification (NCBI, CoL, etc.). Others coin a species concept to which various classifications are associated (TaxonConcept.org) We are at the start of trying to untangle this mess and a good place to start is one resource that contains all the name uses. Besides is there any one else willing to take on the responsibility to collect and curate the 400 name variants that can exist for one species?

...

From this we can begin to connect those names to each other and as well as related data sets like publications and occurrences etc.

...

From where I sit (very much on the sideline of this debate, waiting to see what happens), the main trouble I see is that (1) anyone and his dog can mint yet another unique identifier for the same taxon name, leading to uncontrolled proliferation and never ending ID reconciliation issues, and (2) there are always some names not on any particular external "identifier assigning" list which therefore lack an identifier (however have a scientific name) just when you want one. No problem, just mint your own, however that feeds back into (1) again...

Just curious - ITIS TSNs would have to be one of the longest established and promoted systems of "non-name" identifiers for taxon names - have they been successful in anyone's view, or if not, why not... Any comments appreciated. Regards - Tony -- --------------------------------------------------------------- Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 TaxonConcept Knowledge Base<http://www.taxonconcept.org/> / GeoSpecies Knowledge Base<http://lod.geospecies.org/> About the GeoSpecies Knowledge Base<http://about.geospecies.org/> ------------------------------------------------------------

Peter DeVries

15:54

New subject: most GUIDs/URIs for names/taxon stuff not ready for prime time

Hi Tony, I see what you are getting at at. I think this has happened has several reasons, below is a partial list: 1) Human nature 2) The way these projects are funded 3) Many of these projects specialize in, or have subsets of all taxa - neither NCBI or ITIS have all the North American Insects. Not even some common North American mosquitoes. So you need to make your own ID's for these. 4) Data use restrictions which forces people to create and curate their own lists. It is much easier to reconcile between the TaxonConcept.org and EUNIS lists because they are both available as linked data with RDF dump files. 5) Too many different API's and formats including download formats. You should be able to go to one of these sites and get an unencumbered list of what names they have for North American mosquitoes etc. 6) For several of these projects it is not clear to me how one would add a species to their list, unlike Wikipedia / Wikispecies. It would seem to me that a simple contribute form page which would allow a species to be added to a queue for possible incorporation is strangely missing on most of these. I would not have needed to do what I have done if I could easily add a name to ITIS, get back an ID that stayed the same if the species was moved to a different genus and sent me an email if a name I submitted changed or was somehow "flagged". 7) The fundamentally flawed co-mingling of the idea of a name being a unique stable species identifier and a phylogenetic hypothesis for that species. * Ochlerotatus triseriatus* = *Aedes triseriatus*, *Felis concolor* = *Puma concolor. *Or your whale example. 8) To many different standards on how a name should be formated, and the fact that you need to know what "kind" of thing something is before you can format it correctly. Resistance by some to a solution for how to properly format a name when you don't know what kind of thing it is. Respectfully, - Pete On Sat, Jan 8, 2011 at 2:52 AM, <Tony.Rees@csiro.au> wrote:

...

Dear Pete,

I don't think you have really addressed the question I was attempting to ask - so I will try again...

Let's take as an example the sperm whale, Physeter macrocephalus, syn. Physeter catodon (or vice versa in some sources e.g. Mammal Species of the World 3rd edition).

P. catodon Linnaeus 1758 is given the taxonomic serial number (TSN) of 180489 in ITIS, while P. macrocephalus has the TSN 180488. These usages are then picked up by Cat. of Life, however rather than re-using the ITIS TSNs, they are allocated the LSIDs ??? (for P. catodon) and urn:lsid:catalogueoflife.org:taxon:415df5cc-52c2-102c-b3cd-957176fb88b9:col20101221 for P. macrocephalus. (Also my understanding is that these change every year with a new release of CoL). Meanwhile over at uBio P. catodon has the LSID urn:lsid:ubio.org:namebank:105910 while P. macrocephalus has the LSID urn:lsid:ubio.org:namebank:111731 . Of course being both Linnaean taxa, these also have ZooBank LSIDs i.e. P. catodon is urn:lsid:zoobank.org:act:046FA756-3A20-454E-8351-12EDE16574B4 while P. macrocephalus is urn:lsid:zoobank.org:act:A2F39087-C7A1-476F-88F6-B7C7B61D86AB . Meanwhile over in AFD we find the LSID urn:lsid:biodiversity.org.au:afd.taxon:587e6872-512b-402e-9c5e-f098c6495275 for P. macrocephalus (not sure about catodon); in ION P. catodon has the LSID urn:lsid:organismnames.com:name:553123 while P. macrocephalus has urn:lsid:organismnames.com:name:553124 and so on we go (GenBank IDs, WoRMS IDs, Fauna Europaea IDs, etc. etc.).

Now if you look in GNI (which indexes namestrings) you will find the variants as follows:

Physeter catodon (Linnaeus, 1758) Physeter catodon L. Physeter Catodon Linnaeus 1758 Physeter catodon Linnaeus, 1758 Physeter catodon Linnaeus, 1766

(and similar for P. macrocephalus)

each of which no doubt also has its own unique ID somewhere behind the scenes as well, all presumably awaiting reconciliation.

My question is simply how this apparently unregulated minting of LSIDs and other unique identifiers is contributing to a solution rather than becoming a new problem requiring additional resources to reconcile (bearing in mind that we do not even have a reliable list of all named taxa at this time).

I am sure there is an answer somewhere, it's just that I cannot see it as yet :) - maybe someone will enlighten me however...

Regards - Tony

________________________________ From: Peter DeVries [pete.devries@gmail.com] Sent: Saturday, 8 January 2011 9:00 AM To: Rees, Tony (CMAR, Hobart) Cc: jsachs@csee.umbc.edu; tdwg-content@lists.tdwg.org; pmurray@anbg.gov.au; pleary@mbl.edu; dpatterson@eol.org; dmozzherin; Nathan Wilson Subject: Re: [tdwg-content] most GUIDs/URIs for names/taxon stuff not ready for prime time

Hi Tony,

That is why I think everyone should get behind the GlobalName Index which is a EoL.org / GBIF.org project. It includes the names from ITIS et al.

The work I am doing with Dima is still experimental and in development, but it demonstrates how two independent databases can autogenerate a shared URI.

That idea, in itself, is interesting even if you don't like UUID's etc. or the particular way the RDF is implemented now.

I find ITIS very valuable, but it has a different ID's for the different name's for what would many would consider the same concept.

So if a given species name changes from Aedes triseriatus to Ochlerotatus triseriatus a new ID is generated.

This is different than how NCBI does it, but ITIS has more names.

Also NCBI does not tell you anything about what is or is not an instance of a given species.

Since I think ITIS and NCBI are useful resources I link to it when I find an appropriate ID to match to. You can see this in my RDF.

I would encourage ITIS to continue and think about exposing at least some of the data as RDF using CoolURI's.

http://www.w3.org/TR/cooluris/

<http://www.w3.org/TR/cooluris/>http://www.w3.org/Provider/Style/URI (i.e. do the best you can :-)

There are LOD compliant URI's for the NCBI ID's via bio2rdf and uniprot.

One of the major advantages of the Linked Open Data approach is that there does not have to be one central place for everything.

Data sets can be distributed and each group can focus on it's core competencies.

Even things like species concepts could be distributed, but I think it would be best to first get a common understanding of how they will work.

Or at least a couple different "kinds" of standard species concepts.

I see several kinds of species-like resources out there now, some are name-based (ITIS), others are more like concepts (NCBI). Some entail a particular classification (NCBI, CoL, etc.). Others coin a species concept to which various classifications are associated (TaxonConcept.org)

We are at the start of trying to untangle this mess and a good place to start is one resource that contains all the name uses.

Besides is there any one else willing to take on the responsibility to collect and curate the 400 name variants that can exist for one species?

From this we can begin to connect those names to each other and as well as related data sets like publications and occurrences etc.

I think it is good to have a diversity of projects even if there is some overlap. Each group adds some interesting ideas and perspective.

Respectfully,

- Pete

P.S. Another thing we need is a shared set of URI's for attribution so that they can be easily and efficiently incorporated and tracked.

e.g. dataprovidedBy <http://some.shared.org/providers#ITIS>

A simple URI rather than a huge glob of text and images for each little thing.

Perhaps using the void vocabulary http://vocab.deri.ie/void/guide

On Fri, Jan 7, 2011 at 2:11 PM, <Tony.Rees@csiro.au> wrote: Dear all,

From where I sit (very much on the sideline of this debate, waiting to see what happens), the main trouble I see is that (1) anyone and his dog can mint yet another unique identifier for the same taxon name, leading to uncontrolled proliferation and never ending ID reconciliation issues, and (2) there are always some names not on any particular external "identifier assigning" list which therefore lack an identifier (however have a scientific name) just when you want one. No problem, just mint your own, however that feeds back into (1) again...

Just curious - ITIS TSNs would have to be one of the longest established and promoted systems of "non-name" identifiers for taxon names - have they been successful in anyone's view, or if not, why not...

Any comments appreciated.

Regards - Tony

-- --------------------------------------------------------------- Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 TaxonConcept Knowledge Base<http://www.taxonconcept.org/> / GeoSpecies Knowledge Base<http://lod.geospecies.org/> About the GeoSpecies Knowledge Base<http://about.geospecies.org/> ------------------------------------------------------------

Steve Baskauf

22:37

New subject: most GUIDs/URIs for names/taxon stuff not ready for prime time

I think that both Pete and Tony bring up good but different points. Tony seems to be making a point about non-reuse of identifiers for taxon names, while Pete seems to be making points about the desirability of also having identifiers for concepts (which he might or might not intend to be the same thing as taxon name usages) and the need for identifiers that will work in the Linked Open Data world. As far as Tony's point is concerned, I would like to have an answer to the very direct question that Tony raised. Why aren't ITIS TSNs, which are well-known and often-used unique identifiers within their system, being used as a part a Global Names Index GUID? Since they are locally unique within ITIS, they could easily be made globally unique by concatenating them to http://gni.globalnames.org/ . For example, if TSN=778049, the URI would be http://gni.globalnames.org/778049 . That would be short, simple, globally unique, and a form that could easily be used with content negotiation and therefore be fine for use in the LOD world. Databases that are already using ITIS TSNs could easily and reliably construct the URI. I suppose one objection to this might be that ITIS TNS's imply TNUs and not simply names, but to my knowledge there is only a single name/author combination per TSN and so it could be used as part of a name identifier. Am I correct about this? I do not believe that the suggestion of autogenerating URIs by creating UUIDs from the name strings is a good idea. In the examples that Tony gave: Physeter catodon (Linnaeus, 1758) Physeter catodon L. Physeter Catodon Linnaeus 1758 Physeter catodon Linnaeus, 1758 Physeter catodon Linnaeus, 1766 I'm assuming that they all (except maybe the last one) represent formatting/spelling differences of what is actually the same name. In an earlier email, one of Pete's rationales for why a URI should be used instead of an actual name string was so that the various misspellings and differently formatted versions of the same name would be represented by a single URI. Yet this system of generating the URI from a UUID through an algorithm based on the name string will allow someone who is writing software to generate the URI based on a misspelled name without requiring the software to check against a list of correct names/identifiers. How is that better than just allowing people to enter misspelled names as string literals??? The only thing it accomplishes is replacing something intelligible like "Junonia coenia" with something unintelligible like 3a70f04d-fd29-5570-ba91-52dae0c3d07f I really don't understand this infatuation with UUIDs. I would have thought that after the LSID debacle, this community would have learned something the negative effect on progress of promoting an unnecessarily complicated technical solution when one is not required to get the job done. Imagine you are an herbarium curator who is somewhat behind the times technologically and a bit confused about GUIDs. This curator has an Excel spreadsheet with TSN IDs in it. Do we ask him to get one of his undergrad helpers to create a formula in an adjacent cell to the TSN that concatenates "http://gni.globalnames.org/tsn/" to the TSN to create the GUID that we tell him that he should be using? Or do we ask him to download some special software that isn't quite ready to be used that will install a UUID generator on his computer and then spend an hour with him on the phone walking him through the installation which won't work because he'll have to install LUNIX, a Java virtual machine, a MySQL database, a proxy localhost http xyz server thingamajig that he has never heard of and doesn't understand but which seems so simple to the TDWG tech crew? You can ask him to use http://gni.globalnames.org/tsn/778049 or you can ask him to use http://gni.globalnames.org/name_strings/3a70f04d-fd29-5570-ba91-52dae0c3d07f Which one do you think is going to confuse him? Which one can he type? Which one can he cite in a paper and expect people to write down? I complained about this in the draft of the Beginner's Guide to GUIDs which relied heavily on UUIDs in its examples. Let's get real here. We should be designing systems with the users in mind, not the programmers. Steve Tony.Rees@csiro.au wrote:

...

Dear Pete,

I don't think you have really addressed the question I was attempting to ask - so I will try again...

Let's take as an example the sperm whale, Physeter macrocephalus, syn. Physeter catodon (or vice versa in some sources e.g. Mammal Species of the World 3rd edition).

P. catodon Linnaeus 1758 is given the taxonomic serial number (TSN) of 180489 in ITIS, while P. macrocephalus has the TSN 180488. These usages are then picked up by Cat. of Life, however rather than re-using the ITIS TSNs, they are allocated the LSIDs ??? (for P. catodon) and urn:lsid:catalogueoflife.org:taxon:415df5cc-52c2-102c-b3cd-957176fb88b9:col20101221 for P. macrocephalus. (Also my understanding is that these change every year with a new release of CoL). Meanwhile over at uBio P. catodon has the LSID urn:lsid:ubio.org:namebank:105910 while P. macrocephalus has the LSID urn:lsid:ubio.org:namebank:111731 . Of course being both Linnaean taxa, these also have ZooBank LSIDs i.e. P. catodon is urn:lsid:zoobank.org:act:046FA756-3A20-454E-8351-12EDE16574B4 while P. macrocephalus is urn:lsid:zoobank.org:act:A2F39087-C7A1-476F-88F6-B7C7B61D86AB . Meanwhile over in AFD we find the LSID urn:lsid:biodiversity.org.au:afd.taxon:587e6872-512b-402e-9c5e-f098c6495275 for P. macroceph

a lus (not sure about catodon); in ION P. catodon has the LSID urn:lsid:organismnames.com:name:553123 while P. macrocephalus has urn:lsid:organismnames.com:name:553124 and so on we go (GenBank IDs, WoRMS IDs, Fauna Europaea IDs, etc. etc.).

Now if you look in GNI (which indexes namestrings) you will find the variants as follows:

Physeter catodon (Linnaeus, 1758) Physeter catodon L. Physeter Catodon Linnaeus 1758 Physeter catodon Linnaeus, 1758 Physeter catodon Linnaeus, 1766

(and similar for P. macrocephalus)

each of which no doubt also has its own unique ID somewhere behind the scenes as well, all presumably awaiting reconciliation.

My question is simply how this apparently unregulated minting of LSIDs and other unique identifiers is contributing to a solution rather than becoming a new problem requiring additional resources to reconcile (bearing in mind that we do not even have a reliable list of all named taxa at this time).

I am sure there is an answer somewhere, it's just that I cannot see it as yet :) - maybe someone will enlighten me however...

Regards - Tony

________________________________ From: Peter DeVries [pete.devries@gmail.com] Sent: Saturday, 8 January 2011 9:00 AM To: Rees, Tony (CMAR, Hobart) Cc: jsachs@csee.umbc.edu; tdwg-content@lists.tdwg.org; pmurray@anbg.gov.au; pleary@mbl.edu; dpatterson@eol.org; dmozzherin; Nathan Wilson Subject: Re: [tdwg-content] most GUIDs/URIs for names/taxon stuff not ready for prime time

Hi Tony,

That is why I think everyone should get behind the GlobalName Index which is a EoL.org / GBIF.org project. It includes the names from ITIS et al.

The work I am doing with Dima is still experimental and in development, but it demonstrates how two independent databases can autogenerate a shared URI.

That idea, in itself, is interesting even if you don't like UUID's etc. or the particular way the RDF is implemented now.

I find ITIS very valuable, but it has a different ID's for the different name's for what would many would consider the same concept.

So if a given species name changes from Aedes triseriatus to Ochlerotatus triseriatus a new ID is generated.

This is different than how NCBI does it, but ITIS has more names.

Also NCBI does not tell you anything about what is or is not an instance of a given species.

Since I think ITIS and NCBI are useful resources I link to it when I find an appropriate ID to match to. You can see this in my RDF.

I would encourage ITIS to continue and think about exposing at least some of the data as RDF using CoolURI's.

http://www.w3.org/TR/cooluris/

<http://www.w3.org/TR/cooluris/>http://www.w3.org/Provider/Style/URI (i.e. do the best you can :-)

There are LOD compliant URI's for the NCBI ID's via bio2rdf and uniprot.

One of the major advantages of the Linked Open Data approach is that there does not have to be one central place for everything.

Data sets can be distributed and each group can focus on it's core competencies.

Even things like species concepts could be distributed, but I think it would be best to first get a common understanding of how they will work.

Or at least a couple different "kinds" of standard species concepts.

I see several kinds of species-like resources out there now, some are name-based (ITIS), others are more like concepts (NCBI). Some entail a particular classification (NCBI, CoL, etc.). Others coin a species concept to which various classifications are associated (TaxonConcept.org)

We are at the start of trying to untangle this mess and a good place to start is one resource that contains all the name uses.

Besides is there any one else willing to take on the responsibility to collect and curate the 400 name variants that can exist for one species?

...
From this we can begin to connect those names to each other and as well as related data sets like publications and occurrences etc.

I think it is good to have a diversity of projects even if there is some overlap. Each group adds some interesting ideas and perspective.

Respectfully,

- Pete

P.S. Another thing we need is a shared set of URI's for attribution so that they can be easily and efficiently incorporated and tracked.

e.g. dataprovidedBy <http://some.shared.org/providers#ITIS>

A simple URI rather than a huge glob of text and images for each little thing.

Perhaps using the void vocabulary http://vocab.deri.ie/void/guide

On Fri, Jan 7, 2011 at 2:11 PM, <Tony.Rees@csiro.au> wrote: Dear all,

...
From where I sit (very much on the sideline of this debate, waiting to see what happens), the main trouble I see is that (1) anyone and his dog can mint yet another unique identifier for the same taxon name, leading to uncontrolled proliferation and never ending ID reconciliation issues, and (2) there are always some names not on any particular external "identifier assigning" list which therefore lack an identifier (however have a scientific name) just when you want one. No problem, just mint your own, however that feeds back into (1) again...

Just curious - ITIS TSNs would have to be one of the longest established and promoted systems of "non-name" identifiers for taxon names - have they been successful in anyone's view, or if not, why not...

Any comments appreciated.

Regards - Tony

-- --------------------------------------------------------------- Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 TaxonConcept Knowledge Base<http://www.taxonconcept.org/> / GeoSpecies Knowledge Base<http://lod.geospecies.org/> About the GeoSpecies Knowledge Base<http://about.geospecies.org/> ------------------------------------------------------------ _______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content .

-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A. delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235 office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707 http://bioimages.vanderbilt.edu

Tony.Rees＠csiro.au

9 Jan 9 Jan

00:52

New subject: most GUIDs/URIs for names/taxon stuff not ready for prime time

Steve, thanks for the reply / elaboration. I guess I am indeed talking about non-reuse of identifiers for taxon names, but at an even more basic level, is the "cure" (LSIDs) for the problem (non-unique or variably cited taxon names) actually better or worse than the original problem. I brought up the ITIS TSNs as a worked example more than a perfect model to follow and to see if there are lessons to be learned there, based on to what extent it has or has not been succesful as a numeric alternative to using taxon names for a data interchange format, as well as its degree of re-use by subsequent players in the same area (apparently zero in the LSID world at least).

...

From my point of view "an" integrated taxonomic information system is a worthwhile goal; what we actually need to discuss whether the present ITIS, CoL, uBio, ION, GNI, ZooBank etc. etc. actually have any prospect of reaching the degree of completion and responsiveness to new incoming data that such a vision requires, while eliminating the duplication of effort / proliferation of taxon IDs as indicated in my previous post, and at the same time seeing whether there is in fact a real-world need in the area of data interoperability that would use such a system of alternative identifiers if it existed. To my mind, a set of ever expanding "standards" that are only used by their respective originators are not standards at all, they are just local system identifiers and so should be relegated to that role (which is perfectly valid of course) but not promoted as being the way forward in solving biodiversity informatics data interchange issues.

Regards - Tony ________________________________________ From: Steve Baskauf [steve.baskauf@vanderbilt.edu] Sent: Sunday, 9 January 2011 9:37 AM To: Rees, Tony (CMAR, Hobart) Cc: pete.devries@gmail.com; tdwg-content@lists.tdwg.org Subject: Re: [tdwg-content] most GUIDs/URIs for names/taxon stuff not ready for prime time I think that both Pete and Tony bring up good but different points. Tony seems to be making a point about non-reuse of identifiers for taxon names, while Pete seems to be making points about the desirability of also having identifiers for concepts (which he might or might not intend to be the same thing as taxon name usages) and the need for identifiers that will work in the Linked Open Data world. As far as Tony's point is concerned, I would like to have an answer to the very direct question that Tony raised. Why aren't ITIS TSNs, which are well-known and often-used unique identifiers within their system, being used as a part a Global Names Index GUID? Since they are locally unique within ITIS, they could easily be made globally unique by concatenating them to http://gni.globalnames.org/ . For example, if TSN=778049, the URI would be http://gni.globalnames.org/778049 . That would be short, simple, globally unique, and a form that could easily be used with content negotiation and therefore be fine for use in the LOD world. Databases that are already using ITIS TSNs could easily and reliably construct the URI. I suppose one objection to this might be that ITIS TNS's imply TNUs and not simply names, but to my knowledge there is only a single name/author combination per TSN and so it could be used as part of a name identifier. Am I correct about this? I do not believe that the suggestion of autogenerating URIs by creating UUIDs from the name strings is a good idea. In the examples that Tony gave: Physeter catodon (Linnaeus, 1758) Physeter catodon L. Physeter Catodon Linnaeus 1758 Physeter catodon Linnaeus, 1758 Physeter catodon Linnaeus, 1766 I'm assuming that they all (except maybe the last one) represent formatting/spelling differences of what is actually the same name. In an earlier email, one of Pete's rationales for why a URI should be used instead of an actual name string was so that the various misspellings and differently formatted versions of the same name would be represented by a single URI. Yet this system of generating the URI from a UUID through an algorithm based on the name string will allow someone who is writing software to generate the URI based on a misspelled name without requiring the software to check against a list of correct names/identifiers. How is that better than just allowing people to enter misspelled names as string literals??? The only thing it accomplishes is replacing something intelligible like "Junonia coenia" with something unintelligible like 3a70f04d-fd29-5570-ba91-52dae0c3d07f I really don't understand this infatuation with UUIDs. I would have thought that after the LSID debacle, this community would have learned something the negative effect on progress of promoting an unnecessarily complicated technical solution when one is not required to get the job done. Imagine you are an herbarium curator who is somewhat behind the times technologically and a bit confused about GUIDs. This curator has an Excel spreadsheet with TSN IDs in it. Do we ask him to get one of his undergrad helpers to create a formula in an adjacent cell to the TSN that concatenates "http://gni.globalnames.org/tsn/" to the TSN to create the GUID that we tell him that he should be using? Or do we ask him to download some special software that isn't quite ready to be used that will install a UUID generator on his computer and then spend an hour with him on the phone walking him through the installation which won't work because he'll have to install LUNIX, a Java virtual machine, a MySQL database, a proxy localhost http xyz server thingamajig that he has never heard of and doesn't understand but which seems so simple to the TDWG tech crew? You can ask him to use http://gni.globalnames.org/tsn/778049 or you can ask him to use http://gni.globalnames.org/name_strings/3a70f04d-fd29-5570-ba91-52dae0c3d07f Which one do you think is going to confuse him? Which one can he type? Which one can he cite in a paper and expect people to write down? I complained about this in the draft of the Beginner's Guide to GUIDs which relied heavily on UUIDs in its examples. Let's get real here. We should be designing systems with the users in mind, not the programmers. Steve Tony.Rees@csiro.au wrote:

...

Dear Pete,

I don't think you have really addressed the question I was attempting to ask - so I will try again...

Let's take as an example the sperm whale, Physeter macrocephalus, syn. Physeter catodon (or vice versa in some sources e.g. Mammal Species of the World 3rd edition).

P. catodon Linnaeus 1758 is given the taxonomic serial number (TSN) of 180489 in ITIS, while P. macrocephalus has the TSN 180488. These usages are then picked up by Cat. of Life, however rather than re-using the ITIS TSNs, they are allocated the LSIDs ??? (for P. catodon) and urn:lsid:catalogueoflife.org:taxon:415df5cc-52c2-102c-b3cd-957176fb88b9:col20101221 for P. macrocephalus. (Also my understanding is that these change every year with a new release of CoL). Meanwhile over at uBio P. catodon has the LSID urn:lsid:ubio.org:namebank:105910 while P. macrocephalus has the LSID urn:lsid:ubio.org:namebank:111731 . Of course being both Linnaean taxa, these also have ZooBank LSIDs i.e. P. catodon is urn:lsid:zoobank.org:act:046FA756-3A20-454E-8351-12EDE16574B4 while P. macrocephalus is urn:lsid:zoobank.org:act:A2F39087-C7A1-476F-88F6-B7C7B61D86AB . Meanwhile over in AFD we find the LSID urn:lsid:biodiversity.org.au:afd.taxon:587e6872-512b-402e-9c5e-f098c6495275 for P. macroceph

a lus (not sure about catodon); in ION P. catodon has the LSID urn:lsid:organismnames.com:name:553123 while P. macrocephalus has urn:lsid:organismnames.com:name:553124 and so on we go (GenBank IDs, WoRMS IDs, Fauna Europaea IDs, etc. etc.).

Now if you look in GNI (which indexes namestrings) you will find the variants as follows:

Physeter catodon (Linnaeus, 1758) Physeter catodon L. Physeter Catodon Linnaeus 1758 Physeter catodon Linnaeus, 1758 Physeter catodon Linnaeus, 1766

(and similar for P. macrocephalus)

each of which no doubt also has its own unique ID somewhere behind the scenes as well, all presumably awaiting reconciliation.

My question is simply how this apparently unregulated minting of LSIDs and other unique identifiers is contributing to a solution rather than becoming a new problem requiring additional resources to reconcile (bearing in mind that we do not even have a reliable list of all named taxa at this time).

I am sure there is an answer somewhere, it's just that I cannot see it as yet :) - maybe someone will enlighten me however...

Regards - Tony

________________________________ From: Peter DeVries [pete.devries@gmail.com] Sent: Saturday, 8 January 2011 9:00 AM To: Rees, Tony (CMAR, Hobart) Cc: jsachs@csee.umbc.edu; tdwg-content@lists.tdwg.org; pmurray@anbg.gov.au; pleary@mbl.edu; dpatterson@eol.org; dmozzherin; Nathan Wilson Subject: Re: [tdwg-content] most GUIDs/URIs for names/taxon stuff not ready for prime time

Hi Tony,

That is why I think everyone should get behind the GlobalName Index which is a EoL.org / GBIF.org project. It includes the names from ITIS et al.

The work I am doing with Dima is still experimental and in development, but it demonstrates how two independent databases can autogenerate a shared URI.

That idea, in itself, is interesting even if you don't like UUID's etc. or the particular way the RDF is implemented now.

I find ITIS very valuable, but it has a different ID's for the different name's for what would many would consider the same concept.

So if a given species name changes from Aedes triseriatus to Ochlerotatus triseriatus a new ID is generated.

This is different than how NCBI does it, but ITIS has more names.

Also NCBI does not tell you anything about what is or is not an instance of a given species.

Since I think ITIS and NCBI are useful resources I link to it when I find an appropriate ID to match to. You can see this in my RDF.

I would encourage ITIS to continue and think about exposing at least some of the data as RDF using CoolURI's.

http://www.w3.org/TR/cooluris/

<http://www.w3.org/TR/cooluris/>http://www.w3.org/Provider/Style/URI (i.e. do the best you can :-)

There are LOD compliant URI's for the NCBI ID's via bio2rdf and uniprot.

One of the major advantages of the Linked Open Data approach is that there does not have to be one central place for everything.

Data sets can be distributed and each group can focus on it's core competencies.

Even things like species concepts could be distributed, but I think it would be best to first get a common understanding of how they will work.

Or at least a couple different "kinds" of standard species concepts.

I see several kinds of species-like resources out there now, some are name-based (ITIS), others are more like concepts (NCBI). Some entail a particular classification (NCBI, CoL, etc.). Others coin a species concept to which various classifications are associated (TaxonConcept.org)

We are at the start of trying to untangle this mess and a good place to start is one resource that contains all the name uses.

Besides is there any one else willing to take on the responsibility to collect and curate the 400 name variants that can exist for one species?

...
From this we can begin to connect those names to each other and as well as related data sets like publications and occurrences etc.

I think it is good to have a diversity of projects even if there is some overlap. Each group adds some interesting ideas and perspective.

Respectfully,

- Pete

P.S. Another thing we need is a shared set of URI's for attribution so that they can be easily and efficiently incorporated and tracked.

e.g. dataprovidedBy <http://some.shared.org/providers#ITIS>

A simple URI rather than a huge glob of text and images for each little thing.

Perhaps using the void vocabulary http://vocab.deri.ie/void/guide

On Fri, Jan 7, 2011 at 2:11 PM, <Tony.Rees@csiro.au> wrote: Dear all,

...
From where I sit (very much on the sideline of this debate, waiting to see what happens), the main trouble I see is that (1) anyone and his dog can mint yet another unique identifier for the same taxon name, leading to uncontrolled proliferation and never ending ID reconciliation issues, and (2) there are always some names not on any particular external "identifier assigning" list which therefore lack an identifier (however have a scientific name) just when you want one. No problem, just mint your own, however that feeds back into (1) again...

Just curious - ITIS TSNs would have to be one of the longest established and promoted systems of "non-name" identifiers for taxon names - have they been successful in anyone's view, or if not, why not...

Any comments appreciated.

Regards - Tony

-- --------------------------------------------------------------- Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 TaxonConcept Knowledge Base<http://www.taxonconcept.org/> / GeoSpecies Knowledge Base<http://lod.geospecies.org/> About the GeoSpecies Knowledge Base<http://about.geospecies.org/> ------------------------------------------------------------ _______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content .

Peter DeVries

00:53

New subject: most GUIDs/URIs for names/taxon stuff not ready for prime time

I think people are reading too much in to the gobalnames RDF experiment. You need to look at what the GNI is - a store of the names that have been used and where they are used. Also, the ITIS numbers Steve mentions will clash with other numbers used by the other providers. ITIS consists of a relatively small set of names. See http://gni.globalnames.org/data_sources <http://gni.globalnames.org/data_sources>* The TaxonConcept list number is a bit out of date and was mainly to add names I had that were not in the list. The ITIS record entails much more than a simple string of characters it has a classification and much more. The reason to use UUID's for somethings (maybe not this) is that they allow you to generate a globally unique identifier, and there is support for them built-in to every modern operating system even Windows. Just look at your registry. I can make one on my Mac via the command line command *uuidgen* There is probably something similar built-in to Windows. I have no formal role in the GNI I was just asked to look into seeing if a triple/quadstore approach might help solve the name problem. But it do think that having all the names used in one place is a good idea. Respectfully, - Pete On Sat, Jan 8, 2011 at 4:37 PM, Steve Baskauf <steve.baskauf@vanderbilt.edu>wrote:

...

I think that both Pete and Tony bring up good but different points. Tony seems to be making a point about non-reuse of identifiers for taxon names, while Pete seems to be making points about the desirability of also having identifiers for concepts (which he might or might not intend to be the same thing as taxon name usages) and the need for identifiers that will work in the Linked Open Data world. As far as Tony's point is concerned, I would like to have an answer to the very direct question that Tony raised. Why aren't ITIS TSNs, which are well-known and often-used unique identifiers within their system, being used as a part a Global Names Index GUID? Since they are locally unique within ITIS, they could easily be made globally unique by concatenating them to http://gni.globalnames.org/ . For example, if TSN=778049, the URI would be http://gni.globalnames.org/778049 . That would be short, simple, globally unique, and a form that could easily be used with content negotiation and therefore be fine for use in the LOD world. Databases that are already using ITIS TSNs could easily and reliably construct the URI. I suppose one objection to this might be that ITIS TNS's imply TNUs and not simply names, but to my knowledge there is only a single name/author combination per TSN and so it could be used as part of a name identifier. Am I correct about this?

I do not believe that the suggestion of autogenerating URIs by creating UUIDs from the name strings is a good idea. In the examples that Tony gave:

Physeter catodon (Linnaeus, 1758) Physeter catodon L. Physeter Catodon Linnaeus 1758 Physeter catodon Linnaeus, 1758 Physeter catodon Linnaeus, 1766

I'm assuming that they all (except maybe the last one) represent formatting/spelling differences of what is actually the same name. In an earlier email, one of Pete's rationales for why a URI should be used instead of an actual name string was so that the various misspellings and differently formatted versions of the same name would be represented by a single URI. Yet this system of generating the URI from a UUID through an algorithm based on the name string will allow someone who is writing software to generate the URI based on a misspelled name without requiring the software to check against a list of correct names/identifiers. How is that better than just allowing people to enter misspelled names as string literals??? The only thing it accomplishes is replacing something intelligible like "Junonia coenia" with something unintelligible like 3a70f04d-fd29-5570-ba91-52dae0c3d07f

I really don't understand this infatuation with UUIDs. I would have thought that after the LSID debacle, this community would have learned something the negative effect on progress of promoting an unnecessarily complicated technical solution when one is not required to get the job done. Imagine you are an herbarium curator who is somewhat behind the times technologically and a bit confused about GUIDs. This curator has an Excel spreadsheet with TSN IDs in it. Do we ask him to get one of his undergrad helpers to create a formula in an adjacent cell to the TSN that concatenates "http://gni.globalnames.org/tsn/" to the TSN to create the GUID that we tell him that he should be using? Or do we ask him to download some special software that isn't quite ready to be used that will install a UUID generator on his computer and then spend an hour with him on the phone walking him through the installation which won't work because he'll have to install LUNIX, a Java virtual machine, a MySQL database, a proxy localhost http xyz server thingamajig that he has never heard of and doesn't understand but which seems so simple to the TDWG tech crew? You can ask him to use http://gni.globalnames.org/tsn/778049 or you can ask him to use

http://gni.globalnames.org/name_strings/3a70f04d-fd29-5570-ba91-52dae0c3d07f

Which one do you think is going to confuse him? Which one can he type? Which one can he cite in a paper and expect people to write down? I complained about this in the draft of the Beginner's Guide to GUIDs which relied heavily on UUIDs in its examples. Let's get real here. We should be designing systems with the users in mind, not the programmers.

Steve

Tony.Rees@csiro.au wrote:

...
Dear Pete,

I don't think you have really addressed the question I was attempting to ask - so I will try again...

Let's take as an example the sperm whale, Physeter macrocephalus, syn. Physeter catodon (or vice versa in some sources e.g. Mammal Species of the World 3rd edition).

P. catodon Linnaeus 1758 is given the taxonomic serial number (TSN) of 180489 in ITIS, while P. macrocephalus has the TSN 180488. These usages are then picked up by Cat. of Life, however rather than re-using the ITIS TSNs, they are allocated the LSIDs ??? (for P. catodon) and urn:lsid:catalogueoflife.org:taxon:415df5cc-52c2-102c-b3cd-957176fb88b9:col20101221 for P. macrocephalus. (Also my understanding is that these change every year with a new release of CoL). Meanwhile over at uBio P. catodon has the LSID urn:lsid:ubio.org:namebank:105910 while P. macrocephalus has the LSID urn:lsid:ubio.org:namebank:111731 . Of course being both Linnaean taxa, these also have ZooBank LSIDs i.e. P. catodon is urn:lsid:zoobank.org:act:046FA756-3A20-454E-8351-12EDE16574B4 while P. macrocephalus is urn:lsid:zoobank.org:act:A2F39087-C7A1-476F-88F6-B7C7B61D86AB . Meanwhile over in AFD we find the LSID urn:lsid:biodiversity.org.au:afd.taxon:587e6872-512b-402e-9c5e-f098c6495275 for P. macroceph

a lus (not sure about catodon); in ION P. catodon has the LSID urn:lsid:organismnames.com:name:553123 while P. macrocephalus has urn:lsid:organismnames.com:name:553124 and so on we go (GenBank IDs, WoRMS IDs, Fauna Europaea IDs, etc. etc.).

Now if you look in GNI (which indexes namestrings) you will find the variants as follows:

Physeter catodon (Linnaeus, 1758) Physeter catodon L. Physeter Catodon Linnaeus 1758 Physeter catodon Linnaeus, 1758 Physeter catodon Linnaeus, 1766

(and similar for P. macrocephalus)

each of which no doubt also has its own unique ID somewhere behind the scenes as well, all presumably awaiting reconciliation.

My question is simply how this apparently unregulated minting of LSIDs and other unique identifiers is contributing to a solution rather than becoming a new problem requiring additional resources to reconcile (bearing in mind that we do not even have a reliable list of all named taxa at this time).

I am sure there is an answer somewhere, it's just that I cannot see it as yet :) - maybe someone will enlighten me however...

Regards - Tony

________________________________ From: Peter DeVries [pete.devries@gmail.com] Sent: Saturday, 8 January 2011 9:00 AM To: Rees, Tony (CMAR, Hobart) Cc: jsachs@csee.umbc.edu; tdwg-content@lists.tdwg.org; pmurray@anbg.gov.au; pleary@mbl.edu; dpatterson@eol.org; dmozzherin; Nathan Wilson Subject: Re: [tdwg-content] most GUIDs/URIs for names/taxon stuff not ready for prime time

Hi Tony,

That is why I think everyone should get behind the GlobalName Index which is a EoL.org / GBIF.org project. It includes the names from ITIS et al.

The work I am doing with Dima is still experimental and in development, but it demonstrates how two independent databases can autogenerate a shared URI.

That idea, in itself, is interesting even if you don't like UUID's etc. or the particular way the RDF is implemented now.

I find ITIS very valuable, but it has a different ID's for the different name's for what would many would consider the same concept.

So if a given species name changes from Aedes triseriatus to Ochlerotatus triseriatus a new ID is generated.

This is different than how NCBI does it, but ITIS has more names.

Also NCBI does not tell you anything about what is or is not an instance of a given species.

Since I think ITIS and NCBI are useful resources I link to it when I find an appropriate ID to match to. You can see this in my RDF.

I would encourage ITIS to continue and think about exposing at least some of the data as RDF using CoolURI's.

http://www.w3.org/TR/cooluris/

<http://www.w3.org/TR/cooluris/>http://www.w3.org/Provider/Style/URI(i.e. do the best you can :-)

There are LOD compliant URI's for the NCBI ID's via bio2rdf and uniprot.

One of the major advantages of the Linked Open Data approach is that there does not have to be one central place for everything.

Data sets can be distributed and each group can focus on it's core competencies.

Even things like species concepts could be distributed, but I think it would be best to first get a common understanding of how they will work.

Or at least a couple different "kinds" of standard species concepts.

I see several kinds of species-like resources out there now, some are name-based (ITIS), others are more like concepts (NCBI). Some entail a particular classification (NCBI, CoL, etc.). Others coin a species concept to which various classifications are associated (TaxonConcept.org)

We are at the start of trying to untangle this mess and a good place to start is one resource that contains all the name uses.

Besides is there any one else willing to take on the responsibility to collect and curate the 400 name variants that can exist for one species?

...
From this we can begin to connect those names to each other and as well as related data sets like publications and occurrences etc.

I think it is good to have a diversity of projects even if there is some overlap. Each group adds some interesting ideas and perspective.

Respectfully,

- Pete

P.S. Another thing we need is a shared set of URI's for attribution so that they can be easily and efficiently incorporated and tracked.

e.g. dataprovidedBy <http://some.shared.org/providers#ITIS>

A simple URI rather than a huge glob of text and images for each little thing.

Perhaps using the void vocabulary http://vocab.deri.ie/void/guide

On Fri, Jan 7, 2011 at 2:11 PM, <Tony.Rees@csiro.au> wrote: Dear all,

...
From where I sit (very much on the sideline of this debate, waiting to see what happens), the main trouble I see is that (1) anyone and his dog can mint yet another unique identifier for the same taxon name, leading to uncontrolled proliferation and never ending ID reconciliation issues, and (2) there are always some names not on any particular external "identifier assigning" list which therefore lack an identifier (however have a scientific name) just when you want one. No problem, just mint your own, however that feeds back into (1) again...

Just curious - ITIS TSNs would have to be one of the longest established and promoted systems of "non-name" identifiers for taxon names - have they been successful in anyone's view, or if not, why not...

Any comments appreciated.

Regards - Tony

-- --------------------------------------------------------------- Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 TaxonConcept Knowledge Base<http://www.taxonconcept.org/> / GeoSpecies Knowledge Base<http://lod.geospecies.org/> About the GeoSpecies Knowledge Base<http://about.geospecies.org/> ------------------------------------------------------------ _______________________________________________ tdwg-content mailing list

tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content .

-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences

postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A.

delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235

office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707 http://bioimages.vanderbilt.edu

Roderic Page

10:24

New subject: most GUIDs/URIs for names/taxon stuff not ready for prime time

Why aren't identifiers reused? ---------------------------------------- Because in most cases they offer no added value. If I have a ITIS TSN there's not much I can do with it. I can get a name from ITIS (with some vague assurance that thus name is accepted - or not - with no evidence for this assertion). I can think of only two taxonomic identifiers that have real value and get much reuse, NCBI taxonomy ids and uBio NameBankIDs. NCBI ids are reused because they underpin the genomics databases, and genomics does real computational biology, and makes extensive reuse of data (as exemplified by the annual Nucleic Acids Research database issue). uBio NameBankIDs get reused because uBio has lots of names, and provides services for discovering those names in text (see, e.g., their use by BHL). Few taxonomic name databases provide a compelling reason for anyone to use their identifiers, most being digital sinks (you go there, get an identifier for a name, and nothing else). Why UUIDS? ---------------- UUIDs are ugly, and solve a problem that for the most part we don't have. They are ideal for minting globally unique identifiers in a distributed system, but we don't have distributed systems. Catalogue of Life uses UUIDs, but these are centrally created (I suspect using MySQL's UUID function, given how similar the UUIDs are to each other). ZooBank uses them, but it is not (yet) a distributed system. If the Catalogue of Life were genuinely a distributed system UUIDs would make sense, but that's not actually how it works. I think users would cope with UUIDs if the databases using them provides clear value. For example, MusicBrainz uses UUIDs http://musicbrainz.org/artist/563201cb-721c-4cfb-acca-c1ba69e3d1fb.html , as does Mendeley, the latter hiding them from users via human- readable URLs. Given that we have obvious user-friendly candidates for URLs (taxonomic names), it would be trivial to hide UUIDs in names (making homonyms distinct by adding authorship, or whatever it took to make them unique as strings). What, if anything, is a taxonomic name? ---------------------------------------------------- In my experience, when non-taxonomists meet taxonomists things get ugly. For example, a publisher wanting to mark up taxonomic names in text might ask taxonomists how to do this,and within minutes the taxonomists are off into discussions of namestrings versus usages versus concepts and pretty soon the publisher deeply regrets ever asking the question. I've been at meetings where the look in publishers' eyes said "run away, run away". Part of the reason we have multiple databases is because different projects are capturing different things (roughly speaking, uBio is mostly about namestrings, Catalogue of Life is about concepts, IPNI and ZooBank are about first usage of a name, etc.) Most users outside our field won't give a damn about the niceties of these distinctions, yet we persist in discussing them ad nauseam. Until we provide a single, very simple service that takes a name string and hides all this complexity (unless the user chooses to see the gory details) while still providing useful information, we will be stuck in multiple identifier hell. The tragedy is we've never had more people genuinely interested in linking to names than at present -- publishers are desperately trying to add "semantic value" to their content, and we are spectacularly ill-equipped to deliver this (and it's our own fault). I rather suspect we're rapidly approaching the point where users outside taxonomy will simply say "to hell with these taxonomists, let's just use Wikipedia and be done with it." Regards Rod --------------------------------------------------------- Roderic Page Professor of Taxonomy Institute of Biodiversity, Animal Health and Comparative Medicine College of Medical, Veterinary and Life Sciences Graham Kerr Building University of Glasgow Glasgow G12 8QQ, UK Email: r.page@bio.gla.ac.uk Tel: +44 141 330 4778 Fax: +44 141 330 2792 AIM: rodpage1962@aim.com Facebook: http://www.facebook.com/profile.php?id=1112517192 Twitter: http://twitter.com/rdmpage Blog: http://iphylo.blogspot.com Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html

Steve Baskauf

12:17

New subject: most GUIDs/URIs for names/taxon stuff not ready for prime time

Well, Rod is pretty much describing me. I'm not a taxonomist and mostly find taxonomic discussions annoying. I pay attention to the byzantine taxonomy stuff only because I feel like I have to. I would like to use URIs that link to some source that takes care of the taxonomic stuff so that I don't have to provide the information myself. But I would like to only use URIs that are going to be in widespread use in the future. To use an analogy, I'm shopping for a video player. I want to make sure that I don't end up buying a Betamax and then find out later that everybody else is using VHS (OK, that dates me - I don't want to buy HD-DVD and find out that everybody else is using Blu-ray). Is uBio a Betamax/HD-DVD? I was thinking that the Global Names Index looked like VHS (or Blu-ray), but when I look in the box I only see the instruction book - the actual video player seems to be missing. The conclusion that I'm reaching at this point is that it is too early for me to buy. Am I wrong about this? What is the likelihood that this situation will change in the next year (i.e. that there will be usable URIs with minimal metadata for GNI names)? Steve

...

Most users outside our field won't give a damn about the niceties of these distinctions, yet we persist in discussing them ad nauseam. Until we provide a single, very simple service that takes a name string and hides all this complexity (unless the user chooses to see the gory details) while still providing useful information, we will be stuck in multiple identifier hell. The tragedy is we've never had more people genuinely interested in linking to names than at present -- publishers are desperately trying to add "semantic value" to their content, and we are spectacularly ill-equipped to deliver this (and it's our own fault).

I rather suspect we're rapidly approaching the point where users outside taxonomy will simply say "to hell with these taxonomists, let's just use Wikipedia and be done with it."

Regards

Rod

--------------------------------------------------------- Roderic Page Professor of Taxonomy Institute of Biodiversity, Animal Health and Comparative Medicine College of Medical, Veterinary and Life Sciences Graham Kerr Building University of Glasgow Glasgow G12 8QQ, UK

Email: r.page@bio.gla.ac.uk Tel: +44 141 330 4778 Fax: +44 141 330 2792 AIM: rodpage1962@aim.com Facebook: http://www.facebook.com/profile.php?id=1112517192 Twitter: http://twitter.com/rdmpage Blog: http://iphylo.blogspot.com Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html

.

5462

Age (days ago)

5464

Last active (days ago)

List overview

Download

8 comments

4 participants

participants (4)

Peter DeVries
Roderic Page
Steve Baskauf
Tony.Rees＠csiro.au