Re: [tdwg-content] ITIS TSNID to uBio NamebankIDs mapping
Hi Steve (and Dave),
[NB: After having composed the email below, just before sending it, I re-read your initial email more carefully and realized that you said you already had the ITIS TSNs, and were looking to add the NamebankIDs! Doh! Well, in case you (or anyone else) is interested in methods of matching names to get TSNs, I'll go ahead and send this anyway. But do note the comments below about the ITIS "versions" and ongoing overhaul of the vascular plant data in ITIS!!! -Dave]
I noticed this just before leaving work last week, and was out yesterday, but I wanted to chime in on this. I'm glad the uBio tools are meeting your needs (they do have some cool stuff!), but it should be noted that those tools are using a static snapshot of ITIS data from January 2009, and we have added about 50,000 additional scientific names, and updated tens of thousands of names beyond that (most of that in the last 6 months, as the frequency of loads dropped off in 2009-2010 due to technical issues).
I also want to note that ITIS is right in the middle of a full update of the vascular plant data in ITIS, and we're loading updated families on a monthly basis... and at long last we are tackling all the leftover issues from several bulk loads from USDA PLANTS data that left unreconciled bits of ITIS' older vascular plant data in various confusing states... so it is a VAST improvement that is underway.
There are several options for bouncing your names off the current version of ITIS.
One is to automate a matching process using the live ITIS data, based on the existing ITIS Web Services. I am CC'ing Alan Hampson, our IT fellow who built the Web Services ( http://www.itis.gov/web_service.html ), in case you'd like to follow up with him on that option. The advantage is that once you have a process in place it is completely self-serve and can always utilize the current ITIS data. If you have the resources to do this I think it would be greatly to your advantage to use this approach.
You can explore some ideas for client software to use the services at: http://www.itis.gov/ws_develop.html
And for more information on ITIS web services try http://www.itis.gov/ws_description.html http://www.itis.gov/ITISWebService.xml
The ability to flag multiply-matched names (as you noted) should probably be considered, so that appropriate manual steps can be taken. This solution will allow you to take advantage of subsequent updates to ITIS with a minimum of additional effort, and given that the plant data are in the middle of a major overhaul, this bears consideration!
Another possibility is to grab a full snapshot of the ITIS data, and load it into a database so you can do what you wish. The obvious drawback is that it goes out of date, as with the ITIS snapshot uBio is currently using. But it puts you in the driver's seat re what to do & getting new versions of ITIS. Some general information about the full exports is in the following page, although conspicuously absent is any mention of the MySQL version which (assuming you have the free MySQL properly installed & configured) can be loaded with just a few clicks or a few command lines (depending on your platform): http://www.itis.gov/ftp_download.html And the current ITIS data are all here for downloading: http://www.itis.gov/downloads/
A third option, which I note with some trepidation, is the old "Compare Nomenclature/Taxonomy" function on the ITIS site: http://www.itis.gov/taxmatch_ftp.html This is a VERY old function that we do plan on replacing (timeframe not yet certain), and it is vulnerable to timeouts, etc., which is why it notes to limit the number of names per pass. But with smaller chunks of names it does work quite well. The caveat is that I would make sure to choose the 4th option in Step 4, as it is at least aware (unlike the 3 other options) of multiply-matched name cases, and lists them separately at the bottom of the report. Just a bare listing of the scientific names, with the word "name" at the top, saved as plain text, is all that is needed for input.
A final option would be to ask someone at ITIS to handle the matching for you (leaving you to decide re the multiply-matched names). This might be simple from your end, but is suboptimal as it leaves you in the same position as you are now should you want or need to compare names again in the future (whether due to acquiring new names in your system, or wanting to check against a later updated version of ITIS), and it pulls someone here (probably me) off of the push to get more updates into ITIS. But in a pinch, I'm certainly willing to try to help you, should it come down to that! I would just ask that you seriously consider the web services option (in particular) or the others above first.
I hope this helps some. If you have already run all your matches against the old "ITIS" data via uBio then you might consider re-running (against the current ITIS data) at least the leftover names that you did not yet get matched. Let us know if you have questions (the itiswebmaster@itis.gov address goes to myself and Alan and several others, so that might be the best bet for a follow-up unless you have a question specifically for me).
Regards, Dave
David Nicolson Data Development Coordinator, Integrated Taxonomic Information System Biologist, USGS Core Science Systems, Biological Informatics Program nicolsod@si.edu Office 202-633-2149 Fax 202-786-2934 http://www.itis.gov/ http://www.cbif.gc.ca/itis/ "Nihil sumas necesse est..."
-----Original Message----- Date: Fri, 20 May 2011 05:42:03 -0500 From: Steve Baskauf steve.baskauf@vanderbilt.edu Subject: Re: [tdwg-content] ITIS TSNID to uBio NamebankIDs mapping To: "David Remsen (GBIF)" dremsen@gbif.org Cc: "tdwg-content@lists.tdwg.org" tdwg-content@lists.tdwg.org Message-ID: 4DD6457B.2080204@vanderbilt.edu Content-Type: text/plain; charset="iso-8859-1"
Thanks, all, for the responses. The "Compare to ITIS" function does just what I want. I did a test run of 1000 names and it worked like a charm. I will need to do a little massaging because sometimes two or more ITIS IDs come back for each uBio ID. But I can handle that. Steve
David Remsen (GBIF) wrote:
Steve
Have you tried this? http://www.ubio.org/clients/ITIS/index.php
or this? http://www.ubio.org/services/mapper/index2.php
All this ubio talk makes me think we were on to something. Worth a thought about adopting the new stnadrds and tools and making it really smooth.
DR
On 20 May 2011, at 04:46, Steve Baskauf wrote:
I have generated a csv spreadsheet of about 39 000 plant names for the U.S. which has the ITIS TSNIDs for the names in a column. I would like to have the uBio Namebank IDs in another column of the table. I have been looking them up on the uBio website by typing in the names as I need to know the IDs, but after doing about 300 of them, I'm getting tired of it. Does anybody have a clever idea of a way to get the other 38 000 Namebank IDs without looking them up. I'm sure that it would be possible to find this out because uBio gets names from ITIS. However, I haven't seen any clues about how to do it in an automated fashion. I'm guessing that there might be some way to use the uBio web services, but if so, it isn't obvious and I probably don't have the skills to carry it out anyway.
Any ideas? Steve
-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences
postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A.
delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235
office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707 http://bioimages.vanderbilt.edu
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
.
I had actually written a response to this thread about a week ago in which I tried to clarify why I wanted to connect the ITIS and uBio identifiers. However, I decided that the email was too cynical and not helpful, so I erased it. However, I think that a couple of the points I had in that email probably should have been made, so I will try to state them again in a more constructive manner.
My reason for wanting to connect the uBio and ITIS identifiers really had nothing to do with making use of any of the tools or services that either group provides. Rather it has to do with my desire to follow the best practices for GUIDs as laid out in the TDWG GUID Applicability Statement (now an official standard). In particular, I have in mind Recommendations 2 and 8, which I paraphrase here as: "make HTTP URIs out of your identifiers" and "stop making up new identifiers when somebody else already has one for the thing you are talking about". I suppose Recommendation 10 should also be mentioned, which I paraphrase as "provide RDF/XML to users that want it".
I am actually using ITIS TSNs internally in my database. However, last time I checked there were no GUIDs based on TSNs that met the recommendations I've paraphrased above. (The ITIS website does mention "LSIDs" in the context of web services, but they don't follow either recommendation 2 or 10.) However outdated they are, uBio identifiers do actually meet recommendations 2 and 10 and that is why I wanted to use them (although the http proxied forms are unnecessarily ugly and long). So that explains in a nutshell the reason for my request. If ITIS would provide a simple http URI form of their TSNs which could resolve via content negotiation to either HTML or RDF/XML, it would be much easier for me to just use them.
OK, here is where I risk stepping on people's toes. So I'll try to stomp gently. I think that the area of taxon names is one where the TDWG community fails miserably at recommendation 8. I've lost count of the number of different kinds of identifiers that are available for referring to taxon names (this issue was discussed previously in the thread that starts with http://lists.tdwg.org/pipermail/tdwg-content/2011-January/002231.html so I won't repeat it here). I don't know about (nor particularly care about) "turf" in this area, but I would challenge the community to get serious about recommendation 8 and come up with some consensus about a single, universal set of GUIDs for taxon names. Those identifiers should (in my opinion which stems from the GUID recommendations): - be http URIs (rec 2) - be based on an existing identifier (rec 8) - return RDF/XML when a client requests it (rec 10) - not change (rec 4) I do not like proxied LSIDs (unnecessarily long with many useless characters) and I despise UUIDs (what is the point of creating a long, un-typeable string to replace a serial number that is already globally unique if appended to a domain name?). Why not just register something like "http://purl.org/tn/" (with "tn" representing "taxon name") and stick one of the existing serial numbers onto it? The domain name would be "turf-neutral" and anybody (GBIF, TDWG, or another organization) could manage the actual resolution through redirection from that domain. Somebody else could take over the management of the GUIDs if the first group got tired of it or ran out of money. The result would be a short and simple URI like "http://purl.org/tn/12345". What would be wrong with that? This is not rocket science and could be easily accomplished by a few tech-savvy people if the will were there.
Steve
Nicolson, David wrote:
Hi Steve (and Dave),
[NB: After having composed the email below, just before sending it, I re-read your initial email more carefully and realized that you said you already had the ITIS TSNs, and were looking to add the NamebankIDs! Doh! Well, in case you (or anyone else) is interested in methods of matching names to get TSNs, I'll go ahead and send this anyway. But do note the comments below about the ITIS "versions" and ongoing overhaul of the vascular plant data in ITIS!!! -Dave]
I noticed this just before leaving work last week, and was out yesterday, but I wanted to chime in on this. I'm glad the uBio tools are meeting your needs (they do have some cool stuff!), but it should be noted that those tools are using a static snapshot of ITIS data from January 2009, and we have added about 50,000 additional scientific names, and updated tens of thousands of names beyond that (most of that in the last 6 months, as the frequency of loads dropped off in 2009-2010 due to technical issues).
I also want to note that ITIS is right in the middle of a full update of the vascular plant data in ITIS, and we're loading updated families on a monthly basis... and at long last we are tackling all the leftover issues from several bulk loads from USDA PLANTS data that left unreconciled bits of ITIS' older vascular plant data in various confusing states... so it is a VAST improvement that is underway.
There are several options for bouncing your names off the current version of ITIS.
One is to automate a matching process using the live ITIS data, based on the existing ITIS Web Services. I am CC'ing Alan Hampson, our IT fellow who built the Web Services ( http://www.itis.gov/web_service.html ), in case you'd like to follow up with him on that option. The advantage is that once you have a process in place it is completely self-serve and can always utilize the current ITIS data. If you have the resources to do this I think it would be greatly to your advantage to use this approach.
You can explore some ideas for client software to use the services at: http://www.itis.gov/ws_develop.html
And for more information on ITIS web services try http://www.itis.gov/ws_description.html http://www.itis.gov/ITISWebService.xml
The ability to flag multiply-matched names (as you noted) should probably be considered, so that appropriate manual steps can be taken. This solution will allow you to take advantage of subsequent updates to ITIS with a minimum of additional effort, and given that the plant data are in the middle of a major overhaul, this bears consideration!
Another possibility is to grab a full snapshot of the ITIS data, and load it into a database so you can do what you wish. The obvious drawback is that it goes out of date, as with the ITIS snapshot uBio is currently using. But it puts you in the driver's seat re what to do & getting new versions of ITIS. Some general information about the full exports is in the following page, although conspicuously absent is any mention of the MySQL version which (assuming you have the free MySQL properly installed & configured) can be loaded with just a few clicks or a few command lines (depending on your platform): http://www.itis.gov/ftp_download.html And the current ITIS data are all here for downloading: http://www.itis.gov/downloads/
A third option, which I note with some trepidation, is the old "Compare Nomenclature/Taxonomy" function on the ITIS site: http://www.itis.gov/taxmatch_ftp.html This is a VERY old function that we do plan on replacing (timeframe not yet certain), and it is vulnerable to timeouts, etc., which is why it notes to limit the number of names per pass. But with smaller chunks of names it does work quite well. The caveat is that I would make sure to choose the 4th option in Step 4, as it is at least aware (unlike the 3 other options) of multiply-matched name cases, and lists them separately at the bottom of the report. Just a bare listing of the scientific names, with the word "name" at the top, saved as plain text, is all that is needed for input.
A final option would be to ask someone at ITIS to handle the matching for you (leaving you to decide re the multiply-matched names). This might be simple from your end, but is suboptimal as it leaves you in the same position as you are now should you want or need to compare names again in the future (whether due to acquiring new names in your system, or wanting to check against a later updated version of ITIS), and it pulls someone here (probably me) off of the push to get more updates into ITIS. But in a pinch, I'm certainly willing to try to help you, should it come down to that! I would just ask that you seriously consider the web services option (in particular) or the others above first.
I hope this helps some. If you have already run all your matches against the old "ITIS" data via uBio then you might consider re-running (against the current ITIS data) at least the leftover names that you did not yet get matched. Let us know if you have questions (the itiswebmaster@itis.gov address goes to myself and Alan and several others, so that might be the best bet for a follow-up unless you have a question specifically for me).
Regards, Dave
David Nicolson Data Development Coordinator, Integrated Taxonomic Information System Biologist, USGS Core Science Systems, Biological Informatics Program nicolsod@si.edu Office 202-633-2149 Fax 202-786-2934 http://www.itis.gov/ http://www.cbif.gc.ca/itis/ "Nihil sumas necesse est..."
-----Original Message----- Date: Fri, 20 May 2011 05:42:03 -0500 From: Steve Baskauf steve.baskauf@vanderbilt.edu Subject: Re: [tdwg-content] ITIS TSNID to uBio NamebankIDs mapping To: "David Remsen (GBIF)" dremsen@gbif.org Cc: "tdwg-content@lists.tdwg.org" tdwg-content@lists.tdwg.org Message-ID: 4DD6457B.2080204@vanderbilt.edu Content-Type: text/plain; charset="iso-8859-1"
Thanks, all, for the responses. The "Compare to ITIS" function does just what I want. I did a test run of 1000 names and it worked like a charm. I will need to do a little massaging because sometimes two or more ITIS IDs come back for each uBio ID. But I can handle that. Steve
David Remsen (GBIF) wrote:
Steve
Have you tried this? http://www.ubio.org/clients/ITIS/index.php
or this? http://www.ubio.org/services/mapper/index2.php
All this ubio talk makes me think we were on to something. Worth a thought about adopting the new stnadrds and tools and making it really smooth.
DR
On 20 May 2011, at 04:46, Steve Baskauf wrote:
I have generated a csv spreadsheet of about 39 000 plant names for the U.S. which has the ITIS TSNIDs for the names in a column. I would like to have the uBio Namebank IDs in another column of the table. I have been looking them up on the uBio website by typing in the names as I need to know the IDs, but after doing about 300 of them, I'm getting tired of it. Does anybody have a clever idea of a way to get the other 38 000 Namebank IDs without looking them up. I'm sure that it would be possible to find this out because uBio gets names from ITIS. However, I haven't seen any clues about how to do it in an automated fashion. I'm guessing that there might be some way to use the uBio web services, but if so, it isn't obvious and I probably don't have the skills to carry it out anyway.
Any ideas? Steve
-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences
postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A.
delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235
office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707 http://bioimages.vanderbilt.edu
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
.
Hi Steve et al.,
I agree. This was one of the reasons that I setup TaxonConcept the way I did. It attempts to connect both the LOD entities and the foreign key based entities.
For example:
The Racoon http://lod.taxonconcept.org/ses/CTZ8z.html
http://lod.taxonconcept.org/ses/CTZ8z.htmlHas links to many other URL's and URI's as well as the integer id's for:
EoL NCBI ITIS BOLD
* For some of these it might be best to represent these as a one to many since there are often many names for each concept.
I have uBio ID's in GeoSpecies but I thought that would be eventually pulled in via the GNI.
I also have a small set of other foreign keys for things like the Hymenoptera name server, FishBase, Mushroom Observer and Tropicos.
Since these are specific to specific subsets of organisms, and came on later in my project I thought it would be best to use a separate RDF file to map to those.
For instance with Fishbase http://assets.taxonconcept.org/fb/index.rdf
Insects like this one http://lod.taxonconcept.org/ses/ICmLC.html also have the id for bugguide if it exists there and I have found it under the same name or a synonym.
Of the ~105,000 concepts I have about 47,000 with ITIS ID's. This may be useful for your plant list and I can send you a spreadsheet if that is easier.
Most of the plants also have the USDA Plants identifier. In fact you might be able to get the ITIS numbers via the USDA Plants Database.
I have come to realize that many other groups see the solution to data access is with a custom API, but this requires understanding and debugging your code for each API.
Once the data is available in RDF it is one API for everything. Some issues like what to call each field can be overcome by simply rewriting (converting) the RDF.
This is easy as long as you have equivalent semantics in the meaning of the field.
For instance, it does not really matter if this name is represented as
txn:hasScientificNameProcyon lotor</txn:hasScientificName> or dwc:scientificNameProcyon lotor</dwc:scientificName>
The important thing to understand is that in my model this field does not include the authorship string.
This makes it easier to map this to other datasets and publications that don't include the authorship string.
txn:scientificNameAuthorship(Linnaeus 1758)</txn:scientificNameAuthorship>
* The scientificNameAuthorship should eventually be mapped to a publication or a list of probable publications. It is too ambiguous.
There was a debate about <scientificName> earlier on the list which seemed to go back and forth.
I got tired of rewriting my examples each time and decided to use my own vocabulary that works in my example queries and has fields that map as closely to dwc as possible.
- Pete
On Tue, May 31, 2011 at 7:07 AM, Steve Baskauf <steve.baskauf@vanderbilt.edu
wrote:
I had actually written a response to this thread about a week ago in which I tried to clarify why I wanted to connect the ITIS and uBio identifiers. However, I decided that the email was too cynical and not helpful, so I erased it. However, I think that a couple of the points I had in that email probably should have been made, so I will try to state them again in a more constructive manner.
My reason for wanting to connect the uBio and ITIS identifiers really had nothing to do with making use of any of the tools or services that either group provides. Rather it has to do with my desire to follow the best practices for GUIDs as laid out in the TDWG GUID Applicability Statement (now an official standard). In particular, I have in mind Recommendations 2 and 8, which I paraphrase here as: "make HTTP URIs out of your identifiers" and "stop making up new identifiers when somebody else already has one for the thing you are talking about". I suppose Recommendation 10 should also be mentioned, which I paraphrase as "provide RDF/XML to users that want it".
I am actually using ITIS TSNs internally in my database. However, last time I checked there were no GUIDs based on TSNs that met the recommendations I've paraphrased above. (The ITIS website does mention "LSIDs" in the context of web services, but they don't follow either recommendation 2 or 10.) However outdated they are, uBio identifiers do actually meet recommendations 2 and 10 and that is why I wanted to use them (although the http proxied forms are unnecessarily ugly and long). So that explains in a nutshell the reason for my request. If ITIS would provide a simple http URI form of their TSNs which could resolve via content negotiation to either HTML or RDF/XML, it would be much easier for me to just use them.
OK, here is where I risk stepping on people's toes. So I'll try to stomp gently. I think that the area of taxon names is one where the TDWG community fails miserably at recommendation 8. I've lost count of the number of different kinds of identifiers that are available for referring to taxon names (this issue was discussed previously in the thread that starts with http://lists.tdwg.org/pipermail/tdwg-content/2011-January/002231.htmlso I won't repeat it here). I don't know about (nor particularly care about) "turf" in this area, but I would challenge the community to get serious about recommendation 8 and come up with some consensus about a single, universal set of GUIDs for taxon names. Those identifiers should (in my opinion which stems from the GUID recommendations):
- be http URIs (rec 2)
- be based on an existing identifier (rec 8)
- return RDF/XML when a client requests it (rec 10)
- not change (rec 4)
I do not like proxied LSIDs (unnecessarily long with many useless characters) and I despise UUIDs (what is the point of creating a long, un-typeable string to replace a serial number that is already globally unique if appended to a domain name?). Why not just register something like "http://purl.org/tn/" http://purl.org/tn/ (with "tn" representing "taxon name") and stick one of the existing serial numbers onto it? The domain name would be "turf-neutral" and anybody (GBIF, TDWG, or another organization) could manage the actual resolution through redirection from that domain. Somebody else could take over the management of the GUIDs if the first group got tired of it or ran out of money. The result would be a short and simple URI like "http://purl.org/tn/12345"http://purl.org/tn/12345. What would be wrong with that? This is not rocket science and could be easily accomplished by a few tech-savvy people if the will were there.
Steve
Nicolson, David wrote:
Hi Steve (and Dave),
[NB: After having composed the email below, just before sending it, I re-read your initial email more carefully and realized that you said you already had the ITIS TSNs, and were looking to add the NamebankIDs! Doh! Well, in case you (or anyone else) is interested in methods of matching names to get TSNs, I'll go ahead and send this anyway. But do note the comments below about the ITIS "versions" and ongoing overhaul of the vascular plant data in ITIS!!! -Dave]
I noticed this just before leaving work last week, and was out yesterday, but I wanted to chime in on this. I'm glad the uBio tools are meeting your needs (they do have some cool stuff!), but it should be noted that those tools are using a static snapshot of ITIS data from January 2009, and we have added about 50,000 additional scientific names, and updated tens of thousands of names beyond that (most of that in the last 6 months, as the frequency of loads dropped off in 2009-2010 due to technical issues).
I also want to note that ITIS is right in the middle of a full update of the vascular plant data in ITIS, and we're loading updated families on a monthly basis... and at long last we are tackling all the leftover issues from several bulk loads from USDA PLANTS data that left unreconciled bits of ITIS' older vascular plant data in various confusing states... so it is a VAST improvement that is underway.
There are several options for bouncing your names off the current version of ITIS.
One is to automate a matching process using the live ITIS data, based on the existing ITIS Web Services. I am CC'ing Alan Hampson, our IT fellow who built the Web Services ( http://www.itis.gov/web_service.html ), in case you'd like to follow up with him on that option. The advantage is that once you have a process in place it is completely self-serve and can always utilize the current ITIS data. If you have the resources to do this I think it would be greatly to your advantage to use this approach.
You can explore some ideas for client software to use the services at: http://www.itis.gov/ws_develop.html
And for more information on ITIS web services try http://www.itis.gov/ws_description.htmlhttp://www.itis.gov/ITISWebService.xm...
The ability to flag multiply-matched names (as you noted) should probably be considered, so that appropriate manual steps can be taken. This solution will allow you to take advantage of subsequent updates to ITIS with a minimum of additional effort, and given that the plant data are in the middle of a major overhaul, this bears consideration!
Another possibility is to grab a full snapshot of the ITIS data, and load it into a database so you can do what you wish. The obvious drawback is that it goes out of date, as with the ITIS snapshot uBio is currently using. But it puts you in the driver's seat re what to do & getting new versions of ITIS. Some general information about the full exports is in the following page, although conspicuously absent is any mention of the MySQL version which (assuming you have the free MySQL properly installed & configured) can be loaded with just a few clicks or a few command lines (depending on your platform):http://www.itis.gov/ftp_download.html And the current ITIS data are all here for downloading:http://www.itis.gov/downloads/
A third option, which I note with some trepidation, is the old "Compare Nomenclature/Taxonomy" function on the ITIS site:http://www.itis.gov/taxmatch_ftp.html This is a VERY old function that we do plan on replacing (timeframe not yet certain), and it is vulnerable to timeouts, etc., which is why it notes to limit the number of names per pass. But with smaller chunks of names it does work quite well. The caveat is that I would make sure to choose the 4th option in Step 4, as it is at least aware (unlike the 3 other options) of multiply-matched name cases, and lists them separately at the bottom of the report. Just a bare listing of the scientific names, with the word "name" at the top, saved as plain text, is all that is needed for input.
A final option would be to ask someone at ITIS to handle the matching for you (leaving you to decide re the multiply-matched names). This might be simple from your end, but is suboptimal as it leaves you in the same position as you are now should you want or need to compare names again in the future (whether due to acquiring new names in your system, or wanting to check against a later updated version of ITIS), and it pulls someone here (probably me) off of the push to get more updates into ITIS. But in a pinch, I'm certainly willing to try to help you, should it come down to that! I would just ask that you seriously consider the web services option (in particular) or the others above first.
I hope this helps some. If you have already run all your matches against the old "ITIS" data via uBio then you might consider re-running (against the current ITIS data) at least the leftover names that you did not yet get matched. Let us know if you have questions (the itiswebmaster@itis.gov address goes to myself and Alan and several others, so that might be the best bet for a follow-up unless you have a question specifically for me).
Regards, Dave
David Nicolson Data Development Coordinator, Integrated Taxonomic Information System Biologist, USGS Core Science Systems, Biological Informatics Programnicolsod@si.edu Office 202-633-2149 Fax 202-786-2934http://www.itis.gov/http://www.cbif.gc.ca/itis/ "Nihil sumas necesse est..."
-----Original Message----- Date: Fri, 20 May 2011 05:42:03 -0500 From: Steve Baskauf steve.baskauf@vanderbilt.edu steve.baskauf@vanderbilt.edu Subject: Re: [tdwg-content] ITIS TSNID to uBio NamebankIDs mapping To: "David Remsen (GBIF)" dremsen@gbif.org dremsen@gbif.org Cc: "tdwg-content@lists.tdwg.org" tdwg-content@lists.tdwg.org tdwg-content@lists.tdwg.org tdwg-content@lists.tdwg.org Message-ID: 4DD6457B.2080204@vanderbilt.edu 4DD6457B.2080204@vanderbilt.edu Content-Type: text/plain; charset="iso-8859-1"
Thanks, all, for the responses. The "Compare to ITIS" function does just what I want. I did a test run of 1000 names and it worked like a charm. I will need to do a little massaging because sometimes two or more ITIS IDs come back for each uBio ID. But I can handle that. Steve
David Remsen (GBIF) wrote:
Steve
Have you tried this?http://www.ubio.org/clients/ITIS/index.php
or this?http://www.ubio.org/services/mapper/index2.php
All this ubio talk makes me think we were on to something. Worth a thought about adopting the new stnadrds and tools and making it really smooth.
DR
On 20 May 2011, at 04:46, Steve Baskauf wrote:
I have generated a csv spreadsheet of about 39 000 plant names for the U.S. which has the ITIS TSNIDs for the names in a column. I would like to have the uBio Namebank IDs in another column of the table. I have been looking them up on the uBio website by typing in the names as I need to know the IDs, but after doing about 300 of them, I'm getting tired of it. Does anybody have a clever idea of a way to get the other 38 000 Namebank IDs without looking them up. I'm sure that it would be possible to find this out because uBio gets names from ITIS. However, I haven't seen any clues about how to do it in an automated fashion. I'm guessing that there might be some way to use the uBio web services, but if so, it isn't obvious and I probably don't have the skills to carry it out anyway.
Any ideas? Steve
-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences
postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A.
delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235
office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707http://bioimages.vanderbilt.edu
tdwg-content mailing listtdwg-content@lists.tdwg.orghttp://lists.tdwg.org/mailman/listinfo/tdwg-content
.
-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences
postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A.
delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235
office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707http://bioimages.vanderbilt.edu
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
Dear Pete and Steve,
I cannot comment on the technical content of your emails (sorry, I'm a content guy!), but I do note this comment by Pete: "Most of the plants also have the USDA Plants identifier. In fact you might be able to get the ITIS numbers via the USDA Plants Database."
I would not recommend getting ITIS TSNs from any other source than ITIS (see my prior email for some how-to ideas).
Firstly, the PLANTS Symbol-to-TSN matches were not always well managed due to some technical issues (involving early changes to ITIS, some due to problematic bulk-updates of ITIS from the PLANTS data, or sometimes due to other artifacts). In MOST cases the TSNs they list will be fine, but a silent subset will not.
Secondly, as I noted, we are mid-stream in a full overhaul of the vascular plant data in ITIS, in almost every case using cooperatively-produced data sets that have also been made available to PLANTS as well. When/whether they use them to update that database is another question, but the ITIS updates are proceeding full-steam, with additional improvements where needed.
Finally, at least when dealing with non-static data sets, I feel it is just 'best practice' to get them from the source wherever feasible, rather than from other places.
Best, Dave
David Nicolson Data Development Coordinator, Integrated Taxonomic Information System Biologist, USGS Core Science Systems, Biological Informatics Program nicolsod@si.edu Office 202-633-2149 Fax 202-786-2934 http://www.itis.gov/ http://www.cbif.gc.ca/itis/ "Nihil sumas necesse est..."
From: Peter DeVries [mailto:pete.devries@gmail.com] Sent: Tuesday, May 31, 2011 2:48 PM To: Steve Baskauf Cc: Nicolson, David; tdwg-content@lists.tdwg.org; Gerald Guala; Orrell, Thomas; Alan J Hampson Subject: Re: [tdwg-content] ITIS TSNID to uBio NamebankIDs mapping
Hi Steve et al.,
I agree. This was one of the reasons that I setup TaxonConcept the way I did. It attempts to connect both the LOD entities and the foreign key based entities.
For example:
The Racoon http://lod.taxonconcept.org/ses/CTZ8z.html
Has links to many other URL's and URI's as well as the integer id's for:
EoL NCBI ITIS BOLD
* For some of these it might be best to represent these as a one to many since there are often many names for each concept.
I have uBio ID's in GeoSpecies but I thought that would be eventually pulled in via the GNI.
I also have a small set of other foreign keys for things like the Hymenoptera name server, FishBase, Mushroom Observer and Tropicos.
Since these are specific to specific subsets of organisms, and came on later in my project I thought it would be best to use a separate RDF file to map to those.
For instance with Fishbase http://assets.taxonconcept.org/fb/index.rdf
Insects like this one http://lod.taxonconcept.org/ses/ICmLC.html also have the id for bugguide if it exists there and I have found it under the same name or a synonym.
Of the ~105,000 concepts I have about 47,000 with ITIS ID's. This may be useful for your plant list and I can send you a spreadsheet if that is easier.
Most of the plants also have the USDA Plants identifier. In fact you might be able to get the ITIS numbers via the USDA Plants Database.
I have come to realize that many other groups see the solution to data access is with a custom API, but this requires understanding and debugging your code for each API.
Once the data is available in RDF it is one API for everything. Some issues like what to call each field can be overcome by simply rewriting (converting) the RDF.
This is easy as long as you have equivalent semantics in the meaning of the field.
For instance, it does not really matter if this name is represented as
txn:hasScientificNameProcyon lotor</txn:hasScientificName> or dwc:scientificNameProcyon lotor</dwc:scientificName>
The important thing to understand is that in my model this field does not include the authorship string.
This makes it easier to map this to other datasets and publications that don't include the authorship string.
txn:scientificNameAuthorship(Linnaeus 1758)</txn:scientificNameAuthorship>
* The scientificNameAuthorship should eventually be mapped to a publication or a list of probable publications. It is too ambiguous.
There was a debate about <scientificName> earlier on the list which seemed to go back and forth.
I got tired of rewriting my examples each time and decided to use my own vocabulary that works in my example queries and has fields that map as closely to dwc as possible.
- Pete On Tue, May 31, 2011 at 7:07 AM, Steve Baskauf <steve.baskauf@vanderbilt.edumailto:steve.baskauf@vanderbilt.edu> wrote: I had actually written a response to this thread about a week ago in which I tried to clarify why I wanted to connect the ITIS and uBio identifiers. However, I decided that the email was too cynical and not helpful, so I erased it. However, I think that a couple of the points I had in that email probably should have been made, so I will try to state them again in a more constructive manner.
My reason for wanting to connect the uBio and ITIS identifiers really had nothing to do with making use of any of the tools or services that either group provides. Rather it has to do with my desire to follow the best practices for GUIDs as laid out in the TDWG GUID Applicability Statement (now an official standard). In particular, I have in mind Recommendations 2 and 8, which I paraphrase here as: "make HTTP URIs out of your identifiers" and "stop making up new identifiers when somebody else already has one for the thing you are talking about". I suppose Recommendation 10 should also be mentioned, which I paraphrase as "provide RDF/XML to users that want it".
I am actually using ITIS TSNs internally in my database. However, last time I checked there were no GUIDs based on TSNs that met the recommendations I've paraphrased above. (The ITIS website does mention "LSIDs" in the context of web services, but they don't follow either recommendation 2 or 10.) However outdated they are, uBio identifiers do actually meet recommendations 2 and 10 and that is why I wanted to use them (although the http proxied forms are unnecessarily ugly and long). So that explains in a nutshell the reason for my request. If ITIS would provide a simple http URI form of their TSNs which could resolve via content negotiation to either HTML or RDF/XML, it would be much easier for me to just use them.
OK, here is where I risk stepping on people's toes. So I'll try to stomp gently. I think that the area of taxon names is one where the TDWG community fails miserably at recommendation 8. I've lost count of the number of different kinds of identifiers that are available for referring to taxon names (this issue was discussed previously in the thread that starts with http://lists.tdwg.org/pipermail/tdwg-content/2011-January/002231.html so I won't repeat it here). I don't know about (nor particularly care about) "turf" in this area, but I would challenge the community to get serious about recommendation 8 and come up with some consensus about a single, universal set of GUIDs for taxon names. Those identifiers should (in my opinion which stems from the GUID recommendations): - be http URIs (rec 2) - be based on an existing identifier (rec 8) - return RDF/XML when a client requests it (rec 10) - not change (rec 4) I do not like proxied LSIDs (unnecessarily long with many useless characters) and I despise UUIDs (what is the point of creating a long, un-typeable string to replace a serial number that is already globally unique if appended to a domain name?). Why not just register something like "http://purl.org/tn/"http://purl.org/tn/ (with "tn" representing "taxon name") and stick one of the existing serial numbers onto it? The domain name would be "turf-neutral" and anybody (GBIF, TDWG, or another organization) could manage the actual resolution through redirection from that domain. Somebody else could take over the management of the GUIDs if the first group got tired of it or ran out of money. The result would be a short and simple URI like "http://purl.org/tn/12345"http://purl.org/tn/12345. What would be wrong with that? This is not rocket science and could be easily accomplished by a few tech-savvy people if the will were there.
Steve
Nicolson, David wrote:
Hi Steve (and Dave),
[NB: After having composed the email below, just before sending it, I re-read your initial email more carefully and realized that you said you already had the ITIS TSNs, and were looking to add the NamebankIDs! Doh! Well, in case you (or anyone else) is interested in methods of matching names to get TSNs, I'll go ahead and send this anyway. But do note the comments below about the ITIS "versions" and ongoing overhaul of the vascular plant data in ITIS!!! -Dave]
I noticed this just before leaving work last week, and was out yesterday, but I wanted to chime in on this. I'm glad the uBio tools are meeting your needs (they do have some cool stuff!), but it should be noted that those tools are using a static snapshot of ITIS data from January 2009, and we have added about 50,000 additional scientific names, and updated tens of thousands of names beyond that (most of that in the last 6 months, as the frequency of loads dropped off in 2009-2010 due to technical issues).
I also want to note that ITIS is right in the middle of a full update of the vascular plant data in ITIS, and we're loading updated families on a monthly basis... and at long last we are tackling all the leftover issues from several bulk loads from USDA PLANTS data that left unreconciled bits of ITIS' older vascular plant data in various confusing states... so it is a VAST improvement that is underway.
There are several options for bouncing your names off the current version of ITIS.
One is to automate a matching process using the live ITIS data, based on the existing ITIS Web Services. I am CC'ing Alan Hampson, our IT fellow who built the Web Services ( http://www.itis.gov/web_service.html ), in case you'd like to follow up with him on that option. The advantage is that once you have a process in place it is completely self-serve and can always utilize the current ITIS data. If you have the resources to do this I think it would be greatly to your advantage to use this approach.
You can explore some ideas for client software to use the services at:
http://www.itis.gov/ws_develop.html
And for more information on ITIS web services try
http://www.itis.gov/ws_description.html
http://www.itis.gov/ITISWebService.xml
The ability to flag multiply-matched names (as you noted) should probably be considered, so that appropriate manual steps can be taken. This solution will allow you to take advantage of subsequent updates to ITIS with a minimum of additional effort, and given that the plant data are in the middle of a major overhaul, this bears consideration!
Another possibility is to grab a full snapshot of the ITIS data, and load it into a database so you can do what you wish. The obvious drawback is that it goes out of date, as with the ITIS snapshot uBio is currently using. But it puts you in the driver's seat re what to do & getting new versions of ITIS. Some general information about the full exports is in the following page, although conspicuously absent is any mention of the MySQL version which (assuming you have the free MySQL properly installed & configured) can be loaded with just a few clicks or a few command lines (depending on your platform):
http://www.itis.gov/ftp_download.html
And the current ITIS data are all here for downloading:
http://www.itis.gov/downloads/
A third option, which I note with some trepidation, is the old "Compare Nomenclature/Taxonomy" function on the ITIS site:
http://www.itis.gov/taxmatch_ftp.html
This is a VERY old function that we do plan on replacing (timeframe not yet certain), and it is vulnerable to timeouts, etc., which is why it notes to limit the number of names per pass. But with smaller chunks of names it does work quite well. The caveat is that I would make sure to choose the 4th option in Step 4, as it is at least aware (unlike the 3 other options) of multiply-matched name cases, and lists them separately at the bottom of the report. Just a bare listing of the scientific names, with the word "name" at the top, saved as plain text, is all that is needed for input.
A final option would be to ask someone at ITIS to handle the matching for you (leaving you to decide re the multiply-matched names). This might be simple from your end, but is suboptimal as it leaves you in the same position as you are now should you want or need to compare names again in the future (whether due to acquiring new names in your system, or wanting to check against a later updated version of ITIS), and it pulls someone here (probably me) off of the push to get more updates into ITIS. But in a pinch, I'm certainly willing to try to help you, should it come down to that! I would just ask that you seriously consider the web services option (in particular) or the others above first.
I hope this helps some. If you have already run all your matches against the old "ITIS" data via uBio then you might consider re-running (against the current ITIS data) at least the leftover names that you did not yet get matched. Let us know if you have questions (the itiswebmaster@itis.govmailto:itiswebmaster@itis.gov address goes to myself and Alan and several others, so that might be the best bet for a follow-up unless you have a question specifically for me).
Regards,
Dave
David Nicolson
Data Development Coordinator, Integrated Taxonomic Information System
Biologist, USGS Core Science Systems, Biological Informatics Program
nicolsod@si.edumailto:nicolsod@si.edu Office 202-633-2149tel:202-633-2149 Fax 202-786-2934tel:202-786-2934
"Nihil sumas necesse est..."
-----Original Message-----
Date: Fri, 20 May 2011 05:42:03 -0500
From: Steve Baskauf steve.baskauf@vanderbilt.edumailto:steve.baskauf@vanderbilt.edu
Subject: Re: [tdwg-content] ITIS TSNID to uBio NamebankIDs mapping
To: "David Remsen (GBIF)" dremsen@gbif.orgmailto:dremsen@gbif.org
Cc: "tdwg-content@lists.tdwg.org"mailto:tdwg-content@lists.tdwg.org tdwg-content@lists.tdwg.orgmailto:tdwg-content@lists.tdwg.org
Message-ID: 4DD6457B.2080204@vanderbilt.edumailto:4DD6457B.2080204@vanderbilt.edu
Content-Type: text/plain; charset="iso-8859-1"
Thanks, all, for the responses. The "Compare to ITIS" function does
just what I want. I did a test run of 1000 names and it worked like a
charm. I will need to do a little massaging because sometimes two or
more ITIS IDs come back for each uBio ID. But I can handle that.
Steve
David Remsen (GBIF) wrote:
Steve
Have you tried this?
http://www.ubio.org/clients/ITIS/index.php
or this?
http://www.ubio.org/services/mapper/index2.php
All this ubio talk makes me think we were on to something. Worth a thought about adopting the new stnadrds and tools and making it really smooth.
DR
On 20 May 2011, at 04:46, Steve Baskauf wrote:
I have generated a csv spreadsheet of about 39 000 plant names for the
U.S. which has the ITIS TSNIDs for the names in a column. I would like
to have the uBio Namebank IDs in another column of the table. I have
been looking them up on the uBio website by typing in the names as I
need to know the IDs, but after doing about 300 of them, I'm getting
tired of it. Does anybody have a clever idea of a way to get the other
38 000 Namebank IDs without looking them up. I'm sure that it would be
possible to find this out because uBio gets names from ITIS. However, I
haven't seen any clues about how to do it in an automated fashion. I'm
guessing that there might be some way to use the uBio web services, but
if so, it isn't obvious and I probably don't have the skills to carry it
out anyway.
Any ideas?
Steve
--
Steven J. Baskauf, Ph.D., Senior Lecturer
Vanderbilt University Dept. of Biological Sciences
postal mail address:
VU Station B 351634
Nashville, TN 37235-1634, U.S.A.
delivery address:
2125 Stevenson Center
1161 21st Ave., S.
Nashville, TN 37235
office: 2128 Stevenson Center
phone: (615) 343-4582tel:%28615%29%20343-4582, fax: (615) 343-6707tel:%28615%29%20343-6707
http://bioimages.vanderbilt.edu
_______________________________________________
tdwg-content mailing list
tdwg-content@lists.tdwg.orgmailto:tdwg-content@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-content
.
--
Steven J. Baskauf, Ph.D., Senior Lecturer
Vanderbilt University Dept. of Biological Sciences
postal mail address:
VU Station B 351634
Nashville, TN 37235-1634, U.S.A.
delivery address:
2125 Stevenson Center
1161 21st Ave., S.
Nashville, TN 37235
office: 2128 Stevenson Center
phone: (615) 343-4582tel:%28615%29%20343-4582, fax: (615) 343-6707tel:%28615%29%20343-6707
http://bioimages.vanderbilt.edu
_______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.orgmailto:tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
-- ------------------------------------------------------------------------------------ Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 Email: pdevries@wisc.edumailto:pdevries@wisc.edu TaxonConcepthttp://www.taxonconcept.org/ & GeoSpecieshttp://about.geospecies.org/ Knowledge Bases A Semantic Web, Linked Open Datahttp://linkeddata.org/ Project --------------------------------------------------------------------------------------
Hi David,
Thanks for this heads up. The ITIS ID's that I have are either from checking manually or mapping on names from one of your downloads.
In the database that feeds TaxonConcept I try to store both the ITIS ID and the "Scientific Name" that is tied to that ID.
I do this so I know that it's uses the same or a different name for what I think is the same concept.
Here is an example of a species concept where my name and the ITIS name differ.
http://lod.taxonconcept.org/ses/sDHAp.html
The lexical groups are there to show indicate that there is some association between a given namestring and the concept.
It does not mean that these are true synonyms, these are just to allow pattern matching between namestrings and data that might possibly pertain to the species concept.
Along the lines of Steve's earlier email it would be "relatively easy" or perhaps relatively straightforward to create LOD URI's from your data set with a separate controller that outputs either RDF or RDFa.
URI's similar to http://www.itis.gov/tsn/113839 (ruby on rails likes plural controller so an alternative would be) http://www.itis.gov/tsns/113839
I could markup an RDF / RDFa example of one of your records in DarwinCore RDF if that would help people get their head around this.
RDFa in which the RDF markup exists within the HTML page. Here is a page with links to RDFa examples http://rdfa.info/wiki/Examples-in-the-wild
and a Wikipedia page on RDFa http://en.wikipedia.org/wiki/RDFa
Respectfully,
- Pete
On Tue, May 31, 2011 at 2:16 PM, Nicolson, David NICOLSOD@si.edu wrote:
Dear Pete and Steve,
I cannot comment on the technical content of your emails (sorry, I'm a content guy!), but I do note this comment by Pete:
"Most of the plants also have the USDA Plants identifier. In fact you might be able to get the ITIS numbers via the USDA Plants Database."
I would not recommend getting ITIS TSNs from any other source than ITIS (see my prior email for some how-to ideas).
Firstly, the PLANTS Symbol-to-TSN matches were not always well managed due to some technical issues (involving early changes to ITIS, some due to problematic bulk-updates of ITIS from the PLANTS data, or sometimes due to other artifacts). In MOST cases the TSNs they list will be fine, but a silent subset will not.
Secondly, as I noted, we are mid-stream in a full overhaul of the vascular plant data in ITIS, in almost every case using cooperatively-produced data sets that have also been made available to PLANTS as well. When/whether they use them to update that database is another question, but the ITIS updates are proceeding full-steam, with additional improvements where needed.
Finally, at least when dealing with non-static data sets, I feel it is just 'best practice' to get them from the source wherever feasible, rather than from other places.
Best,
Dave
David Nicolson Data Development Coordinator, Integrated Taxonomic Information System Biologist, USGS Core Science Systems, Biological Informatics Program nicolsod@si.edu Office 202-633-2149 Fax 202-786-2934 http://www.itis.gov/ http://www.cbif.gc.ca/itis/ "Nihil sumas necesse est..."
*From:* Peter DeVries [mailto:pete.devries@gmail.com] *Sent:* Tuesday, May 31, 2011 2:48 PM *To:* Steve Baskauf *Cc:* Nicolson, David; tdwg-content@lists.tdwg.org; Gerald Guala; Orrell, Thomas; Alan J Hampson
*Subject:* Re: [tdwg-content] ITIS TSNID to uBio NamebankIDs mapping
Hi Steve et al.,
I agree. This was one of the reasons that I setup TaxonConcept the way I did. It attempts to connect both the LOD entities and the foreign key based entities.
For example:
The Racoon http://lod.taxonconcept.org/ses/CTZ8z.html
Has links to many other URL's and URI's as well as the integer id's for:
EoL
NCBI
ITIS
BOLD
- For some of these it might be best to represent these as a one to many
since there are often many names for each concept.
I have uBio ID's in GeoSpecies but I thought that would be eventually pulled in via the GNI.
I also have a small set of other foreign keys for things like the Hymenoptera name server, FishBase, Mushroom Observer and Tropicos.
Since these are specific to specific subsets of organisms, and came on later in my project I thought it would be best to use a separate RDF file to map to those.
For instance with Fishbase http://assets.taxonconcept.org/fb/index.rdf
Insects like this one http://lod.taxonconcept.org/ses/ICmLC.html also have the id for bugguide if it exists there and I have found it under the same name or a synonym.
Of the ~105,000 concepts I have about 47,000 with ITIS ID's. This may be useful for your plant list and I can send you a spreadsheet if that is easier.
Most of the plants also have the USDA Plants identifier. In fact you might be able to get the ITIS numbers via the USDA Plants Database.
I have come to realize that many other groups see the solution to data access is with a custom API, but this requires understanding and debugging your code for each API.
Once the data is available in RDF it is one API for everything. Some issues like what to call each field can be overcome by simply rewriting (converting) the RDF.
This is easy as long as you have equivalent semantics in the meaning of the field.
For instance, it does not really matter if this name is represented as
txn:hasScientificNameProcyon lotor</txn:hasScientificName> or dwc:scientificNameProcyon lotor</dwc:scientificName>
The important thing to understand is that in my model this field does not include the authorship string.
This makes it easier to map this to other datasets and publications that don't include the authorship string.
txn:scientificNameAuthorship(Linnaeus 1758)</txn:scientificNameAuthorship>
- The scientificNameAuthorship should eventually be mapped to a
publication or a list of probable publications. It is too ambiguous.
There was a debate about <scientificName> earlier on the list which seemed to go back and forth.
I got tired of rewriting my examples each time and decided to use my own vocabulary that works in my example queries and has fields that map as closely to dwc as possible.
- Pete
On Tue, May 31, 2011 at 7:07 AM, Steve Baskauf < steve.baskauf@vanderbilt.edu> wrote:
I had actually written a response to this thread about a week ago in which I tried to clarify why I wanted to connect the ITIS and uBio identifiers. However, I decided that the email was too cynical and not helpful, so I erased it. However, I think that a couple of the points I had in that email probably should have been made, so I will try to state them again in a more constructive manner.
My reason for wanting to connect the uBio and ITIS identifiers really had nothing to do with making use of any of the tools or services that either group provides. Rather it has to do with my desire to follow the best practices for GUIDs as laid out in the TDWG GUID Applicability Statement (now an official standard). In particular, I have in mind Recommendations 2 and 8, which I paraphrase here as: "make HTTP URIs out of your identifiers" and "stop making up new identifiers when somebody else already has one for the thing you are talking about". I suppose Recommendation 10 should also be mentioned, which I paraphrase as "provide RDF/XML to users that want it".
I am actually using ITIS TSNs internally in my database. However, last time I checked there were no GUIDs based on TSNs that met the recommendations I've paraphrased above. (The ITIS website does mention "LSIDs" in the context of web services, but they don't follow either recommendation 2 or 10.) However outdated they are, uBio identifiers do actually meet recommendations 2 and 10 and that is why I wanted to use them (although the http proxied forms are unnecessarily ugly and long). So that explains in a nutshell the reason for my request. If ITIS would provide a simple http URI form of their TSNs which could resolve via content negotiation to either HTML or RDF/XML, it would be much easier for me to just use them.
OK, here is where I risk stepping on people's toes. So I'll try to stomp gently. I think that the area of taxon names is one where the TDWG community fails miserably at recommendation 8. I've lost count of the number of different kinds of identifiers that are available for referring to taxon names (this issue was discussed previously in the thread that starts with http://lists.tdwg.org/pipermail/tdwg-content/2011-January/002231.htmlso I won't repeat it here). I don't know about (nor particularly care about) "turf" in this area, but I would challenge the community to get serious about recommendation 8 and come up with some consensus about a single, universal set of GUIDs for taxon names. Those identifiers should (in my opinion which stems from the GUID recommendations):
- be http URIs (rec 2)
- be based on an existing identifier (rec 8)
- return RDF/XML when a client requests it (rec 10)
- not change (rec 4)
I do not like proxied LSIDs (unnecessarily long with many useless characters) and I despise UUIDs (what is the point of creating a long, un-typeable string to replace a serial number that is already globally unique if appended to a domain name?). Why not just register something like "http://purl.org/tn/" http://purl.org/tn/ (with "tn" representing "taxon name") and stick one of the existing serial numbers onto it? The domain name would be "turf-neutral" and anybody (GBIF, TDWG, or another organization) could manage the actual resolution through redirection from that domain. Somebody else could take over the management of the GUIDs if the first group got tired of it or ran out of money. The result would be a short and simple URI like "http://purl.org/tn/12345"http://purl.org/tn/12345. What would be wrong with that? This is not rocket science and could be easily accomplished by a few tech-savvy people if the will were there.
Steve
Nicolson, David wrote:
Hi Steve (and Dave),
[NB: After having composed the email below, just before sending it, I re-read your initial email more carefully and realized that you said you already had the ITIS TSNs, and were looking to add the NamebankIDs! Doh! Well, in case you (or anyone else) is interested in methods of matching names to get TSNs, I'll go ahead and send this anyway. But do note the comments below about the ITIS "versions" and ongoing overhaul of the vascular plant data in ITIS!!! -Dave]
I noticed this just before leaving work last week, and was out yesterday, but I wanted to chime in on this. I'm glad the uBio tools are meeting your needs (they do have some cool stuff!), but it should be noted that those tools are using a static snapshot of ITIS data from January 2009, and we have added about 50,000 additional scientific names, and updated tens of thousands of names beyond that (most of that in the last 6 months, as the frequency of loads dropped off in 2009-2010 due to technical issues).
I also want to note that ITIS is right in the middle of a full update of the vascular plant data in ITIS, and we're loading updated families on a monthly basis... and at long last we are tackling all the leftover issues from several bulk loads from USDA PLANTS data that left unreconciled bits of ITIS' older vascular plant data in various confusing states... so it is a VAST improvement that is underway.
There are several options for bouncing your names off the current version of ITIS.
One is to automate a matching process using the live ITIS data, based on the existing ITIS Web Services. I am CC'ing Alan Hampson, our IT fellow who built the Web Services ( http://www.itis.gov/web_service.html ), in case you'd like to follow up with him on that option. The advantage is that once you have a process in place it is completely self-serve and can always utilize the current ITIS data. If you have the resources to do this I think it would be greatly to your advantage to use this approach.
You can explore some ideas for client software to use the services at:
http://www.itis.gov/ws_develop.html
And for more information on ITIS web services try
http://www.itis.gov/ws_description.html
http://www.itis.gov/ITISWebService.xml
The ability to flag multiply-matched names (as you noted) should probably be considered, so that appropriate manual steps can be taken. This solution will allow you to take advantage of subsequent updates to ITIS with a minimum of additional effort, and given that the plant data are in the middle of a major overhaul, this bears consideration!
Another possibility is to grab a full snapshot of the ITIS data, and load it into a database so you can do what you wish. The obvious drawback is that it goes out of date, as with the ITIS snapshot uBio is currently using. But it puts you in the driver's seat re what to do & getting new versions of ITIS. Some general information about the full exports is in the following page, although conspicuously absent is any mention of the MySQL version which (assuming you have the free MySQL properly installed & configured) can be loaded with just a few clicks or a few command lines (depending on your platform):
http://www.itis.gov/ftp_download.html
And the current ITIS data are all here for downloading:
http://www.itis.gov/downloads/
A third option, which I note with some trepidation, is the old "Compare Nomenclature/Taxonomy" function on the ITIS site:
http://www.itis.gov/taxmatch_ftp.html
This is a VERY old function that we do plan on replacing (timeframe not yet certain), and it is vulnerable to timeouts, etc., which is why it notes to limit the number of names per pass. But with smaller chunks of names it does work quite well. The caveat is that I would make sure to choose the 4th option in Step 4, as it is at least aware (unlike the 3 other options) of multiply-matched name cases, and lists them separately at the bottom of the report. Just a bare listing of the scientific names, with the word "name" at the top, saved as plain text, is all that is needed for input.
A final option would be to ask someone at ITIS to handle the matching for you (leaving you to decide re the multiply-matched names). This might be simple from your end, but is suboptimal as it leaves you in the same position as you are now should you want or need to compare names again in the future (whether due to acquiring new names in your system, or wanting to check against a later updated version of ITIS), and it pulls someone here (probably me) off of the push to get more updates into ITIS. But in a pinch, I'm certainly willing to try to help you, should it come down to that! I would just ask that you seriously consider the web services option (in particular) or the others above first.
I hope this helps some. If you have already run all your matches against the old "ITIS" data via uBio then you might consider re-running (against the current ITIS data) at least the leftover names that you did not yet get matched. Let us know if you have questions (the itiswebmaster@itis.gov address goes to myself and Alan and several others, so that might be the best bet for a follow-up unless you have a question specifically for me).
Regards,
Dave
David Nicolson
Data Development Coordinator, Integrated Taxonomic Information System
Biologist, USGS Core Science Systems, Biological Informatics Program
nicolsod@si.edu Office 202-633-2149 Fax 202-786-2934
"Nihil sumas necesse est..."
-----Original Message-----
Date: Fri, 20 May 2011 05:42:03 -0500
From: Steve Baskauf steve.baskauf@vanderbilt.edu steve.baskauf@vanderbilt.edu
Subject: Re: [tdwg-content] ITIS TSNID to uBio NamebankIDs mapping
To: "David Remsen (GBIF)" dremsen@gbif.org dremsen@gbif.org
Cc: "tdwg-content@lists.tdwg.org" tdwg-content@lists.tdwg.org tdwg-content@lists.tdwg.org tdwg-content@lists.tdwg.org
Message-ID: 4DD6457B.2080204@vanderbilt.edu 4DD6457B.2080204@vanderbilt.edu
Content-Type: text/plain; charset="iso-8859-1"
Thanks, all, for the responses. The "Compare to ITIS" function does
just what I want. I did a test run of 1000 names and it worked like a
charm. I will need to do a little massaging because sometimes two or
more ITIS IDs come back for each uBio ID. But I can handle that.
Steve
David Remsen (GBIF) wrote:
Steve
Have you tried this?
http://www.ubio.org/clients/ITIS/index.php
or this?
http://www.ubio.org/services/mapper/index2.php
All this ubio talk makes me think we were on to something. Worth a thought about adopting the new stnadrds and tools and making it really smooth.
DR
On 20 May 2011, at 04:46, Steve Baskauf wrote:
I have generated a csv spreadsheet of about 39 000 plant names for the
U.S. which has the ITIS TSNIDs for the names in a column. I would like
to have the uBio Namebank IDs in another column of the table. I have
been looking them up on the uBio website by typing in the names as I
need to know the IDs, but after doing about 300 of them, I'm getting
tired of it. Does anybody have a clever idea of a way to get the other
38 000 Namebank IDs without looking them up. I'm sure that it would be
possible to find this out because uBio gets names from ITIS. However, I
haven't seen any clues about how to do it in an automated fashion. I'm
guessing that there might be some way to use the uBio web services, but
if so, it isn't obvious and I probably don't have the skills to carry it
out anyway.
Any ideas?
Steve
--
Steven J. Baskauf, Ph.D., Senior Lecturer
Vanderbilt University Dept. of Biological Sciences
postal mail address:
VU Station B 351634
Nashville, TN 37235-1634, U.S.A.
delivery address:
2125 Stevenson Center
1161 21st Ave., S.
Nashville, TN 37235
office: 2128 Stevenson Center
phone: (615) 343-4582, fax: (615) 343-6707
http://bioimages.vanderbilt.edu
tdwg-content mailing list
tdwg-content@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-content
.
--
Steven J. Baskauf, Ph.D., Senior Lecturer
Vanderbilt University Dept. of Biological Sciences
postal mail address:
VU Station B 351634
Nashville, TN 37235-1634, U.S.A.
delivery address:
2125 Stevenson Center
1161 21st Ave., S.
Nashville, TN 37235
office: 2128 Stevenson Center
phone: (615) 343-4582, fax: (615) 343-6707
http://bioimages.vanderbilt.edu
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
--
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 Email: pdevries@wisc.edu TaxonConcept http://www.taxonconcept.org/ & GeoSpecieshttp://about.geospecies.org/ Knowledge Bases A Semantic Web, Linked Open Data http://linkeddata.org/ Project
This is exactly why this problem still exists and will be very complex to solve - everyone says "we should have a single ID for a specific taxon name, there seems to be several IDs 'out there' that refer to the same taxon name, so Im going to create another ID to link them all up" - yet another ID that no one will particularly want to follow - you would have to get everyone to agree that your combinations/integration of taxon names is the best one and hope everyone follows it - unlikely in this domain.
My thoughts are that the most likely way this will be solve is by stnadard market type pressures - ie the best solution/IDs will be used the most and "float" to the top. It is easy to say that the global taxon name data is a mess, but if you think about it 30 years ago taxon name data were very disparate, duplicated, unconnected, many with NO IDs at all. So I beleive we are making progress and that we will continue to do so albeit at a fairly slow rate.
Kevin
"I agree. This was one of the reasons that I setup TaxonConcept the way I did. It attempts to connect both the LOD entities and the foreign key based entities."
Please consider the environment before printing this email Warning: This electronic message together with any attachments is confidential. If you receive it in error: (i) you must not read, use, disclose, copy or retain it; (ii) please contact the sender immediately by reply email and then delete the emails. The views expressed in this email may not be those of Landcare Research New Zealand Limited. http://www.landcareresearch.co.nz
Hi Kevin,
On Tue, May 31, 2011 at 3:27 PM, Kevin Richards < RichardsK@landcareresearch.co.nz> wrote:
This is exactly why this problem still exists and will be very complex to solve - everyone says "we should have a single ID for a specific taxon name, there seems to be several IDs 'out there' that refer to the same taxon name, so Im going to create another ID to link them all up" - yet another ID that no one will particularly want to follow - you would have to get everyone to agree that your combinations/integration of taxon names is the best one and hope everyone follows it - unlikely in this domain.
Isn't this kind of what the The Plant List, and eBird already do?
A difference being that they tie these to a specific name and specific classification.
The Plant list is not really even open so it is difficult to people to adopt it in mass.
For instance, if I manage a herbarium, how do I easily reconcile my species list with the entities represented in the Plant List?
eBird has millions of records which implies that they have been able to convince the observers in the field to adopt their system. You are correct in that there are probably a lot of taxonomists that don't like their list. It differs from many of the other classifications, but remember the system rewards them for not agreeing. Note the difference between the microbial taxonomists and other taxonomists. In the case of the microbial workers, the system rewards them for solving problems not debating alternatives. Also, if a good idea comes out that will make it easier for the microbiologists to solve the problems they are rewarded for solving, they are less likely to care whose idea it is.
Like the microbiologists, there are lots of biologists that work with species with the goal of addressing some non-taxonomic problem.
They don't really care if the name is *Aedes triseriatus* or *Ochlerotatus triseriatus, *but they do care that the identifier that they connect their data to is stable.
In regards to the issue of market forces,I suspect (but have no knowledge of) that there were probably decisions made in devising these lists that have more to do with appeasing certain personalities that creating best list. With the way this system rewards people it is likely that the "correct" version will float to the top only after that person has passed away. I don't have much faith that the best system will always float to the top, That has a lot to do with the personalities and how the system rewards are setup. Theoretically, it is possible for one strong personality or group to force others to adopt their less than optimal solution - at least this seems to happen in other environments.
Also, there are all sorts of ways that people can use the publication record to rewrite history. Simply cite the review paper that cites the original paper. Or don't cite it at all.
I would have used only the ITIS TSN but if the name changes the ID changes. This isn't "wrong", it just does not solve my problem.
* ITIS also should add the spiders from the World Spider Catalog.
Another issue that I think has inhibited adoption of a common list is that people can't agree on a particular name or a particular classification.
Since you can model a species concept as having many names and many classifications why not do so?
If this idea was originally accepted, I would not have needed to create TaxonConcept.org.
My plan has aways been to get something that works to solve some problems and then let some larger group take it over.
In a sense, I am more like the microbiologists in that I am not being paid to solve this or debate this problem.
I am doing it because I think something like this is needed, and it is an interesting and personally rewarding puzzle.
- Pete
My thoughts are that the most likely way this will be solve is by stnadard market type pressures - ie the best solution/IDs will be used the most and "float" to the top. It is easy to say that the global taxon name data is a mess, but if you think about it 30 years ago taxon name data were very disparate, duplicated, unconnected, many with NO IDs at all. So I beleive we are making progress and that we will continue to do so albeit at a fairly slow rate.
Kevin
"I agree. This was one of the reasons that I setup TaxonConcept the way I did. It attempts to connect both the LOD entities and the foreign key based entities."
Please consider the environment before printing this email Warning: This electronic message together with any attachments is confidential. If you receive it in error: (i) you must not read, use, disclose, copy or retain it; (ii) please contact the sender immediately by reply email and then delete the emails. The views expressed in this email may not be those of Landcare Research New Zealand Limited. http://www.landcareresearch.co.nz
Hi Kevin,
I forgot one mention some other things that are different about my project.
You can write a simple SPARQL query to get a list of all the TaxonConcept's that have ITIS ids, or all those that have ITIS and NCBI ID's etc.
You can do this on any SPARQL endpoint that hosts the data.
You can download the entire data set and run the queries on your own endpoint.
You can write a script that runs the query and downloads the ITIS numbers and exports them to CSV etc.
- Pete
On Tue, May 31, 2011 at 5:16 PM, Peter DeVries pete.devries@gmail.comwrote:
Hi Kevin,
On Tue, May 31, 2011 at 3:27 PM, Kevin Richards < RichardsK@landcareresearch.co.nz> wrote:
This is exactly why this problem still exists and will be very complex to solve - everyone says "we should have a single ID for a specific taxon name, there seems to be several IDs 'out there' that refer to the same taxon name, so Im going to create another ID to link them all up" - yet another ID that no one will particularly want to follow - you would have to get everyone to agree that your combinations/integration of taxon names is the best one and hope everyone follows it - unlikely in this domain.
Isn't this kind of what the The Plant List, and eBird already do?
A difference being that they tie these to a specific name and specific classification.
The Plant list is not really even open so it is difficult to people to adopt it in mass.
For instance, if I manage a herbarium, how do I easily reconcile my species list with the entities represented in the Plant List?
eBird has millions of records which implies that they have been able to convince the observers in the field to adopt their system. You are correct in that there are probably a lot of taxonomists that don't like their list. It differs from many of the other classifications, but remember the system rewards them for not agreeing. Note the difference between the microbial taxonomists and other taxonomists. In the case of the microbial workers, the system rewards them for solving problems not debating alternatives. Also, if a good idea comes out that will make it easier for the microbiologists to solve the problems they are rewarded for solving, they are less likely to care whose idea it is.
Like the microbiologists, there are lots of biologists that work with species with the goal of addressing some non-taxonomic problem.
They don't really care if the name is *Aedes triseriatus* or *Ochlerotatus triseriatus, *but they do care that the identifier that they connect their data to is stable.
In regards to the issue of market forces,I suspect (but have no knowledge of) that there were probably decisions made in devising these lists that have more to do with appeasing certain personalities that creating best list. With the way this system rewards people it is likely that the "correct" version will float to the top only after that person has passed away. I don't have much faith that the best system will always float to the top, That has a lot to do with the personalities and how the system rewards are setup. Theoretically, it is possible for one strong personality or group to force others to adopt their less than optimal solution - at least this seems to happen in other environments.
Also, there are all sorts of ways that people can use the publication record to rewrite history. Simply cite the review paper that cites the original paper. Or don't cite it at all.
I would have used only the ITIS TSN but if the name changes the ID changes. This isn't "wrong", it just does not solve my problem.
- ITIS also should add the spiders from the World Spider Catalog.
Another issue that I think has inhibited adoption of a common list is that people can't agree on a particular name or a particular classification.
Since you can model a species concept as having many names and many classifications why not do so?
If this idea was originally accepted, I would not have needed to create TaxonConcept.org.
My plan has aways been to get something that works to solve some problems and then let some larger group take it over.
In a sense, I am more like the microbiologists in that I am not being paid to solve this or debate this problem.
I am doing it because I think something like this is needed, and it is an interesting and personally rewarding puzzle.
- Pete
My thoughts are that the most likely way this will be solve is by stnadard market type pressures - ie the best solution/IDs will be used the most and "float" to the top. It is easy to say that the global taxon name data is a mess, but if you think about it 30 years ago taxon name data were very disparate, duplicated, unconnected, many with NO IDs at all. So I beleive we are making progress and that we will continue to do so albeit at a fairly slow rate.
Kevin
"I agree. This was one of the reasons that I setup TaxonConcept the way I did. It attempts to connect both the LOD entities and the foreign key based entities."
Please consider the environment before printing this email Warning: This electronic message together with any attachments is confidential. If you receive it in error: (i) you must not read, use, disclose, copy or retain it; (ii) please contact the sender immediately by reply email and then delete the emails. The views expressed in this email may not be those of Landcare Research New Zealand Limited. http://www.landcareresearch.co.nz
--
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 Email: pdevries@wisc.edu TaxonConcept http://www.taxonconcept.org/ & GeoSpecieshttp://about.geospecies.org/ Knowledge Bases A Semantic Web, Linked Open Data http://linkeddata.org/ Project
Pete, I'm not trying to say what you are doing is a waste of time/impossible. I actually think RDF + semantics are a good way forward, but this really implies that we need to rely on the semantics and linkages rather than having a SINGLE ID for a taxon name. (which is what I thought Steve was getting at). Each instance of a taxon name can have its own ID and then all these instances are connected via ontology defined semantic links. This seems more appropriate to me than insisting everyone uses the "Global Taxon Name ID X".
In your example of Aedes triseriatus and Ochlerotatus triseriatus - these are two different names so they need two different IDs, they may be linked by a single taxon concept, but they are separate names. So which of these now 3 IDs do you expect people to use, and according to what source??
For example if we have a name, eg the Robin, Erithacus rubecula, mentioned in IT IS (TSN : 559964) and also in EOL (www.eol.org/pages/1051567http://www.eol.org/pages/1051567), also in GBIF (http://data.gbif.org/species/21266780), also in avibase (http://avibase.bsc-eoc.org/species.jsp?avibaseid=C809B2B90399A43D), which ID are you hoping people will use?? Would you put the IT IS ID in your own dataset as the ID for that name - unlikely. Or would it be better to link them up with semantic linkages.
What I take from recommendation 8 of the GUID applicability guide (as Steve puts is "stop making up new identifiers when somebody else already has one for the thing you are talking about") is that if you DON'T already have a record in your own database for a taxon name/concept, then reuse an existing one. NOT ditch all your current IDs and adopt someone else's (especially hard considering it is so hard to work out which if the multitude of names ad concept IDs that directly relates to your taxon name).
I am all for limiting the number of IDs for the "same" thing, but in some cases it is more useful to build linkages than force this tight integration of data and IDs. Especially for taxon names and concepts, where it is complex to define if you are even talking about the "same" thing or not.
Kevin
From: Peter DeVries [mailto:pete.devries@gmail.com] Sent: Wednesday, 1 June 2011 12:38 p.m. To: Kevin Richards Cc: Steve Baskauf; tdwg-content@lists.tdwg.org; Gerald Guala; Nicolson, David; Alan J Hampson; Orrell, Thomas Subject: Re: [tdwg-content] ITIS TSNID to uBio NamebankIDs mapping
Hi Kevin,
I forgot one mention some other things that are different about my project.
You can write a simple SPARQL query to get a list of all the TaxonConcept's that have ITIS ids, or all those that have ITIS and NCBI ID's etc.
You can do this on any SPARQL endpoint that hosts the data.
You can download the entire data set and run the queries on your own endpoint.
You can write a script that runs the query and downloads the ITIS numbers and exports them to CSV etc.
- Pete
On Tue, May 31, 2011 at 5:16 PM, Peter DeVries <pete.devries@gmail.commailto:pete.devries@gmail.com> wrote: Hi Kevin, On Tue, May 31, 2011 at 3:27 PM, Kevin Richards <RichardsK@landcareresearch.co.nzmailto:RichardsK@landcareresearch.co.nz> wrote: This is exactly why this problem still exists and will be very complex to solve - everyone says "we should have a single ID for a specific taxon name, there seems to be several IDs 'out there' that refer to the same taxon name, so Im going to create another ID to link them all up" - yet another ID that no one will particularly want to follow - you would have to get everyone to agree that your combinations/integration of taxon names is the best one and hope everyone follows it - unlikely in this domain.
Isn't this kind of what the The Plant List, and eBird already do?
A difference being that they tie these to a specific name and specific classification.
The Plant list is not really even open so it is difficult to people to adopt it in mass.
For instance, if I manage a herbarium, how do I easily reconcile my species list with the entities represented in the Plant List?
eBird has millions of records which implies that they have been able to convince the observers in the field to adopt their system. You are correct in that there are probably a lot of taxonomists that don't like their list. It differs from many of the other classifications, but remember the system rewards them for not agreeing. Note the difference between the microbial taxonomists and other taxonomists. In the case of the microbial workers, the system rewards them for solving problems not debating alternatives. Also, if a good idea comes out that will make it easier for the microbiologists to solve the problems they are rewarded for solving, they are less likely to care whose idea it is.
Like the microbiologists, there are lots of biologists that work with species with the goal of addressing some non-taxonomic problem.
They don't really care if the name is Aedes triseriatus or Ochlerotatus triseriatus, but they do care that the identifier that they connect their data to is stable.
In regards to the issue of market forces,I suspect (but have no knowledge of) that there were probably decisions made in devising these lists that have more to do with appeasing certain personalities that creating best list. With the way this system rewards people it is likely that the "correct" version will float to the top only after that person has passed away. I don't have much faith that the best system will always float to the top, That has a lot to do with the personalities and how the system rewards are setup. Theoretically, it is possible for one strong personality or group to force others to adopt their less than optimal solution - at least this seems to happen in other environments.
Also, there are all sorts of ways that people can use the publication record to rewrite history. Simply cite the review paper that cites the original paper. Or don't cite it at all.
I would have used only the ITIS TSN but if the name changes the ID changes. This isn't "wrong", it just does not solve my problem.
* ITIS also should add the spiders from the World Spider Catalog.
Another issue that I think has inhibited adoption of a common list is that people can't agree on a particular name or a particular classification.
Since you can model a species concept as having many names and many classifications why not do so?
If this idea was originally accepted, I would not have needed to create TaxonConcept.org.
My plan has aways been to get something that works to solve some problems and then let some larger group take it over.
In a sense, I am more like the microbiologists in that I am not being paid to solve this or debate this problem.
I am doing it because I think something like this is needed, and it is an interesting and personally rewarding puzzle.
- Pete
My thoughts are that the most likely way this will be solve is by stnadard market type pressures - ie the best solution/IDs will be used the most and "float" to the top. It is easy to say that the global taxon name data is a mess, but if you think about it 30 years ago taxon name data were very disparate, duplicated, unconnected, many with NO IDs at all. So I beleive we are making progress and that we will continue to do so albeit at a fairly slow rate.
Kevin
"I agree. This was one of the reasons that I setup TaxonConcept the way I did. It attempts to connect both the LOD entities and the foreign key based entities."
Please consider the environment before printing this email Warning: This electronic message together with any attachments is confidential. If you receive it in error: (i) you must not read, use, disclose, copy or retain it; (ii) please contact the sender immediately by reply email and then delete the emails. The views expressed in this email may not be those of Landcare Research New Zealand Limited. http://www.landcareresearch.co.nz
-- ------------------------------------------------------------------------------------ Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 Email: pdevries@wisc.edumailto:pdevries@wisc.edu TaxonConcepthttp://www.taxonconcept.org/ & GeoSpecieshttp://about.geospecies.org/ Knowledge Bases A Semantic Web, Linked Open Datahttp://linkeddata.org/ Project --------------------------------------------------------------------------------------
-- ------------------------------------------------------------------------------------ Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 Email: pdevries@wisc.edumailto:pdevries@wisc.edu TaxonConcepthttp://www.taxonconcept.org/ & GeoSpecieshttp://about.geospecies.org/ Knowledge Bases A Semantic Web, Linked Open Datahttp://linkeddata.org/ Project --------------------------------------------------------------------------------------
________________________________ Please consider the environment before printing this email Warning: This electronic message together with any attachments is confidential. If you receive it in error: (i) you must not read, use, disclose, copy or retain it; (ii) please contact the sender immediately by reply email and then delete the emails. The views expressed in this email may not be those of Landcare Research New Zealand Limited. http://www.landcareresearch.co.nz
Hi Kevin,
Sorry I thought you were mad at me for creating yet another set of ID's. :-\
Yes I agree that it is unlikely that we will get everyone to adopt the same name.
In some cases, it is unclear what the "right name" is. The mosquito community seems to be split about *Aedes triseriatus* and *Ochlerotatus triseriatus. *With the differences largely bases on name stability*.* * * The only way to really see this is to watch what names are appearing in articles.
As a check I did a Google Scholar search to see how many recent publications use *Felis concolor* rather than *Puma concolor*. The Felis variant count was *3,920* since the year 2000.
Another advantage of the linking approach you suggest is that it allows for different interpretations and approaches and makes those visible and findable.
In order to make it clearer whether something is the same as another thing we need to encourage people to better document each species with e-type like LOD pages that contain links to specimens, etc.
For some uses I think this page and it's linked resources is good enough that someone could use it to determine if what they observed on a BioBlitz was an instance of this concept.
http://lod.taxonconcept.org/ses/CTZ8z.html
For a number of other species you really need links to good photographs that help clarify characters and the locations of specimens that they could use for comparison.
Unfortunately, the system does not reward the creation of clearer descriptions of existing or new species. It tends to reward the re-categorizing of previously described species.
If you had open, easily citable and trackable species descriptions then there might be a shift in the how different activities are rewarded.
One reason being that the descriptions are accessible to, and findable by, anyone with an internet connection.
In addition a masters student might not be able to fully document a species but they might be able to document the distribution, variability in morphology and DNA for a species over a specific geographical area like the North American Midwest.
It is these kinds of resources that are needed by other biologists, citizens and governments.
Once it is easier to determine what butterfly is visiting what flower than it is possible to work out what plants each butterfly species seems to dependent on.
Respectfully,
- Pete
On Tue, May 31, 2011 at 8:53 PM, Kevin Richards < RichardsK@landcareresearch.co.nz> wrote:
Pete,
I’m not trying to say what you are doing is a waste of time/impossible. I actually think RDF + semantics are a good way forward, but this really implies that we need to rely on the semantics and linkages rather than having a SINGLE ID for a taxon name. (which is what I thought Steve was getting at). Each instance of a taxon name can have its own ID and then all these instances are connected via ontology defined semantic links. This seems more appropriate to me than insisting everyone uses the “Global Taxon Name ID X”.
In your example of *Aedes triseriatus* and *Ochlerotatus triseriatus* – these are two different names so they need two different IDs, they may be linked by a single taxon concept, but they are separate names. So which of these now 3 IDs do you expect people to use, and according to what source??
For example if we have a name, eg the Robin, Erithacus rubecula, mentioned in IT IS (TSN : 559964) and also in EOL (www.eol.org/pages/1051567), also in GBIF (http://data.gbif.org/species/21266780), also in avibase ( http://avibase.bsc-eoc.org/species.jsp?avibaseid=C809B2B90399A43D), which ID are you hoping people will use?? Would you put the IT IS ID in your own dataset as the ID for that name – unlikely. Or would it be better to link them up with semantic linkages.
What I take from recommendation 8 of the GUID applicability guide (as Steve puts is "stop making up new identifiers when somebody else already has one for the thing you are talking about”) is that if you DON’T already have a record in your own database for a taxon name/concept, then reuse an existing one. NOT ditch all your current IDs and adopt someone else’s (especially hard considering it is so hard to work out which if the multitude of names ad concept IDs that directly relates to your taxon name).
I am all for limiting the number of IDs for the “same” thing, but in some cases it is more useful to build linkages than force this tight integration of data and IDs. Especially for taxon names and concepts, where it is complex to define if you are even talking about the “same” thing or not.
Kevin
*From:* Peter DeVries [mailto:pete.devries@gmail.com] *Sent:* Wednesday, 1 June 2011 12:38 p.m. *To:* Kevin Richards *Cc:* Steve Baskauf; tdwg-content@lists.tdwg.org; Gerald Guala; Nicolson, David; Alan J Hampson; Orrell, Thomas
*Subject:* Re: [tdwg-content] ITIS TSNID to uBio NamebankIDs mapping
Hi Kevin,
I forgot one mention some other things that are different about my project.
You can write a simple SPARQL query to get a list of all the TaxonConcept's that have ITIS ids, or all those that have ITIS and NCBI ID's etc.
You can do this on any SPARQL endpoint that hosts the data.
You can download the entire data set and run the queries on your own endpoint.
You can write a script that runs the query and downloads the ITIS numbers and exports them to CSV etc.
- Pete
On Tue, May 31, 2011 at 5:16 PM, Peter DeVries pete.devries@gmail.com wrote:
Hi Kevin,
On Tue, May 31, 2011 at 3:27 PM, Kevin Richards < RichardsK@landcareresearch.co.nz> wrote:
This is exactly why this problem still exists and will be very complex to solve - everyone says "we should have a single ID for a specific taxon name, there seems to be several IDs 'out there' that refer to the same taxon name, so Im going to create another ID to link them all up" - yet another ID that no one will particularly want to follow - you would have to get everyone to agree that your combinations/integration of taxon names is the best one and hope everyone follows it - unlikely in this domain.
Isn't this kind of what the The Plant List, and eBird already do?
A difference being that they tie these to a specific name and specific classification.
The Plant list is not really even open so it is difficult to people to adopt it in mass.
For instance, if I manage a herbarium, how do I easily reconcile my species list with the entities represented in the Plant List?
eBird has millions of records which implies that they have been able to convince the observers in the field to adopt their system. You are correct in that there are probably a lot of taxonomists that don't like their list.
It differs from many of the other classifications, but remember the system rewards them for not agreeing. Note the difference between the microbial taxonomists and other taxonomists. In the case of the microbial
workers, the system rewards them for solving problems not debating alternatives. Also, if a good idea comes out that will make it easier for the microbiologists to solve the problems they are rewarded for solving, they are less likely to care whose idea it is.
Like the microbiologists, there are lots of biologists that work with species with the goal of addressing some non-taxonomic problem.
They don't really care if the name is *Aedes triseriatus* or *Ochlerotatus triseriatus, *but they do care that the identifier that they connect their data to is stable.
In regards to the issue of market forces,I suspect (but have no knowledge of) that there were probably decisions made in devising these lists that have more to do with appeasing certain personalities that creating best list. With the way this system rewards people it is likely that the "correct" version will float to the top only after that person has passed away. I don't have much faith that the best system will always float to the top, That has a lot to do with the personalities and how the system rewards are setup. Theoretically, it is possible for one strong personality or group to force others to adopt their less than optimal solution - at least this seems to happen in other environments.
Also, there are all sorts of ways that people can use the publication record to rewrite history. Simply cite the review paper that cites the original paper. Or don't cite it at all.
I would have used only the ITIS TSN but if the name changes the ID changes. This isn't "wrong", it just does not solve my problem.
- ITIS also should add the spiders from the World Spider Catalog.
Another issue that I think has inhibited adoption of a common list is that people can't agree on a particular name or a particular classification.
Since you can model a species concept as having many names and many classifications why not do so?
If this idea was originally accepted, I would not have needed to create TaxonConcept.org.
My plan has aways been to get something that works to solve some problems and then let some larger group take it over.
In a sense, I am more like the microbiologists in that I am not being paid to solve this or debate this problem.
I am doing it because I think something like this is needed, and it is an interesting and personally rewarding puzzle.
- Pete
My thoughts are that the most likely way this will be solve is by stnadard market type pressures - ie the best solution/IDs will be used the most and "float" to the top. It is easy to say that the global taxon name data is a mess, but if you think about it 30 years ago taxon name data were very disparate, duplicated, unconnected, many with NO IDs at all. So I beleive we are making progress and that we will continue to do so albeit at a fairly slow rate.
Kevin
"I agree. This was one of the reasons that I setup TaxonConcept the way I did. It attempts to connect both the LOD entities and the foreign key based entities."
Please consider the environment before printing this email Warning: This electronic message together with any attachments is confidential. If you receive it in error: (i) you must not read, use, disclose, copy or retain it; (ii) please contact the sender immediately by reply email and then delete the emails. The views expressed in this email may not be those of Landcare Research New Zealand Limited. http://www.landcareresearch.co.nz
--
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 Email: pdevries@wisc.edu TaxonConcept http://www.taxonconcept.org/ & GeoSpecieshttp://about.geospecies.org/ Knowledge Bases A Semantic Web, Linked Open Data http://linkeddata.org/ Project
--
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 Email: pdevries@wisc.edu TaxonConcept http://www.taxonconcept.org/ & GeoSpecieshttp://about.geospecies.org/ Knowledge Bases A Semantic Web, Linked Open Data http://linkeddata.org/ Project
Please consider the environment before printing this email Warning: This electronic message together with any attachments is confidential. If you receive it in error: (i) you must not read, use, disclose, copy or retain it; (ii) please contact the sender immediately by reply email and then delete the emails. The views expressed in this email may not be those of Landcare Research New Zealand Limited. http://www.landcareresearch.co.nz
My email access has been sporadic since this thread developed, so at this point I'll respond to points made in several of the messages.
First, I should note that there has been previous discussion on this list on a similar topic from http://lists.tdwg.org/pipermail/tdwg-content/2011-January/002231.html through http://lists.tdwg.org/pipermail/tdwg-content/2011-January/002231.html. One can review what was said at that time rather quickly by starting on the first linked message and clicking on the "Next Message" link until you get to the end of the range I gave above.
My reason for the request for information that started this thread was that I wanted to link to a URI that would anchor the name portion of a name/sensu pair (TNU or Taxon Concept a la TCS if you prefer) as in this RDF snippet:
tc:nameStringQuercus rubra L.</tc:nameString> <tc:hasName rdf:about="http://www.ubio.org/authority/metadata.php?lsid=urn:lsid:ubio.org:namebank:4...
At this point in the discussion, I'm not actually talking about creating a link to a taxon concept but rather to a taxon name, so some of the issues Pete raised don't apply here (e.g. what's the "right" name for a concept - the question here is simply what's a stable identifier for the name) . In principle, I could probably just provide the name string and be done with it. However, having some degree of faith that Smart, Computer Savvy People might some day be able to use the metadata returned by the URI (or perhaps metadata which they already have in a triple store onsite) to do cool things like knowing that my name is the same as an orthographic variant or that "Quercus rubra L." is basically the same thing as "Quercus rubra", I would like to also provide a functional URI.
As an end -user who isn't very interested in the technical issues involving names, I don't really care what URI I use. I would prefer for it to be widely recognized and for it to "work" (i.e. be resolvable). In the earlier (January) thread, there was discussion about existing identifiers. There were a number of posts, but in particular http://lists.tdwg.org/pipermail/tdwg-content/2011-January/002258.html http://lists.tdwg.org/pipermail/tdwg-content/2011-January/002259.html discussed the relative merits of ITIS and uBio ID numbers. My take-home message from this was that uBio represented the largest single set of names with assigned identifiers (see http://gni.globalnames.org/data_sources cited in Pete's email) and that uBio metadata provides useful references. Hence my interest in referencing uBio ids as a URI. However, as a practical matter, the organizations that I share images with either want ITIS TSNs (EOL and Morphbank) or just names (Discover Life). Nobody is asking for uBio identifiers or any other identifier.
I found Kevin's comment at http://lists.tdwg.org/pipermail/tdwg-content/2011-May/002486.html very thought-provoking: "My thoughts are that the most likely way this will be solved is by standard market type pressures - ie the best solution/IDs will be used the most and 'float' to the top." I'm not going to make a judgment about what is the "best" solution or ID. But I would say that in "computer" history, being the "best" doesn't necessarily mean that something will be used. Take for example, the FOAF vocabulary. What the heck is Friend of a Friend? I would venture to say that most of the people using the FOAF vocabulary don't know or care. The FOAF vocabulary was the one that people started to use and once that happened, people didn't switch even if there was something better. I'm not familiar with the history of other stuff like YouTube and Craig's List, but I would guess that they weren't necessarily "the best" systems - they were just the one that the most people started using first and once that happened, people didn't switch. I'm using ITIS IDs because they are easy to get and the people I communicate with want them. Whether they are the "best" or "done correctly" doesn't matter to me as much as the fact that that they are widely recognized and stable (and that thus far every name that I've looked for has been in their database).
I think that one reason why this question has been on my mind is that I've been waiting for GNUB (Global Name Use Bank) to come out. I'm not really up on how it is going to work, but my impression is that it was going to be based on the Global Name Index (GNI) which was mentioned in that earlier January thread. At that point, the GNI names didn't have any identifiers that were exposed to the public as permanent GUIDs. I'm assuming that if GNUB refers to GNI names, they will have some kind of identifiers. So if that happens how is the GUID recommendation 8 going to be followed? As Kevin said in http://lists.tdwg.org/pipermail/tdwg-content/2011-June/002499.html "What I take from recommendation 8 of the GUID applicability guide ... is that if you DON'T already have a record in your own database for a taxon name/concept, then reuse an existing one. " What we have here with GNI is a situation where none of the records have identifiers. In my mind, the "best practice" according to recommendation 8 would be for the GNI to reuse existing identifiers where they exist and NOT make up new ones. This is a bit more complicated because the ITIS identifiers (which are in common use) don't have an http URI version that is resolvable, and while the uBio identifiers have a resolvable http URI, it's in the form of a proxied LSID, which I've already complained is very ugly. So I'd like to hear some ideas about how to have "reused" identifiers in the GNI.
One thing that comes to my mind would be to have a "domain name" like "http://purl.org/gni/" or "http://purl.org/tn/" ("tn" for "taxon name") and to follow it with a namespace/id combination similar to what is done with lsids. So for example "itis/19408" and "ubio/448439" could be appended, creating http://purl.org/gni/itis/19408 and http://purl.org/gni/ubio/448439 for "Quercus rubra L." Both URIs could point to the same RDF and that RDF could indicate that the two identifiers are owl:sameAs . I realize from what Bob Morris has cautioned in the past that there are problems with owl:sameAs when the two things aren't actually the same thing (e.g. if the uBio ID refers to a name string only but the ITIS TSN refers to the name plus an "accepted" status and a relationship to parent taxa). However, if there were an understanding that the GNI only refers to name strings, then one could still refer to http://purl.org/gni/itis/19408 as an identifier for the name string of the thing (whatever it is) that is referred to by an ITIS TSN of 19408. I don't think there would be a problem saying that and the ubio ID were "owl:sameAs". Some kind of solution like this would allow people to easily generate a resolvable URI for a name if they were using ITIS TSNs or uBio IDs. If the name that one wanted to use was so obscure that it was one of the 9.5 million names that uBio has that ITIS doesn't have, then that name would only have the ubio version. I have no idea whether this would be a good idea or not, but I was really cringing to think about 19 million newly minted UUIDs appended to "http://gni.globalnames.org/" and figuring out how to connect those horrid things to the names and ITIS TSNs that I'm already using. I think that I said this before, but using the purl.org domain rather than one like http://gni.globalnames.org/ would in the future allow somebody else to take over management of providing the metadata when the GUIDs are resolved without having to deal with issues of who "owns" the domain name.
Steve
Kevin Richards wrote:
Pete,
I’m not trying to say what you are doing is a waste of time/impossible. I actually think RDF + semantics are a good way forward, but this really implies that we need to rely on the semantics and linkages rather than having a SINGLE ID for a taxon name. (which is what I thought Steve was getting at). Each instance of a taxon name can have its own ID and then all these instances are connected via ontology defined semantic links. This seems more appropriate to me than insisting everyone uses the “Global Taxon Name ID X”.
In your example of /Aedes triseriatus/ and /Ochlerotatus triseriatus/– these are two different names so they need two different IDs, they may be linked by a single taxon concept, but they are separate names. So which of these now 3 IDs do you expect people to use, and according to what source??
For example if we have a name, eg the Robin, Erithacus rubecula, mentioned in IT IS (TSN : 559964) and also in EOL (www.eol.org/pages/1051567 http://www.eol.org/pages/1051567), also in GBIF (http://data.gbif.org/species/21266780), also in avibase (http://avibase.bsc-eoc.org/species.jsp?avibaseid=C809B2B90399A43D), which ID are you hoping people will use?? Would you put the IT IS ID in your own dataset as the ID for that name – unlikely. Or would it be better to link them up with semantic linkages.
What I take from recommendation 8 of the GUID applicability guide (as Steve puts is "stop making up new identifiers when somebody else already has one for the thing you are talking about”) is that if you DON’T already have a record in your own database for a taxon name/concept, then reuse an existing one. NOT ditch all your current IDs and adopt someone else’s (especially hard considering it is so hard to work out which if the multitude of names ad concept IDs that directly relates to your taxon name).
I am all for limiting the number of IDs for the “same” thing, but in some cases it is more useful to build linkages than force this tight integration of data and IDs. Especially for taxon names and concepts, where it is complex to define if you are even talking about the “same” thing or not.
Kevin
*From:*Peter DeVries [mailto:pete.devries@gmail.com] *Sent:* Wednesday, 1 June 2011 12:38 p.m. *To:* Kevin Richards *Cc:* Steve Baskauf; tdwg-content@lists.tdwg.org; Gerald Guala; Nicolson, David; Alan J Hampson; Orrell, Thomas *Subject:* Re: [tdwg-content] ITIS TSNID to uBio NamebankIDs mapping
Hi Kevin,
I forgot one mention some other things that are different about my project.
You can write a simple SPARQL query to get a list of all the TaxonConcept's that have ITIS ids, or all those that have ITIS and NCBI ID's etc.
You can do this on any SPARQL endpoint that hosts the data.
You can download the entire data set and run the queries on your own endpoint.
You can write a script that runs the query and downloads the ITIS numbers and exports them to CSV etc.
- Pete
On Tue, May 31, 2011 at 5:16 PM, Peter DeVries <pete.devries@gmail.com mailto:pete.devries@gmail.com> wrote:
Hi Kevin,
On Tue, May 31, 2011 at 3:27 PM, Kevin Richards <RichardsK@landcareresearch.co.nz mailto:RichardsK@landcareresearch.co.nz> wrote:
This is exactly why this problem still exists and will be very complex to solve - everyone says "we should have a single ID for a specific taxon name, there seems to be several IDs 'out there' that refer to the same taxon name, so Im going to create another ID to link them all up" - yet another ID that no one will particularly want to follow - you would have to get everyone to agree that your combinations/integration of taxon names is the best one and hope everyone follows it - unlikely in this domain.
Isn't this kind of what the The Plant List, and eBird already do?
A difference being that they tie these to a specific name and specific classification.
The Plant list is not really even open so it is difficult to people to adopt it in mass.
For instance, if I manage a herbarium, how do I easily reconcile my species list with the entities represented in the Plant List?
eBird has millions of records which implies that they have been able to convince the observers in the field to adopt their system. You are correct in that there are probably a lot of taxonomists that don't like their list.
It differs from many of the other classifications, but remember the system rewards them for not agreeing. Note the difference between the microbial taxonomists and other taxonomists. In the case of the microbial
workers, the system rewards them for solving problems not debating alternatives. Also, if a good idea comes out that will make it easier for the microbiologists to solve the problems they are rewarded for solving, they are less likely to care whose idea it is.
Like the microbiologists, there are lots of biologists that work with species with the goal of addressing some non-taxonomic problem.
They don't really care if the name is /Aedes triseriatus/ or /Ochlerotatus triseriatus, /but they do care that the identifier that they connect their data to is stable.
In regards to the issue of market forces,I suspect (but have no knowledge of) that there were probably decisions made in devising these lists that have more to do with appeasing certain personalities that creating best list. With the way this system rewards people it is likely that the "correct" version will float to the top only after that person has passed away. I don't have much faith that the best system will always float to the top, That has a lot to do with the personalities and how the system rewards are setup. Theoretically, it is possible for one strong personality or group to force others to adopt their less than optimal solution - at least this seems to happen in other environments.
Also, there are all sorts of ways that people can use the publication record to rewrite history. Simply cite the review paper that cites the original paper. Or don't cite it at all.
I would have used only the ITIS TSN but if the name changes the ID changes. This isn't "wrong", it just does not solve my problem.
- ITIS also should add the spiders from the World Spider Catalog.
Another issue that I think has inhibited adoption of a common list is that people can't agree on a particular name or a particular classification.
Since you can model a species concept as having many names and many classifications why not do so?
If this idea was originally accepted, I would not have needed to create TaxonConcept.org.
My plan has aways been to get something that works to solve some problems and then let some larger group take it over.
In a sense, I am more like the microbiologists in that I am not being paid to solve this or debate this problem.
I am doing it because I think something like this is needed, and it is an interesting and personally rewarding puzzle.
Pete
My thoughts are that the most likely way this will be solve is by stnadard market type pressures - ie the best solution/IDs will be used the most and "float" to the top. It is easy to say that the global taxon name data is a mess, but if you think about it 30 years ago taxon name data were very disparate, duplicated, unconnected, many with NO IDs at all. So I beleive we are making progress and that we will continue to do so albeit at a fairly slow rate.
Kevin
"I agree. This was one of the reasons that I setup TaxonConcept the way I did. It attempts to connect both the LOD entities and the foreign key based entities."
Please consider the environment before printing this email Warning: This electronic message together with any attachments is confidential. If you receive it in error: (i) you must not read, use, disclose, copy or retain it; (ii) please contact the sender immediately by reply email and then delete the emails. The views expressed in this email may not be those of Landcare Research New Zealand Limited. http://www.landcareresearch.co.nz
--
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 Email: pdevries@wisc.edu mailto:pdevries@wisc.edu TaxonConcept http://www.taxonconcept.org/ & GeoSpecies http://about.geospecies.org/ Knowledge Bases A Semantic Web, Linked Open Data http://linkeddata.org/ Project
--
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 Email: pdevries@wisc.edu mailto:pdevries@wisc.edu TaxonConcept http://www.taxonconcept.org/ & GeoSpecies http://about.geospecies.org/ Knowledge Bases A Semantic Web, Linked Open Data http://linkeddata.org/ Project
Please consider the environment before printing this email Warning: This electronic message together with any attachments is confidential. If you receive it in error: (i) you must not read, use, disclose, copy or retain it; (ii) please contact the sender immediately by reply email and then delete the emails. The views expressed in this email may not be those of Landcare Research New Zealand Limited. http://www.landcareresearch.co.nz
Hi Steve,
I don't have time to go through this in detail, and I can't speak for the GNI, but I can tell you about how the GNI URI's work at least for now.
A while back Dima Mozzherin and I were looking into how triples etc. might be of use to the GNI.
We needed a way to generate unique URI's for each name.
We wanted to avoid having to keep these in sync and not require everyone to look each ID up through some service.
Dima came up with the following plan. We use the namestring as seed to generate a unique UUID.
Basically this is a shared algorithm which the GNI and TaxonConcept both use. But it could be used by anyone.
You feed the name string to the algorithm and it spits out a UUID. We append then append that to a URI and web service so it is resolvable.
So the name Danaus plexippus (Linnaeus 1758) => 4ef223c4-0c3e-5e84-ace9-755c34c601ec
So if the GNI and and another group have the same namestring they have the same UUID.
People can then can link their data set to the GNI with the following URI
http://gni.globalnames.org/name_strings/4ef223c4-0c3e-5e84-ace9-755c34c601ec
RDF http://gni.globalnames.org/name_strings/4ef223c4-0c3e-5e84-ace9-755c34c601ec...
http://gni.globalnames.org/name_strings/4ef223c4-0c3e-5e84-ace9-755c34c601ec.rdfIf you think of your data set as one table and the GNI as another, this URI serves as the foreign key that connects them together.
Some on the list don't like how these look, but there is a tremendous advantage in not having to worry about syncing two large data sets and determining if a given integer is already in use.
Also Rod Page has written a recently about UUID's. http://iphylo.blogspot.com/2011/05/zoobank-on-couchdb-uuids-replication.html
http://iphylo.blogspot.com/2011/05/zoobank-on-couchdb-uuids-replication.htmlThere may be a way to do something similar with bit.ly like identifiers that are shorter (mCcSp), but I think it the general idea is a good one.
If you recall from my talk at TDWG, I was able to use these to make statements that one namestring was a synonym etc. of another etc.
The algorithm we use is written in Ruby but I could be ported to many different languages since UUIDs are widely supported.
Respectfully,
- Pete
On Thu, Jun 2, 2011 at 11:41 PM, Steven J. Baskauf < steve.baskauf@vanderbilt.edu> wrote:
My email access has been sporadic since this thread developed, so at this point I'll respond to points made in several of the messages.
First, I should note that there has been previous discussion on this list on a similar topic from http://lists.tdwg.org/pipermail/tdwg-content/2011-January/002231.htmlthrough http://lists.tdwg.org/pipermail/tdwg-content/2011-January/002231.html. One can review what was said at that time rather quickly by starting on the first linked message and clicking on the "Next Message" link until you get to the end of the range I gave above.
My reason for the request for information that started this thread was that I wanted to link to a URI that would anchor the name portion of a name/sensu pair (TNU or Taxon Concept a la TCS if you prefer) as in this RDF snippet:
tc:nameStringQuercus rubra L.</tc:nameString> <tc:hasName rdf:about="http://www.ubio.org/authority/metadata.php?lsid=urn:lsid:ubio.org:namebank:4..." http://www.ubio.org/authority/metadata.php?lsid=urn:lsid:ubio.org:namebank:448439/>
At this point in the discussion, I'm not actually talking about creating a link to a taxon concept but rather to a taxon name, so some of the issues Pete raised don't apply here (e.g. what's the "right" name for a concept - the question here is simply what's a stable identifier for the name) . In principle, I could probably just provide the name string and be done with it. However, having some degree of faith that Smart, Computer Savvy People might some day be able to use the metadata returned by the URI (or perhaps metadata which they already have in a triple store onsite) to do cool things like knowing that my name is the same as an orthographic variant or that "Quercus rubra L." is basically the same thing as "Quercus rubra", I would like to also provide a functional URI.
As an end -user who isn't very interested in the technical issues involving names, I don't really care what URI I use. I would prefer for it to be widely recognized and for it to "work" (i.e. be resolvable). In the earlier (January) thread, there was discussion about existing identifiers. There were a number of posts, but in particular http://lists.tdwg.org/pipermail/tdwg-content/2011-January/002258.html http://lists.tdwg.org/pipermail/tdwg-content/2011-January/002259.htmldiscuss... the relative merits of ITIS and uBio ID numbers. My take-home message from this was that uBio represented the largest single set of names with assigned identifiers (see http://gni.globalnames.org/data_sourcescited in Pete's email) and that uBio metadata provides useful references. Hence my interest in referencing uBio ids as a URI. However, as a practical matter, the organizations that I share images with either want ITIS TSNs (EOL and Morphbank) or just names (Discover Life). Nobody is asking for uBio identifiers or any other identifier.
I found Kevin's comment at http://lists.tdwg.org/pipermail/tdwg-content/2011-May/002486.html very thought-provoking: "My thoughts are that the most likely way this will be solved is by standard market type pressures - ie the best solution/IDs will be used the most and 'float' to the top." I'm not going to make a judgment about what is the "best" solution or ID. But I would say that in "computer" history, being the "best" doesn't necessarily mean that something will be used. Take for example, the FOAF vocabulary. What the heck is Friend of a Friend? I would venture to say that most of the people using the FOAF vocabulary don't know or care. The FOAF vocabulary was the one that people started to use and once that happened, people didn't switch even if there was something better. I'm not familiar with the history of other stuff like YouTube and Craig's List, but I would guess that they weren't necessarily "the best" systems - they were just the one that the most people started using first and once that happened, people didn't switch. I'm using ITIS IDs because they are easy to get and the people I communicate with want them. Whether they are the "best" or "done correctly" doesn't matter to me as much as the fact that that they are widely recognized and stable (and that thus far every name that I've looked for has been in their database).
I think that one reason why this question has been on my mind is that I've been waiting for GNUB (Global Name Use Bank) to come out. I'm not really up on how it is going to work, but my impression is that it was going to be based on the Global Name Index (GNI) which was mentioned in that earlier January thread. At that point, the GNI names didn't have any identifiers that were exposed to the public as permanent GUIDs. I'm assuming that if GNUB refers to GNI names, they will have some kind of identifiers. So if that happens how is the GUID recommendation 8 going to be followed? As Kevin said in http://lists.tdwg.org/pipermail/tdwg-content/2011-June/002499.html "What I take from recommendation 8 of the GUID applicability guide ... is that if you DON'T already have a record in your own database for a taxon name/concept, then reuse an existing one. " What we have here with GNI is a situation where none of the records have identifiers. In my mind, the "best practice" according to recommendation 8 would be for the GNI to reuse existing identifiers where they exist and NOT make up new ones. This is a bit more complicated because the ITIS identifiers (which are in common use) don't have an http URI version that is resolvable, and while the uBio identifiers have a resolvable http URI, it's in the form of a proxied LSID, which I've already complained is very ugly. So I'd like to hear some ideas about how to have "reused" identifiers in the GNI.
One thing that comes to my mind would be to have a "domain name" like "http://purl.org/gni/" http://purl.org/gni/ or "http://purl.org/tn/"http://purl.org/tn/("tn" for "taxon name") and to follow it with a namespace/id combination similar to what is done with lsids. So for example "itis/19408" and "ubio/448439" could be appended, creating http://purl.org/gni/itis/19408and http://purl.org/gni/ubio/448439 for "Quercus rubra L." Both URIs could point to the same RDF and that RDF could indicate that the two identifiers are owl:sameAs . I realize from what Bob Morris has cautioned in the past that there are problems with owl:sameAs when the two things aren't actually the same thing (e.g. if the uBio ID refers to a name string only but the ITIS TSN refers to the name plus an "accepted" status and a relationship to parent taxa). However, if there were an understanding that the GNI only refers to name strings, then one could still refer to http://purl.org/gni/itis/19408 as an identifier for the name string of the thing (whatever it is) that is referred to by an ITIS TSN of 19408. I don't think there would be a problem saying that and the ubio ID were "owl:sameAs". Some kind of solution like this would allow people to easily generate a resolvable URI for a name if they were using ITIS TSNs or uBio IDs. If the name that one wanted to use was so obscure that it was one of the 9.5 million names that uBio has that ITIS doesn't have, then that name would only have the ubio version. I have no idea whether this would be a good idea or not, but I was really cringing to think about 19 million newly minted UUIDs appended to "http://gni.globalnames.org/"http://gni.globalnames.org/and figuring out how to connect those horrid things to the names and ITIS TSNs that I'm already using. I think that I said this before, but using the purl.org domain rather than one like http://gni.globalnames.org/ would in the future allow somebody else to take over management of providing the metadata when the GUIDs are resolved without having to deal with issues of who "owns" the domain name.
Steve
Kevin Richards wrote:
Pete,
I’m not trying to say what you are doing is a waste of time/impossible. I actually think RDF + semantics are a good way forward, but this really implies that we need to rely on the semantics and linkages rather than having a SINGLE ID for a taxon name. (which is what I thought Steve was getting at). Each instance of a taxon name can have its own ID and then all these instances are connected via ontology defined semantic links. This seems more appropriate to me than insisting everyone uses the “Global Taxon Name ID X”.
In your example of *Aedes triseriatus* and *Ochlerotatus triseriatus* – these are two different names so they need two different IDs, they may be linked by a single taxon concept, but they are separate names. So which of these now 3 IDs do you expect people to use, and according to what source??
For example if we have a name, eg the Robin, Erithacus rubecula, mentioned in IT IS (TSN : 559964) and also in EOL (www.eol.org/pages/1051567), also in GBIF (http://data.gbif.org/species/21266780), also in avibase ( http://avibase.bsc-eoc.org/species.jsp?avibaseid=C809B2B90399A43D), which ID are you hoping people will use?? Would you put the IT IS ID in your own dataset as the ID for that name – unlikely. Or would it be better to link them up with semantic linkages.
What I take from recommendation 8 of the GUID applicability guide (as Steve puts is "stop making up new identifiers when somebody else already has one for the thing you are talking about”) is that if you DON’T already have a record in your own database for a taxon name/concept, then reuse an existing one. NOT ditch all your current IDs and adopt someone else’s (especially hard considering it is so hard to work out which if the multitude of names ad concept IDs that directly relates to your taxon name).
I am all for limiting the number of IDs for the “same” thing, but in some cases it is more useful to build linkages than force this tight integration of data and IDs. Especially for taxon names and concepts, where it is complex to define if you are even talking about the “same” thing or not.
Kevin
*From:* Peter DeVries [mailto:pete.devries@gmail.compete.devries@gmail.com]
*Sent:* Wednesday, 1 June 2011 12:38 p.m. *To:* Kevin Richards *Cc:* Steve Baskauf; tdwg-content@lists.tdwg.org; Gerald Guala; Nicolson, David; Alan J Hampson; Orrell, Thomas *Subject:* Re: [tdwg-content] ITIS TSNID to uBio NamebankIDs mapping
Hi Kevin,
I forgot one mention some other things that are different about my project.
You can write a simple SPARQL query to get a list of all the TaxonConcept's that have ITIS ids, or all those that have ITIS and NCBI ID's etc.
You can do this on any SPARQL endpoint that hosts the data.
You can download the entire data set and run the queries on your own endpoint.
You can write a script that runs the query and downloads the ITIS numbers and exports them to CSV etc.
- Pete
On Tue, May 31, 2011 at 5:16 PM, Peter DeVries pete.devries@gmail.com wrote:
Hi Kevin,
On Tue, May 31, 2011 at 3:27 PM, Kevin Richards < RichardsK@landcareresearch.co.nz> wrote:
This is exactly why this problem still exists and will be very complex to solve - everyone says "we should have a single ID for a specific taxon name, there seems to be several IDs 'out there' that refer to the same taxon name, so Im going to create another ID to link them all up" - yet another ID that no one will particularly want to follow - you would have to get everyone to agree that your combinations/integration of taxon names is the best one and hope everyone follows it - unlikely in this domain.
Isn't this kind of what the The Plant List, and eBird already do?
A difference being that they tie these to a specific name and specific classification.
The Plant list is not really even open so it is difficult to people to adopt it in mass.
For instance, if I manage a herbarium, how do I easily reconcile my species list with the entities represented in the Plant List?
eBird has millions of records which implies that they have been able to convince the observers in the field to adopt their system. You are correct in that there are probably a lot of taxonomists that don't like their list.
It differs from many of the other classifications, but remember the system rewards them for not agreeing. Note the difference between the microbial taxonomists and other taxonomists. In the case of the microbial
workers, the system rewards them for solving problems not debating alternatives. Also, if a good idea comes out that will make it easier for the microbiologists to solve the problems they are rewarded for solving, they are less likely to care whose idea it is.
Like the microbiologists, there are lots of biologists that work with species with the goal of addressing some non-taxonomic problem.
They don't really care if the name is *Aedes triseriatus* or *Ochlerotatus triseriatus, *but they do care that the identifier that they connect their data to is stable.
In regards to the issue of market forces,I suspect (but have no knowledge of) that there were probably decisions made in devising these lists that have more to do with appeasing certain personalities that creating best list. With the way this system rewards people it is likely that the "correct" version will float to the top only after that person has passed away. I don't have much faith that the best system will always float to the top, That has a lot to do with the personalities and how the system rewards are setup. Theoretically, it is possible for one strong personality or group to force others to adopt their less than optimal solution - at least this seems to happen in other environments.
Also, there are all sorts of ways that people can use the publication record to rewrite history. Simply cite the review paper that cites the original paper. Or don't cite it at all.
I would have used only the ITIS TSN but if the name changes the ID changes. This isn't "wrong", it just does not solve my problem.
- ITIS also should add the spiders from the World Spider Catalog.
Another issue that I think has inhibited adoption of a common list is that people can't agree on a particular name or a particular classification.
Since you can model a species concept as having many names and many classifications why not do so?
If this idea was originally accepted, I would not have needed to create TaxonConcept.org.
My plan has aways been to get something that works to solve some problems and then let some larger group take it over.
In a sense, I am more like the microbiologists in that I am not being paid to solve this or debate this problem.
I am doing it because I think something like this is needed, and it is an interesting and personally rewarding puzzle.
- Pete
My thoughts are that the most likely way this will be solve is by stnadard market type pressures - ie the best solution/IDs will be used the most and "float" to the top. It is easy to say that the global taxon name data is a mess, but if you think about it 30 years ago taxon name data were very disparate, duplicated, unconnected, many with NO IDs at all. So I beleive we are making progress and that we will continue to do so albeit at a fairly slow rate.
Kevin
"I agree. This was one of the reasons that I setup TaxonConcept the way I did. It attempts to connect both the LOD entities and the foreign key based entities."
Please consider the environment before printing this email Warning: This electronic message together with any attachments is confidential. If you receive it in error: (i) you must not read, use, disclose, copy or retain it; (ii) please contact the sender immediately by reply email and then delete the emails. The views expressed in this email may not be those of Landcare Research New Zealand Limited. http://www.landcareresearch.co.nz
--
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 Email: pdevries@wisc.edu TaxonConcept http://www.taxonconcept.org/ & GeoSpecieshttp://about.geospecies.org/ Knowledge Bases A Semantic Web, Linked Open Data http://linkeddata.org/ Project
--
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 Email: pdevries@wisc.edu TaxonConcept http://www.taxonconcept.org/ & GeoSpecieshttp://about.geospecies.org/ Knowledge Bases A Semantic Web, Linked Open Data http://linkeddata.org/ Project
Please consider the environment before printing this email Warning: This electronic message together with any attachments is confidential. If you receive it in error: (i) you must not read, use, disclose, copy or retain it; (ii) please contact the sender immediately by reply email and then delete the emails. The views expressed in this email may not be those of Landcare Research New Zealand Limited. http://www.landcareresearch.co.nz
-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences
postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A.
delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235
office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707http://bioimages.vanderbilt.edu
Why not use the name as the basis for the resolvable identifier instead of a uuid. Isnt there a 1:1 cardinality between the name and the uuid in the GNI? Doesnt that mean that
http://gni.globalnames.org/name_strings/4ef223c4-0c3e-5e84-ace9-755c34c601ec and http://gni.globalnames.org/name_strings/Danaus_plexippus_(Linnaeus_1758)
are equally unique? The latter is certainly more readable. In those cases where the namestring is a homonym like
http://gni.globalnames.org/name_strings/Oenanthe
couldn't you just return the addresses of the two globally unique forms of the name when you resolve it?
http://gni.globalnames.org/name_strings/Oenanthe_Smith_1899
http://gni.globalnames.org/name_strings/Oenanthe_Jones_1900
Wouldn't those be as globally unique and easier to read and adjust to? Or am I missing something. I always wanted to do that with ubio IDs after a back and forth with Gregor Hagedorn and wished we hadn't exposed those integers.
DR
Hi Steve,
I don't have time to go through this in detail, and I can't speak for the GNI, but I can tell you about how the GNI URI's work at least for now.
A while back Dima Mozzherin and I were looking into how triples etc. might be of use to the GNI.
We needed a way to generate unique URI's for each name.
We wanted to avoid having to keep these in sync and not require everyone to look each ID up through some service.
Dima came up with the following plan. We use the namestring as seed to generate a unique UUID.
Basically this is a shared algorithm which the GNI and TaxonConcept both use. But it could be used by anyone.
You feed the name string to the algorithm and it spits out a UUID. We append then append that to a URI and web service so it is resolvable.
So the name Danaus plexippus (Linnaeus 1758) => 4ef223c4-0c3e-5e84-ace9-755c34c601ec
So if the GNI and and another group have the same namestring they have the same UUID.
People can then can link their data set to the GNI with the following URI
http://gni.globalnames.org/name_strings/4ef223c4-0c3e-5e84-ace9-755c34c601ec
RDF http://gni.globalnames.org/name_strings/4ef223c4-0c3e-5e84-ace9-755c34c601ec...
http://gni.globalnames.org/name_strings/4ef223c4-0c3e-5e84-ace9-755c34c601ec.rdfIf you think of your data set as one table and the GNI as another, this URI serves as the foreign key that connects them together.
Some on the list don't like how these look, but there is a tremendous advantage in not having to worry about syncing two large data sets and determining if a given integer is already in use.
Also Rod Page has written a recently about UUID's. http://iphylo.blogspot.com/2011/05/zoobank-on-couchdb-uuids-replication.html
http://iphylo.blogspot.com/2011/05/zoobank-on-couchdb-uuids-replication.htmlThere may be a way to do something similar with bit.ly like identifiers that are shorter (mCcSp), but I think it the general idea is a good one.
If you recall from my talk at TDWG, I was able to use these to make statements that one namestring was a synonym etc. of another etc.
The algorithm we use is written in Ruby but I could be ported to many different languages since UUIDs are widely supported.
Respectfully,
- Pete
On Thu, Jun 2, 2011 at 11:41 PM, Steven J. Baskauf < steve.baskauf@vanderbilt.edu> wrote:
My email access has been sporadic since this thread developed, so at this point I'll respond to points made in several of the messages.
First, I should note that there has been previous discussion on this list on a similar topic from http://lists.tdwg.org/pipermail/tdwg-content/2011-January/002231.htmlthrough http://lists.tdwg.org/pipermail/tdwg-content/2011-January/002231.html. One can review what was said at that time rather quickly by starting on the first linked message and clicking on the "Next Message" link until you get to the end of the range I gave above.
My reason for the request for information that started this thread was that I wanted to link to a URI that would anchor the name portion of a name/sensu pair (TNU or Taxon Concept a la TCS if you prefer) as in this RDF snippet:
tc:nameStringQuercus rubra L.</tc:nameString> <tc:hasName rdf:about="http://www.ubio.org/authority/metadata.php?lsid=urn:lsid:ubio.org:namebank:4..." http://www.ubio.org/authority/metadata.php?lsid=urn:lsid:ubio.org:namebank:448439/>
At this point in the discussion, I'm not actually talking about creating a link to a taxon concept but rather to a taxon name, so some of the issues Pete raised don't apply here (e.g. what's the "right" name for a concept
the question here is simply what's a stable identifier for the name) . In principle, I could probably just provide the name string and be done with it. However, having some degree of faith that Smart, Computer Savvy People might some day be able to use the metadata returned by the URI (or perhaps metadata which they already have in a triple store onsite) to do cool things like knowing that my name is the same as an orthographic variant or that "Quercus rubra L." is basically the same thing as "Quercus rubra", I would like to also provide a functional URI.
As an end -user who isn't very interested in the technical issues involving names, I don't really care what URI I use. I would prefer for it to be widely recognized and for it to "work" (i.e. be resolvable). In the earlier (January) thread, there was discussion about existing identifiers. There were a number of posts, but in particular http://lists.tdwg.org/pipermail/tdwg-content/2011-January/002258.html http://lists.tdwg.org/pipermail/tdwg-content/2011-January/002259.htmldiscuss... the relative merits of ITIS and uBio ID numbers. My take-home message from this was that uBio represented the largest single set of names with assigned identifiers (see http://gni.globalnames.org/data_sourcescited in Pete's email) and that uBio metadata provides useful references. Hence my interest in referencing uBio ids as a URI. However, as a practical matter, the organizations that I share images with either want ITIS TSNs (EOL and Morphbank) or just names (Discover Life). Nobody is asking for uBio identifiers or any other identifier.
I found Kevin's comment at http://lists.tdwg.org/pipermail/tdwg-content/2011-May/002486.html very thought-provoking: "My thoughts are that the most likely way this will be solved is by standard market type pressures - ie the best solution/IDs will be used the most and 'float' to the top." I'm not going to make a judgment about what is the "best" solution or ID. But I would say that in "computer" history, being the "best" doesn't necessarily mean that something will be used. Take for example, the FOAF vocabulary. What the heck is Friend of a Friend? I would venture to say that most of the people using the FOAF vocabulary don't know or care. The FOAF vocabulary was the one that people started to use and once that happened, people didn't switch even if there was something better. I'm not familiar with the history of other stuff like YouTube and Craig's List, but I would guess that they weren't necessarily "the best" systems - they were just the one that the most people started using first and once that happened, people didn't switch. I'm using ITIS IDs because they are easy to get and the people I communicate with want them. Whether they are the "best" or "done correctly" doesn't matter to me as much as the fact that that they are widely recognized and stable (and that thus far every name that I've looked for has been in their database).
I think that one reason why this question has been on my mind is that I've been waiting for GNUB (Global Name Use Bank) to come out. I'm not really up on how it is going to work, but my impression is that it was going to be based on the Global Name Index (GNI) which was mentioned in that earlier January thread. At that point, the GNI names didn't have any identifiers that were exposed to the public as permanent GUIDs. I'm assuming that if GNUB refers to GNI names, they will have some kind of identifiers. So if that happens how is the GUID recommendation 8 going to be followed? As Kevin said in http://lists.tdwg.org/pipermail/tdwg-content/2011-June/002499.html "What I take from recommendation 8 of the GUID applicability guide ... is that if you DON'T already have a record in your own database for a taxon name/concept, then reuse an existing one. " What we have here with GNI is a situation where none of the records have identifiers. In my mind, the "best practice" according to recommendation 8 would be for the GNI to reuse existing identifiers where they exist and NOT make up new ones. This is a bit more complicated because the ITIS identifiers (which are in common use) don't have an http URI version that is resolvable, and while the uBio identifiers have a resolvable http URI, it's in the form of a proxied LSID, which I've already complained is very ugly. So I'd like to hear some ideas about how to have "reused" identifiers in the GNI.
One thing that comes to my mind would be to have a "domain name" like "http://purl.org/gni/" http://purl.org/gni/ or "http://purl.org/tn/"http://purl.org/tn/("tn" for "taxon name") and to follow it with a namespace/id combination similar to what is done with lsids. So for example "itis/19408" and "ubio/448439" could be appended, creating http://purl.org/gni/itis/19408and http://purl.org/gni/ubio/448439 for "Quercus rubra L." Both URIs could point to the same RDF and that RDF could indicate that the two identifiers are owl:sameAs . I realize from what Bob Morris has cautioned in the past that there are problems with owl:sameAs when the two things aren't actually the same thing (e.g. if the uBio ID refers to a name string only but the ITIS TSN refers to the name plus an "accepted" status and a relationship to parent taxa). However, if there were an understanding that the GNI only refers to name strings, then one could still refer to http://purl.org/gni/itis/19408 as an identifier for the name string of the thing (whatever it is) that is referred to by an ITIS TSN of 19408. I don't think there would be a problem saying that and the ubio ID were "owl:sameAs". Some kind of solution like this would allow people to easily generate a resolvable URI for a name if they were using ITIS TSNs or uBio IDs. If the name that one wanted to use was so obscure that it was one of the 9.5 million names that uBio has that ITIS doesn't have, then that name would only have the ubio version. I have no idea whether this would be a good idea or not, but I was really cringing to think about 19 million newly minted UUIDs appended to "http://gni.globalnames.org/"http://gni.globalnames.org/and figuring out how to connect those horrid things to the names and ITIS TSNs that I'm already using. I think that I said this before, but using the purl.org domain rather than one like http://gni.globalnames.org/ would in the future allow somebody else to take over management of providing the metadata when the GUIDs are resolved without having to deal with issues of who "owns" the domain name.
Steve
Kevin Richards wrote:
Pete,
Im not trying to say what you are doing is a waste of time/impossible. I actually think RDF + semantics are a good way forward, but this really implies that we need to rely on the semantics and linkages rather than having a SINGLE ID for a taxon name. (which is what I thought Steve was getting at). Each instance of a taxon name can have its own ID and then all these instances are connected via ontology defined semantic links. This seems more appropriate to me than insisting everyone uses the Global Taxon Name ID X.
In your example of *Aedes triseriatus* and *Ochlerotatus triseriatus* these are two different names so they need two different IDs, they may be linked by a single taxon concept, but they are separate names. So which of these now 3 IDs do you expect people to use, and according to what source??
For example if we have a name, eg the Robin, Erithacus rubecula, mentioned in IT IS (TSN : 559964) and also in EOL (www.eol.org/pages/1051567), also in GBIF (http://data.gbif.org/species/21266780), also in avibase ( http://avibase.bsc-eoc.org/species.jsp?avibaseid=C809B2B90399A43D), which ID are you hoping people will use?? Would you put the IT IS ID in your own dataset as the ID for that name unlikely. Or would it be better to link them up with semantic linkages.
What I take from recommendation 8 of the GUID applicability guide (as Steve puts is "stop making up new identifiers when somebody else already has one for the thing you are talking about) is that if you DONT already have a record in your own database for a taxon name/concept, then reuse an existing one. NOT ditch all your current IDs and adopt someone elses (especially hard considering it is so hard to work out which if the multitude of names ad concept IDs that directly relates to your taxon name).
I am all for limiting the number of IDs for the same thing, but in some cases it is more useful to build linkages than force this tight integration of data and IDs. Especially for taxon names and concepts, where it is complex to define if you are even talking about the same thing or not.
Kevin
*From:* Peter DeVries [mailto:pete.devries@gmail.compete.devries@gmail.com]
*Sent:* Wednesday, 1 June 2011 12:38 p.m. *To:* Kevin Richards *Cc:* Steve Baskauf; tdwg-content@lists.tdwg.org; Gerald Guala; Nicolson, David; Alan J Hampson; Orrell, Thomas *Subject:* Re: [tdwg-content] ITIS TSNID to uBio NamebankIDs mapping
Hi Kevin,
I forgot one mention some other things that are different about my project.
You can write a simple SPARQL query to get a list of all the TaxonConcept's that have ITIS ids, or all those that have ITIS and NCBI ID's etc.
You can do this on any SPARQL endpoint that hosts the data.
You can download the entire data set and run the queries on your own endpoint.
You can write a script that runs the query and downloads the ITIS numbers and exports them to CSV etc.
- Pete
On Tue, May 31, 2011 at 5:16 PM, Peter DeVries pete.devries@gmail.com wrote:
Hi Kevin,
On Tue, May 31, 2011 at 3:27 PM, Kevin Richards < RichardsK@landcareresearch.co.nz> wrote:
This is exactly why this problem still exists and will be very complex to solve - everyone says "we should have a single ID for a specific taxon name, there seems to be several IDs 'out there' that refer to the same taxon name, so Im going to create another ID to link them all up" - yet another ID that no one will particularly want to follow - you would have to get everyone to agree that your combinations/integration of taxon names is the best one and hope everyone follows it - unlikely in this domain.
Isn't this kind of what the The Plant List, and eBird already do?
A difference being that they tie these to a specific name and specific classification.
The Plant list is not really even open so it is difficult to people to adopt it in mass.
For instance, if I manage a herbarium, how do I easily reconcile my species list with the entities represented in the Plant List?
eBird has millions of records which implies that they have been able to convince the observers in the field to adopt their system. You are correct in that there are probably a lot of taxonomists that don't like their list.
It differs from many of the other classifications, but remember the system rewards them for not agreeing. Note the difference between the microbial taxonomists and other taxonomists. In the case of the microbial
workers, the system rewards them for solving problems not debating alternatives. Also, if a good idea comes out that will make it easier for the microbiologists to solve the problems they are rewarded for solving, they are less likely to care whose idea it is.
Like the microbiologists, there are lots of biologists that work with species with the goal of addressing some non-taxonomic problem.
They don't really care if the name is *Aedes triseriatus* or *Ochlerotatus triseriatus, *but they do care that the identifier that they connect their data to is stable.
In regards to the issue of market forces,I suspect (but have no knowledge of) that there were probably decisions made in devising these lists that have more to do with appeasing certain personalities that creating best list. With the way this system rewards people it is likely that the "correct" version will float to the top only after that person has passed away. I don't have much faith that the best system will always float to the top, That has a lot to do with the personalities and how the system rewards are setup. Theoretically, it is possible for one strong personality or group to force others to adopt their less than optimal solution - at least this seems to happen in other environments.
Also, there are all sorts of ways that people can use the publication record to rewrite history. Simply cite the review paper that cites the original paper. Or don't cite it at all.
I would have used only the ITIS TSN but if the name changes the ID changes. This isn't "wrong", it just does not solve my problem.
- ITIS also should add the spiders from the World Spider Catalog.
Another issue that I think has inhibited adoption of a common list is that people can't agree on a particular name or a particular classification.
Since you can model a species concept as having many names and many classifications why not do so?
If this idea was originally accepted, I would not have needed to create TaxonConcept.org.
My plan has aways been to get something that works to solve some problems and then let some larger group take it over.
In a sense, I am more like the microbiologists in that I am not being paid to solve this or debate this problem.
I am doing it because I think something like this is needed, and it is an interesting and personally rewarding puzzle.
- Pete
My thoughts are that the most likely way this will be solve is by stnadard market type pressures - ie the best solution/IDs will be used the most and "float" to the top. It is easy to say that the global taxon name data is a mess, but if you think about it 30 years ago taxon name data were very disparate, duplicated, unconnected, many with NO IDs at all. So I beleive we are making progress and that we will continue to do so albeit at a fairly slow rate.
Kevin
"I agree. This was one of the reasons that I setup TaxonConcept the way I did. It attempts to connect both the LOD entities and the foreign key based entities."
Please consider the environment before printing this email Warning: This electronic message together with any attachments is confidential. If you receive it in error: (i) you must not read, use, disclose, copy or retain it; (ii) please contact the sender immediately by reply email and then delete the emails. The views expressed in this email may not be those of Landcare Research New Zealand Limited. http://www.landcareresearch.co.nz
--
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 Email: pdevries@wisc.edu TaxonConcept http://www.taxonconcept.org/ & GeoSpecieshttp://about.geospecies.org/ Knowledge Bases A Semantic Web, Linked Open Data http://linkeddata.org/ Project
--
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 Email: pdevries@wisc.edu TaxonConcept http://www.taxonconcept.org/ & GeoSpecieshttp://about.geospecies.org/ Knowledge Bases A Semantic Web, Linked Open Data http://linkeddata.org/ Project
Please consider the environment before printing this email Warning: This electronic message together with any attachments is confidential. If you receive it in error: (i) you must not read, use, disclose, copy or retain it; (ii) please contact the sender immediately by reply email and then delete the emails. The views expressed in this email may not be those of Landcare Research New Zealand Limited. http://www.landcareresearch.co.nz
-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences
postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A.
delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235
office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707http://bioimages.vanderbilt.edu
--
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 Email: pdevries@wisc.edu TaxonConcept http://www.taxonconcept.org/ & GeoSpecieshttp://about.geospecies.org/ Knowledge Bases A Semantic Web, Linked Open Data http://linkeddata.org/ Project
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
---------------------------------------------------------------------------- David Remsen, Senior Programme Officer Electronic Catalog of Names of Known Organisms Global Biodiversity Information Facility Secretariat Universitetsparken 15, DK-2100 Copenhagen, Denmark Tel: +45-35321472 Fax: +45-35321480 Skype: dremsen ----------------------------------------------------------------------------
In my opinion UUIDs have a few advantages over strings --
1. It is uuid, so it will work with uuid tools (current and future ones) 2. It is less ambiguous -- For example -- what is the difference between Betulа and Betula for your eyes? (one of them has a Cyrillic 'a') 3. Database wise it is faster to search because it is just a 128bit number, while a name is at least 245 byte varchar -- it makes searching much faster because in relational databases the size of keys directly proportional to the search speed 4. UUID v. 5 (http://en.wikipedia.org/wiki/Universally_unique_identifier) allows to generate UUID algorithmically without looking up a database (no need for network connection) 5. Links like http://gni.globalnames.org/name_strings/Danaus_plexippus_(Linnaeus_1758) might be ambigous -- I can think of several ways I can represent name string part in the url and they will all resolve to the same thing in GNI. 6. Unescaped unicode characters in url containing literal name strings (people will forget to escape them) will depend on an implementation of a url resolver
Saying this links like http://gni.globalnames.org/name_strings/Danaus_plexippus_(Linnaeus_1758) are definitely attractive and is it good to have them as another way to access a name! My personal preference would be not use them as main identifier because of the reasons 1, 2, 3 and 5.
Dima
On Fri, Jun 3, 2011 at 7:59 AM, David Remsen (GBIF) dremsen@gbif.org wrote:
Why not use the name as the basis for the resolvable identifier instead of a uuid. Isnt there a 1:1 cardinality between the name and the uuid in the GNI? Doesnt that mean that
http://gni.globalnames.org/name_strings/4ef223c4-0c3e-5e84-ace9-755c34c601ec and http://gni.globalnames.org/name_strings/Danaus_plexippus_(Linnaeus_1758)
are equally unique? The latter is certainly more readable. In those cases where the namestring is a homonym like
http://gni.globalnames.org/name_strings/Oenanthe
couldn't you just return the addresses of the two globally unique forms of the name when you resolve it?
http://gni.globalnames.org/name_strings/Oenanthe_Smith_1899
http://gni.globalnames.org/name_strings/Oenanthe_Jones_1900
Wouldn't those be as globally unique and easier to read and adjust to? Or am I missing something. I always wanted to do that with ubio IDs after a back and forth with Gregor Hagedorn and wished we hadn't exposed those integers.
DR
Hi Steve,
I don't have time to go through this in detail, and I can't speak for the GNI, but I can tell you about how the GNI URI's work at least for now.
A while back Dima Mozzherin and I were looking into how triples etc. might be of use to the GNI.
We needed a way to generate unique URI's for each name.
We wanted to avoid having to keep these in sync and not require everyone to look each ID up through some service.
Dima came up with the following plan. We use the namestring as seed to generate a unique UUID.
Basically this is a shared algorithm which the GNI and TaxonConcept both use. But it could be used by anyone.
You feed the name string to the algorithm and it spits out a UUID. We append then append that to a URI and web service so it is resolvable.
So the name Danaus plexippus (Linnaeus 1758) => 4ef223c4-0c3e-5e84-ace9-755c34c601ec
So if the GNI and and another group have the same namestring they have the same UUID.
People can then can link their data set to the GNI with the following URI
http://gni.globalnames.org/name_strings/4ef223c4-0c3e-5e84-ace9-755c34c601ec
RDF http://gni.globalnames.org/name_strings/4ef223c4-0c3e-5e84-ace9-755c34c601ec...
http://gni.globalnames.org/name_strings/4ef223c4-0c3e-5e84-ace9-755c34c601ec.rdfIf you think of your data set as one table and the GNI as another, this URI serves as the foreign key that connects them together.
Some on the list don't like how these look, but there is a tremendous advantage in not having to worry about syncing two large data sets and determining if a given integer is already in use.
Also Rod Page has written a recently about UUID's. http://iphylo.blogspot.com/2011/05/zoobank-on-couchdb-uuids-replication.html
http://iphylo.blogspot.com/2011/05/zoobank-on-couchdb-uuids-replication.htmlThere may be a way to do something similar with bit.ly like identifiers that are shorter (mCcSp), but I think it the general idea is a good one.
If you recall from my talk at TDWG, I was able to use these to make statements that one namestring was a synonym etc. of another etc.
The algorithm we use is written in Ruby but I could be ported to many different languages since UUIDs are widely supported.
Respectfully,
- Pete
On Thu, Jun 2, 2011 at 11:41 PM, Steven J. Baskauf < steve.baskauf@vanderbilt.edu> wrote:
My email access has been sporadic since this thread developed, so at this point I'll respond to points made in several of the messages.
First, I should note that there has been previous discussion on this list on a similar topic from http://lists.tdwg.org/pipermail/tdwg-content/2011-January/002231.htmlthrough http://lists.tdwg.org/pipermail/tdwg-content/2011-January/002231.html. One can review what was said at that time rather quickly by starting on the first linked message and clicking on the "Next Message" link until you get to the end of the range I gave above.
My reason for the request for information that started this thread was that I wanted to link to a URI that would anchor the name portion of a name/sensu pair (TNU or Taxon Concept a la TCS if you prefer) as in this RDF snippet:
tc:nameStringQuercus rubra L.</tc:nameString> <tc:hasName rdf:about="http://www.ubio.org/authority/metadata.php?lsid=urn:lsid:ubio.org:namebank:4..." http://www.ubio.org/authority/metadata.php?lsid=urn:lsid:ubio.org:namebank:448439/>
At this point in the discussion, I'm not actually talking about creating a link to a taxon concept but rather to a taxon name, so some of the issues Pete raised don't apply here (e.g. what's the "right" name for a concept
the question here is simply what's a stable identifier for the name) . In principle, I could probably just provide the name string and be done with it. However, having some degree of faith that Smart, Computer Savvy People might some day be able to use the metadata returned by the URI (or perhaps metadata which they already have in a triple store onsite) to do cool things like knowing that my name is the same as an orthographic variant or that "Quercus rubra L." is basically the same thing as "Quercus rubra", I would like to also provide a functional URI.
As an end -user who isn't very interested in the technical issues involving names, I don't really care what URI I use. I would prefer for it to be widely recognized and for it to "work" (i.e. be resolvable). In the earlier (January) thread, there was discussion about existing identifiers. There were a number of posts, but in particular http://lists.tdwg.org/pipermail/tdwg-content/2011-January/002258.html http://lists.tdwg.org/pipermail/tdwg-content/2011-January/002259.htmldiscuss... the relative merits of ITIS and uBio ID numbers. My take-home message from this was that uBio represented the largest single set of names with assigned identifiers (see http://gni.globalnames.org/data_sourcescited in Pete's email) and that uBio metadata provides useful references. Hence my interest in referencing uBio ids as a URI. However, as a practical matter, the organizations that I share images with either want ITIS TSNs (EOL and Morphbank) or just names (Discover Life). Nobody is asking for uBio identifiers or any other identifier.
I found Kevin's comment at http://lists.tdwg.org/pipermail/tdwg-content/2011-May/002486.html very thought-provoking: "My thoughts are that the most likely way this will be solved is by standard market type pressures - ie the best solution/IDs will be used the most and 'float' to the top." I'm not going to make a judgment about what is the "best" solution or ID. But I would say that in "computer" history, being the "best" doesn't necessarily mean that something will be used. Take for example, the FOAF vocabulary. What the heck is Friend of a Friend? I would venture to say that most of the people using the FOAF vocabulary don't know or care. The FOAF vocabulary was the one that people started to use and once that happened, people didn't switch even if there was something better. I'm not familiar with the history of other stuff like YouTube and Craig's List, but I would guess that they weren't necessarily "the best" systems - they were just the one that the most people started using first and once that happened, people didn't switch. I'm using ITIS IDs because they are easy to get and the people I communicate with want them. Whether they are the "best" or "done correctly" doesn't matter to me as much as the fact that that they are widely recognized and stable (and that thus far every name that I've looked for has been in their database).
I think that one reason why this question has been on my mind is that I've been waiting for GNUB (Global Name Use Bank) to come out. I'm not really up on how it is going to work, but my impression is that it was going to be based on the Global Name Index (GNI) which was mentioned in that earlier January thread. At that point, the GNI names didn't have any identifiers that were exposed to the public as permanent GUIDs. I'm assuming that if GNUB refers to GNI names, they will have some kind of identifiers. So if that happens how is the GUID recommendation 8 going to be followed? As Kevin said in http://lists.tdwg.org/pipermail/tdwg-content/2011-June/002499.html "What I take from recommendation 8 of the GUID applicability guide ... is that if you DON'T already have a record in your own database for a taxon name/concept, then reuse an existing one. " What we have here with GNI is a situation where none of the records have identifiers. In my mind, the "best practice" according to recommendation 8 would be for the GNI to reuse existing identifiers where they exist and NOT make up new ones. This is a bit more complicated because the ITIS identifiers (which are in common use) don't have an http URI version that is resolvable, and while the uBio identifiers have a resolvable http URI, it's in the form of a proxied LSID, which I've already complained is very ugly. So I'd like to hear some ideas about how to have "reused" identifiers in the GNI.
One thing that comes to my mind would be to have a "domain name" like "http://purl.org/gni/" http://purl.org/gni/ or "http://purl.org/tn/"http://purl.org/tn/("tn" for "taxon name") and to follow it with a namespace/id combination similar to what is done with lsids. So for example "itis/19408" and "ubio/448439" could be appended, creating http://purl.org/gni/itis/19408and http://purl.org/gni/ubio/448439 for "Quercus rubra L." Both URIs could point to the same RDF and that RDF could indicate that the two identifiers are owl:sameAs . I realize from what Bob Morris has cautioned in the past that there are problems with owl:sameAs when the two things aren't actually the same thing (e.g. if the uBio ID refers to a name string only but the ITIS TSN refers to the name plus an "accepted" status and a relationship to parent taxa). However, if there were an understanding that the GNI only refers to name strings, then one could still refer to http://purl.org/gni/itis/19408 as an identifier for the name string of the thing (whatever it is) that is referred to by an ITIS TSN of 19408. I don't think there would be a problem saying that and the ubio ID were "owl:sameAs". Some kind of solution like this would allow people to easily generate a resolvable URI for a name if they were using ITIS TSNs or uBio IDs. If the name that one wanted to use was so obscure that it was one of the 9.5 million names that uBio has that ITIS doesn't have, then that name would only have the ubio version. I have no idea whether this would be a good idea or not, but I was really cringing to think about 19 million newly minted UUIDs appended to "http://gni.globalnames.org/"http://gni.globalnames.org/and figuring out how to connect those horrid things to the names and ITIS TSNs that I'm already using. I think that I said this before, but using the purl.org domain rather than one like http://gni.globalnames.org/ would in the future allow somebody else to take over management of providing the metadata when the GUIDs are resolved without having to deal with issues of who "owns" the domain name.
Steve
Kevin Richards wrote:
Pete,
I’m not trying to say what you are doing is a waste of time/impossible. I actually think RDF + semantics are a good way forward, but this really implies that we need to rely on the semantics and linkages rather than having a SINGLE ID for a taxon name. (which is what I thought Steve was getting at). Each instance of a taxon name can have its own ID and then all these instances are connected via ontology defined semantic links. This seems more appropriate to me than insisting everyone uses the “Global Taxon Name ID X”.
In your example of *Aedes triseriatus* and *Ochlerotatus triseriatus* – these are two different names so they need two different IDs, they may be linked by a single taxon concept, but they are separate names. So which of these now 3 IDs do you expect people to use, and according to what source??
For example if we have a name, eg the Robin, Erithacus rubecula, mentioned in IT IS (TSN : 559964) and also in EOL (www.eol.org/pages/1051567), also in GBIF (http://data.gbif.org/species/21266780), also in avibase ( http://avibase.bsc-eoc.org/species.jsp?avibaseid=C809B2B90399A43D), which ID are you hoping people will use?? Would you put the IT IS ID in your own dataset as the ID for that name – unlikely. Or would it be better to link them up with semantic linkages.
What I take from recommendation 8 of the GUID applicability guide (as Steve puts is "stop making up new identifiers when somebody else already has one for the thing you are talking about”) is that if you DON’T already have a record in your own database for a taxon name/concept, then reuse an existing one. NOT ditch all your current IDs and adopt someone else’s (especially hard considering it is so hard to work out which if the multitude of names ad concept IDs that directly relates to your taxon name).
I am all for limiting the number of IDs for the “same” thing, but in some cases it is more useful to build linkages than force this tight integration of data and IDs. Especially for taxon names and concepts, where it is complex to define if you are even talking about the “same” thing or not.
Kevin
*From:* Peter DeVries [mailto:pete.devries@gmail.compete.devries@gmail.com]
*Sent:* Wednesday, 1 June 2011 12:38 p.m. *To:* Kevin Richards *Cc:* Steve Baskauf; tdwg-content@lists.tdwg.org; Gerald Guala; Nicolson, David; Alan J Hampson; Orrell, Thomas *Subject:* Re: [tdwg-content] ITIS TSNID to uBio NamebankIDs mapping
Hi Kevin,
I forgot one mention some other things that are different about my project.
You can write a simple SPARQL query to get a list of all the TaxonConcept's that have ITIS ids, or all those that have ITIS and NCBI ID's etc.
You can do this on any SPARQL endpoint that hosts the data.
You can download the entire data set and run the queries on your own endpoint.
You can write a script that runs the query and downloads the ITIS numbers and exports them to CSV etc.
- Pete
On Tue, May 31, 2011 at 5:16 PM, Peter DeVries pete.devries@gmail.com wrote:
Hi Kevin,
On Tue, May 31, 2011 at 3:27 PM, Kevin Richards < RichardsK@landcareresearch.co.nz> wrote:
This is exactly why this problem still exists and will be very complex to solve - everyone says "we should have a single ID for a specific taxon name, there seems to be several IDs 'out there' that refer to the same taxon name, so Im going to create another ID to link them all up" - yet another ID that no one will particularly want to follow - you would have to get everyone to agree that your combinations/integration of taxon names is the best one and hope everyone follows it - unlikely in this domain.
Isn't this kind of what the The Plant List, and eBird already do?
A difference being that they tie these to a specific name and specific classification.
The Plant list is not really even open so it is difficult to people to adopt it in mass.
For instance, if I manage a herbarium, how do I easily reconcile my species list with the entities represented in the Plant List?
eBird has millions of records which implies that they have been able to convince the observers in the field to adopt their system. You are correct in that there are probably a lot of taxonomists that don't like their list.
It differs from many of the other classifications, but remember the system rewards them for not agreeing. Note the difference between the microbial taxonomists and other taxonomists. In the case of the microbial
workers, the system rewards them for solving problems not debating alternatives. Also, if a good idea comes out that will make it easier for the microbiologists to solve the problems they are rewarded for solving, they are less likely to care whose idea it is.
Like the microbiologists, there are lots of biologists that work with species with the goal of addressing some non-taxonomic problem.
They don't really care if the name is *Aedes triseriatus* or *Ochlerotatus triseriatus, *but they do care that the identifier that they connect their data to is stable.
In regards to the issue of market forces,I suspect (but have no knowledge of) that there were probably decisions made in devising these lists that have more to do with appeasing certain personalities that creating best list. With the way this system rewards people it is likely that the "correct" version will float to the top only after that person has passed away. I don't have much faith that the best system will always float to the top, That has a lot to do with the personalities and how the system rewards are setup. Theoretically, it is possible for one strong personality or group to force others to adopt their less than optimal solution - at least this seems to happen in other environments.
Also, there are all sorts of ways that people can use the publication record to rewrite history. Simply cite the review paper that cites the original paper. Or don't cite it at all.
I would have used only the ITIS TSN but if the name changes the ID changes. This isn't "wrong", it just does not solve my problem.
- ITIS also should add the spiders from the World Spider Catalog.
Another issue that I think has inhibited adoption of a common list is that people can't agree on a particular name or a particular classification.
Since you can model a species concept as having many names and many classifications why not do so?
If this idea was originally accepted, I would not have needed to create TaxonConcept.org.
My plan has aways been to get something that works to solve some problems and then let some larger group take it over.
In a sense, I am more like the microbiologists in that I am not being paid to solve this or debate this problem.
I am doing it because I think something like this is needed, and it is an interesting and personally rewarding puzzle.
- Pete
My thoughts are that the most likely way this will be solve is by stnadard market type pressures - ie the best solution/IDs will be used the most and "float" to the top. It is easy to say that the global taxon name data is a mess, but if you think about it 30 years ago taxon name data were very disparate, duplicated, unconnected, many with NO IDs at all. So I beleive we are making progress and that we will continue to do so albeit at a fairly slow rate.
Kevin
"I agree. This was one of the reasons that I setup TaxonConcept the way I did. It attempts to connect both the LOD entities and the foreign key based entities."
Please consider the environment before printing this email Warning: This electronic message together with any attachments is confidential. If you receive it in error: (i) you must not read, use, disclose, copy or retain it; (ii) please contact the sender immediately by reply email and then delete the emails. The views expressed in this email may not be those of Landcare Research New Zealand Limited. http://www.landcareresearch.co.nz
--
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 Email: pdevries@wisc.edu TaxonConcept http://www.taxonconcept.org/ & GeoSpecieshttp://about.geospecies.org/ Knowledge Bases A Semantic Web, Linked Open Data http://linkeddata.org/ Project
--
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 Email: pdevries@wisc.edu TaxonConcept http://www.taxonconcept.org/ & GeoSpecieshttp://about.geospecies.org/ Knowledge Bases A Semantic Web, Linked Open Data http://linkeddata.org/ Project
Please consider the environment before printing this email Warning: This electronic message together with any attachments is confidential. If you receive it in error: (i) you must not read, use, disclose, copy or retain it; (ii) please contact the sender immediately by reply email and then delete the emails. The views expressed in this email may not be those of Landcare Research New Zealand Limited. http://www.landcareresearch.co.nz
-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences
postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A.
delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235
office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707http://bioimages.vanderbilt.edu
--
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 Email: pdevries@wisc.edu TaxonConcept http://www.taxonconcept.org/ & GeoSpecieshttp://about.geospecies.org/ Knowledge Bases A Semantic Web, Linked Open Data http://linkeddata.org/ Project
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
David Remsen, Senior Programme Officer Electronic Catalog of Names of Known Organisms Global Biodiversity Information Facility Secretariat Universitetsparken 15, DK-2100 Copenhagen, Denmark Tel: +45-35321472 Fax: +45-35321480 Skype: dremsen
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
Working backwards through this thread...
I hadn't read Dima's post until just now, and I see that at least a couple of his points (i.e., #2, #5, #6) apply to exposing the UUIDs externally. However, I think that a simple protocol (such as replacing spaces with "_", and avoiding characters that look the same but are different -- such as the Cyrillic 'a') could go a long way to mitigating those problems.
On the other hand, it really depends on what the identifier is for. The string "Danaus_plexippus_(Linnaeus_1758)" may be more friendly to our eyes, but "A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523" is definitely more friendly to a computer (Dima's points 1, 3 & 4, among others). My feeling is that the push for GUIDs is more about enabling computer-computer conversations, than it is about enabling human-human or human-computer interactions; and therefore we should not get bogged down in the "ugliness" of the identifiers. In the context of electronic data services, the "ugliness" potential of the "Danaus_plexippus_(Linnaeus_1758)" approach to identifiers is far greater than the ugliness potential of "A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523", when it comes to interlinking electronic biodiversity data. It is nothing for a computer to render relevant metadata of the object identified by "A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523" into "Danaus plexippus (Linnaeus_1758)" on a computer screen or piece of paper for human-eyeball consumption. But there are many pitfalls (some noted by Dima) for a computer to unambiguously resolve "Danaus_plexippus_(Linnaeus_1758)" back to a meaningful data object.
I guess my revised point is: GNI (and uBio/NameBank) are essentially the only taxonomic databases out there where a human-friendly persistent/actionable identifier of the sort being discussed is even plausible as an option. It may not even be wise in this context (as per Dima's points), but it *might* be, depending on the need for a human-friendly identifier.
Maybe the simplest thing to do would be to not regard "http://gni.globalnames.org/name_strings/Danaus_plexippus_(Linnaeus_1758)" as an identifier per se, but rather as a protocol for a web service. In other words, if you append a text string to the root URL "http://gni.globalnames.org/name_strings/", GNI would run that text string against its index and return whatever metadata based on a text-string match. This is not mutually exclusive with an "identifier" in the form of "http://gni.globalnames.org/name_strings/A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523", that would less ambiguously resolve a known record in GNI. At this point, the line between "identifier" and "service" gets fuzzy, of course. But the analogy is true in ZooBank:
The persistent "Identifer" looks like this: A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523
One way that this identifier can be represented as an *actionable* identifier is this: urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523
Another "actionable" form of the identifier might be this: http://zoobank.org/urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E5BF4...
or this: http://zoobank.org/A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523
or even this(?): http://lsid.tdwg.org/urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E5B...
(all of which work, by the way)
However, the following are examples of what I would think of as *services*: http://www.google.com/search?q=Danaus+plexippus+(Linnaeus+1758) http://lsid.tdwg.org/summary/urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB... http://darwin.zoology.gla.ac.uk/~rpage/lsid/tester/?q=urn:lsid:zoobank.org:a...
But really, from the perspective of the end-user, does it matter if it's an identifier or a service? Ultimately, they ask the questions, and the answers appear on their computer screens.
Aloha, Rich
-----Original Message----- From: tdwg-content-bounces@lists.tdwg.org [mailto:tdwg-content- bounces@lists.tdwg.org] On Behalf Of Dmitry Mozzherin Sent: Friday, June 03, 2011 4:34 AM To: David Remsen (GBIF) Cc: tdwg-content@lists.tdwg.org; Dmitry Mozzherin; Orrell, Thomas; Alan J Hampson; Nicolson, David; Gerald Guala Subject: Re: [tdwg-content] ITIS TSNID to uBio NamebankIDs mapping
In my opinion UUIDs have a few advantages over strings --
- It is uuid, so it will work with uuid tools (current and future ones)
- It is less ambiguous -- For example -- what is the difference between Betulа and
Betula for your eyes? (one of them has a Cyrillic 'a') 3. Database wise it is faster to search because it is just a 128bit number, while a name is at least 245 byte varchar -- it makes searching much faster because in relational databases the size of keys directly proportional to the search speed 4. UUID v. 5 (http://en.wikipedia.org/wiki/Universally_unique_identifier) allows to generate UUID algorithmically without looking up a database (no need for network connection) 5. Links like http://gni.globalnames.org/name_strings/Danaus_plexippus_(Linnaeus_1758) might be ambigous -- I can think of several ways I can represent name string part in the url and they will all resolve to the same thing in GNI. 6. Unescaped unicode characters in url containing literal name strings (people will forget to escape them) will depend on an implementation of a url resolver
Saying this links like http://gni.globalnames.org/name_strings/Danaus_plexippus_(Linnaeus_175 8) are definitely attractive and is it good to have them as another way to access a name! My personal preference would be not use them as main identifier because of the reasons 1, 2, 3 and 5.
Dima
On Fri, Jun 3, 2011 at 7:59 AM, David Remsen (GBIF) dremsen@gbif.org wrote:
Why not use the name as the basis for the resolvable identifier instead of a uuid. Isnt there a 1:1 cardinality between the name and the uuid in the GNI? Doesnt that mean that
http://gni.globalnames.org/name_strings/4ef223c4-0c3e-5e84-ace9-
755c34
c601ec and
http://gni.globalnames.org/name_strings/Danaus_plexippus_(Linnaeus_175
are equally unique? The latter is certainly more readable. In those cases where the namestring is a homonym like
http://gni.globalnames.org/name_strings/Oenanthe
couldn't you just return the addresses of the two globally unique forms of the name when you resolve it?
http://gni.globalnames.org/name_strings/Oenanthe_Smith_1899
http://gni.globalnames.org/name_strings/Oenanthe_Jones_1900
Wouldn't those be as globally unique and easier to read and adjust to? Or am I missing something. I always wanted to do that with ubio IDs after a back and forth with Gregor Hagedorn and wished we hadn't exposed those integers.
DR
Hi Steve,
I don't have time to go through this in detail, and I can't speak for the GNI, but I can tell you about how the GNI URI's work at least for now.
A while back Dima Mozzherin and I were looking into how triples etc. might be of use to the GNI.
We needed a way to generate unique URI's for each name.
We wanted to avoid having to keep these in sync and not require everyone to look each ID up through some service.
Dima came up with the following plan. We use the namestring as seed to generate a unique UUID.
Basically this is a shared algorithm which the GNI and TaxonConcept both use. But it could be used by anyone.
You feed the name string to the algorithm and it spits out a UUID. We append then append that to a URI and web service so it is resolvable.
So the name Danaus plexippus (Linnaeus 1758) => 4ef223c4-0c3e-5e84-ace9-755c34c601ec
So if the GNI and and another group have the same namestring they have the same UUID.
People can then can link their data set to the GNI with the following URI
http://gni.globalnames.org/name_strings/4ef223c4-0c3e-5e84-ace9-
755c3
4c601ec
RDF http://gni.globalnames.org/name_strings/4ef223c4-0c3e-5e84-ace9-
755c3
4c601ec.rdf
<http://gni.globalnames.org/name_strings/4ef223c4-0c3e-5e84-ace9-
755c
34c601ec.rdf>If you think of your data set as one table and the GNI as another, this URI serves as the foreign key that connects them together.
Some on the list don't like how these look, but there is a tremendous advantage in not having to worry about syncing two large data sets and determining if a given integer is already in use.
Also Rod Page has written a recently about UUID's. http://iphylo.blogspot.com/2011/05/zoobank-on-couchdb-uuids-replicati on.html
<http://iphylo.blogspot.com/2011/05/zoobank-on-couchdb-uuids-
replicat
ion.html>There may be a way to do something similar with bit.ly like identifiers that are shorter (mCcSp), but I think it the general idea is a good one.
If you recall from my talk at TDWG, I was able to use these to make statements that one namestring was a synonym etc. of another etc.
The algorithm we use is written in Ruby but I could be ported to many different languages since UUIDs are widely supported.
Respectfully,
- Pete
On Thu, Jun 2, 2011 at 11:41 PM, Steven J. Baskauf < steve.baskauf@vanderbilt.edu> wrote:
My email access has been sporadic since this thread developed, so at this point I'll respond to points made in several of the messages.
First, I should note that there has been previous discussion on this list on a similar topic from http://lists.tdwg.org/pipermail/tdwg-content/2011-January/002231.htm lthrough http://lists.tdwg.org/pipermail/tdwg-content/2011-
January/002231.html.
One can review what was said at that time rather quickly by starting on the first linked message and clicking on the "Next Message" link until you get to the end of the range I gave above.
My reason for the request for information that started this thread was that I wanted to link to a URI that would anchor the name portion of a name/sensu pair (TNU or Taxon Concept a la TCS if you prefer) as in this RDF snippet:
tc:nameStringQuercus rubra L.</tc:nameString> <tc:hasName
rdf:about="http://www.ubio.org/authority/metadata.php?lsid=urn:lsid:ubio .org:namebank:448439"
http://www.ubio.org/authority/metadata.php?lsid=urn:lsid:ubio.org:n amebank:448439/>
At this point in the discussion, I'm not actually talking about creating a link to a taxon concept but rather to a taxon name, so some of the issues Pete raised don't apply here (e.g. what's the "right" name for a concept
the question here is simply what's a stable identifier for the name) . In principle, I could probably just provide the name string and be done with it. However, having some degree of faith that Smart, Computer Savvy People might some day be able to use the metadata returned by the URI (or perhaps metadata which they already have in a triple store onsite) to do cool things like knowing that my name is the same as an orthographic variant or that "Quercus rubra L." is basically the same thing as "Quercus rubra", I would like to also provide a functional URI.
As an end -user who isn't very interested in the technical issues involving names, I don't really care what URI I use. I would prefer for it to be widely recognized and for it to "work" (i.e. be resolvable). In the earlier (January) thread, there was discussion about existing identifiers. There were a number of posts, but in particular http://lists.tdwg.org/pipermail/tdwg-content/2011-January/002258.htm l http://lists.tdwg.org/pipermail/tdwg-content/2011-January/002259.htm ldiscussed the relative merits of ITIS and uBio ID numbers. My take-home message from this was that uBio represented the largest single set of names with assigned identifiers (see http://gni.globalnames.org/data_sourcescited in Pete's email) and that uBio metadata provides useful references. Hence my interest in referencing uBio ids as a URI. However, as a practical matter, the organizations that I share images with either want ITIS TSNs (EOL and Morphbank) or just names (Discover Life). Nobody is asking for uBio identifiers or any other identifier.
I found Kevin's comment at http://lists.tdwg.org/pipermail/tdwg-content/2011-May/002486.html very thought-provoking: "My thoughts are that the most likely way this will be solved is by standard market type pressures - ie the best solution/IDs will be used the most and 'float' to the top." I'm not going to make a judgment about what is the "best" solution or ID. But I would say that in "computer" history, being the "best" doesn't necessarily mean that something will be used. Take for example, the FOAF vocabulary. What the heck is Friend of a Friend? I would venture to say that most of the people using the FOAF vocabulary don't know or care. The FOAF vocabulary was the one that people started to use and once that happened, people didn't switch even if there was something better. I'm not familiar with the history of other stuff like YouTube and Craig's List, but I would guess that they weren't necessarily "the best" systems - they were just the one that the most people started using first and once that happened, people didn't switch. I'm using ITIS IDs because they are easy to get and the people I communicate with want them. Whether they are the "best" or "done correctly" doesn't matter to me as much as the fact that that they are widely recognized and stable (and that thus far every name that I've looked for has been in their database).
I think that one reason why this question has been on my mind is that I've been waiting for GNUB (Global Name Use Bank) to come out. I'm not really up on how it is going to work, but my impression is that it was going to be based on the Global Name Index (GNI) which was mentioned in that earlier January thread. At that point, the GNI names didn't have any identifiers that were exposed to the public as permanent GUIDs. I'm assuming that if GNUB refers to GNI names, they will have some kind of identifiers. So if that happens how is the GUID recommendation 8 going to be followed? As Kevin said in http://lists.tdwg.org/pipermail/tdwg-content/2011-June/002499.html "What I take from recommendation 8 of the GUID applicability guide ... is that if you DON'T already have a record in your own database for a taxon name/concept, then reuse an existing one. " What we have here with GNI is a situation where none of the records have identifiers. In my mind, the "best practice" according to recommendation 8 would be for the GNI to reuse existing identifiers where they exist and NOT make up new ones. This is a bit more complicated because the ITIS identifiers (which are in common use) don't have an http URI version that is resolvable, and while the uBio identifiers have a resolvable http URI, it's in the form of a proxied LSID, which I've already complained is very ugly. So I'd like to hear some ideas about how to have "reused" identifiers in the GNI.
One thing that comes to my mind would be to have a "domain name" like "http://purl.org/gni/" http://purl.org/gni/ or "http://purl.org/tn/"http://purl.org/tn/("tn" for "taxon name") and to follow it with a namespace/id combination similar to what is done with lsids. So for example "itis/19408" and "ubio/448439" could be appended, creating http://purl.org/gni/itis/19408and http://purl.org/gni/ubio/448439 for "Quercus rubra L." Both URIs could point to the same RDF and that RDF could indicate that the two identifiers are owl:sameAs . I realize from what Bob Morris has cautioned in the past that there are problems with owl:sameAs when the two things aren't actually the same thing (e.g. if the uBio ID refers to a name string only but the ITIS TSN refers to the name plus an "accepted" status and a relationship to parent taxa). However, if there were an understanding that the GNI only refers to name strings, then one could still refer to http://purl.org/gni/itis/19408 as an identifier for the name string of the thing (whatever it is) that is referred to by an ITIS TSN of 19408. I don't think there would be a problem saying that and the ubio ID were "owl:sameAs". Some kind of solution like this would allow people to easily generate a resolvable URI for a name if they were using ITIS TSNs or uBio IDs. If the name that one wanted to use was so obscure that it was one of the 9.5 million names that uBio has that ITIS doesn't have, then that name would only have the ubio version. I have no idea whether this would be a good idea or not, but I was really cringing to think about 19 million newly minted UUIDs appended to "http://gni.globalnames.org/"http://gni.globalnames.org/and figuring out how to connect those horrid things to the names and ITIS TSNs that I'm already using. I think that I said this before, but using the purl.org domain rather than one like http://gni.globalnames.org/ would in the future allow somebody else to take over management of providing the metadata when the GUIDs
are
resolved without having to deal with issues of who "owns" the domain name.
Steve
Kevin Richards wrote:
Pete,
I’m not trying to say what you are doing is a waste of time/impossible. I actually think RDF + semantics are a good way forward, but this really implies that we need to rely on the semantics and linkages rather than having a SINGLE ID for a taxon name. (which is what I thought Steve was getting at). Each instance of a taxon name can have its own ID and then all these instances are connected via ontology defined semantic links. This seems more appropriate to me than insisting everyone uses the “Global Taxon Name ID X”.
In your example of *Aedes triseriatus* and *Ochlerotatus triseriatus* – these are two different names so they need two different IDs, they may be linked by a single taxon concept, but they are separate names. So which of these now 3 IDs do you expect people to use, and according to what source??
For example if we have a name, eg the Robin, Erithacus rubecula, mentioned in IT IS (TSN : 559964) and also in EOL (www.eol.org/pages/1051567), also in GBIF (http://data.gbif.org/species/21266780), also in avibase ( http://avibase.bsc-eoc.org/species.jsp?avibaseid=C809B2B90399A43D), which ID are you hoping people will use?? Would you put the IT IS ID in your own dataset as the ID for that name – unlikely. Or would it be better to link them up with semantic linkages.
What I take from recommendation 8 of the GUID applicability guide (as Steve puts is "stop making up new identifiers when somebody else already has one for the thing you are talking about”) is that if you DON’T already have a record in your own database for a taxon name/concept, then reuse an existing one. NOT ditch all your current IDs and adopt someone else’s (especially hard considering it is so hard to work out which if the multitude of names ad concept IDs that directly relates to your taxon name).
I am all for limiting the number of IDs for the “same” thing, but in some cases it is more useful to build linkages than force this tight integration of data and IDs. Especially for taxon names and concepts, where it is complex to define if you are even talking about the “same” thing or not.
Kevin
*From:* Peter DeVries [mailto:pete.devries@gmail.compete.devries@gmail.com]
*Sent:* Wednesday, 1 June 2011 12:38 p.m. *To:* Kevin Richards *Cc:* Steve Baskauf; tdwg-content@lists.tdwg.org; Gerald Guala; Nicolson, David; Alan J Hampson; Orrell, Thomas *Subject:* Re: [tdwg-content] ITIS TSNID to uBio NamebankIDs mapping
Hi Kevin,
I forgot one mention some other things that are different about my project.
You can write a simple SPARQL query to get a list of all the TaxonConcept's that have ITIS ids, or all those that have ITIS and NCBI ID's etc.
You can do this on any SPARQL endpoint that hosts the data.
You can download the entire data set and run the queries on your own endpoint.
You can write a script that runs the query and downloads the ITIS numbers and exports them to CSV etc.
- Pete
On Tue, May 31, 2011 at 5:16 PM, Peter DeVries
wrote:
Hi Kevin,
On Tue, May 31, 2011 at 3:27 PM, Kevin Richards < RichardsK@landcareresearch.co.nz> wrote:
This is exactly why this problem still exists and will be very complex to solve - everyone says "we should have a single ID for a specific taxon name, there seems to be several IDs 'out there' that refer to the same taxon name, so Im going to create another ID to link them all up" - yet another ID that no one will particularly want to follow - you would have to get everyone to agree that your combinations/integration of taxon names is the best one and hope everyone follows it - unlikely in this domain.
Isn't this kind of what the The Plant List, and eBird already do?
A difference being that they tie these to a specific name and specific classification.
The Plant list is not really even open so it is difficult to people to adopt it in mass.
For instance, if I manage a herbarium, how do I easily reconcile my species list with the entities represented in the Plant List?
eBird has millions of records which implies that they have been able to convince the observers in the field to adopt their system. You are correct in that there are probably a lot of taxonomists that don't like their list.
It differs from many of the other classifications, but remember the system rewards them for not agreeing. Note the difference between the
microbial
taxonomists and other taxonomists. In the case of the microbial
workers, the system rewards them for solving problems not debating alternatives. Also, if a good idea comes out that will make it easier for the microbiologists to solve the problems they are rewarded for solving, they are less likely to care whose idea it is.
Like the microbiologists, there are lots of biologists that work with species with the goal of addressing some non-taxonomic problem.
They don't really care if the name is *Aedes triseriatus* or *Ochlerotatus triseriatus, *but they do care that the identifier that they connect their data to is stable.
In regards to the issue of market forces,I suspect (but have no knowledge of) that there were probably decisions made in devising these lists that have more to do with appeasing certain personalities that creating best list. With the way this system rewards people it is likely that the "correct" version will float to the top only after that person has passed away. I don't have much faith that the best system will always float to the top, That has a lot to do with the personalities and how the system rewards are setup. Theoretically, it is possible for one strong personality or group to force others to adopt their less than optimal solution - at least this seems to happen in other environments.
Also, there are all sorts of ways that people can use the publication record to rewrite history. Simply cite the review paper that cites the original paper. Or don't cite it at all.
I would have used only the ITIS TSN but if the name changes the ID changes. This isn't "wrong", it just does not solve my problem.
- ITIS also should add the spiders from the World Spider Catalog.
Another issue that I think has inhibited adoption of a common list is that people can't agree on a particular name or a particular classification.
Since you can model a species concept as having many names and many classifications why not do so?
If this idea was originally accepted, I would not have needed to create TaxonConcept.org.
My plan has aways been to get something that works to solve some problems and then let some larger group take it over.
In a sense, I am more like the microbiologists in that I am not being paid to solve this or debate this problem.
I am doing it because I think something like this is needed, and it is an interesting and personally rewarding puzzle.
- Pete
My thoughts are that the most likely way this will be solve is by stnadard market type pressures - ie the best solution/IDs will be used the most and "float" to the top. It is easy to say that the global taxon name data is a mess, but if you think about it 30 years ago taxon name data were very disparate, duplicated, unconnected, many with NO IDs at all. So I beleive we are making progress and that we will continue to do so albeit at a fairly slow rate.
Kevin
"I agree. This was one of the reasons that I setup TaxonConcept the way I did. It attempts to connect both the LOD entities and the foreign key based entities."
Please consider the environment before printing this email Warning: This electronic message together with any attachments is confidential. If you receive it in error: (i) you must not read, use, disclose, copy or retain it; (ii) please contact the sender immediately by reply email and then delete the emails. The views expressed in this email may not be those of Landcare
Research
New Zealand Limited. http://www.landcareresearch.co.nz
--
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 Email: pdevries@wisc.edu TaxonConcept http://www.taxonconcept.org/ & GeoSpecieshttp://about.geospecies.org/ Knowledge Bases A Semantic Web, Linked Open Data http://linkeddata.org/ Project
--
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 Email: pdevries@wisc.edu TaxonConcept http://www.taxonconcept.org/ & GeoSpecieshttp://about.geospecies.org/ Knowledge Bases A Semantic Web, Linked Open Data http://linkeddata.org/ Project
Please consider the environment before printing this email Warning: This electronic message together with any attachments is confidential. If you receive it in error: (i) you must not read, use, disclose, copy or retain it; (ii) please contact the sender immediately by reply email and then delete the emails. The views expressed in this email may not be those of Landcare
Research
New Zealand Limited. http://www.landcareresearch.co.nz
-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences
postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A.
delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235
office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707http://bioimages.vanderbilt.edu
--
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 Email: pdevries@wisc.edu TaxonConcept http://www.taxonconcept.org/ & GeoSpecieshttp://about.geospecies.org/ Knowledge Bases A Semantic Web, Linked Open Data http://linkeddata.org/ Project
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
David Remsen, Senior Programme Officer Electronic Catalog of Names of Known Organisms Global Biodiversity Information Facility Secretariat Universitetsparken 15, DK-2100 Copenhagen, Denmark Tel: +45-35321472 Fax: +45-35321480 Skype: dremsen
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
Oh, now that I have read Rich's email here, it seems we are in agreement, of sorts. I think there is obviously a need for both of these "identifier" approaches - ie a record based ID that no one should really ever need to interact with directly, and a human friendly "ID" that allows people to discuss the same semantic "thing".
It is interesting all this discussion of identifiers when in the end it doesn’t matter too much what the identifier is, just that you have an identifier at all. The important thing is the semantics, the "are we talking about the same thing" question - so this is where I believe RDF/semantic web comes in - I might see if I can come up with some RDF/sem web example for TDWG that could demonstrate this, hmmm... I think underlying all this discussion about identifiers are thoughts like "but what is it that that person is meaning when they quote identifier X?". This is obviously semantics, and where semantic approaches can help - ie don’t worry too much about what identifier you give your own record of "Aus bus", so long as you semantically map your record/ID to as many other records/IDs of "Aus bus" as you can.
Kevin
-----Original Message----- From: tdwg-content-bounces@lists.tdwg.org [mailto:tdwg-content-bounces@lists.tdwg.org] On Behalf Of Richard Pyle Sent: Saturday, 4 June 2011 10:49 a.m. To: tdwg-content@lists.tdwg.org Subject: Re: [tdwg-content] ITIS TSNID to uBio NamebankIDs mapping
Working backwards through this thread...
I hadn't read Dima's post until just now, and I see that at least a couple of his points (i.e., #2, #5, #6) apply to exposing the UUIDs externally. However, I think that a simple protocol (such as replacing spaces with "_", and avoiding characters that look the same but are different -- such as the Cyrillic 'a') could go a long way to mitigating those problems.
On the other hand, it really depends on what the identifier is for. The string "Danaus_plexippus_(Linnaeus_1758)" may be more friendly to our eyes, but "A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523" is definitely more friendly to a computer (Dima's points 1, 3 & 4, among others). My feeling is that the push for GUIDs is more about enabling computer-computer conversations, than it is about enabling human-human or human-computer interactions; and therefore we should not get bogged down in the "ugliness" of the identifiers. In the context of electronic data services, the "ugliness" potential of the "Danaus_plexippus_(Linnaeus_1758)" approach to identifiers is far greater than the ugliness potential of "A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523", when it comes to interlinking electronic biodiversity data. It is nothing for a computer to render relevant metadata of the object identified by "A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523" into "Danaus plexippus (Linnaeus_1758)" on a computer screen or piece of paper for human-eyeball consumption. But there are many pitfalls (some noted by Dima) for a computer to unambiguously resolve "Danaus_plexippus_(Linnaeus_1758)" back to a meaningful data object.
I guess my revised point is: GNI (and uBio/NameBank) are essentially the only taxonomic databases out there where a human-friendly persistent/actionable identifier of the sort being discussed is even plausible as an option. It may not even be wise in this context (as per Dima's points), but it *might* be, depending on the need for a human-friendly identifier.
Maybe the simplest thing to do would be to not regard "http://gni.globalnames.org/name_strings/Danaus_plexippus_(Linnaeus_1758)" as an identifier per se, but rather as a protocol for a web service. In other words, if you append a text string to the root URL "http://gni.globalnames.org/name_strings/", GNI would run that text string against its index and return whatever metadata based on a text-string match. This is not mutually exclusive with an "identifier" in the form of "http://gni.globalnames.org/name_strings/A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523", that would less ambiguously resolve a known record in GNI. At this point, the line between "identifier" and "service" gets fuzzy, of course. But the analogy is true in ZooBank:
The persistent "Identifer" looks like this: A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523
One way that this identifier can be represented as an *actionable* identifier is this: urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523
Another "actionable" form of the identifier might be this: http://zoobank.org/urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E5BF4...
or this: http://zoobank.org/A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523
or even this(?): http://lsid.tdwg.org/urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E5B...
(all of which work, by the way)
However, the following are examples of what I would think of as *services*: http://www.google.com/search?q=Danaus+plexippus+(Linnaeus+1758) http://lsid.tdwg.org/summary/urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB... http://darwin.zoology.gla.ac.uk/~rpage/lsid/tester/?q=urn:lsid:zoobank.org:a...
But really, from the perspective of the end-user, does it matter if it's an identifier or a service? Ultimately, they ask the questions, and the answers appear on their computer screens.
Aloha, Rich
-----Original Message----- From: tdwg-content-bounces@lists.tdwg.org [mailto:tdwg-content- bounces@lists.tdwg.org] On Behalf Of Dmitry Mozzherin Sent: Friday, June 03, 2011 4:34 AM To: David Remsen (GBIF) Cc: tdwg-content@lists.tdwg.org; Dmitry Mozzherin; Orrell, Thomas; Alan J Hampson; Nicolson, David; Gerald Guala Subject: Re: [tdwg-content] ITIS TSNID to uBio NamebankIDs mapping
In my opinion UUIDs have a few advantages over strings --
- It is uuid, so it will work with uuid tools (current and future ones)
- It is less ambiguous -- For example -- what is the difference between Betulа and
Betula for your eyes? (one of them has a Cyrillic 'a') 3. Database wise it is faster to search because it is just a 128bit number, while a name is at least 245 byte varchar -- it makes searching much faster because in relational databases the size of keys directly proportional to the search speed 4. UUID v. 5 (http://en.wikipedia.org/wiki/Universally_unique_identifier) allows to generate UUID algorithmically without looking up a database (no need for network connection) 5. Links like http://gni.globalnames.org/name_strings/Danaus_plexippus_(Linnaeus_1758) might be ambigous -- I can think of several ways I can represent name string part in the url and they will all resolve to the same thing in GNI. 6. Unescaped unicode characters in url containing literal name strings (people will forget to escape them) will depend on an implementation of a url resolver
Saying this links like http://gni.globalnames.org/name_strings/Danaus_plexippus_(Linnaeus_175 8) are definitely attractive and is it good to have them as another way to access a name! My personal preference would be not use them as main identifier because of the reasons 1, 2, 3 and 5.
Dima
On Fri, Jun 3, 2011 at 7:59 AM, David Remsen (GBIF) dremsen@gbif.org wrote:
Why not use the name as the basis for the resolvable identifier instead of a uuid. Isnt there a 1:1 cardinality between the name and the uuid in the GNI? Doesnt that mean that
http://gni.globalnames.org/name_strings/4ef223c4-0c3e-5e84-ace9-
755c34
c601ec and
http://gni.globalnames.org/name_strings/Danaus_plexippus_(Linnaeus_175
are equally unique? The latter is certainly more readable. In those cases where the namestring is a homonym like
http://gni.globalnames.org/name_strings/Oenanthe
couldn't you just return the addresses of the two globally unique forms of the name when you resolve it?
http://gni.globalnames.org/name_strings/Oenanthe_Smith_1899
http://gni.globalnames.org/name_strings/Oenanthe_Jones_1900
Wouldn't those be as globally unique and easier to read and adjust to? Or am I missing something. I always wanted to do that with ubio IDs after a back and forth with Gregor Hagedorn and wished we hadn't exposed those integers.
DR
Hi Steve,
I don't have time to go through this in detail, and I can't speak for the GNI, but I can tell you about how the GNI URI's work at least for now.
A while back Dima Mozzherin and I were looking into how triples etc. might be of use to the GNI.
We needed a way to generate unique URI's for each name.
We wanted to avoid having to keep these in sync and not require everyone to look each ID up through some service.
Dima came up with the following plan. We use the namestring as seed to generate a unique UUID.
Basically this is a shared algorithm which the GNI and TaxonConcept both use. But it could be used by anyone.
You feed the name string to the algorithm and it spits out a UUID. We append then append that to a URI and web service so it is resolvable.
So the name Danaus plexippus (Linnaeus 1758) => 4ef223c4-0c3e-5e84-ace9-755c34c601ec
So if the GNI and and another group have the same namestring they have the same UUID.
People can then can link their data set to the GNI with the following URI
http://gni.globalnames.org/name_strings/4ef223c4-0c3e-5e84-ace9-
755c3
4c601ec
RDF http://gni.globalnames.org/name_strings/4ef223c4-0c3e-5e84-ace9-
755c3
4c601ec.rdf
<http://gni.globalnames.org/name_strings/4ef223c4-0c3e-5e84-ace9-
755c
34c601ec.rdf>If you think of your data set as one table and the GNI as another, this URI serves as the foreign key that connects them together.
Some on the list don't like how these look, but there is a tremendous advantage in not having to worry about syncing two large data sets and determining if a given integer is already in use.
Also Rod Page has written a recently about UUID's. http://iphylo.blogspot.com/2011/05/zoobank-on-couchdb-uuids-replicati on.html
<http://iphylo.blogspot.com/2011/05/zoobank-on-couchdb-uuids-
replicat
ion.html>There may be a way to do something similar with bit.ly like identifiers that are shorter (mCcSp), but I think it the general idea is a good one.
If you recall from my talk at TDWG, I was able to use these to make statements that one namestring was a synonym etc. of another etc.
The algorithm we use is written in Ruby but I could be ported to many different languages since UUIDs are widely supported.
Respectfully,
- Pete
On Thu, Jun 2, 2011 at 11:41 PM, Steven J. Baskauf < steve.baskauf@vanderbilt.edu> wrote:
My email access has been sporadic since this thread developed, so at this point I'll respond to points made in several of the messages.
First, I should note that there has been previous discussion on this list on a similar topic from http://lists.tdwg.org/pipermail/tdwg-content/2011-January/002231.htm lthrough http://lists.tdwg.org/pipermail/tdwg-content/2011-
January/002231.html.
One can review what was said at that time rather quickly by starting on the first linked message and clicking on the "Next Message" link until you get to the end of the range I gave above.
My reason for the request for information that started this thread was that I wanted to link to a URI that would anchor the name portion of a name/sensu pair (TNU or Taxon Concept a la TCS if you prefer) as in this RDF snippet:
tc:nameStringQuercus rubra L.</tc:nameString> <tc:hasName
rdf:about="http://www.ubio.org/authority/metadata.php?lsid=urn:lsid:ubio .org:namebank:448439"
http://www.ubio.org/authority/metadata.php?lsid=urn:lsid:ubio.org:n amebank:448439/>
At this point in the discussion, I'm not actually talking about creating a link to a taxon concept but rather to a taxon name, so some of the issues Pete raised don't apply here (e.g. what's the "right" name for a concept
the question here is simply what's a stable identifier for the name) . In principle, I could probably just provide the name string and be done with it. However, having some degree of faith that Smart, Computer Savvy People might some day be able to use the metadata returned by the URI (or perhaps metadata which they already have in a triple store onsite) to do cool things like knowing that my name is the same as an orthographic variant or that "Quercus rubra L." is basically the same thing as "Quercus rubra", I would like to also provide a functional URI.
As an end -user who isn't very interested in the technical issues involving names, I don't really care what URI I use. I would prefer for it to be widely recognized and for it to "work" (i.e. be resolvable). In the earlier (January) thread, there was discussion about existing identifiers. There were a number of posts, but in particular http://lists.tdwg.org/pipermail/tdwg-content/2011-January/002258.htm l http://lists.tdwg.org/pipermail/tdwg-content/2011-January/002259.htm ldiscussed the relative merits of ITIS and uBio ID numbers. My take-home message from this was that uBio represented the largest single set of names with assigned identifiers (see http://gni.globalnames.org/data_sourcescited in Pete's email) and that uBio metadata provides useful references. Hence my interest in referencing uBio ids as a URI. However, as a practical matter, the organizations that I share images with either want ITIS TSNs (EOL and Morphbank) or just names (Discover Life). Nobody is asking for uBio identifiers or any other identifier.
I found Kevin's comment at http://lists.tdwg.org/pipermail/tdwg-content/2011-May/002486.html very thought-provoking: "My thoughts are that the most likely way this will be solved is by standard market type pressures - ie the best solution/IDs will be used the most and 'float' to the top." I'm not going to make a judgment about what is the "best" solution or ID. But I would say that in "computer" history, being the "best" doesn't necessarily mean that something will be used. Take for example, the FOAF vocabulary. What the heck is Friend of a Friend? I would venture to say that most of the people using the FOAF vocabulary don't know or care. The FOAF vocabulary was the one that people started to use and once that happened, people didn't switch even if there was something better. I'm not familiar with the history of other stuff like YouTube and Craig's List, but I would guess that they weren't necessarily "the best" systems - they were just the one that the most people started using first and once that happened, people didn't switch. I'm using ITIS IDs because they are easy to get and the people I communicate with want them. Whether they are the "best" or "done correctly" doesn't matter to me as much as the fact that that they are widely recognized and stable (and that thus far every name that I've looked for has been in their database).
I think that one reason why this question has been on my mind is that I've been waiting for GNUB (Global Name Use Bank) to come out. I'm not really up on how it is going to work, but my impression is that it was going to be based on the Global Name Index (GNI) which was mentioned in that earlier January thread. At that point, the GNI names didn't have any identifiers that were exposed to the public as permanent GUIDs. I'm assuming that if GNUB refers to GNI names, they will have some kind of identifiers. So if that happens how is the GUID recommendation 8 going to be followed? As Kevin said in http://lists.tdwg.org/pipermail/tdwg-content/2011-June/002499.html "What I take from recommendation 8 of the GUID applicability guide ... is that if you DON'T already have a record in your own database for a taxon name/concept, then reuse an existing one. " What we have here with GNI is a situation where none of the records have identifiers. In my mind, the "best practice" according to recommendation 8 would be for the GNI to reuse existing identifiers where they exist and NOT make up new ones. This is a bit more complicated because the ITIS identifiers (which are in common use) don't have an http URI version that is resolvable, and while the uBio identifiers have a resolvable http URI, it's in the form of a proxied LSID, which I've already complained is very ugly. So I'd like to hear some ideas about how to have "reused" identifiers in the GNI.
One thing that comes to my mind would be to have a "domain name" like "http://purl.org/gni/" http://purl.org/gni/ or "http://purl.org/tn/"http://purl.org/tn/("tn" for "taxon name") and to follow it with a namespace/id combination similar to what is done with lsids. So for example "itis/19408" and "ubio/448439" could be appended, creating http://purl.org/gni/itis/19408and http://purl.org/gni/ubio/448439 for "Quercus rubra L." Both URIs could point to the same RDF and that RDF could indicate that the two identifiers are owl:sameAs . I realize from what Bob Morris has cautioned in the past that there are problems with owl:sameAs when the two things aren't actually the same thing (e.g. if the uBio ID refers to a name string only but the ITIS TSN refers to the name plus an "accepted" status and a relationship to parent taxa). However, if there were an understanding that the GNI only refers to name strings, then one could still refer to http://purl.org/gni/itis/19408 as an identifier for the name string of the thing (whatever it is) that is referred to by an ITIS TSN of 19408. I don't think there would be a problem saying that and the ubio ID were "owl:sameAs". Some kind of solution like this would allow people to easily generate a resolvable URI for a name if they were using ITIS TSNs or uBio IDs. If the name that one wanted to use was so obscure that it was one of the 9.5 million names that uBio has that ITIS doesn't have, then that name would only have the ubio version. I have no idea whether this would be a good idea or not, but I was really cringing to think about 19 million newly minted UUIDs appended to "http://gni.globalnames.org/"http://gni.globalnames.org/and figuring out how to connect those horrid things to the names and ITIS TSNs that I'm already using. I think that I said this before, but using the purl.org domain rather than one like http://gni.globalnames.org/ would in the future allow somebody else to take over management of providing the metadata when the GUIDs
are
resolved without having to deal with issues of who "owns" the domain name.
Steve
Kevin Richards wrote:
Pete,
I’m not trying to say what you are doing is a waste of time/impossible. I actually think RDF + semantics are a good way forward, but this really implies that we need to rely on the semantics and linkages rather than having a SINGLE ID for a taxon name. (which is what I thought Steve was getting at). Each instance of a taxon name can have its own ID and then all these instances are connected via ontology defined semantic links. This seems more appropriate to me than insisting everyone uses the “Global Taxon Name ID X”.
In your example of *Aedes triseriatus* and *Ochlerotatus triseriatus* – these are two different names so they need two different IDs, they may be linked by a single taxon concept, but they are separate names. So which of these now 3 IDs do you expect people to use, and according to what source??
For example if we have a name, eg the Robin, Erithacus rubecula, mentioned in IT IS (TSN : 559964) and also in EOL (www.eol.org/pages/1051567), also in GBIF (http://data.gbif.org/species/21266780), also in avibase ( http://avibase.bsc-eoc.org/species.jsp?avibaseid=C809B2B90399A43D), which ID are you hoping people will use?? Would you put the IT IS ID in your own dataset as the ID for that name – unlikely. Or would it be better to link them up with semantic linkages.
What I take from recommendation 8 of the GUID applicability guide (as Steve puts is "stop making up new identifiers when somebody else already has one for the thing you are talking about”) is that if you DON’T already have a record in your own database for a taxon name/concept, then reuse an existing one. NOT ditch all your current IDs and adopt someone else’s (especially hard considering it is so hard to work out which if the multitude of names ad concept IDs that directly relates to your taxon name).
I am all for limiting the number of IDs for the “same” thing, but in some cases it is more useful to build linkages than force this tight integration of data and IDs. Especially for taxon names and concepts, where it is complex to define if you are even talking about the “same” thing or not.
Kevin
*From:* Peter DeVries [mailto:pete.devries@gmail.compete.devries@gmail.com]
*Sent:* Wednesday, 1 June 2011 12:38 p.m. *To:* Kevin Richards *Cc:* Steve Baskauf; tdwg-content@lists.tdwg.org; Gerald Guala; Nicolson, David; Alan J Hampson; Orrell, Thomas *Subject:* Re: [tdwg-content] ITIS TSNID to uBio NamebankIDs mapping
Hi Kevin,
I forgot one mention some other things that are different about my project.
You can write a simple SPARQL query to get a list of all the TaxonConcept's that have ITIS ids, or all those that have ITIS and NCBI ID's etc.
You can do this on any SPARQL endpoint that hosts the data.
You can download the entire data set and run the queries on your own endpoint.
You can write a script that runs the query and downloads the ITIS numbers and exports them to CSV etc.
- Pete
On Tue, May 31, 2011 at 5:16 PM, Peter DeVries
wrote:
Hi Kevin,
On Tue, May 31, 2011 at 3:27 PM, Kevin Richards < RichardsK@landcareresearch.co.nz> wrote:
This is exactly why this problem still exists and will be very complex to solve - everyone says "we should have a single ID for a specific taxon name, there seems to be several IDs 'out there' that refer to the same taxon name, so Im going to create another ID to link them all up" - yet another ID that no one will particularly want to follow - you would have to get everyone to agree that your combinations/integration of taxon names is the best one and hope everyone follows it - unlikely in this domain.
Isn't this kind of what the The Plant List, and eBird already do?
A difference being that they tie these to a specific name and specific classification.
The Plant list is not really even open so it is difficult to people to adopt it in mass.
For instance, if I manage a herbarium, how do I easily reconcile my species list with the entities represented in the Plant List?
eBird has millions of records which implies that they have been able to convince the observers in the field to adopt their system. You are correct in that there are probably a lot of taxonomists that don't like their list.
It differs from many of the other classifications, but remember the system rewards them for not agreeing. Note the difference between the
microbial
taxonomists and other taxonomists. In the case of the microbial
workers, the system rewards them for solving problems not debating alternatives. Also, if a good idea comes out that will make it easier for the microbiologists to solve the problems they are rewarded for solving, they are less likely to care whose idea it is.
Like the microbiologists, there are lots of biologists that work with species with the goal of addressing some non-taxonomic problem.
They don't really care if the name is *Aedes triseriatus* or *Ochlerotatus triseriatus, *but they do care that the identifier that they connect their data to is stable.
In regards to the issue of market forces,I suspect (but have no knowledge of) that there were probably decisions made in devising these lists that have more to do with appeasing certain personalities that creating best list. With the way this system rewards people it is likely that the "correct" version will float to the top only after that person has passed away. I don't have much faith that the best system will always float to the top, That has a lot to do with the personalities and how the system rewards are setup. Theoretically, it is possible for one strong personality or group to force others to adopt their less than optimal solution - at least this seems to happen in other environments.
Also, there are all sorts of ways that people can use the publication record to rewrite history. Simply cite the review paper that cites the original paper. Or don't cite it at all.
I would have used only the ITIS TSN but if the name changes the ID changes. This isn't "wrong", it just does not solve my problem.
- ITIS also should add the spiders from the World Spider Catalog.
Another issue that I think has inhibited adoption of a common list is that people can't agree on a particular name or a particular classification.
Since you can model a species concept as having many names and many classifications why not do so?
If this idea was originally accepted, I would not have needed to create TaxonConcept.org.
My plan has aways been to get something that works to solve some problems and then let some larger group take it over.
In a sense, I am more like the microbiologists in that I am not being paid to solve this or debate this problem.
I am doing it because I think something like this is needed, and it is an interesting and personally rewarding puzzle.
- Pete
My thoughts are that the most likely way this will be solve is by stnadard market type pressures - ie the best solution/IDs will be used the most and "float" to the top. It is easy to say that the global taxon name data is a mess, but if you think about it 30 years ago taxon name data were very disparate, duplicated, unconnected, many with NO IDs at all. So I beleive we are making progress and that we will continue to do so albeit at a fairly slow rate.
Kevin
"I agree. This was one of the reasons that I setup TaxonConcept the way I did. It attempts to connect both the LOD entities and the foreign key based entities."
Please consider the environment before printing this email Warning: This electronic message together with any attachments is confidential. If you receive it in error: (i) you must not read, use, disclose, copy or retain it; (ii) please contact the sender immediately by reply email and then delete the emails. The views expressed in this email may not be those of Landcare
Research
New Zealand Limited. http://www.landcareresearch.co.nz
--
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 Email: pdevries@wisc.edu TaxonConcept http://www.taxonconcept.org/ & GeoSpecieshttp://about.geospecies.org/ Knowledge Bases A Semantic Web, Linked Open Data http://linkeddata.org/ Project
--
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 Email: pdevries@wisc.edu TaxonConcept http://www.taxonconcept.org/ & GeoSpecieshttp://about.geospecies.org/ Knowledge Bases A Semantic Web, Linked Open Data http://linkeddata.org/ Project
Please consider the environment before printing this email Warning: This electronic message together with any attachments is confidential. If you receive it in error: (i) you must not read, use, disclose, copy or retain it; (ii) please contact the sender immediately by reply email and then delete the emails. The views expressed in this email may not be those of Landcare
Research
New Zealand Limited. http://www.landcareresearch.co.nz
-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences
postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A.
delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235
office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707http://bioimages.vanderbilt.edu
--
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 Email: pdevries@wisc.edu TaxonConcept http://www.taxonconcept.org/ & GeoSpecieshttp://about.geospecies.org/ Knowledge Bases A Semantic Web, Linked Open Data http://linkeddata.org/ Project
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
David Remsen, Senior Programme Officer Electronic Catalog of Names of Known Organisms Global Biodiversity Information Facility Secretariat Universitetsparken 15, DK-2100 Copenhagen, Denmark Tel: +45-35321472 Fax: +45-35321480 Skype: dremsen
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
_______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
Please consider the environment before printing this email Warning: This electronic message together with any attachments is confidential. If you receive it in error: (i) you must not read, use, disclose, copy or retain it; (ii) please contact the sender immediately by reply email and then delete the emails. The views expressed in this email may not be those of Landcare Research New Zealand Limited. http://www.landcareresearch.co.nz
Hi Rich,
In a way the GNI URI's are really only "exposed" to those who are talking about them.
Note in this Knowledge Base view of another butterfly *Papilio canadensis*
* This is simply *http://lsd.taxonconcept.org/describe/?url= *followed by the URL encoded species concept URI
http://lsd.taxonconcept.org/describe/?url=http%3A%2F%2Flod.taxonconcept.org%... bit.ly http://bit.ly/giZd7t
the GNI URI is replaced in the human view with the skos:prefLabel. (rdfs:label will do the same thing) The RDF that makes this possible is shown below.
<txn:SpeciesNameString rdf:about=" http://gni.globalnames.org/name_strings/9bead295-36c9-58d0-b3bc-554264ee8908 "> skos:prefLabelPterourus canadensis</skos:prefLabel> <txn:speciesNameStringHasSpeciesTaxonConcept rdf:resource=" http://lod.taxonconcept.org/ses/wbbPl#Species%22/%3E <wdrs:describedby rdf:resource=" http://lod.taxonconcept.org/ses/wbbPl.rdf%22/%3E </txn:SpeciesNameString>
If you click on one of these GNI linked names you are taken to a URL like
http://lsd.taxonconcept.org/describe/?url=http%3A%2F%2Fgni.globalnames.org%2... bit.ly http://bit.ly/in5dGL
Which is the same "describe prefix" followed by the GNI URI.
In a sense the KB is providing a "describe service" for human browsers of the URI's in the data set.
My point is that humans see the name string, but it is actually represented and processed by computers as a UUID based URI.
This system creates a series of URI's that can be used to make statements about how various name strings relate to each other.
It will also allow you to make statements about how your GNUB entities relate to different namestrings.
Below I have attached a small screenshot of the part of the KB view that shows the GNI names.
Respectfully,
- Pete
------------------------------------------------------------------------------------ Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 Email: pdevries@wisc.edu TaxonConcept http://www.taxonconcept.org/ & GeoSpecieshttp://about.geospecies.org/ Knowledge Bases A Semantic Web, Linked Open Data http://linkeddata.org/ Project --------------------------------------------------------------------------------------
I agree with Dima here (probably my technical background :-)).
Using a name string as an identifier breaks most of the rules of good identifiers.
Seems like the debate is really about what is exposed externally/publicly - by all means have a web service/url where you can put http://.../Aus bus, that does a 'search'/get of that name, but it should not be the ID for that name (or is it even the ID for that 'instance' of that name string?? - what if two people had two records about the same name string but treated them differently/had different properties etc for that name string? => results in bad conclusions, inferences).
Kevin
-----Original Message----- From: tdwg-content-bounces@lists.tdwg.org [mailto:tdwg-content-bounces@lists.tdwg.org] On Behalf Of Dmitry Mozzherin Sent: Saturday, 4 June 2011 2:34 a.m. To: David Remsen (GBIF) Cc: tdwg-content@lists.tdwg.org; Dmitry Mozzherin; Orrell, Thomas; Alan J Hampson; Nicolson, David; Gerald Guala Subject: Re: [tdwg-content] ITIS TSNID to uBio NamebankIDs mapping
In my opinion UUIDs have a few advantages over strings --
1. It is uuid, so it will work with uuid tools (current and future ones) 2. It is less ambiguous -- For example -- what is the difference between Betulа and Betula for your eyes? (one of them has a Cyrillic 'a') 3. Database wise it is faster to search because it is just a 128bit number, while a name is at least 245 byte varchar -- it makes searching much faster because in relational databases the size of keys directly proportional to the search speed 4. UUID v. 5 (http://en.wikipedia.org/wiki/Universally_unique_identifier) allows to generate UUID algorithmically without looking up a database (no need for network connection) 5. Links like http://gni.globalnames.org/name_strings/Danaus_plexippus_(Linnaeus_1758) might be ambigous -- I can think of several ways I can represent name string part in the url and they will all resolve to the same thing in GNI. 6. Unescaped unicode characters in url containing literal name strings (people will forget to escape them) will depend on an implementation of a url resolver
Saying this links like http://gni.globalnames.org/name_strings/Danaus_plexippus_(Linnaeus_1758) are definitely attractive and is it good to have them as another way to access a name! My personal preference would be not use them as main identifier because of the reasons 1, 2, 3 and 5.
Dima
On Fri, Jun 3, 2011 at 7:59 AM, David Remsen (GBIF) dremsen@gbif.org wrote:
Why not use the name as the basis for the resolvable identifier instead of a uuid. Isnt there a 1:1 cardinality between the name and the uuid in the GNI? Doesnt that mean that
http://gni.globalnames.org/name_strings/4ef223c4-0c3e-5e84-ace9-755c34c601ec and http://gni.globalnames.org/name_strings/Danaus_plexippus_(Linnaeus_1758)
are equally unique? The latter is certainly more readable. In those cases where the namestring is a homonym like
http://gni.globalnames.org/name_strings/Oenanthe
couldn't you just return the addresses of the two globally unique forms of the name when you resolve it?
http://gni.globalnames.org/name_strings/Oenanthe_Smith_1899
http://gni.globalnames.org/name_strings/Oenanthe_Jones_1900
Wouldn't those be as globally unique and easier to read and adjust to? Or am I missing something. I always wanted to do that with ubio IDs after a back and forth with Gregor Hagedorn and wished we hadn't exposed those integers.
DR
Hi Steve,
I don't have time to go through this in detail, and I can't speak for the GNI, but I can tell you about how the GNI URI's work at least for now.
A while back Dima Mozzherin and I were looking into how triples etc. might be of use to the GNI.
We needed a way to generate unique URI's for each name.
We wanted to avoid having to keep these in sync and not require everyone to look each ID up through some service.
Dima came up with the following plan. We use the namestring as seed to generate a unique UUID.
Basically this is a shared algorithm which the GNI and TaxonConcept both use. But it could be used by anyone.
You feed the name string to the algorithm and it spits out a UUID. We append then append that to a URI and web service so it is resolvable.
So the name Danaus plexippus (Linnaeus 1758) => 4ef223c4-0c3e-5e84-ace9-755c34c601ec
So if the GNI and and another group have the same namestring they have the same UUID.
People can then can link their data set to the GNI with the following URI
http://gni.globalnames.org/name_strings/4ef223c4-0c3e-5e84-ace9-755c34c601ec
RDF http://gni.globalnames.org/name_strings/4ef223c4-0c3e-5e84-ace9-755c34c601ec...
http://gni.globalnames.org/name_strings/4ef223c4-0c3e-5e84-ace9-755c34c601ec.rdfIf you think of your data set as one table and the GNI as another, this URI serves as the foreign key that connects them together.
Some on the list don't like how these look, but there is a tremendous advantage in not having to worry about syncing two large data sets and determining if a given integer is already in use.
Also Rod Page has written a recently about UUID's. http://iphylo.blogspot.com/2011/05/zoobank-on-couchdb-uuids-replication.html
http://iphylo.blogspot.com/2011/05/zoobank-on-couchdb-uuids-replication.htmlThere may be a way to do something similar with bit.ly like identifiers that are shorter (mCcSp), but I think it the general idea is a good one.
If you recall from my talk at TDWG, I was able to use these to make statements that one namestring was a synonym etc. of another etc.
The algorithm we use is written in Ruby but I could be ported to many different languages since UUIDs are widely supported.
Respectfully,
- Pete
On Thu, Jun 2, 2011 at 11:41 PM, Steven J. Baskauf < steve.baskauf@vanderbilt.edu> wrote:
My email access has been sporadic since this thread developed, so at this point I'll respond to points made in several of the messages.
First, I should note that there has been previous discussion on this list on a similar topic from http://lists.tdwg.org/pipermail/tdwg-content/2011-January/002231.htmlthrough http://lists.tdwg.org/pipermail/tdwg-content/2011-January/002231.html. One can review what was said at that time rather quickly by starting on the first linked message and clicking on the "Next Message" link until you get to the end of the range I gave above.
My reason for the request for information that started this thread was that I wanted to link to a URI that would anchor the name portion of a name/sensu pair (TNU or Taxon Concept a la TCS if you prefer) as in this RDF snippet:
tc:nameStringQuercus rubra L.</tc:nameString> <tc:hasName rdf:about="http://www.ubio.org/authority/metadata.php?lsid=urn:lsid:ubio.org:namebank:4..." http://www.ubio.org/authority/metadata.php?lsid=urn:lsid:ubio.org:namebank:448439/>
At this point in the discussion, I'm not actually talking about creating a link to a taxon concept but rather to a taxon name, so some of the issues Pete raised don't apply here (e.g. what's the "right" name for a concept
the question here is simply what's a stable identifier for the name) . In principle, I could probably just provide the name string and be done with it. However, having some degree of faith that Smart, Computer Savvy People might some day be able to use the metadata returned by the URI (or perhaps metadata which they already have in a triple store onsite) to do cool things like knowing that my name is the same as an orthographic variant or that "Quercus rubra L." is basically the same thing as "Quercus rubra", I would like to also provide a functional URI.
As an end -user who isn't very interested in the technical issues involving names, I don't really care what URI I use. I would prefer for it to be widely recognized and for it to "work" (i.e. be resolvable). In the earlier (January) thread, there was discussion about existing identifiers. There were a number of posts, but in particular http://lists.tdwg.org/pipermail/tdwg-content/2011-January/002258.html http://lists.tdwg.org/pipermail/tdwg-content/2011-January/002259.htmldiscuss... the relative merits of ITIS and uBio ID numbers. My take-home message from this was that uBio represented the largest single set of names with assigned identifiers (see http://gni.globalnames.org/data_sourcescited in Pete's email) and that uBio metadata provides useful references. Hence my interest in referencing uBio ids as a URI. However, as a practical matter, the organizations that I share images with either want ITIS TSNs (EOL and Morphbank) or just names (Discover Life). Nobody is asking for uBio identifiers or any other identifier.
I found Kevin's comment at http://lists.tdwg.org/pipermail/tdwg-content/2011-May/002486.html very thought-provoking: "My thoughts are that the most likely way this will be solved is by standard market type pressures - ie the best solution/IDs will be used the most and 'float' to the top." I'm not going to make a judgment about what is the "best" solution or ID. But I would say that in "computer" history, being the "best" doesn't necessarily mean that something will be used. Take for example, the FOAF vocabulary. What the heck is Friend of a Friend? I would venture to say that most of the people using the FOAF vocabulary don't know or care. The FOAF vocabulary was the one that people started to use and once that happened, people didn't switch even if there was something better. I'm not familiar with the history of other stuff like YouTube and Craig's List, but I would guess that they weren't necessarily "the best" systems - they were just the one that the most people started using first and once that happened, people didn't switch. I'm using ITIS IDs because they are easy to get and the people I communicate with want them. Whether they are the "best" or "done correctly" doesn't matter to me as much as the fact that that they are widely recognized and stable (and that thus far every name that I've looked for has been in their database).
I think that one reason why this question has been on my mind is that I've been waiting for GNUB (Global Name Use Bank) to come out. I'm not really up on how it is going to work, but my impression is that it was going to be based on the Global Name Index (GNI) which was mentioned in that earlier January thread. At that point, the GNI names didn't have any identifiers that were exposed to the public as permanent GUIDs. I'm assuming that if GNUB refers to GNI names, they will have some kind of identifiers. So if that happens how is the GUID recommendation 8 going to be followed? As Kevin said in http://lists.tdwg.org/pipermail/tdwg-content/2011-June/002499.html "What I take from recommendation 8 of the GUID applicability guide ... is that if you DON'T already have a record in your own database for a taxon name/concept, then reuse an existing one. " What we have here with GNI is a situation where none of the records have identifiers. In my mind, the "best practice" according to recommendation 8 would be for the GNI to reuse existing identifiers where they exist and NOT make up new ones. This is a bit more complicated because the ITIS identifiers (which are in common use) don't have an http URI version that is resolvable, and while the uBio identifiers have a resolvable http URI, it's in the form of a proxied LSID, which I've already complained is very ugly. So I'd like to hear some ideas about how to have "reused" identifiers in the GNI.
One thing that comes to my mind would be to have a "domain name" like "http://purl.org/gni/" http://purl.org/gni/ or "http://purl.org/tn/"http://purl.org/tn/("tn" for "taxon name") and to follow it with a namespace/id combination similar to what is done with lsids. So for example "itis/19408" and "ubio/448439" could be appended, creating http://purl.org/gni/itis/19408and http://purl.org/gni/ubio/448439 for "Quercus rubra L." Both URIs could point to the same RDF and that RDF could indicate that the two identifiers are owl:sameAs . I realize from what Bob Morris has cautioned in the past that there are problems with owl:sameAs when the two things aren't actually the same thing (e.g. if the uBio ID refers to a name string only but the ITIS TSN refers to the name plus an "accepted" status and a relationship to parent taxa). However, if there were an understanding that the GNI only refers to name strings, then one could still refer to http://purl.org/gni/itis/19408 as an identifier for the name string of the thing (whatever it is) that is referred to by an ITIS TSN of 19408. I don't think there would be a problem saying that and the ubio ID were "owl:sameAs". Some kind of solution like this would allow people to easily generate a resolvable URI for a name if they were using ITIS TSNs or uBio IDs. If the name that one wanted to use was so obscure that it was one of the 9.5 million names that uBio has that ITIS doesn't have, then that name would only have the ubio version. I have no idea whether this would be a good idea or not, but I was really cringing to think about 19 million newly minted UUIDs appended to "http://gni.globalnames.org/"http://gni.globalnames.org/and figuring out how to connect those horrid things to the names and ITIS TSNs that I'm already using. I think that I said this before, but using the purl.org domain rather than one like http://gni.globalnames.org/ would in the future allow somebody else to take over management of providing the metadata when the GUIDs are resolved without having to deal with issues of who "owns" the domain name.
Steve
Kevin Richards wrote:
Pete,
I’m not trying to say what you are doing is a waste of time/impossible. I actually think RDF + semantics are a good way forward, but this really implies that we need to rely on the semantics and linkages rather than having a SINGLE ID for a taxon name. (which is what I thought Steve was getting at). Each instance of a taxon name can have its own ID and then all these instances are connected via ontology defined semantic links. This seems more appropriate to me than insisting everyone uses the “Global Taxon Name ID X”.
In your example of *Aedes triseriatus* and *Ochlerotatus triseriatus* – these are two different names so they need two different IDs, they may be linked by a single taxon concept, but they are separate names. So which of these now 3 IDs do you expect people to use, and according to what source??
For example if we have a name, eg the Robin, Erithacus rubecula, mentioned in IT IS (TSN : 559964) and also in EOL (www.eol.org/pages/1051567), also in GBIF (http://data.gbif.org/species/21266780), also in avibase ( http://avibase.bsc-eoc.org/species.jsp?avibaseid=C809B2B90399A43D), which ID are you hoping people will use?? Would you put the IT IS ID in your own dataset as the ID for that name – unlikely. Or would it be better to link them up with semantic linkages.
What I take from recommendation 8 of the GUID applicability guide (as Steve puts is "stop making up new identifiers when somebody else already has one for the thing you are talking about”) is that if you DON’T already have a record in your own database for a taxon name/concept, then reuse an existing one. NOT ditch all your current IDs and adopt someone else’s (especially hard considering it is so hard to work out which if the multitude of names ad concept IDs that directly relates to your taxon name).
I am all for limiting the number of IDs for the “same” thing, but in some cases it is more useful to build linkages than force this tight integration of data and IDs. Especially for taxon names and concepts, where it is complex to define if you are even talking about the “same” thing or not.
Kevin
*From:* Peter DeVries [mailto:pete.devries@gmail.compete.devries@gmail.com]
*Sent:* Wednesday, 1 June 2011 12:38 p.m. *To:* Kevin Richards *Cc:* Steve Baskauf; tdwg-content@lists.tdwg.org; Gerald Guala; Nicolson, David; Alan J Hampson; Orrell, Thomas *Subject:* Re: [tdwg-content] ITIS TSNID to uBio NamebankIDs mapping
Hi Kevin,
I forgot one mention some other things that are different about my project.
You can write a simple SPARQL query to get a list of all the TaxonConcept's that have ITIS ids, or all those that have ITIS and NCBI ID's etc.
You can do this on any SPARQL endpoint that hosts the data.
You can download the entire data set and run the queries on your own endpoint.
You can write a script that runs the query and downloads the ITIS numbers and exports them to CSV etc.
- Pete
On Tue, May 31, 2011 at 5:16 PM, Peter DeVries pete.devries@gmail.com wrote:
Hi Kevin,
On Tue, May 31, 2011 at 3:27 PM, Kevin Richards < RichardsK@landcareresearch.co.nz> wrote:
This is exactly why this problem still exists and will be very complex to solve - everyone says "we should have a single ID for a specific taxon name, there seems to be several IDs 'out there' that refer to the same taxon name, so Im going to create another ID to link them all up" - yet another ID that no one will particularly want to follow - you would have to get everyone to agree that your combinations/integration of taxon names is the best one and hope everyone follows it - unlikely in this domain.
Isn't this kind of what the The Plant List, and eBird already do?
A difference being that they tie these to a specific name and specific classification.
The Plant list is not really even open so it is difficult to people to adopt it in mass.
For instance, if I manage a herbarium, how do I easily reconcile my species list with the entities represented in the Plant List?
eBird has millions of records which implies that they have been able to convince the observers in the field to adopt their system. You are correct in that there are probably a lot of taxonomists that don't like their list.
It differs from many of the other classifications, but remember the system rewards them for not agreeing. Note the difference between the microbial taxonomists and other taxonomists. In the case of the microbial
workers, the system rewards them for solving problems not debating alternatives. Also, if a good idea comes out that will make it easier for the microbiologists to solve the problems they are rewarded for solving, they are less likely to care whose idea it is.
Like the microbiologists, there are lots of biologists that work with species with the goal of addressing some non-taxonomic problem.
They don't really care if the name is *Aedes triseriatus* or *Ochlerotatus triseriatus, *but they do care that the identifier that they connect their data to is stable.
In regards to the issue of market forces,I suspect (but have no knowledge of) that there were probably decisions made in devising these lists that have more to do with appeasing certain personalities that creating best list. With the way this system rewards people it is likely that the "correct" version will float to the top only after that person has passed away. I don't have much faith that the best system will always float to the top, That has a lot to do with the personalities and how the system rewards are setup. Theoretically, it is possible for one strong personality or group to force others to adopt their less than optimal solution - at least this seems to happen in other environments.
Also, there are all sorts of ways that people can use the publication record to rewrite history. Simply cite the review paper that cites the original paper. Or don't cite it at all.
I would have used only the ITIS TSN but if the name changes the ID changes. This isn't "wrong", it just does not solve my problem.
- ITIS also should add the spiders from the World Spider Catalog.
Another issue that I think has inhibited adoption of a common list is that people can't agree on a particular name or a particular classification.
Since you can model a species concept as having many names and many classifications why not do so?
If this idea was originally accepted, I would not have needed to create TaxonConcept.org.
My plan has aways been to get something that works to solve some problems and then let some larger group take it over.
In a sense, I am more like the microbiologists in that I am not being paid to solve this or debate this problem.
I am doing it because I think something like this is needed, and it is an interesting and personally rewarding puzzle.
- Pete
My thoughts are that the most likely way this will be solve is by stnadard market type pressures - ie the best solution/IDs will be used the most and "float" to the top. It is easy to say that the global taxon name data is a mess, but if you think about it 30 years ago taxon name data were very disparate, duplicated, unconnected, many with NO IDs at all. So I beleive we are making progress and that we will continue to do so albeit at a fairly slow rate.
Kevin
"I agree. This was one of the reasons that I setup TaxonConcept the way I did. It attempts to connect both the LOD entities and the foreign key based entities."
Please consider the environment before printing this email Warning: This electronic message together with any attachments is confidential. If you receive it in error: (i) you must not read, use, disclose, copy or retain it; (ii) please contact the sender immediately by reply email and then delete the emails. The views expressed in this email may not be those of Landcare Research New Zealand Limited. http://www.landcareresearch.co.nz
--
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 Email: pdevries@wisc.edu TaxonConcept http://www.taxonconcept.org/ & GeoSpecieshttp://about.geospecies.org/ Knowledge Bases A Semantic Web, Linked Open Data http://linkeddata.org/ Project
--
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 Email: pdevries@wisc.edu TaxonConcept http://www.taxonconcept.org/ & GeoSpecieshttp://about.geospecies.org/ Knowledge Bases A Semantic Web, Linked Open Data http://linkeddata.org/ Project
Please consider the environment before printing this email Warning: This electronic message together with any attachments is confidential. If you receive it in error: (i) you must not read, use, disclose, copy or retain it; (ii) please contact the sender immediately by reply email and then delete the emails. The views expressed in this email may not be those of Landcare Research New Zealand Limited. http://www.landcareresearch.co.nz
-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences
postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A.
delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235
office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707http://bioimages.vanderbilt.edu
--
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 Email: pdevries@wisc.edu TaxonConcept http://www.taxonconcept.org/ & GeoSpecieshttp://about.geospecies.org/ Knowledge Bases A Semantic Web, Linked Open Data http://linkeddata.org/ Project
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
David Remsen, Senior Programme Officer Electronic Catalog of Names of Known Organisms Global Biodiversity Information Facility Secretariat Universitetsparken 15, DK-2100 Copenhagen, Denmark Tel: +45-35321472 Fax: +45-35321480 Skype: dremsen
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
_______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
Please consider the environment before printing this email Warning: This electronic message together with any attachments is confidential. If you receive it in error: (i) you must not read, use, disclose, copy or retain it; (ii) please contact the sender immediately by reply email and then delete the emails. The views expressed in this email may not be those of Landcare Research New Zealand Limited. http://www.landcareresearch.co.nz
+1
I suspect that for many uses (e.g., linking names in a publication to taxonomic databases) the name will be the obvious, and only practical to make the links. We "just" need a little secret source at the other end to handle the same name string mapping to more than one taxonomic name, and variations on the same name. Wikipedia has done pretty well using just names as identifiers.
UUIDs work well for decentralised minting of identifiers, having a seeded UUID service pretty much defeats the point of UUIDs.
Regards
Rod
On 3 Jun 2011, at 12:59, David Remsen (GBIF) wrote:
Why not use the name as the basis for the resolvable identifier instead of a uuid. Isnt there a 1:1 cardinality between the name and the uuid in the GNI? Doesnt that mean that
http://gni.globalnames.org/name_strings/4ef223c4-0c3e-5e84-ace9-755c34c601ec and http://gni.globalnames.org/name_strings/Danaus_plexippus_(Linnaeus_1758)
are equally unique? The latter is certainly more readable. In those cases where the namestring is a homonym like
http://gni.globalnames.org/name_strings/Oenanthe
couldn't you just return the addresses of the two globally unique forms of the name when you resolve it?
http://gni.globalnames.org/name_strings/Oenanthe_Smith_1899
http://gni.globalnames.org/name_strings/Oenanthe_Jones_1900
Wouldn't those be as globally unique and easier to read and adjust to? Or am I missing something. I always wanted to do that with ubio IDs after a back and forth with Gregor Hagedorn and wished we hadn't exposed those integers.
DR
Hi Steve,
I don't have time to go through this in detail, and I can't speak for the GNI, but I can tell you about how the GNI URI's work at least for now.
A while back Dima Mozzherin and I were looking into how triples etc. might be of use to the GNI.
We needed a way to generate unique URI's for each name.
We wanted to avoid having to keep these in sync and not require everyone to look each ID up through some service.
Dima came up with the following plan. We use the namestring as seed to generate a unique UUID.
Basically this is a shared algorithm which the GNI and TaxonConcept both use. But it could be used by anyone.
You feed the name string to the algorithm and it spits out a UUID. We append then append that to a URI and web service so it is resolvable.
So the name Danaus plexippus (Linnaeus 1758) => 4ef223c4-0c3e-5e84-ace9-755c34c601ec
So if the GNI and and another group have the same namestring they have the same UUID.
People can then can link their data set to the GNI with the following URI
http://gni.globalnames.org/name_strings/4ef223c4-0c3e-5e84-ace9-755c34c601ec
RDF http://gni.globalnames.org/name_strings/4ef223c4-0c3e-5e84-ace9-755c34c601ec...
http://gni.globalnames.org/name_strings/4ef223c4-0c3e-5e84-ace9-755c34c601ec.rdfIf you think of your data set as one table and the GNI as another, this URI serves as the foreign key that connects them together.
Some on the list don't like how these look, but there is a tremendous advantage in not having to worry about syncing two large data sets and determining if a given integer is already in use.
Also Rod Page has written a recently about UUID's. http://iphylo.blogspot.com/2011/05/zoobank-on-couchdb-uuids-replication.html
http://iphylo.blogspot.com/2011/05/zoobank-on-couchdb-uuids-replication.htmlThere may be a way to do something similar with bit.ly like identifiers that are shorter (mCcSp), but I think it the general idea is a good one.
If you recall from my talk at TDWG, I was able to use these to make statements that one namestring was a synonym etc. of another etc.
The algorithm we use is written in Ruby but I could be ported to many different languages since UUIDs are widely supported.
Respectfully,
- Pete
On Thu, Jun 2, 2011 at 11:41 PM, Steven J. Baskauf < steve.baskauf@vanderbilt.edu> wrote:
My email access has been sporadic since this thread developed, so at this point I'll respond to points made in several of the messages.
First, I should note that there has been previous discussion on this list on a similar topic from http://lists.tdwg.org/pipermail/tdwg-content/2011-January/002231.htmlthrough http://lists.tdwg.org/pipermail/tdwg-content/2011-January/002231.html. One can review what was said at that time rather quickly by starting on the first linked message and clicking on the "Next Message" link until you get to the end of the range I gave above.
My reason for the request for information that started this thread was that I wanted to link to a URI that would anchor the name portion of a name/sensu pair (TNU or Taxon Concept a la TCS if you prefer) as in this RDF snippet:
tc:nameStringQuercus rubra L.</tc:nameString> <tc:hasName rdf:about="http://www.ubio.org/authority/metadata.php?lsid=urn:lsid:ubio.org:namebank:4..." http://www.ubio.org/authority/metadata.php?lsid=urn:lsid:ubio.org:namebank:448439/>
At this point in the discussion, I'm not actually talking about creating a link to a taxon concept but rather to a taxon name, so some of the issues Pete raised don't apply here (e.g. what's the "right" name for a concept
the question here is simply what's a stable identifier for the name) . In principle, I could probably just provide the name string and be done with it. However, having some degree of faith that Smart, Computer Savvy People might some day be able to use the metadata returned by the URI (or perhaps metadata which they already have in a triple store onsite) to do cool things like knowing that my name is the same as an orthographic variant or that "Quercus rubra L." is basically the same thing as "Quercus rubra", I would like to also provide a functional URI.
As an end -user who isn't very interested in the technical issues involving names, I don't really care what URI I use. I would prefer for it to be widely recognized and for it to "work" (i.e. be resolvable). In the earlier (January) thread, there was discussion about existing identifiers. There were a number of posts, but in particular http://lists.tdwg.org/pipermail/tdwg-content/2011-January/002258.html http://lists.tdwg.org/pipermail/tdwg-content/2011-January/002259.htmldiscuss... the relative merits of ITIS and uBio ID numbers. My take-home message from this was that uBio represented the largest single set of names with assigned identifiers (see http://gni.globalnames.org/data_sourcescited in Pete's email) and that uBio metadata provides useful references. Hence my interest in referencing uBio ids as a URI. However, as a practical matter, the organizations that I share images with either want ITIS TSNs (EOL and Morphbank) or just names (Discover Life). Nobody is asking for uBio identifiers or any other identifier.
I found Kevin's comment at http://lists.tdwg.org/pipermail/tdwg-content/2011-May/002486.html very thought-provoking: "My thoughts are that the most likely way this will be solved is by standard market type pressures - ie the best solution/IDs will be used the most and 'float' to the top." I'm not going to make a judgment about what is the "best" solution or ID. But I would say that in "computer" history, being the "best" doesn't necessarily mean that something will be used. Take for example, the FOAF vocabulary. What the heck is Friend of a Friend? I would venture to say that most of the people using the FOAF vocabulary don't know or care. The FOAF vocabulary was the one that people started to use and once that happened, people didn't switch even if there was something better. I'm not familiar with the history of other stuff like YouTube and Craig's List, but I would guess that they weren't necessarily "the best" systems - they were just the one that the most people started using first and once that happened, people didn't switch. I'm using ITIS IDs because they are easy to get and the people I communicate with want them. Whether they are the "best" or "done correctly" doesn't matter to me as much as the fact that that they are widely recognized and stable (and that thus far every name that I've looked for has been in their database).
I think that one reason why this question has been on my mind is that I've been waiting for GNUB (Global Name Use Bank) to come out. I'm not really up on how it is going to work, but my impression is that it was going to be based on the Global Name Index (GNI) which was mentioned in that earlier January thread. At that point, the GNI names didn't have any identifiers that were exposed to the public as permanent GUIDs. I'm assuming that if GNUB refers to GNI names, they will have some kind of identifiers. So if that happens how is the GUID recommendation 8 going to be followed? As Kevin said in http://lists.tdwg.org/pipermail/tdwg-content/2011-June/002499.html "What I take from recommendation 8 of the GUID applicability guide ... is that if you DON'T already have a record in your own database for a taxon name/concept, then reuse an existing one. " What we have here with GNI is a situation where none of the records have identifiers. In my mind, the "best practice" according to recommendation 8 would be for the GNI to reuse existing identifiers where they exist and NOT make up new ones. This is a bit more complicated because the ITIS identifiers (which are in common use) don't have an http URI version that is resolvable, and while the uBio identifiers have a resolvable http URI, it's in the form of a proxied LSID, which I've already complained is very ugly. So I'd like to hear some ideas about how to have "reused" identifiers in the GNI.
One thing that comes to my mind would be to have a "domain name" like "http://purl.org/gni/" http://purl.org/gni/ or "http://purl.org/tn/"http://purl.org/tn/("tn" for "taxon name") and to follow it with a namespace/id combination similar to what is done with lsids. So for example "itis/19408" and "ubio/448439" could be appended, creating http://purl.org/gni/itis/19408and http://purl.org/gni/ubio/448439 for "Quercus rubra L." Both URIs could point to the same RDF and that RDF could indicate that the two identifiers are owl:sameAs . I realize from what Bob Morris has cautioned in the past that there are problems with owl:sameAs when the two things aren't actually the same thing (e.g. if the uBio ID refers to a name string only but the ITIS TSN refers to the name plus an "accepted" status and a relationship to parent taxa). However, if there were an understanding that the GNI only refers to name strings, then one could still refer to http://purl.org/gni/itis/19408 as an identifier for the name string of the thing (whatever it is) that is referred to by an ITIS TSN of 19408. I don't think there would be a problem saying that and the ubio ID were "owl:sameAs". Some kind of solution like this would allow people to easily generate a resolvable URI for a name if they were using ITIS TSNs or uBio IDs. If the name that one wanted to use was so obscure that it was one of the 9.5 million names that uBio has that ITIS doesn't have, then that name would only have the ubio version. I have no idea whether this would be a good idea or not, but I was really cringing to think about 19 million newly minted UUIDs appended to "http://gni.globalnames.org/"http://gni.globalnames.org/and figuring out how to connect those horrid things to the names and ITIS TSNs that I'm already using. I think that I said this before, but using the purl.org domain rather than one like http://gni.globalnames.org/ would in the future allow somebody else to take over management of providing the metadata when the GUIDs are resolved without having to deal with issues of who "owns" the domain name.
Steve
Kevin Richards wrote:
Pete,
I’m not trying to say what you are doing is a waste of time/impossible. I actually think RDF + semantics are a good way forward, but this really implies that we need to rely on the semantics and linkages rather than having a SINGLE ID for a taxon name. (which is what I thought Steve was getting at). Each instance of a taxon name can have its own ID and then all these instances are connected via ontology defined semantic links. This seems more appropriate to me than insisting everyone uses the “Global Taxon Name ID X”.
In your example of *Aedes triseriatus* and *Ochlerotatus triseriatus* – these are two different names so they need two different IDs, they may be linked by a single taxon concept, but they are separate names. So which of these now 3 IDs do you expect people to use, and according to what source??
For example if we have a name, eg the Robin, Erithacus rubecula, mentioned in IT IS (TSN : 559964) and also in EOL (www.eol.org/pages/1051567), also in GBIF (http://data.gbif.org/species/21266780), also in avibase ( http://avibase.bsc-eoc.org/species.jsp?avibaseid=C809B2B90399A43D), which ID are you hoping people will use?? Would you put the IT IS ID in your own dataset as the ID for that name – unlikely. Or would it be better to link them up with semantic linkages.
What I take from recommendation 8 of the GUID applicability guide (as Steve puts is "stop making up new identifiers when somebody else already has one for the thing you are talking about”) is that if you DON’T already have a record in your own database for a taxon name/concept, then reuse an existing one. NOT ditch all your current IDs and adopt someone else’s (especially hard considering it is so hard to work out which if the multitude of names ad concept IDs that directly relates to your taxon name).
I am all for limiting the number of IDs for the “same” thing, but in some cases it is more useful to build linkages than force this tight integration of data and IDs. Especially for taxon names and concepts, where it is complex to define if you are even talking about the “same” thing or not.
Kevin
*From:* Peter DeVries [mailto:pete.devries@gmail.compete.devries@gmail.com]
*Sent:* Wednesday, 1 June 2011 12:38 p.m. *To:* Kevin Richards *Cc:* Steve Baskauf; tdwg-content@lists.tdwg.org; Gerald Guala; Nicolson, David; Alan J Hampson; Orrell, Thomas *Subject:* Re: [tdwg-content] ITIS TSNID to uBio NamebankIDs mapping
Hi Kevin,
I forgot one mention some other things that are different about my project.
You can write a simple SPARQL query to get a list of all the TaxonConcept's that have ITIS ids, or all those that have ITIS and NCBI ID's etc.
You can do this on any SPARQL endpoint that hosts the data.
You can download the entire data set and run the queries on your own endpoint.
You can write a script that runs the query and downloads the ITIS numbers and exports them to CSV etc.
- Pete
On Tue, May 31, 2011 at 5:16 PM, Peter DeVries pete.devries@gmail.com wrote:
Hi Kevin,
On Tue, May 31, 2011 at 3:27 PM, Kevin Richards < RichardsK@landcareresearch.co.nz> wrote:
This is exactly why this problem still exists and will be very complex to solve - everyone says "we should have a single ID for a specific taxon name, there seems to be several IDs 'out there' that refer to the same taxon name, so Im going to create another ID to link them all up" - yet another ID that no one will particularly want to follow - you would have to get everyone to agree that your combinations/integration of taxon names is the best one and hope everyone follows it - unlikely in this domain.
Isn't this kind of what the The Plant List, and eBird already do?
A difference being that they tie these to a specific name and specific classification.
The Plant list is not really even open so it is difficult to people to adopt it in mass.
For instance, if I manage a herbarium, how do I easily reconcile my species list with the entities represented in the Plant List?
eBird has millions of records which implies that they have been able to convince the observers in the field to adopt their system. You are correct in that there are probably a lot of taxonomists that don't like their list.
It differs from many of the other classifications, but remember the system rewards them for not agreeing. Note the difference between the microbial taxonomists and other taxonomists. In the case of the microbial
workers, the system rewards them for solving problems not debating alternatives. Also, if a good idea comes out that will make it easier for the microbiologists to solve the problems they are rewarded for solving, they are less likely to care whose idea it is.
Like the microbiologists, there are lots of biologists that work with species with the goal of addressing some non-taxonomic problem.
They don't really care if the name is *Aedes triseriatus* or *Ochlerotatus triseriatus, *but they do care that the identifier that they connect their data to is stable.
In regards to the issue of market forces,I suspect (but have no knowledge of) that there were probably decisions made in devising these lists that have more to do with appeasing certain personalities that creating best list. With the way this system rewards people it is likely that the "correct" version will float to the top only after that person has passed away. I don't have much faith that the best system will always float to the top, That has a lot to do with the personalities and how the system rewards are setup. Theoretically, it is possible for one strong personality or group to force others to adopt their less than optimal solution - at least this seems to happen in other environments.
Also, there are all sorts of ways that people can use the publication record to rewrite history. Simply cite the review paper that cites the original paper. Or don't cite it at all.
I would have used only the ITIS TSN but if the name changes the ID changes. This isn't "wrong", it just does not solve my problem.
- ITIS also should add the spiders from the World Spider Catalog.
Another issue that I think has inhibited adoption of a common list is that people can't agree on a particular name or a particular classification.
Since you can model a species concept as having many names and many classifications why not do so?
If this idea was originally accepted, I would not have needed to create TaxonConcept.org.
My plan has aways been to get something that works to solve some problems and then let some larger group take it over.
In a sense, I am more like the microbiologists in that I am not being paid to solve this or debate this problem.
I am doing it because I think something like this is needed, and it is an interesting and personally rewarding puzzle.
- Pete
My thoughts are that the most likely way this will be solve is by stnadard market type pressures - ie the best solution/IDs will be used the most and "float" to the top. It is easy to say that the global taxon name data is a mess, but if you think about it 30 years ago taxon name data were very disparate, duplicated, unconnected, many with NO IDs at all. So I beleive we are making progress and that we will continue to do so albeit at a fairly slow rate.
Kevin
"I agree. This was one of the reasons that I setup TaxonConcept the way I did. It attempts to connect both the LOD entities and the foreign key based entities."
Please consider the environment before printing this email Warning: This electronic message together with any attachments is confidential. If you receive it in error: (i) you must not read, use, disclose, copy or retain it; (ii) please contact the sender immediately by reply email and then delete the emails. The views expressed in this email may not be those of Landcare Research New Zealand Limited. http://www.landcareresearch.co.nz
--
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 Email: pdevries@wisc.edu TaxonConcept http://www.taxonconcept.org/ & GeoSpecieshttp://about.geospecies.org/ Knowledge Bases A Semantic Web, Linked Open Data http://linkeddata.org/ Project
--
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 Email: pdevries@wisc.edu TaxonConcept http://www.taxonconcept.org/ & GeoSpecieshttp://about.geospecies.org/ Knowledge Bases A Semantic Web, Linked Open Data http://linkeddata.org/ Project
Please consider the environment before printing this email Warning: This electronic message together with any attachments is confidential. If you receive it in error: (i) you must not read, use, disclose, copy or retain it; (ii) please contact the sender immediately by reply email and then delete the emails. The views expressed in this email may not be those of Landcare Research New Zealand Limited. http://www.landcareresearch.co.nz
-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences
postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A.
delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235
office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707http://bioimages.vanderbilt.edu
--
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 Email: pdevries@wisc.edu TaxonConcept http://www.taxonconcept.org/ & GeoSpecieshttp://about.geospecies.org/ Knowledge Bases A Semantic Web, Linked Open Data http://linkeddata.org/ Project
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
David Remsen, Senior Programme Officer Electronic Catalog of Names of Known Organisms Global Biodiversity Information Facility Secretariat Universitetsparken 15, DK-2100 Copenhagen, Denmark Tel: +45-35321472 Fax: +45-35321480 Skype: dremsen
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
--------------------------------------------------------- Roderic Page Professor of Taxonomy Institute of Biodiversity, Animal Health and Comparative Medicine College of Medical, Veterinary and Life Sciences Graham Kerr Building University of Glasgow Glasgow G12 8QQ, UK
Email: r.page@bio.gla.ac.uk Tel: +44 141 330 4778 Fax: +44 141 330 2792 AIM: rodpage1962@aim.com Facebook: http://www.facebook.com/profile.php?id=1112517192 Twitter: http://twitter.com/rdmpage Blog: http://iphylo.blogspot.com Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html
Hi David,
In cases like http://gni.globalnames.org/name_strings/Danaus_plexippus you are correct.
But remember how URLs are encoded so many things will not work and different system seem to treat these differently.
The space " " in the above example would need to be precent encoded.
I would suggest that we use underscores "_" for spaces.
Any spaces, accented characters, commas or parenthesis would need to be % encoded so.
http://gni.globalnames.org/name_strings/Danaus_plexippus_(Linnaeus_1758) => http://gni.globalnames.org/name_strings/Danaus_plexippus_%28Linnaeus_1758%29
Try pasting the first url into the browser and then copy it back. Depending on what browser you use it will either be % encoded or not.
So to the extent that the URI can be "Cool URI's http://www.w3.org/TR/cooluris/ using the name will work.
Respectfully,
- Pete
On Fri, Jun 3, 2011 at 6:59 AM, David Remsen (GBIF) dremsen@gbif.orgwrote:
Why not use the name as the basis for the resolvable identifier instead of a uuid. Isnt there a 1:1 cardinality between the name and the uuid in the GNI? Doesnt that mean that
http://gni.globalnames.org/name_strings/4ef223c4-0c3e-5e84-ace9-755c34c601ec and http://gni.globalnames.org/name_strings/Danaus_plexippus_(Linnaeus_1758)
are equally unique? The latter is certainly more readable. In those cases where the namestring is a homonym like
http://gni.globalnames.org/name_strings/Oenanthe
couldn't you just return the addresses of the two globally unique forms of the name when you resolve it?
http://gni.globalnames.org/name_strings/Oenanthe_Smith_1899
http://gni.globalnames.org/name_strings/Oenanthe_Jones_1900
Wouldn't those be as globally unique and easier to read and adjust to? Or am I missing something. I always wanted to do that with ubio IDs after a back and forth with Gregor Hagedorn and wished we hadn't exposed those integers.
DR
Hi Steve,
I don't have time to go through this in detail, and I can't speak for the GNI, but I can tell you about how the GNI URI's work at least for now.
A while back Dima Mozzherin and I were looking into how triples etc.
might
be of use to the GNI.
We needed a way to generate unique URI's for each name.
We wanted to avoid having to keep these in sync and not require everyone to look each ID up through some service.
Dima came up with the following plan. We use the namestring as seed to generate a unique UUID.
Basically this is a shared algorithm which the GNI and TaxonConcept both use. But it could be used by anyone.
You feed the name string to the algorithm and it spits out a UUID. We append then append that to a URI and web service so it is resolvable.
So the name Danaus plexippus (Linnaeus 1758) => 4ef223c4-0c3e-5e84-ace9-755c34c601ec
So if the GNI and and another group have the same namestring they have
the
same UUID.
People can then can link their data set to the GNI with the following URI
http://gni.globalnames.org/name_strings/4ef223c4-0c3e-5e84-ace9-755c34c601ec
RDF
http://gni.globalnames.org/name_strings/4ef223c4-0c3e-5e84-ace9-755c34c601ec...
<
http://gni.globalnames.org/name_strings/4ef223c4-0c3e-5e84-ace9-755c34c601ec...
If you think of your data set as one table and the GNI as another, this URI serves as the foreign key that connects them together.
Some on the list don't like how these look, but there is a tremendous advantage in not having to worry about syncing two large data sets and determining if a given integer is already in use.
Also Rod Page has written a recently about UUID's.
http://iphylo.blogspot.com/2011/05/zoobank-on-couchdb-uuids-replication.html
<
http://iphylo.blogspot.com/2011/05/zoobank-on-couchdb-uuids-replication.html
There may be a way to do something similar with bit.ly like identifiers that
are
shorter (mCcSp), but I think it the general idea is a good one.
If you recall from my talk at TDWG, I was able to use these to make statements that one namestring was a synonym etc. of another etc.
The algorithm we use is written in Ruby but I could be ported to many different languages since UUIDs are widely supported.
Respectfully,
- Pete
On Thu, Jun 2, 2011 at 11:41 PM, Steven J. Baskauf < steve.baskauf@vanderbilt.edu> wrote:
My email access has been sporadic since this thread developed, so at this point I'll respond to points made in several of the messages.
First, I should note that there has been previous discussion on this list on a similar topic from
http://lists.tdwg.org/pipermail/tdwg-content/2011-January/002231.htmlthrough
http://lists.tdwg.org/pipermail/tdwg-content/2011-January/002231.html. One can review what was said at that time rather quickly by starting on the first linked message and clicking on the "Next Message" link until you get to the end of the range I gave above.
My reason for the request for information that started this thread was that I wanted to link to a URI that would anchor the name portion of a name/sensu pair (TNU or Taxon Concept a la TCS if you prefer) as in this RDF snippet:
tc:nameStringQuercus rubra L.</tc:nameString> <tc:hasName rdf:about="
http://www.ubio.org/authority/metadata.php?lsid=urn:lsid:ubio.org:namebank:4... "
<
http://www.ubio.org/authority/metadata.php?lsid=urn:lsid:ubio.org:namebank:4...
/>
At this point in the discussion, I'm not actually talking about creating a link to a taxon concept but rather to a taxon name, so some of the issues Pete raised don't apply here (e.g. what's the "right" name for a concept
the question here is simply what's a stable identifier for the name) . In principle, I could probably just provide the name string and be done with it. However, having some degree of faith that Smart, Computer Savvy People might some day be able to use the metadata returned by the URI (or perhaps metadata which they already have in a triple store onsite) to do cool things like knowing that my name is the same as an orthographic variant or that "Quercus rubra L." is basically the same thing as "Quercus rubra", I would like to also provide a functional URI.
As an end -user who isn't very interested in the technical issues involving names, I don't really care what URI I use. I would prefer for it to be widely recognized and for it to "work" (i.e. be resolvable). In the earlier (January) thread, there was discussion about existing identifiers. There were a number of posts, but in particular http://lists.tdwg.org/pipermail/tdwg-content/2011-January/002258.html
http://lists.tdwg.org/pipermail/tdwg-content/2011-January/002259.htmldiscuss...
the relative merits of ITIS and uBio ID numbers. My take-home message from this was that uBio represented the largest single set of names with assigned identifiers (see http://gni.globalnames.org/data_sourcescited in Pete's email) and that uBio metadata provides useful references. Hence my interest in referencing uBio ids as a URI. However, as a practical matter, the organizations that I share images with either want ITIS TSNs (EOL and Morphbank) or just names (Discover Life). Nobody is asking for uBio identifiers or any other identifier.
I found Kevin's comment at http://lists.tdwg.org/pipermail/tdwg-content/2011-May/002486.html very thought-provoking: "My thoughts are that the most likely way this will be solved is by standard market type pressures - ie the best solution/IDs will be used the most and 'float' to the top." I'm not going to make a judgment about what is the "best" solution or ID. But I would say that in "computer" history, being the "best" doesn't necessarily mean that something will be used. Take for example, the FOAF vocabulary. What the heck is Friend of a Friend? I would venture to say that most of the people using the FOAF vocabulary don't know or care. The FOAF vocabulary was the one that people started to use and once that happened, people didn't switch even if there was something better. I'm not familiar with the history of other stuff like YouTube and Craig's List, but I would guess that they weren't necessarily "the best" systems - they were just the one that the most people started using first and once that happened, people didn't switch. I'm using ITIS IDs because they are easy to get and the people I communicate with want them. Whether they are the "best" or "done correctly" doesn't matter to me as much as the fact that that they are widely recognized and stable (and that thus far every name that I've looked for has been in their database).
I think that one reason why this question has been on my mind is that I've been waiting for GNUB (Global Name Use Bank) to come out. I'm not really up on how it is going to work, but my impression is that it was going to be based on the Global Name Index (GNI) which was mentioned in that earlier January thread. At that point, the GNI names didn't have any identifiers that were exposed to the public as permanent GUIDs. I'm assuming that if GNUB refers to GNI names, they will have some kind of identifiers. So if that happens how is the GUID recommendation 8 going to be followed? As Kevin said in http://lists.tdwg.org/pipermail/tdwg-content/2011-June/002499.html%22What I take from recommendation 8 of the GUID applicability guide ... is that if you DON'T already have a record in your own database for a taxon name/concept, then reuse an existing one. " What we have here with GNI is a situation where none of the records have identifiers. In my mind, the "best practice" according to recommendation 8 would be for the GNI to reuse existing identifiers where they exist and NOT make up new ones. This is a bit more complicated because the ITIS identifiers (which are in common use) don't have an http URI version that is resolvable, and while the uBio identifiers have a resolvable http URI, it's in the form of a proxied LSID, which I've already complained is very ugly. So I'd like to hear some ideas about how to have "reused" identifiers in the GNI.
One thing that comes to my mind would be to have a "domain name" like "http://purl.org/gni/" http://purl.org/gni/ or "http://purl.org/tn/"http://purl.org/tn/("tn" for "taxon name") and
to
follow it with a namespace/id combination similar to what is done with lsids. So for example "itis/19408" and "ubio/448439" could be appended, creating http://purl.org/gni/itis/19408and http://purl.org/gni/ubio/448439 for "Quercus rubra L." Both URIs
could
point to the same RDF and that RDF could indicate that the two identifiers are owl:sameAs . I realize from what Bob Morris has cautioned in the past that there are problems with owl:sameAs when the two things aren't actually the same thing (e.g. if the uBio ID refers to a name string only but the ITIS TSN refers to the name plus an "accepted" status and a relationship to parent taxa). However, if there were an understanding that the GNI only refers to name strings, then one could still refer to http://purl.org/gni/itis/19408 as an identifier for the name string of the thing (whatever it is) that is referred to by an ITIS TSN of 19408. I don't think there would be a problem saying that and the ubio ID were "owl:sameAs". Some kind of solution like this would allow people to easily generate a resolvable URI for a name if they were using ITIS TSNs or uBio IDs. If the name that one wanted to use was so obscure that it was one of the 9.5 million names that uBio has that ITIS doesn't have, then that name would only have the ubio version. I have no idea whether this would be a good idea or not, but I was really cringing to think about 19 million newly minted UUIDs appended to "http://gni.globalnames.org/"http://gni.globalnames.org/and figuring out how to connect those horrid things to the names and ITIS TSNs that I'm already using. I think that I said this before, but using the purl.org domain rather than one like http://gni.globalnames.org/ would in the future allow somebody else to take over management of providing the metadata when the GUIDs are resolved without having to deal with issues of who "owns" the domain name.
Steve
Kevin Richards wrote:
Pete,
I’m not trying to say what you are doing is a waste of time/impossible. I actually think RDF + semantics are a good way forward, but this really implies that we need to rely on the semantics and linkages rather than having a SINGLE ID for a taxon name. (which is what I thought Steve was getting at). Each instance of a taxon name can have its own ID and then all these instances are connected via ontology defined semantic links. This seems more appropriate to me than insisting everyone uses the “Global Taxon Name ID X”.
In your example of *Aedes triseriatus* and *Ochlerotatus triseriatus* – these are two different names so they need two different IDs, they may be linked by a single taxon concept, but they are separate names. So which of these now 3 IDs do you expect people to use, and according to what source??
For example if we have a name, eg the Robin, Erithacus rubecula, mentioned in IT IS (TSN : 559964) and also in EOL (www.eol.org/pages/1051567), also in GBIF (http://data.gbif.org/species/21266780), also in avibase ( http://avibase.bsc-eoc.org/species.jsp?avibaseid=C809B2B90399A43D), which ID are you hoping people will use?? Would you put the IT IS ID in your own dataset as the ID for that name – unlikely. Or would it be better to link them up with semantic linkages.
What I take from recommendation 8 of the GUID applicability guide (as Steve puts is "stop making up new identifiers when somebody else already has one for the thing you are talking about”) is that if you DON’T already have a record in your own database for a taxon name/concept, then reuse an existing one. NOT ditch all your current IDs and adopt someone else’s (especially hard considering it is so hard to work out which if the multitude of names ad concept IDs that directly relates to your taxon name).
I am all for limiting the number of IDs for the “same” thing, but in some cases it is more useful to build linkages than force this tight integration of data and IDs. Especially for taxon names and concepts, where it is complex to define if you are even talking about the “same” thing or not.
Kevin
*From:* Peter DeVries [mailto:pete.devries@gmail.compete.devries@gmail.com]
*Sent:* Wednesday, 1 June 2011 12:38 p.m. *To:* Kevin Richards *Cc:* Steve Baskauf; tdwg-content@lists.tdwg.org; Gerald Guala; Nicolson, David; Alan J Hampson; Orrell, Thomas *Subject:* Re: [tdwg-content] ITIS TSNID to uBio NamebankIDs mapping
Hi Kevin,
I forgot one mention some other things that are different about my project.
You can write a simple SPARQL query to get a list of all the TaxonConcept's that have ITIS ids, or all those that have ITIS and NCBI ID's etc.
You can do this on any SPARQL endpoint that hosts the data.
You can download the entire data set and run the queries on your own endpoint.
You can write a script that runs the query and downloads the ITIS numbers and exports them to CSV etc.
- Pete
On Tue, May 31, 2011 at 5:16 PM, Peter DeVries pete.devries@gmail.com wrote:
Hi Kevin,
On Tue, May 31, 2011 at 3:27 PM, Kevin Richards < RichardsK@landcareresearch.co.nz> wrote:
This is exactly why this problem still exists and will be very complex to solve - everyone says "we should have a single ID for a specific taxon name, there seems to be several IDs 'out there' that refer to the same taxon name, so Im going to create another ID to link them all up" - yet another ID that no one will particularly want to follow - you would have to get everyone to agree that your combinations/integration of taxon names is the best one and hope everyone follows it - unlikely in this domain.
Isn't this kind of what the The Plant List, and eBird already do?
A difference being that they tie these to a specific name and specific classification.
The Plant list is not really even open so it is difficult to people to adopt it in mass.
For instance, if I manage a herbarium, how do I easily reconcile my species list with the entities represented in the Plant List?
eBird has millions of records which implies that they have been able to convince the observers in the field to adopt their system. You are correct in that there are probably a lot of taxonomists that don't like their list.
It differs from many of the other classifications, but remember the system rewards them for not agreeing. Note the difference between the microbial taxonomists and other taxonomists. In the case of the microbial
workers, the system rewards them for solving problems not debating alternatives. Also, if a good idea comes out that will make it easier for the microbiologists to solve the problems they are rewarded for solving, they are less likely to care whose idea it is.
Like the microbiologists, there are lots of biologists that work with species with the goal of addressing some non-taxonomic problem.
They don't really care if the name is *Aedes triseriatus* or *Ochlerotatus triseriatus, *but they do care that the identifier that they connect their data to is stable.
In regards to the issue of market forces,I suspect (but have no knowledge of) that there were probably decisions made in devising these lists that have more to do with appeasing certain personalities that creating best list. With the way this system rewards people it is likely that the "correct" version will float to the top only after that person has passed away. I don't have much faith that the best system will always float to the top, That has a lot to do with the personalities and how the system rewards are setup. Theoretically, it is possible for one strong personality or group to force others to adopt their less than optimal solution - at least this seems to happen in other environments.
Also, there are all sorts of ways that people can use the publication record to rewrite history. Simply cite the review paper that cites the original paper. Or don't cite it at all.
I would have used only the ITIS TSN but if the name changes the ID changes. This isn't "wrong", it just does not solve my problem.
- ITIS also should add the spiders from the World Spider Catalog.
Another issue that I think has inhibited adoption of a common list is that people can't agree on a particular name or a particular classification.
Since you can model a species concept as having many names and many classifications why not do so?
If this idea was originally accepted, I would not have needed to create TaxonConcept.org.
My plan has aways been to get something that works to solve some problems and then let some larger group take it over.
In a sense, I am more like the microbiologists in that I am not being paid to solve this or debate this problem.
I am doing it because I think something like this is needed, and it is an interesting and personally rewarding puzzle.
- Pete
My thoughts are that the most likely way this will be solve is by stnadard market type pressures - ie the best solution/IDs will be used the most and "float" to the top. It is easy to say that the global taxon name data is a mess, but if you think about it 30 years ago taxon name data were very disparate, duplicated, unconnected, many with NO IDs at all. So I beleive we are making progress and that we will continue to do so albeit at a fairly slow rate.
Kevin
"I agree. This was one of the reasons that I setup TaxonConcept the way I did. It attempts to connect both the LOD entities and the foreign key based entities."
Please consider the environment before printing this email Warning: This electronic message together with any attachments is confidential. If you receive it in error: (i) you must not read, use, disclose, copy or retain it; (ii) please contact the sender immediately by reply email and then delete the emails. The views expressed in this email may not be those of Landcare Research New Zealand Limited. http://www.landcareresearch.co.nz
--
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 Email: pdevries@wisc.edu TaxonConcept http://www.taxonconcept.org/ & GeoSpecieshttp://about.geospecies.org/ Knowledge Bases A Semantic Web, Linked Open Data http://linkeddata.org/ Project
--
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 Email: pdevries@wisc.edu TaxonConcept http://www.taxonconcept.org/ & GeoSpecieshttp://about.geospecies.org/ Knowledge Bases A Semantic Web, Linked Open Data http://linkeddata.org/ Project
Please consider the environment before printing this email Warning: This electronic message together with any attachments is confidential. If you receive it in error: (i) you must not read, use, disclose, copy or retain it; (ii) please contact the sender immediately by reply email and then delete the emails. The views expressed in this email may not be those of Landcare Research New Zealand Limited. http://www.landcareresearch.co.nz
-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences
postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A.
delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235
office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707http://bioimages.vanderbilt.edu
--
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 Email: pdevries@wisc.edu TaxonConcept http://www.taxonconcept.org/ & GeoSpecieshttp://about.geospecies.org/ Knowledge Bases A Semantic Web, Linked Open Data http://linkeddata.org/ Project
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
David Remsen, Senior Programme Officer Electronic Catalog of Names of Known Organisms Global Biodiversity Information Facility Secretariat Universitetsparken 15, DK-2100 Copenhagen, Denmark Tel: +45-35321472 Fax: +45-35321480 Skype: dremsen
I agree with David on this in the context of GNI/uBio. At last year's TDWG, I asked Dima and Peter why they generated UUIDs as a surrogate representation of the text string. The answer (which involved reducing all names to a consistent 128 bits, among other technical reasons) made a lot of sense from a technical perspective for internal processing purposes, but I never thought it would be a good idea to expose them as such externally.
The key to understand here, though, is what makes GNI and uBio (and NameBank?) different from essentially all other taxonomic databases (ITIS, CoL, Catalog of Fishes, WoRMS, Hymenoptera Name Server, BDWD, etc., etc. etc.). Whereas those other databases generate records to represent some flavor of a "taxon name", or "taxon concept", or something in-between, GNI/uBio/[NameBank] generate records *FOR THE TEXT STRING ITSELF*. The text-string *is* the basis of the record. The other databases use the text string as the human-friendly access point to a database record that includes metadata relevant to a taxon name or taxon concept (as variously defined).
One other point regarding my previous message: in my comparison between GNUB and GNI as being opposing endpoints on a spectrum, I want to make it absolutely clear that I do NOT think that either end of this spectrum is somehow more important or more powerful or more relevant to biodiversity informatics than the other. At the moment, I think GNI is much closer to the mark in terms of where biodiversity informatics is at this point in history, because the vast majority of data linked to taxon names provide little more than a text-string name as context for the name. The GNUB end of the spectrum is more important to nomenclaturalists and hard-core taxonomists, but is currently woefully lacking in content and services. My hope and intention is that this will change in the coming months, such that the relevancy of the GNUB end will grow to match that of the GNI end. I think that both ends of the spectrum will be useful and relevant for a long time to come.
Aloha, Rich
Richard L. Pyle, PhD Database Coordinator for Natural Sciences Associate Zoologist in Ichthyology Dive Safety Officer Department of Natural Sciences, Bishop Museum 1525 Bernice St., Honolulu, HI 96817 Ph: (808)848-4115, Fax: (808)847-8252 email: deepreef@bishopmuseum.org http://hbs.bishopmuseum.org/staff/pylerichard.html
-----Original Message----- From: tdwg-content-bounces@lists.tdwg.org [mailto:tdwg-content- bounces@lists.tdwg.org] On Behalf Of David Remsen (GBIF) Sent: Friday, June 03, 2011 2:00 AM To: Peter DeVries Cc: tdwg-content@lists.tdwg.org; Dmitry Mozzherin; Orrell, Thomas; Alan J Hampson; Gerald Guala; Nicolson, David Subject: Re: [tdwg-content] ITIS TSNID to uBio NamebankIDs mapping
Why not use the name as the basis for the resolvable identifier instead of
a
uuid. Isnt there a 1:1 cardinality between the name and the uuid in the
GNI?
Doesnt that mean that
http://gni.globalnames.org/name_strings/4ef223c4-0c3e-5e84-ace9- 755c34c601ec and http://gni.globalnames.org/name_strings/Danaus_plexippus_(Linnaeus_175 8)
are equally unique? The latter is certainly more readable. In those
cases
where the namestring is a homonym like
http://gni.globalnames.org/name_strings/Oenanthe
couldn't you just return the addresses of the two globally unique forms of the name when you resolve it?
http://gni.globalnames.org/name_strings/Oenanthe_Smith_1899
http://gni.globalnames.org/name_strings/Oenanthe_Jones_1900
Wouldn't those be as globally unique and easier to read and adjust to? Or
am
I missing something. I always wanted to do that with ubio IDs after a
back
and forth with Gregor Hagedorn and wished we hadn't exposed those integers.
DR
Hi Steve,
I don't have time to go through this in detail, and I can't speak for the GNI, but I can tell you about how the GNI URI's work at least for
now.
A while back Dima Mozzherin and I were looking into how triples etc. might be of use to the GNI.
We needed a way to generate unique URI's for each name.
We wanted to avoid having to keep these in sync and not require everyone to look each ID up through some service.
Dima came up with the following plan. We use the namestring as seed to generate a unique UUID.
Basically this is a shared algorithm which the GNI and TaxonConcept both use. But it could be used by anyone.
You feed the name string to the algorithm and it spits out a UUID. We append then append that to a URI and web service so it is resolvable.
So the name Danaus plexippus (Linnaeus 1758) => 4ef223c4-0c3e-5e84-ace9-755c34c601ec
So if the GNI and and another group have the same namestring they have the same UUID.
People can then can link their data set to the GNI with the following URI
http://gni.globalnames.org/name_strings/4ef223c4-0c3e-5e84-ace9-
755c34
c601ec
RDF http://gni.globalnames.org/name_strings/4ef223c4-0c3e-5e84-ace9-
755c34
c601ec.rdf
<http://gni.globalnames.org/name_strings/4ef223c4-0c3e-5e84-ace9-
755c3
4c601ec.rdf>If you think of your data set as one table and the GNI as another, this URI serves as the foreign key that connects them together.
Some on the list don't like how these look, but there is a tremendous advantage in not having to worry about syncing two large data sets and determining if a given integer is already in use.
Also Rod Page has written a recently about UUID's. http://iphylo.blogspot.com/2011/05/zoobank-on-couchdb-uuids-replicatio n.html
http://iphylo.blogspot.com/2011/05/zoobank-on-couchdb-uuids-replicati on.htmlThere may be a way to do something similar with bit.ly like identifiers that are shorter (mCcSp), but I think it the general idea is a good one.
If you recall from my talk at TDWG, I was able to use these to make statements that one namestring was a synonym etc. of another etc.
The algorithm we use is written in Ruby but I could be ported to many different languages since UUIDs are widely supported.
Respectfully,
- Pete
On Thu, Jun 2, 2011 at 11:41 PM, Steven J. Baskauf < steve.baskauf@vanderbilt.edu> wrote:
My email access has been sporadic since this thread developed, so at this point I'll respond to points made in several of the messages.
First, I should note that there has been previous discussion on this list on a similar topic from http://lists.tdwg.org/pipermail/tdwg-content/2011-January/002231.html through http://lists.tdwg.org/pipermail/tdwg-content/2011-January/002231.html. One can review what was said at that time rather quickly by starting on the first linked message and clicking on the "Next Message" link until you get to the end of the range I gave above.
My reason for the request for information that started this thread was that I wanted to link to a URI that would anchor the name portion of a name/sensu pair (TNU or Taxon Concept a la TCS if you prefer) as in this RDF snippet:
tc:nameStringQuercus rubra L.</tc:nameString> <tc:hasName
rdf:about="http://www.ubio.org/authority/metadata.php?lsid=urn:lsid:ubio .org:namebank:448439"
http://www.ubio.org/authority/metadata.php?lsid=urn:lsid:ubio.org:na mebank:448439/>
At this point in the discussion, I'm not actually talking about creating a link to a taxon concept but rather to a taxon name, so some of the issues Pete raised don't apply here (e.g. what's the "right" name for a concept
the question here is simply what's a stable identifier for the name) . In principle, I could probably just provide the name string and be done with it. However, having some degree of faith that Smart, Computer Savvy People might some day be able to use the metadata returned by the URI (or perhaps metadata which they already have in a triple store onsite) to do cool things like knowing that my name is the same as an orthographic variant or that "Quercus rubra L." is basically the same thing as "Quercus rubra", I would like to also provide a functional URI.
As an end -user who isn't very interested in the technical issues involving names, I don't really care what URI I use. I would prefer for it to be widely recognized and for it to "work" (i.e. be resolvable). In the earlier (January) thread, there was discussion about existing identifiers. There were a number of posts, but in particular http://lists.tdwg.org/pipermail/tdwg-content/2011-January/002258.html http://lists.tdwg.org/pipermail/tdwg-content/2011-January/002259.html discussed the relative merits of ITIS and uBio ID numbers. My take-home message from this was that uBio represented the largest single set of names with assigned identifiers (see http://gni.globalnames.org/data_sourcescited in Pete's email) and that uBio metadata provides useful references. Hence my interest in referencing uBio ids as a URI. However, as a practical matter, the organizations that I share images with either want ITIS TSNs (EOL and Morphbank) or just names (Discover Life). Nobody is asking for uBio identifiers or any other identifier.
I found Kevin's comment at http://lists.tdwg.org/pipermail/tdwg-content/2011-May/002486.html very thought-provoking: "My thoughts are that the most likely way this will be solved is by standard market type pressures - ie the best solution/IDs will be used the most and 'float' to the top." I'm not going to make a judgment about what is the "best" solution or ID. But I would say that in "computer" history, being the "best" doesn't necessarily mean that something will be used. Take for example, the FOAF vocabulary. What the heck is Friend of a Friend? I would venture to say that most of the people using the FOAF vocabulary don't know or care. The FOAF vocabulary was the one that people started to use and once that happened, people didn't switch even if there was something better. I'm not familiar with the history of other stuff like YouTube and Craig's List, but I would guess that they weren't necessarily "the best" systems - they were just the one that the most people started using first and once that happened, people didn't switch. I'm using ITIS IDs because they are easy to get and the people I communicate with want them. Whether they are the "best" or "done correctly" doesn't matter to me as much as the fact that that they are widely recognized and stable (and that thus far every name that I've looked for has been in their database).
I think that one reason why this question has been on my mind is that I've been waiting for GNUB (Global Name Use Bank) to come out. I'm not really up on how it is going to work, but my impression is that it was going to be based on the Global Name Index (GNI) which was mentioned in that earlier January thread. At that point, the GNI names didn't have any identifiers that were exposed to the public as permanent GUIDs. I'm assuming that if GNUB refers to GNI names, they will have some kind of identifiers. So if that happens how is the GUID recommendation 8 going to be followed? As Kevin said in http://lists.tdwg.org/pipermail/tdwg-content/2011-June/002499.html "What I take from recommendation 8 of the GUID applicability guide ... is that if you DON'T already have a record in your own database for a taxon name/concept, then reuse an existing one. " What we have here with GNI is a situation where none of the records have identifiers. In my mind, the "best practice" according to recommendation 8 would be for the GNI to reuse existing identifiers where they exist and NOT make up new ones. This is a bit more complicated because the ITIS identifiers (which are in common use) don't have an http URI version that is resolvable, and while the uBio identifiers have a resolvable http URI, it's in the form of a proxied LSID, which I've already complained is very ugly. So I'd like to hear some ideas about how to have "reused" identifiers in the GNI.
One thing that comes to my mind would be to have a "domain name" like "http://purl.org/gni/" http://purl.org/gni/ or "http://purl.org/tn/"http://purl.org/tn/("tn" for "taxon name") and to follow it with a namespace/id combination similar to what is done with lsids. So for example "itis/19408" and "ubio/448439" could be appended, creating http://purl.org/gni/itis/19408and http://purl.org/gni/ubio/448439 for "Quercus rubra L." Both URIs could point to the same RDF and that RDF could indicate that the two identifiers are owl:sameAs . I realize from what Bob Morris has cautioned in the past that there are problems with owl:sameAs when the two things aren't actually the same thing (e.g. if the uBio ID refers to a name string only but the ITIS TSN refers to the name plus an "accepted" status and a relationship to parent taxa). However, if there were an understanding that the GNI only refers to name strings, then one could still refer to http://purl.org/gni/itis/19408 as an identifier for the name string of the thing (whatever it is) that is referred to by an ITIS TSN of 19408. I don't think there would be a problem saying that and the ubio ID were "owl:sameAs". Some kind of solution like this would allow people to easily generate a resolvable URI for a name if they were using ITIS TSNs or uBio IDs. If the name that one wanted to use was so obscure that it was one of the 9.5 million names that uBio has that ITIS doesn't have, then that name would only have the ubio version. I have no idea whether this would be a good idea or not, but I was really cringing to think about 19 million newly minted UUIDs appended to "http://gni.globalnames.org/"http://gni.globalnames.org/and figuring out how to connect those horrid things to the names and ITIS TSNs that I'm already using. I think that I said this before, but using the purl.org domain rather than one like http://gni.globalnames.org/ would in the future allow somebody else to take over management of providing the metadata when the GUIDs are resolved without having to deal with issues of who "owns" the domain name.
Steve
Kevin Richards wrote:
Pete,
I'm not trying to say what you are doing is a waste of time/impossible. I actually think RDF + semantics are a good way forward, but this really implies that we need to rely on the semantics and linkages rather than having a SINGLE ID for a taxon name. (which is what I thought Steve was getting at). Each instance of a taxon name can have its own ID and then all these instances are connected via ontology defined semantic links. This seems more appropriate to me than insisting everyone uses the "Global Taxon Name ID X".
In your example of *Aedes triseriatus* and *Ochlerotatus triseriatus*
- these are two different names so they need two different IDs, they
may be linked by a single taxon concept, but they are separate names. So which of these now 3 IDs do you expect people to use, and according to what source??
For example if we have a name, eg the Robin, Erithacus rubecula, mentioned in IT IS (TSN : 559964) and also in EOL (www.eol.org/pages/1051567), also in GBIF (http://data.gbif.org/species/21266780), also in avibase ( http://avibase.bsc-eoc.org/species.jsp?avibaseid=C809B2B90399A43D), which ID are you hoping people will use?? Would you put the IT IS ID in your own dataset as the ID for that name - unlikely. Or would it be better to link them up with semantic linkages.
What I take from recommendation 8 of the GUID applicability guide (as Steve puts is "stop making up new identifiers when somebody else already has one for the thing you are talking about") is that if you DON'T already have a record in your own database for a taxon name/concept, then reuse an existing one. NOT ditch all your current IDs and adopt someone else's (especially hard considering it is so hard to work out which if the multitude of names ad concept IDs that directly relates to your taxon name).
I am all for limiting the number of IDs for the "same" thing, but in some cases it is more useful to build linkages than force this tight integration of data and IDs. Especially for taxon names and concepts, where it is complex to define if you are even talking about the "same" thing or
not.
Kevin
*From:* Peter DeVries [mailto:pete.devries@gmail.compete.devries@gmail.com]
*Sent:* Wednesday, 1 June 2011 12:38 p.m. *To:* Kevin Richards *Cc:* Steve Baskauf; tdwg-content@lists.tdwg.org; Gerald Guala; Nicolson, David; Alan J Hampson; Orrell, Thomas *Subject:* Re: [tdwg-content] ITIS TSNID to uBio NamebankIDs mapping
Hi Kevin,
I forgot one mention some other things that are different about my project.
You can write a simple SPARQL query to get a list of all the TaxonConcept's that have ITIS ids, or all those that have ITIS and NCBI ID's etc.
You can do this on any SPARQL endpoint that hosts the data.
You can download the entire data set and run the queries on your own endpoint.
You can write a script that runs the query and downloads the ITIS numbers and exports them to CSV etc.
- Pete
On Tue, May 31, 2011 at 5:16 PM, Peter DeVries
wrote:
Hi Kevin,
On Tue, May 31, 2011 at 3:27 PM, Kevin Richards < RichardsK@landcareresearch.co.nz> wrote:
This is exactly why this problem still exists and will be very complex to solve - everyone says "we should have a single ID for a specific taxon name, there seems to be several IDs 'out there' that refer to the same taxon name, so Im going to create another ID to link them all up" - yet another ID that no one will particularly want to follow - you would have to get
everyone
to agree that your combinations/integration of taxon names is the best one and hope everyone follows it - unlikely in this domain.
Isn't this kind of what the The Plant List, and eBird already do?
A difference being that they tie these to a specific name and specific classification.
The Plant list is not really even open so it is difficult to people to adopt it in mass.
For instance, if I manage a herbarium, how do I easily reconcile my species list with the entities represented in the Plant List?
eBird has millions of records which implies that they have been able to convince the observers in the field to adopt their system. You are correct in that there are probably a lot of taxonomists that don't like their list.
It differs from many of the other classifications, but remember the system rewards them for not agreeing. Note the difference between the
microbial
taxonomists and other taxonomists. In the case of the microbial
workers, the system rewards them for solving problems not debating alternatives. Also, if a good idea comes out that will make it easier for the microbiologists to solve the problems they are rewarded for
solving,
they are less likely to care whose idea it is.
Like the microbiologists, there are lots of biologists that work with species with the goal of addressing some non-taxonomic problem.
They don't really care if the name is *Aedes triseriatus* or *Ochlerotatus triseriatus, *but they do care that the identifier that they connect their data to is stable.
In regards to the issue of market forces,I suspect (but have no knowledge of) that there were probably decisions made in devising these lists
that
have more to do with appeasing certain personalities that creating best list. With the way this system rewards people it is likely that the "correct" version will float to the top only after that person has passed away. I don't have much faith that the best system will always float to the top, That has a lot to do with the personalities and how the system rewards are setup. Theoretically, it is possible for one strong personality or group to force others to adopt their less than optimal solution - at least this seems to happen in other environments.
Also, there are all sorts of ways that people can use the publication record to rewrite history. Simply cite the review paper that cites the original paper. Or don't cite it at all.
I would have used only the ITIS TSN but if the name changes the ID changes. This isn't "wrong", it just does not solve my problem.
- ITIS also should add the spiders from the World Spider Catalog.
Another issue that I think has inhibited adoption of a common list is that people can't agree on a particular name or a particular classification.
Since you can model a species concept as having many names and many classifications why not do so?
If this idea was originally accepted, I would not have needed to create TaxonConcept.org.
My plan has aways been to get something that works to solve some problems and then let some larger group take it over.
In a sense, I am more like the microbiologists in that I am not being paid to solve this or debate this problem.
I am doing it because I think something like this is needed, and it is an interesting and personally rewarding puzzle.
- Pete
My thoughts are that the most likely way this will be solve is by stnadard market type pressures - ie the best solution/IDs will be used the most and "float" to the top. It is easy to say that the global taxon name data is a mess, but if you think about it 30 years ago taxon name data were very disparate, duplicated, unconnected, many with NO IDs at all. So I beleive we are making progress and that we will continue to do so albeit at a fairly slow rate.
Kevin
"I agree. This was one of the reasons that I setup TaxonConcept the way I did. It attempts to connect both the LOD entities and the foreign key based entities."
Please consider the environment before printing this email Warning: This electronic message together with any attachments is confidential. If you receive it in error: (i) you must not read, use, disclose, copy or retain it; (ii) please contact the sender immediately by reply email and then delete the emails. The views expressed in this email may not be those of Landcare Research New Zealand Limited. http://www.landcareresearch.co.nz
--
---------------------------------------------------------------------------- --------
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 Email: pdevries@wisc.edu TaxonConcept http://www.taxonconcept.org/ & GeoSpecieshttp://about.geospecies.org/ Knowledge Bases A Semantic Web, Linked Open Data http://linkeddata.org/ Project
---------------------------------------------------------------------------- ----------
--
---------------------------------------------------------------------------- --------
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 Email: pdevries@wisc.edu TaxonConcept http://www.taxonconcept.org/ & GeoSpecieshttp://about.geospecies.org/ Knowledge Bases A Semantic Web, Linked Open Data http://linkeddata.org/ Project
---------------------------------------------------------------------------- ----------
Please consider the environment before printing this email Warning: This electronic message together with any attachments is confidential. If you receive it in error: (i) you must not read, use, disclose, copy or retain it; (ii) please contact the sender immediately by reply email and then delete the emails. The views expressed in this email may not be those of Landcare Research New Zealand Limited. http://www.landcareresearch.co.nz
-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences
postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A.
delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235
office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707http://bioimages.vanderbilt.edu
--
---------------------------------------------------------------------------- --------
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 Email: pdevries@wisc.edu TaxonConcept http://www.taxonconcept.org/ & GeoSpecieshttp://about.geospecies.org/ Knowledge Bases A Semantic Web, Linked Open Data http://linkeddata.org/ Project
---------------------------------------------------------------------------- ----------
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
----------------------------------------------------------------------------
David Remsen, Senior Programme Officer Electronic Catalog of Names of Known Organisms Global Biodiversity Information Facility Secretariat Universitetsparken 15, DK-2100 Copenhagen, Denmark Tel: +45-35321472 Fax: +45-35321480 Skype: dremsen
----------------------------------------------------------------------------
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
Hi All,
I'm just catching up on email now, after a series of other work-related obligations and virtual attendance at a cybertaxonomy/e-literature meeting in Chicago this week. I do not now have time to review the entire thread, so I'll jump into the stream with Steve's recent post.
I think that one reason why this question has been on my mind is that I've
been waiting for
GNUB (Global Name Use Bank) to come out.
Just a quick update, due to budgetary woes in the U.S. Federal Government, NSF funding for awarded proposals has been pushed every further back. If I'm not mistaken, something like 18 months passed between proposal submission and availability of funds for the BiSciCol grant, which our institution was only able to (finally!) start processing within the past few months. Why is this relevant to GNUB? Because the BiSciCol grant includes the most substantial funding yet for implementation of GNUB (indeed, the only funding for GNUB by name). The good news is that, now that funding is in hand and money (finally) flowing, development & implementation of GNUB is ramping up quickly. And the promise of more (and more substantial) funding is just around the corner (watch this space).
I'm not really up on how it is going to work, but my impression is that it
was going
to be based on the Global Name Index (GNI) which was mentioned in that
earlier
January thread.
Not exactly. GNI and GNUB represent two ends of a spectrum. GNI is at the "minimal metadata/maximal content" end of the spectrum -- basically a repository of any text-string purported to represent a taxon name that can be linked via a resolvable identifier. GNUB is at the "richly metadata'd/carefully curated" end of the spectrum, representing a highly normalized structure with permanent resolvable GUIDs and the potential for robust information/data services. In the vernacular, GNI is the "dirty bucket", and GNUB is the "clean bucket". At the moment, the connection between GNUB and GNI is unidirectional, in that the content of the progenitor of GNUB has been indexed in GNI, but there is no mechanism (yet) for GNI content to feed into GNUB. The reason for this is fairly straightforward: it's very easy to flatten out normalized content into simple text strings (GNUB-->GNI), but it's much more difficult (impossible?) to migrate metadata-poor, moderately parsed content into a highly structured system.
At that point, the GNI names didn't have any identifiers that were exposed
to
the public as permanent GUIDs. I'm assuming that if GNUB refers to GNI
names,
they will have some kind of identifiers. So if that happens how is the
GUID
recommendation 8 going to be followed? As Kevin said in http://lists.tdwg.org/pipermail/tdwg-content/2011-June/002499.html "What I
take
from recommendation 8 of the GUID applicability guide ... is that if you
DON'T
already have a record in your own database for a taxon name/concept, then reuse an existing one. " What we have here with GNI is a situation where
none
of the records have identifiers. In my mind, the "best practice"
according to
recommendation 8 would be for the GNI to reuse existing identifiers where they exist and NOT make up new ones. This is a bit more complicated
because the
ITIS identifiers (which are in common use) don't have an http URI version
that is
resolvable, and while the uBio identifiers have a resolvable http URI,
it's in the
form of a proxied LSID, which I've already complained is very ugly. So
I'd like to
hear some ideas about how to have "reused" identifiers in the GNI.
In terms of GUIDs, the objects in GNUB and the objects in GNI are not the same, and therefore cannot share identifiers. The core object in GNI is a text-string. Indeed, the text string itself can be the actual identifier, because it *is* the thing being identified. In other words, because the essential uniqueness of an instance (record) in GNI by definition *is* the text string (i.e., the series of UTF-8-encoded characters), then that text string represents a perfectly suitable unique identifier. There is no need to generate a surrogate identifier like an integer number or UUID or LSID or whatever (except, perhaps, for internal use as a primary key for joining tables; but those identifiers need not/should not be exposed to the outside world).
By contrast, the core object in GNUB is a taxon name usage instance -- which is a purely abstract notion of the usage of a taxon name within some documentation source (like a publication). In this case, the text-string name is merely a property of the GUID-identified object, and would be an extremely BAD choice to use as a unique identifier. This is why GNUB needs to generate a unique identifier to represent this core data object. The form that identifier takes (UUID, LSID, integer, DOI, whatever) from the perspective of the end user should be completely irrelevant, because the user should rarely (if ever) see it, and should certainly *never* be in a position to type it on a keyboard (we can discuss the appearance of ZooBank LSIDs on printed pages separately). All that matters is that it is persistent, globally unique identifier that can be used to cross-link information and can be conveniently resolved to the metadata of the object it represents.
But the point is, recommendation 8 of the GUID applicability guide is not being violated in the context of GNI and GNUB.
The real problem in all of this is the inconsistent meaning people apply to the notion of a "taxon name". In GNI-space, the name is simply a text string. In GNUB-space, the "name-object" is a code-compliant Protonym that serves to cross-link Name-usages to each other. ITIS is different still. My understanding (David N.: correct me if I'm wrong), is that all TSNs that correspond to "valid/accepted" names (where [taxonomic_units].[usage]='valid'|'accepted') essentially represent a taxon concept. The rest of the TSNs (where [taxonomic_units].[usage]='invalid'|'not accepted') represent a variety of things, ranging from different combinations to alternate spellings to subjective synonyms, each of which is referable to one of the "valid/accepted" names. CoL uses names as proxies to taxon concepts (not sure how they handle synonyms vs. misspellings, etc.) And there are other variations as well -- to most botanists, "Aus bus L." and "Xus bus (L.) Smith" represent "different names", whereas to most zoologists (who would not bother to include the "Smith"), regard them as the "different combinations of the same name" (zoologists are less consistent than botanists in this regard).
The point is, this inconsistency and heterogeneity of what is meant by a "name" in taxonomy is, in my opinion, the single GREATEST obstacle in achieving informatics harmony among biodiversity datsets.
One thing that comes to my mind would be to have a "domain name" like "http://purl.org/gni/" or "http://purl.org/tn/" ("tn" for "taxon name")
and
to follow it with a namespace/id combination similar to what is done with
lsids.
So for example "itis/19408" and "ubio/448439" could be appended, creating http://purl.org/gni/itis/19408 and
http://purl.org/gni/ubio/448439 for "
Quercus rubra L." Both URIs could point to the same RDF and that RDF
could
indicate that the two identifiers are owl:sameAs .
This syntax is basically what ZooBank does (and GNUB will do), within their own domain name. But I like the idea of a common URL domain that allows these qualified identifiers to be appended.
The real problem is what you describe next:
I realize from what Bob Morris has cautioned in the past that there are
problems
with owl:sameAs when the two things aren't actually the same thing (e.g. if the uBio ID refers to a name string only but the ITIS TSN refers
to the name
plus an "accepted" status and a relationship to parent taxa).
Do NOT underestimate the significance of this point.
However, if there were an understanding that the GNI only refers to name
strings,
then one could still refer to http://purl.org/gni/itis/19408 as an
identifier for the
name string of the thing (whatever it is) that is referred to by an ITIS
TSN of 19408.
Here be dragons -- for lots of reasons. At this point, you might as well just do a text-string match on the name. The problem is, you'll miss the match if authorship is not identical, but you risk homonymy mis-match if authorship is not included.
I have no idea whether this would be a good idea or not, but I was really
cringing
to think about 19 million newly minted UUIDs appended to
and figuring out how to connect those horrid things to the names and ITIS
TSNs
that I'm already using. I think that I said this before, but using the
purl.org domain
rather than one like http://gni.globalnames.org/ would in the future allow
somebody else to take over management of providing the metadata when the GUIDs are resolved without having to deal with issues of who "owns" the
domain name.
As I said before, I think it's perfectly fine to generate UUIDs for internal purposes within GNI for varius performance reasons (or whatever), but I don't think it's wise to expose those UUIDs externally. Because the uniqueness of a GNI record *is* the text string, then it makes more sense to me to simply use the text string. However, that only works for GNI/uBio/NameBank, where the essence of the record *is* the text string. It's a non-starter for other datasets like GNUB, ITIS, CoL, and most others, where the essence of the record is something altogether different.
Aloha, Rich
Richard L. Pyle, PhD Database Coordinator for Natural Sciences Associate Zoologist in Ichthyology Dive Safety Officer Department of Natural Sciences, Bishop Museum 1525 Bernice St., Honolulu, HI 96817 Ph: (808)848-4115, Fax: (808)847-8252 email: deepreef@bishopmuseum.org http://hbs.bishopmuseum.org/staff/pylerichard.html
Rich (et al.)... Just a quick comment re ITIS TSNs, since Rich posited: ======================= My understanding (David N.: correct me if I'm wrong), is that all TSNs that correspond to "valid/accepted" names (where [taxonomic_units].[usage]='valid'|'accepted') essentially represent a taxon concept. The rest of the TSNs (where [taxonomic_units].[usage]='invalid'|'not accepted') represent a variety of things, ranging from different combinations to alternate spellings to subjective synonyms, each of which is referable to one of the "valid/accepted" names. =======================
I would say "sort of".... The TSNs do not themselves correspond with much of anything other than a unique, persistent, non-intelligent identifier for a "scientific name" (I realize that begs your next question/point of what that term "means") record in the context of the ITIS data system. See this linked from the "About ITIS" page: http://www.itis.gov/pdf/faq_itis_tsn.pdf
Re "Scientific Name", as you hopefully see in the above document, the term in ITIS generally corresponds to what I see from the ICBN use (Art. 16-24) and the ICZN use (Art. 4-5 in particular, as the "combination" formation, rather than the more atomized uses like "specific name" which is like "epithet"). There are of course other thing in ITIS with TSNs, like database artifacts, that are labeled as such and retained but hidden from most users to avoid confusion and not strand any user that might already have the TSN.
By way of an example of the use of "name" fields.... Just recently I was given a nice "finished" world dataset for a modest animal-family-that-shall-remain-nameless, and the "name" fields were in some cases just as ITIS uses them, and in others there were additional things like authorship and so on lumped in with the name parts in those fields, though there were no years provided even then. So, usable, but the amount of work to essentially re-parse the data was surprising for just a couple hundred names, and even then they were inconsistent and incomplete, so someone now has to go collect all the missing details and go over it all again, and it clearly needs some smoothing around the edges as well. That was for just 200+ names from a single source. Ugh, thanks....
As to the relationship to taxon concept, if you squinted your eyes "just so" you could qualify as Rich did above and suggest that those TSNs that happen to represent names with usage=valid/accepted (and preferably those with some level of verification indicated, vs. the legacy data we're still dealing with!) "essentially represent a taxon concept", but I don't really think that is appropriate at this point.... actually the closest thing in ITIS to a "taxon concept" would be certain entries in the reference_links table (the intersection between the scientific names entries and the reference entries), but even that is too abstract in my view. Since any number of references may be linked to a single TSN, that TSN won't necessarily yield something that maps to "a taxon concept" unless you're thinking "sensu ITIS v2011-05-31" or something of that ilk, which is I guess another way to think about it, with its own pros/cons.
And I agree with Rich's warnings of many pitfalls below (dragons and such).
I'll leave it there. Oops. So much for the "quick" comment....
Best, Dave
David Nicolson Data Development Coordinator, Integrated Taxonomic Information System Biologist, USGS Core Science Systems, Biological Informatics Program nicolsod@si.edu Office 202-633-2149 Fax 202-786-2934 http://www.itis.gov/ http://www.cbif.gc.ca/itis/ "Nihil sumas necesse est..."
-----Original Message----- From: Richard Pyle [mailto:deepreef@bishopmuseum.org] Sent: Friday, June 03, 2011 3:16 PM To: 'Steven J. Baskauf'; 'Kevin Richards' Cc: tdwg-content@lists.tdwg.org; Orrell, Thomas; 'Alan J Hampson'; Nicolson, David; 'Gerald Guala' Subject: RE: [tdwg-content] ITIS TSNID to uBio NamebankIDs mapping
Hi All,
I'm just catching up on email now, after a series of other work-related obligations and virtual attendance at a cybertaxonomy/e-literature meeting in Chicago this week. I do not now have time to review the entire thread, so I'll jump into the stream with Steve's recent post.
I think that one reason why this question has been on my mind is that I've
been waiting for
GNUB (Global Name Use Bank) to come out.
Just a quick update, due to budgetary woes in the U.S. Federal Government, NSF funding for awarded proposals has been pushed every further back. If I'm not mistaken, something like 18 months passed between proposal submission and availability of funds for the BiSciCol grant, which our institution was only able to (finally!) start processing within the past few months. Why is this relevant to GNUB? Because the BiSciCol grant includes the most substantial funding yet for implementation of GNUB (indeed, the only funding for GNUB by name). The good news is that, now that funding is in hand and money (finally) flowing, development & implementation of GNUB is ramping up quickly. And the promise of more (and more substantial) funding is just around the corner (watch this space).
I'm not really up on how it is going to work, but my impression is that it
was going
to be based on the Global Name Index (GNI) which was mentioned in that
earlier
January thread.
Not exactly. GNI and GNUB represent two ends of a spectrum. GNI is at the "minimal metadata/maximal content" end of the spectrum -- basically a repository of any text-string purported to represent a taxon name that can be linked via a resolvable identifier. GNUB is at the "richly metadata'd/carefully curated" end of the spectrum, representing a highly normalized structure with permanent resolvable GUIDs and the potential for robust information/data services. In the vernacular, GNI is the "dirty bucket", and GNUB is the "clean bucket". At the moment, the connection between GNUB and GNI is unidirectional, in that the content of the progenitor of GNUB has been indexed in GNI, but there is no mechanism (yet) for GNI content to feed into GNUB. The reason for this is fairly straightforward: it's very easy to flatten out normalized content into simple text strings (GNUB-->GNI), but it's much more difficult (impossible?) to migrate metadata-poor, moderately parsed content into a highly structured system.
At that point, the GNI names didn't have any identifiers that were exposed
to
the public as permanent GUIDs. I'm assuming that if GNUB refers to GNI
names,
they will have some kind of identifiers. So if that happens how is the
GUID
recommendation 8 going to be followed? As Kevin said in http://lists.tdwg.org/pipermail/tdwg-content/2011-June/002499.html "What I
take
from recommendation 8 of the GUID applicability guide ... is that if you
DON'T
already have a record in your own database for a taxon name/concept, then reuse an existing one. " What we have here with GNI is a situation where
none
of the records have identifiers. In my mind, the "best practice"
according to
recommendation 8 would be for the GNI to reuse existing identifiers where they exist and NOT make up new ones. This is a bit more complicated
because the
ITIS identifiers (which are in common use) don't have an http URI version
that is
resolvable, and while the uBio identifiers have a resolvable http URI,
it's in the
form of a proxied LSID, which I've already complained is very ugly. So
I'd like to
hear some ideas about how to have "reused" identifiers in the GNI.
In terms of GUIDs, the objects in GNUB and the objects in GNI are not the same, and therefore cannot share identifiers. The core object in GNI is a text-string. Indeed, the text string itself can be the actual identifier, because it *is* the thing being identified. In other words, because the essential uniqueness of an instance (record) in GNI by definition *is* the text string (i.e., the series of UTF-8-encoded characters), then that text string represents a perfectly suitable unique identifier. There is no need to generate a surrogate identifier like an integer number or UUID or LSID or whatever (except, perhaps, for internal use as a primary key for joining tables; but those identifiers need not/should not be exposed to the outside world).
By contrast, the core object in GNUB is a taxon name usage instance -- which is a purely abstract notion of the usage of a taxon name within some documentation source (like a publication). In this case, the text-string name is merely a property of the GUID-identified object, and would be an extremely BAD choice to use as a unique identifier. This is why GNUB needs to generate a unique identifier to represent this core data object. The form that identifier takes (UUID, LSID, integer, DOI, whatever) from the perspective of the end user should be completely irrelevant, because the user should rarely (if ever) see it, and should certainly *never* be in a position to type it on a keyboard (we can discuss the appearance of ZooBank LSIDs on printed pages separately). All that matters is that it is persistent, globally unique identifier that can be used to cross-link information and can be conveniently resolved to the metadata of the object it represents.
But the point is, recommendation 8 of the GUID applicability guide is not being violated in the context of GNI and GNUB.
The real problem in all of this is the inconsistent meaning people apply to the notion of a "taxon name". In GNI-space, the name is simply a text string. In GNUB-space, the "name-object" is a code-compliant Protonym that serves to cross-link Name-usages to each other. ITIS is different still. My understanding (David N.: correct me if I'm wrong), is that all TSNs that correspond to "valid/accepted" names (where [taxonomic_units].[usage]='valid'|'accepted') essentially represent a taxon concept. The rest of the TSNs (where [taxonomic_units].[usage]='invalid'|'not accepted') represent a variety of things, ranging from different combinations to alternate spellings to subjective synonyms, each of which is referable to one of the "valid/accepted" names. CoL uses names as proxies to taxon concepts (not sure how they handle synonyms vs. misspellings, etc.) And there are other variations as well -- to most botanists, "Aus bus L." and "Xus bus (L.) Smith" represent "different names", whereas to most zoologists (who would not bother to include the "Smith"), regard them as the "different combinations of the same name" (zoologists are less consistent than botanists in this regard).
The point is, this inconsistency and heterogeneity of what is meant by a "name" in taxonomy is, in my opinion, the single GREATEST obstacle in achieving informatics harmony among biodiversity datsets.
One thing that comes to my mind would be to have a "domain name" like "http://purl.org/gni/" or "http://purl.org/tn/" ("tn" for "taxon name")
and
to follow it with a namespace/id combination similar to what is done with
lsids.
So for example "itis/19408" and "ubio/448439" could be appended, creating http://purl.org/gni/itis/19408 and
http://purl.org/gni/ubio/448439 for "
Quercus rubra L." Both URIs could point to the same RDF and that RDF
could
indicate that the two identifiers are owl:sameAs .
This syntax is basically what ZooBank does (and GNUB will do), within their own domain name. But I like the idea of a common URL domain that allows these qualified identifiers to be appended.
The real problem is what you describe next:
I realize from what Bob Morris has cautioned in the past that there are
problems
with owl:sameAs when the two things aren't actually the same thing (e.g. if the uBio ID refers to a name string only but the ITIS TSN refers
to the name
plus an "accepted" status and a relationship to parent taxa).
Do NOT underestimate the significance of this point.
However, if there were an understanding that the GNI only refers to name
strings,
then one could still refer to http://purl.org/gni/itis/19408 as an
identifier for the
name string of the thing (whatever it is) that is referred to by an ITIS
TSN of 19408.
Here be dragons -- for lots of reasons. At this point, you might as well just do a text-string match on the name. The problem is, you'll miss the match if authorship is not identical, but you risk homonymy mis-match if authorship is not included.
I have no idea whether this would be a good idea or not, but I was really
cringing
to think about 19 million newly minted UUIDs appended to
and figuring out how to connect those horrid things to the names and ITIS
TSNs
that I'm already using. I think that I said this before, but using the
purl.org domain
rather than one like http://gni.globalnames.org/ would in the future allow
somebody else to take over management of providing the metadata when the GUIDs are resolved without having to deal with issues of who "owns" the
domain name.
As I said before, I think it's perfectly fine to generate UUIDs for internal purposes within GNI for varius performance reasons (or whatever), but I don't think it's wise to expose those UUIDs externally. Because the uniqueness of a GNI record *is* the text string, then it makes more sense to me to simply use the text string. However, that only works for GNI/uBio/NameBank, where the essence of the record *is* the text string. It's a non-starter for other datasets like GNUB, ITIS, CoL, and most others, where the essence of the record is something altogether different.
Aloha, Rich
Richard L. Pyle, PhD Database Coordinator for Natural Sciences Associate Zoologist in Ichthyology Dive Safety Officer Department of Natural Sciences, Bishop Museum 1525 Bernice St., Honolulu, HI 96817 Ph: (808)848-4115, Fax: (808)847-8252 email: deepreef@bishopmuseum.org http://hbs.bishopmuseum.org/staff/pylerichard.html
Thanks for the clarification, Dave.
Re "Scientific Name", as you hopefully see in the above document, the term in ITIS generally corresponds to what I see from the ICBN use (Art. 16-24)
and
the ICZN use (Art. 4-5 in particular, as the "combination" formation,
rather
than the more atomized uses like "specific name" which is like "epithet").
Yes, agreed.
There are of course other thing in ITIS with TSNs, like database
artifacts, that
are labeled as such and retained but hidden from most users to avoid confusion and not strand any user that might already have the TSN.
Yes, I didn't even want to bring those up -- I was just talking about the "legit" records.
As to the relationship to taxon concept, if you squinted your eyes "just
so"
you could qualify as Rich did above and suggest that those TSNs that
happen
to represent names with usage=valid/accepted (and preferably those with some level of verification indicated, vs. the legacy data we're still
dealing
with!) "essentially represent a taxon concept", but I don't really think
that is
appropriate at this point....
Fair enough.
I guess my point was that if you filter down to just those tsn's deemed to be valid/accepted, you end up with (ideally) a mosaic that covers the biodiversity landscape. That is, they ultimately are intended to represent a "preferred view" of taxon concept circumscriptions that collectively cover all within-scope biodiversity (allowing for the fact that it's not complete yet, and the scope may change). As such, those TSN's can be thought of as "representing" a taxon concept (as per the assertion of the pool of ITIS taxonomic experts).
Of the remaining TSN's, some apply to subjective/heterotypic junior synonyms that represent smaller taxon concept circumscriptions created by a splitter (when the ITIS expert was more of a lumper). Others apply to subjective/heterotypic junior synonyms that represent taxon concept circumscriptions congruent with the corresponding valid/accepted TSN-represented circumscription, in cases where a taxonomist established a name for a clear-cut taxon that already had an earlier name (unbeknownst to the later taxonomist). Others apply to the same specific epithet combined with a different genus epithet (same species epithet, different genus name, same or different taxon circumscription) -- that is, alternate combinations. Others apply to alternate spellings of the same genus/species combination, without known implications to concept circumscription relationship.
So the point is, in cases where [usage]="valid"|"accepted", then TSN can (I think) reasonably be taken (eyes squinted) as representative/proxy for a taxon concept circumscription; but for the other TSNs, the implications for taxon concepts are different, depending (in part) on the value of the [unaccept_reason] field.
actually the closest thing in ITIS to a "taxon concept" would be certain entries in the reference_links table (the intersection between the scientific names entries and the reference entries),
These are what I/GNUB would call "Taxon Name Usage" instances. And as you stated for TSNs, while such "TNUs" may be thought of as a proxy for a taxon concept, this only applies to a subset of all possible TNUs. Sort of like TSNs :-)
Aloha, Rich
Thanks to all of you for taking the time to explain your perspective on the issues relating to taxon names and their identifiers. I suspect that some of this was probably explained last year in earlier threads that I zoned out on, so thanks for your patience in "re-explaining" - I'm understanding this better now. For example, I think I "get" the reason for the use of UUIDs that Pete and Dima explained - thanks for explaining that again. I see the problem with character encodings in URIs, etc.
I think that part of the difficulty that we are having in nailing down this issue is that different people have different things that they want their identifiers to "do". In a sense this is a good thing because "clever" GUIDs actually CAN do multiple things at once. Some of these things are: 1. uniquely identifying a resource globally 2. use http as a resolution mechanism 3. providing a means for tracking provenance 4. providing a single, stable point of reference to which multiple people/institutions can anchor the properties of their own resources 5. providing a way to unambiguously refer to the resource in a publication (i.e. type-able) 6. providing a means for a human to find information about the thing via a webpage 7. providing a means for a computer to find information about the thing via RDF/XML 8. not change Some people are primarily interested in a few of these things. Some people are interested in others. If the identifier is for internal or temporary use, then it doesn't really matter what the form of the identifier is, or whether it only meets a few of these functions. For example a URI that is used to identify your shopping cart when you make a purchase on amazon.com is going to be globally unique, but not stable or type-able. A private URI that you are using to call up information within your organization (maybe with a query string on the end) may meet some of these functions and might actually be globally unique. However, I would not count either of these things as a "GUID" in the sense of an identifier for permanent, public exposure and consumption, and intended as an object for a property in somebody else's RDF. I would assert that a "good" GUID in that sense ought to meet all of the 8 criteria that I listed above. We know how to do those things. There are plenty of resources online about how to achieve content negotiation, "cool" URIs (http://www.w3.org/TR/cooluris/), GUID standards (http://www.tdwg.org/standards/150/) and best practices (http://www2.gbif.org/Persistent-Identifiers.pdf http://links.gbif.org/persistent_identifiers_guide_en_v1.pdf), etc. that tell us how to do this. We have examples in our community that show us how to do these things and people who know how to do them. There is, therefore, no excuse for us to be creating public GUIDs that only do part of these things.
Rich provides these examples of identifiers: A. A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523 B. urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523 C. http://zoobank.org/urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E5BF4... D. http://zoobank.org/A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523 E. http://lsid.tdwg.org/urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E5B... F. http://www.google.com/search?q=Danaus+plexippus+(Linnaeus+1758) G. http://lsid.tdwg.org/summary/urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB... H. http://darwin.zoology.gla.ac.uk/~rpage/lsid/tester/?q=urn:lsid:zoobank.org:a... and then says: "But really, from the perspective of the end-user, does it matter if it's an identifier or a service? Ultimately, they ask the questions, and the answers appear on their computer screens." I would answer this question by saying "yes, it does matter!" - it is important that a well-designed GUID do more than just throw something up onto a human user's web browser. All of 8 these examples are globally unique (test 1). However, many of them flunk one or more of the seven other "tests" that I've laid out above. A and B flunk test 2. I applied test 6 all 8 of them by pasting them into Firefox to see if they would produce a human-friendly webpage and this test was flunked by A, B, and E. I applied test 7 to all of them by checking them with the OpenLink RDF browser (http://demo.openlinksw.com/rdfbrowser2/). Only E passed that test. All of them are pretty rotten on test 5, although D isn't too bad. The fact that there are at least five identifiers suggested which might be appropriate for use in metadata (A through E) makes the set as a whole flunk test 4. The bottom line is that not one of the 8 suggested identifiers pass all 8 of the tests. To me that is unacceptable. We simply should not be satisfied with that when it is clear that we can create identifiers that will meet all 8 tests. Lest anyone doubt me, here are three examples of "good" GUIDs created by members of our community which pass all 8 tests; there are undoubtedly more. http://biodiversity.org.au/apni.taxon/118883 http://lod.taxonconcept.org/ses/v6n7p http://biocol.org/urn:lsid:biocol.org:col:35259 All three pass the "cool URI" test and clearly resolve to metadata for both humans and computers. What I am wanting for taxon names is identifiers that pass the 8 tests I listed.
An important question that I think has been underlying much of this discussion is whether GUIDs are actually needed for names. If one takes the position that a "name" can never be more than a string without crossing the line into being something more complicated like a TNU or TaxonConcept, then I think one could make the case that the answer to this is "no". There isn't a whole lot that one would want to know about the string that couldn't just be imparted by letting it be a string literal. If one takes this position, then "Quercus alba L." is a different "thing" (i.e. resource) from "Quercus alba" or "Quercus alba Linnaeus". It seems that something like this is the position that Rich and the GNI are taking. Under this scenario, there is little point in creating URI GUIDs for the name strings.
On the other hand, if one takes the position that a name can be a conceptual entity that has properties which include its name string(s) and parts thereof, then it does make sense to apply GUIDs to that kind of entity. I am thinking about a tn:TaxonName as defined in the TDWG ontology (see http://code.google.com/p/tdwg-ontology/source/browse/trunk/ontology/voc/Taxo...), which comes out of the TCS schema (see http://code.google.com/p/darwin-sw/wiki/ClassTaxon for info and links regarding TCS). A tn:TaxonName is "An object that represents a single scientific biological name..." i.e. an "object" NOT defined as a string. Under the TCS, it is not kosher for a tn:TaxonName to have properties that tell how it is related to higher taxa - this is reserved for TaxonConcept instances. However, it is perfectly fine for an instance of tn:TaxonName to have properties that define the parts of its name (genus, species, and infraspecificEpithet if any) and other metadata directly associated with the name. Assigning a "good" GUID to a tn:TaxonName would be a desirable thing from the standpoint of test #4 because it would allow multiple people to assert that they were talking about the same name without having to worry about whether they did or did not include the author in the string and whether "Quercus lobata Née" is the same thing as "Quercus lobata Nee". The "atomized" parts of the name could be provided in a single unit of metadata rather than having to repeat them for every concept like "Quercus alba L. sensu Weakley 2010", "Quercus alba L. sensu Gleason and Cronquist 1991", "Quercus alba L. sensu Radford et al. 1968", "Quercus alba L. sensu Wofford and Chester 2002", etc.
Currently I'm following the approach for marking up Taxon metadata that was outlined by Cam and me at http://code.google.com/p/darwin-sw/wiki/ClassTaxon - it's based primarily on ideas from TCS and influenced by RDF examples posted by several people on this list (specifically it uses the Taxon Concept parts of the unfinished TDWG ontology with is NOT a standard, but is based on the TCS standard). We considered it to be relatively "safe" because it was based primarily on TCS which is a ratified TDWG standard and seems to mesh well with what most people seem to intend when they are talking about Taxon/TaxonConcept instances. We hope that this would avoid descending into "taxon concept hell" as would happen if we defined our own personal idea of what we think a taxon concept is. (In the interest of moving forward constructively in a relatively short period of time we hope that others will follow the same practice.) What I'm doing is creating temporary tc:TaxonConcept (a.k.a. tc:Taxon) resources to which I'm linking the dwc:Identification instances. I call them temporary because I don't really intend for others to use them and I'm hoping eventually to replace them with links to GNUB URIs when such things exist. In a nutshell, the RDF looks like this:
<tc:TaxonConcept rdf:about="http://bioimages.vanderbilt.edu/taxon/19290-weakley2010%22%3E tc:nameStringQuercus alba L.</tc:nameString> <tc:hasName rdf:resource ="http://www.ubio.org/authority/metadata.php?lsid=urn:lsid:ubio.org:namebank:4... tc:accordingToStringWeakley, A.S., 2010. Flora of the southern and mid-Atlantic states, working draft of 8 March 2010. University of North Carolina Herbarium, North Carolina Botanical Garden, Chapel Hill, NC, US. http://www.herbarium.unc.edu/flora.htm</tc:accordingToString> <tc:accordingTo rdf:resource="http://www.herbarium.unc.edu/flora.htm%22/%3E </tc:TaxonConcept>
In reality the RDF is more complicated than that but for the purposes of discussion, these are the important features. [For now you can ignore the "accordingTo parts because it's not really right according to TCS and the URI is for a web page, not the actual reference.] For the purposes of "stupid" clients (e.g. a Linked Data browser), the tc:nameString and tc:accordingToString literals would give them something to throw on a screen for humans to look at. A smarter client might be able to parse out the name string and do something clever with it. However, as someone who has listened to Pete's preaching and is trying hard to be a Linked Data/Semantic Web Believer, I want to put a functioning URI as the object of the tc:hasName property so that a real smarty-pants client could do exceptionally clever things like decide that maybe my /"Quercus alba/ L." pictures should go be put on the same webpage as somebody else's /"Quercus alba"/ pictures and with a third person's /"Quercus alba/ Linnaeus" pictures. I don't know how to do that, but I have faith that there are people on the list who know how, and so I want to give them the Linked Data resources to make that possible. This is similar to the rationale for using the URI http://bioimages.vanderbilt.edu/contact/baskauf to refer to me rather than the strings "Steve Baskauf", "Steven J. Baskauf", "S. Baskauf", etc. etc.
As I think about it, a primary purpose of providing a URI for a taxon name is to allow a machine to know that names which are referred to in two different taxon concepts are the same. This can be accomplished in one of two ways. One is simply make sure that users who are trying to look up two variant name strings for the same taxon name are directed to the same identifier. This is essentially what happens when one looks up a TSN on the ITIS website. If I search the ITIS website for either "Quercus alba L." or "Quercus alba", I'm sent to the record for TSN 19290. In this scenario, there is a single identifier for all of the lexical variants of Quercus alba L. When comparing two taxon instances to see if they have the same taxon name part, a machine has virtually no work to do in order to know that two name strings represent the same tn:TaxonName because even if the tc:nameString literal for the taxon has a different string literal value, as long as the URI object of tc:hasName is the same, then the names are the same.
The situation at uBio is different. uBio assigns a different identifier for each string and if one searches for different lexical variants, one gets different identifiers (although as one types a name on their website, the name version with the preferred author abbreviation shows up at the top of the "suggestions" list). However, although uBio provides separate identifiers for separate name strings, the RDF metadata that is returned when the "urn:lsid:ubio.org:namebank:448328" URI for "Quercus alba L." is resolved includes statements like <ubio:lexicalVariant rdf:resource="urn:lsid:ubio.org:namebank:2645936"/> whose object is the URI for "Quercus alba". So although two users who are creating tc:Taxon records for taxa with the same tn:TaxonName may get two different uBio ID numbers for two lexical variants, the properties of a uBio name string includes metadata property connections which could allow a client to figure out whether two name strings are really the same tn:TaxonName.
Now what about the GNI? At the moment, if I search for either "Quercus alba L." or "Quercus alba", I get a web page showing "Lexical Groups" which I guess may or may not actually be the same conceptual name. I'm not directed to any preferred version nor am I told which ones are considered variants of others. I don't know the algorithm for generating the UUIDs for these strings so I can't actually look up what RDF is returned for those Quercus alba strings, but based on the examples of other names that Pete gave, the RDF just seems to say "this string is the scientific name Quercus alba" and has no information telling me what other strings are variants. So what I'm struggling to figure out is what purpose there actually would be for me to use any kind of GNI URI as the object of my tc:hasName property. A URI based either on the name string or a URI containing a UUID generated from the name string would simply tell me that the thing that I'm talking about is that name string. That is exactly the same information that I already provide in the literal value of the tc:nameString property. Maybe there is a plan at some time in the future to add RDF metadata of the kind that uBio provides about variants and then I guess there would be a point in using a URI with tc:hasName that resolves to that RDF. But if the GNI is only a "dirty bucket" that accumulates every name string that anybody has ever used in history but with little or no metadata, then I can't see that I have any use for a URI point to it, at least as something to which I would refer in RDF. I'm not saying that there isn't a use for the GNI. I think what I'm saying is that there doesn't seem to be any point in worrying about how to create URIs for the GNI when those URIs don't "do" anything different from what a string literal does. I think this is essentially what Rich was saying: "that text string represents a perfectly suitable unique identifier. There is no need to generate a surrogate identifier like an integer number or UUID or LSID or whatever".
From the standpoint of what I think of as the "general Bob Morris naughty test" (am I being naughty to assert that some resource has an rdf:type that doesn't make sense?), I believe that one could technically use either a uBio http proxied LSID or a resolvable identifier created from an ITIS TSN (which unfortunately does not exist to my knowledge) to represent a tn:TaxonName. Although Rich has been very cautionary about maintaining the distinction between ITIS TSNs, which he believes to represent some kind of minimal TNU and uBio IDs which he believes to represent a name string, I haven't been able to find any evidence that it would be "naughty" to assert that either one is a tn:TaxonName. The RDF returned when a uBio LSID is resolved does not give any rdf:type. It simply says that the resource has a dc:type of "scientific name". This could really mean anything we want since scientific name is not a part of the DC type vocabulary. There is also nothing in the RDF that defines the resource as being of the string data type (although one of its properties, ubio:canonicalName, does). So I can't see that I would be creating any "collision" of rdf:types by asserting the resource to be of type tn:TaxonName. The ITIS website does return tc:Taxon (a.k.a. tc:TaxonConcept) type information when one looks up a TSN, but since as far as I know ITIS doesn't provide any RDF describing the resources identified by their TSNs, there isn't any collision if I assert that such a resource is a tn:TaxonName. Also, if one reads the information that was referenced earlier in the thread at http://www.itis.gov/pdf/faq_itis_tsn.pdf , it seems pretty clear that ITIS intends for the subject of their TSNs to be names, not taxon concepts even if the metadata they send to humans on their website seems to say otherwise.
As far as what URI I stick in my RDF as the object of the tc:hasName property (which incidentally does not HAVE to be a tn:TaxonName, since tc:hasName does not have a defined range), at this point I guess the uBio URI is the only choice since there aren't ITIS http URIs (as far as I know). Steve
Nicolson, David wrote:
Rich (et al.)... Just a quick comment re ITIS TSNs, since Rich posited:
My understanding (David N.: correct me if I'm wrong), is that all TSNs that correspond to "valid/accepted" names (where [taxonomic_units].[usage]='valid'|'accepted') essentially represent a taxon concept. The rest of the TSNs (where [taxonomic_units].[usage]='invalid'|'not accepted') represent a variety of things, ranging from different combinations to alternate spellings to subjective synonyms, each of which is referable to one of the "valid/accepted" names.
I would say "sort of".... The TSNs do not themselves correspond with much of anything other than a unique, persistent, non-intelligent identifier for a "scientific name" (I realize that begs your next question/point of what that term "means") record in the context of the ITIS data system. See this linked from the "About ITIS" page: http://www.itis.gov/pdf/faq_itis_tsn.pdf
Re "Scientific Name", as you hopefully see in the above document, the term in ITIS generally corresponds to what I see from the ICBN use (Art. 16-24) and the ICZN use (Art. 4-5 in particular, as the "combination" formation, rather than the more atomized uses like "specific name" which is like "epithet"). There are of course other thing in ITIS with TSNs, like database artifacts, that are labeled as such and retained but hidden from most users to avoid confusion and not strand any user that might already have the TSN.
By way of an example of the use of "name" fields.... Just recently I was given a nice "finished" world dataset for a modest animal-family-that-shall-remain-nameless, and the "name" fields were in some cases just as ITIS uses them, and in others there were additional things like authorship and so on lumped in with the name parts in those fields, though there were no years provided even then. So, usable, but the amount of work to essentially re-parse the data was surprising for just a couple hundred names, and even then they were inconsistent and incomplete, so someone now has to go collect all the missing details and go over it all again, and it clearly needs some smoothing around the edges as well. That was for just 200+ names from a single source. Ugh, thanks....
As to the relationship to taxon concept, if you squinted your eyes "just so" you could qualify as Rich did above and suggest that those TSNs that happen to represent names with usage=valid/accepted (and preferably those with some level of verification indicated, vs. the legacy data we're still dealing with!) "essentially represent a taxon concept", but I don't really think that is appropriate at this point.... actually the closest thing in ITIS to a "taxon concept" would be certain entries in the reference_links table (the intersection between the scientific names entries and the reference entries), but even that is too abstract in my view. Since any number of references may be linked to a single TSN, that TSN won't necessarily yield something that maps to "a taxon concept" unless you're thinking "sensu ITIS v2011-05-31" or something of that ilk, which is I guess another way to think about it, with its own pros/cons.
And I agree with Rich's warnings of many pitfalls below (dragons and such).
I'll leave it there. Oops. So much for the "quick" comment....
Best, Dave
David Nicolson Data Development Coordinator, Integrated Taxonomic Information System Biologist, USGS Core Science Systems, Biological Informatics Program nicolsod@si.edu Office 202-633-2149 Fax 202-786-2934 http://www.itis.gov/ http://www.cbif.gc.ca/itis/ "Nihil sumas necesse est..."
-----Original Message----- From: Richard Pyle [mailto:deepreef@bishopmuseum.org] Sent: Friday, June 03, 2011 3:16 PM To: 'Steven J. Baskauf'; 'Kevin Richards' Cc: tdwg-content@lists.tdwg.org; Orrell, Thomas; 'Alan J Hampson'; Nicolson, David; 'Gerald Guala' Subject: RE: [tdwg-content] ITIS TSNID to uBio NamebankIDs mapping
Hi All,
I'm just catching up on email now, after a series of other work-related obligations and virtual attendance at a cybertaxonomy/e-literature meeting in Chicago this week. I do not now have time to review the entire thread, so I'll jump into the stream with Steve's recent post.
I think that one reason why this question has been on my mind is that I've
been waiting for
GNUB (Global Name Use Bank) to come out.
Just a quick update, due to budgetary woes in the U.S. Federal Government, NSF funding for awarded proposals has been pushed every further back. If I'm not mistaken, something like 18 months passed between proposal submission and availability of funds for the BiSciCol grant, which our institution was only able to (finally!) start processing within the past few months. Why is this relevant to GNUB? Because the BiSciCol grant includes the most substantial funding yet for implementation of GNUB (indeed, the only funding for GNUB by name). The good news is that, now that funding is in hand and money (finally) flowing, development & implementation of GNUB is ramping up quickly. And the promise of more (and more substantial) funding is just around the corner (watch this space).
I'm not really up on how it is going to work, but my impression is that it
was going
to be based on the Global Name Index (GNI) which was mentioned in that
earlier
January thread.
Not exactly. GNI and GNUB represent two ends of a spectrum. GNI is at the "minimal metadata/maximal content" end of the spectrum -- basically a repository of any text-string purported to represent a taxon name that can be linked via a resolvable identifier. GNUB is at the "richly metadata'd/carefully curated" end of the spectrum, representing a highly normalized structure with permanent resolvable GUIDs and the potential for robust information/data services. In the vernacular, GNI is the "dirty bucket", and GNUB is the "clean bucket". At the moment, the connection between GNUB and GNI is unidirectional, in that the content of the progenitor of GNUB has been indexed in GNI, but there is no mechanism (yet) for GNI content to feed into GNUB. The reason for this is fairly straightforward: it's very easy to flatten out normalized content into simple text strings (GNUB-->GNI), but it's much more difficult (impossible?) to migrate metadata-poor, moderately parsed content into a highly structured system.
At that point, the GNI names didn't have any identifiers that were exposed
to
the public as permanent GUIDs. I'm assuming that if GNUB refers to GNI
names,
they will have some kind of identifiers. So if that happens how is the
GUID
recommendation 8 going to be followed? As Kevin said in http://lists.tdwg.org/pipermail/tdwg-content/2011-June/002499.html "What I
take
from recommendation 8 of the GUID applicability guide ... is that if you
DON'T
already have a record in your own database for a taxon name/concept, then reuse an existing one. " What we have here with GNI is a situation where
none
of the records have identifiers. In my mind, the "best practice"
according to
recommendation 8 would be for the GNI to reuse existing identifiers where they exist and NOT make up new ones. This is a bit more complicated
because the
ITIS identifiers (which are in common use) don't have an http URI version
that is
resolvable, and while the uBio identifiers have a resolvable http URI,
it's in the
form of a proxied LSID, which I've already complained is very ugly. So
I'd like to
hear some ideas about how to have "reused" identifiers in the GNI.
In terms of GUIDs, the objects in GNUB and the objects in GNI are not the same, and therefore cannot share identifiers. The core object in GNI is a text-string. Indeed, the text string itself can be the actual identifier, because it *is* the thing being identified. In other words, because the essential uniqueness of an instance (record) in GNI by definition *is* the text string (i.e., the series of UTF-8-encoded characters), then that text string represents a perfectly suitable unique identifier. There is no need to generate a surrogate identifier like an integer number or UUID or LSID or whatever (except, perhaps, for internal use as a primary key for joining tables; but those identifiers need not/should not be exposed to the outside world).
By contrast, the core object in GNUB is a taxon name usage instance -- which is a purely abstract notion of the usage of a taxon name within some documentation source (like a publication). In this case, the text-string name is merely a property of the GUID-identified object, and would be an extremely BAD choice to use as a unique identifier. This is why GNUB needs to generate a unique identifier to represent this core data object. The form that identifier takes (UUID, LSID, integer, DOI, whatever) from the perspective of the end user should be completely irrelevant, because the user should rarely (if ever) see it, and should certainly *never* be in a position to type it on a keyboard (we can discuss the appearance of ZooBank LSIDs on printed pages separately). All that matters is that it is persistent, globally unique identifier that can be used to cross-link information and can be conveniently resolved to the metadata of the object it represents.
But the point is, recommendation 8 of the GUID applicability guide is not being violated in the context of GNI and GNUB.
The real problem in all of this is the inconsistent meaning people apply to the notion of a "taxon name". In GNI-space, the name is simply a text string. In GNUB-space, the "name-object" is a code-compliant Protonym that serves to cross-link Name-usages to each other. ITIS is different still. My understanding (David N.: correct me if I'm wrong), is that all TSNs that correspond to "valid/accepted" names (where [taxonomic_units].[usage]='valid'|'accepted') essentially represent a taxon concept. The rest of the TSNs (where [taxonomic_units].[usage]='invalid'|'not accepted') represent a variety of things, ranging from different combinations to alternate spellings to subjective synonyms, each of which is referable to one of the "valid/accepted" names. CoL uses names as proxies to taxon concepts (not sure how they handle synonyms vs. misspellings, etc.) And there are other variations as well -- to most botanists, "Aus bus L." and "Xus bus (L.) Smith" represent "different names", whereas to most zoologists (who would not bother to include the "Smith"), regard them as the "different combinations of the same name" (zoologists are less consistent than botanists in this regard).
The point is, this inconsistency and heterogeneity of what is meant by a "name" in taxonomy is, in my opinion, the single GREATEST obstacle in achieving informatics harmony among biodiversity datsets.
One thing that comes to my mind would be to have a "domain name" like "http://purl.org/gni/" or "http://purl.org/tn/" ("tn" for "taxon name")
and
to follow it with a namespace/id combination similar to what is done with
lsids.
So for example "itis/19408" and "ubio/448439" could be appended, creating http://purl.org/gni/itis/19408 and
http://purl.org/gni/ubio/448439 for "
Quercus rubra L." Both URIs could point to the same RDF and that RDF
could
indicate that the two identifiers are owl:sameAs .
This syntax is basically what ZooBank does (and GNUB will do), within their own domain name. But I like the idea of a common URL domain that allows these qualified identifiers to be appended.
The real problem is what you describe next:
I realize from what Bob Morris has cautioned in the past that there are
problems
with owl:sameAs when the two things aren't actually the same thing (e.g. if the uBio ID refers to a name string only but the ITIS TSN refers
to the name
plus an "accepted" status and a relationship to parent taxa).
Do NOT underestimate the significance of this point.
However, if there were an understanding that the GNI only refers to name
strings,
then one could still refer to http://purl.org/gni/itis/19408 as an
identifier for the
name string of the thing (whatever it is) that is referred to by an ITIS
TSN of 19408.
Here be dragons -- for lots of reasons. At this point, you might as well just do a text-string match on the name. The problem is, you'll miss the match if authorship is not identical, but you risk homonymy mis-match if authorship is not included.
I have no idea whether this would be a good idea or not, but I was really
cringing
to think about 19 million newly minted UUIDs appended to
and figuring out how to connect those horrid things to the names and ITIS
TSNs
that I'm already using. I think that I said this before, but using the
purl.org domain
rather than one like http://gni.globalnames.org/ would in the future allow
somebody else to take over management of providing the metadata when the GUIDs are resolved without having to deal with issues of who "owns" the
domain name.
As I said before, I think it's perfectly fine to generate UUIDs for internal purposes within GNI for varius performance reasons (or whatever), but I don't think it's wise to expose those UUIDs externally. Because the uniqueness of a GNI record *is* the text string, then it makes more sense to me to simply use the text string. However, that only works for GNI/uBio/NameBank, where the essence of the record *is* the text string. It's a non-starter for other datasets like GNUB, ITIS, CoL, and most others, where the essence of the record is something altogether different.
Aloha, Rich
Richard L. Pyle, PhD Database Coordinator for Natural Sciences Associate Zoologist in Ichthyology Dive Safety Officer Department of Natural Sciences, Bishop Museum 1525 Bernice St., Honolulu, HI 96817 Ph: (808)848-4115, Fax: (808)847-8252 email: deepreef@bishopmuseum.org http://hbs.bishopmuseum.org/staff/pylerichard.html
.
Hi Steve,
Excellent post!
I like your list of what we want "GUIDs" (see below) to do, and I think it's an excellent starting point for a bar we should all strive for. I'm particularly grateful to learn that the existing ZooBank service fails so many of them. I've forwarded your post to Rob Whitton, who will be working on Gen-2 of ZooBank in the coming weeks, and asked him if we can use your 8 tests as a metric to adhere to. Watch this space.
Meanwhile...
"But really, from the perspective of the end-user, does it matter if it's an identifier or a service? Ultimately, they ask the questions, and the answers appear on their computer screens."
I would answer this question by saying "yes, it does matter!" - it is important that a well-designed GUID do more than just throw something up onto a human user's web browser.
I absolutely agree with you, but that's not the distinction I was making in my quoted text. I was only talking about whether we call something an "identifier" (not GUID, which has more specific implications), or a "service", in the context of human-machine conversations. I think your enumeration of things we want GUIDs to do is a very good framework for discussion. I would only caution that "GUID" means different things to different people (some people use it synonymously with UUID, for example), and also that GUID does not imply "actionable". There has been a bit of a debate over the importance of embedding "actionability" into identifiers inherently (the Tim Berners-Lee perspective), vs thinking about "identification" separately from how we perform some action on it. For example, UUIDs and Social Security numbers are extremely useful identifiers, even though they are not inherently actionable. It's amazingly easy to perform action on a non-actionable identifier by simply appending it to a actionable prefix. For example, going back to the list of "identifiers":
A. A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523 B. urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523 C. http://zoobank.org/urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E5BF4 1523 D. http://zoobank.org/A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523 E. http://lsid.tdwg.org/urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E5B F41523 F. http://www.google.com/search?q=Danaus+plexippus+(Linnaeus+1758) G. http://lsid.tdwg.org/summary/urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB 4-EA8E5BF41523 H. http://darwin.zoology.gla.ac.uk/~rpage/lsid/tester/?q=urn:lsid:zoobank.org:a ct:A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523&submit=Go
There are two different ways of looking at this:
1) There are 8 different identifiers 2) There is one identifier (A), and 6 ways to perform action on it (B-E, G-H).
If you treat them all as distinct identifiers, then let me add a few more to the list:
I. http://zoobank.org/?lsid=urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA 8E5BF41523 J. http://zoobank.org/?id=urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E 5BF41523 K. http://zoobank.org/?id=A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523 L. http://zoobank.org/?uuid=A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523
Note that all four of the above, plus B-D in the original list, are all resolved through zoobank.org. Why are there so many different ways to perform action on the "same" identifier? Because I wanted the ZooBank resolution service to be flexible. And, because in my mind, there is only one identifier (A); and lots of different ways to retrieve the metadata of the object it represents.
Now consider this from the TB-L perspective. Eleven different identifiers for the same object (excluding F). Does that mean we need to generate owl:sameAs statements for all pair-wise relationships? That's a lot of owl:sameAs statements! Even if I'm the bad guy in foolishly allowing so many different ways to resolve ZooBank identifiers, and needlessly fabricated so many "different" identifiers for the same thing unnecessarily. Fair enough. But I still think we're a lot better off by disentangling identifiers from the services we use to perform action on them.
One of the arguments on the TB-L side is that a non-actionable identifier by itself is useless if you cannot inherently perform action on it. For example, if you were walking through the park and stumbled upon a slip of paper with "A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523" written on it, you probably wouldn't be able to do much with it. But in reality, that's not what happens. We never expose identifiers as a simple context-free identifiers in their non-resolvable form. These identifiers are *always* exposed in some context. The problem is that if you treat the "resolution metadata" (as I call it -- e.g., "urn:lsid:zoobank.org:act:" or "http://zoobank.org/") as *part* of the identifier (as you have to do if you make things like "urn:lsid:ubio.org:namebank:11815"), then it becomes difficult for an application to distinguish between "http://zoobank.org/?id=A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523", and "http://zoobank.org/?uuid=A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523"; which, to a human, obviously refers to the same thing. In other words, absent all those owl:sameAs statements, an application could break if it harvests content from different sources that use different resolution metadata for the "same" (sensu Pyle) identifier.
Maybe what we need to think about is a registry of "persistent resolution services", which our community relies on. That way, we can apply the owl:sameAs statements to the resolution services, rather than to every single individual identifier.
An important question that I think has been underlying much of this
discussion
is whether GUIDs are actually needed for names.
I think the answer is clearly "yes". The problem is defining what is meant by the word "name".
If one takes the position that a "name" can never be more than a string without crossing the line into being something more complicated like a TNU or TaxonConcept, then I think one could make the case that the answer to this is "no".
Perhaps, but I don't know of anyone who takes that position. GNI/uBio/NameBank exist for a very specific purpose, and in that very narrow context, the "name" is equivalent to the UTF-8-encouded string of characters. The architects of these systems would be the first to say that this is a very limited context for what a "name" is, and *none* of them would assert that a "name" can never be more than this. Everyone I know understands that all other flavors of "name" imply something much, much more than the string of text characters.
There isn't a whole lot that one would want to know about the string that couldn't just be imparted by letting it be a string literal. If one takes this position, then "Quercus alba L." is a different "thing" (i.e. resource) from "Quercus alba" or "Quercus alba Linnaeus". It seems that something like this is the position that Rich and the GNI are taking. Under this scenario, there is little point in creating URI GUIDs for the name strings.
I only took that position in the *very narrow* context of GNI, which is unusual among the millions of taxonomic datasets in treating a "name" as a distinct text string. And I backed off from that position after reading Dima's post.
On the other hand, if one takes the position that a name can be a conceptual entity that has properties which include its name string(s)
...as, I think, everyone does...
and parts thereof, then it does make sense to apply GUIDs to that kind of entity. I am thinking about a tn:TaxonName as defined in the TDWG ontology (see
http://code.google.com/p/tdwg-ontology/source/browse/trunk/ontology/voc/Taxo nName.rdf),
which comes out of the TCS schema (see http://code.google.com/p/darwin-sw/wiki/ClassTaxon for info and links
regarding TCS).
A tn:TaxonName is "An object that represents a single scientific
biological name..." i.e. an "object"
NOT defined as a string.
While it's nice to see the explicit representation of a "name" as an object, rather than a string; unfortunately that doesn't address the elephant in the room; that is, that different people have different notions of what "a single scientific biological name" is. I'm not talking subtly different shades of fundamentally the same thing; I'm talking about fundamentally different things with different implied sets of properties. This is one of the issues I continued to hammer on during the development of TCS, and the one that gave me the biggest qualms about TCS 1.0. My hope was that it would be resolved in TCS 2.0. I wanted to reduce both names and concepts to the same core entity: usage instances. That's exactly what we're doing with GNUB.
But if the GNI is only a "dirty bucket" that accumulates every name string
that anybody
has ever used in history but with little or no metadata, then I can't see
that I have any
use for a URI point to it, at least as something to which I would refer in
RDF.
I think it's helpful to see GNI and GNUB as a yin-and-yang sort of thing. There *needs* to be a service at the dirty end of the spectrum, because for the vast majority of existing biodiversity data (digitized or not), the only link we have to at taxon concept is a text-string name. There needs to be a service that manages names-as-text strings. GNUB, at the other end of the spectrum, has the rich full-context metadata that I think you are interested in, allowing for unambiguous reconciliation of different text strings as applied to type specimens, or enumerating all spelling variants of the "same" name, etc., etc. What's missing (but DEFINITELY planned and already sketched out), are the services that connect GNUB and GNI together. As soon as we hear definitively from NSF (should be soon now), we'll have the resources to start building those services.
I'm not saying that there isn't a use for the GNI. I think what I'm
saying is that there
doesn't seem to be any point in worrying about how to create URIs for the
GNI
when those URIs don't "do" anything different from what a string literal
does.
I think this is essentially what Rich was saying: "that text string
represents a perfectly
suitable unique identifier. There is no need to generate a surrogate
identifier like
an integer number or UUID or LSID or whatever".
Yes, I think that's exactly what I was saying. Dima's post has forced me to reconsider this somewhat, but even still, more broadly, I never saw GNI as a service in need of "GUIDs" (in the sense that you outlined at the beginning of your post). Certainly there is value in having internal data structures to perform certain functions, but as far as I can tell, the interface between GNI and the outside world should probably be limited to human-readable name-strings.
Although Rich has been very cautionary about maintaining the distinction
between
ITIS TSNs, which he believes to represent some kind of minimal TNU
I would defer to Dave N.'s post concerning what a TSN is, and represents.
and uBio IDs which he believes to represent a name string, I haven't been able to find any evidence that it would be "naughty" to assert that either one is a tn:TaxonName.
That's only true to the extent that tn:TaxonName may be too broadly (imprecisely) defined (just like dwc:Taxon).
Aloha, Rich
I just added urls with name strings as an alternative 'ids' in GNI. So it is possible to have something like
http://gni.globalnames.org/name_strings/Quercus_alba
and even
http://gni.globalnames.org/name_strings/Quercus alba L. 'Elongata'
Also links like
http://gni.globalnames.org/name_strings/10507390
will be converted when accessed by a human via browser to
http://gni.globalnames.org/name_strings/Quercus_alba_L._%27Elongata'
There are still some problems, for example names ending with period do not work yet.
Dima
On Sun, Jun 5, 2011 at 2:56 PM, Richard Pyle deepreef@bishopmuseum.org wrote:
Hi Steve,
Excellent post!
I like your list of what we want "GUIDs" (see below) to do, and I think it's an excellent starting point for a bar we should all strive for. I'm particularly grateful to learn that the existing ZooBank service fails so many of them. I've forwarded your post to Rob Whitton, who will be working on Gen-2 of ZooBank in the coming weeks, and asked him if we can use your 8 tests as a metric to adhere to. Watch this space.
Meanwhile...
"But really, from the perspective of the end-user, does it matter if it's an identifier or a service? Ultimately, they ask the questions, and the answers appear on their computer screens."
I would answer this question by saying "yes, it does matter!" - it is important that a well-designed GUID do more than just throw something up onto a human user's web browser.
I absolutely agree with you, but that's not the distinction I was making in my quoted text. I was only talking about whether we call something an "identifier" (not GUID, which has more specific implications), or a "service", in the context of human-machine conversations. I think your enumeration of things we want GUIDs to do is a very good framework for discussion. I would only caution that "GUID" means different things to different people (some people use it synonymously with UUID, for example), and also that GUID does not imply "actionable". There has been a bit of a debate over the importance of embedding "actionability" into identifiers inherently (the Tim Berners-Lee perspective), vs thinking about "identification" separately from how we perform some action on it. For example, UUIDs and Social Security numbers are extremely useful identifiers, even though they are not inherently actionable. It's amazingly easy to perform action on a non-actionable identifier by simply appending it to a actionable prefix. For example, going back to the list of "identifiers":
A. A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523 B. urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523 C. http://zoobank.org/urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E5BF4 1523 D. http://zoobank.org/A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523 E. http://lsid.tdwg.org/urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E5B F41523 F. http://www.google.com/search?q=Danaus+plexippus+(Linnaeus+1758) G. http://lsid.tdwg.org/summary/urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB 4-EA8E5BF41523 H. http://darwin.zoology.gla.ac.uk/~rpage/lsid/tester/?q=urn:lsid:zoobank.org:a ct:A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523&submit=Go
There are two different ways of looking at this:
- There are 8 different identifiers
- There is one identifier (A), and 6 ways to perform action on it (B-E,
G-H).
If you treat them all as distinct identifiers, then let me add a few more to the list:
I. http://zoobank.org/?lsid=urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA 8E5BF41523 J. http://zoobank.org/?id=urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E 5BF41523 K. http://zoobank.org/?id=A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523 L. http://zoobank.org/?uuid=A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523
Note that all four of the above, plus B-D in the original list, are all resolved through zoobank.org. Why are there so many different ways to perform action on the "same" identifier? Because I wanted the ZooBank resolution service to be flexible. And, because in my mind, there is only one identifier (A); and lots of different ways to retrieve the metadata of the object it represents.
Now consider this from the TB-L perspective. Eleven different identifiers for the same object (excluding F). Does that mean we need to generate owl:sameAs statements for all pair-wise relationships? That's a lot of owl:sameAs statements! Even if I'm the bad guy in foolishly allowing so many different ways to resolve ZooBank identifiers, and needlessly fabricated so many "different" identifiers for the same thing unnecessarily. Fair enough. But I still think we're a lot better off by disentangling identifiers from the services we use to perform action on them.
One of the arguments on the TB-L side is that a non-actionable identifier by itself is useless if you cannot inherently perform action on it. For example, if you were walking through the park and stumbled upon a slip of paper with "A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523" written on it, you probably wouldn't be able to do much with it. But in reality, that's not what happens. We never expose identifiers as a simple context-free identifiers in their non-resolvable form. These identifiers are *always* exposed in some context. The problem is that if you treat the "resolution metadata" (as I call it -- e.g., "urn:lsid:zoobank.org:act:" or "http://zoobank.org/") as *part* of the identifier (as you have to do if you make things like "urn:lsid:ubio.org:namebank:11815"), then it becomes difficult for an application to distinguish between "http://zoobank.org/?id=A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523", and "http://zoobank.org/?uuid=A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523"; which, to a human, obviously refers to the same thing. In other words, absent all those owl:sameAs statements, an application could break if it harvests content from different sources that use different resolution metadata for the "same" (sensu Pyle) identifier.
Maybe what we need to think about is a registry of "persistent resolution services", which our community relies on. That way, we can apply the owl:sameAs statements to the resolution services, rather than to every single individual identifier.
An important question that I think has been underlying much of this
discussion
is whether GUIDs are actually needed for names.
I think the answer is clearly "yes". The problem is defining what is meant by the word "name".
If one takes the position that a "name" can never be more than a string without crossing the line into being something more complicated like a TNU or TaxonConcept, then I think one could make the case that the answer to this is "no".
Perhaps, but I don't know of anyone who takes that position. GNI/uBio/NameBank exist for a very specific purpose, and in that very narrow context, the "name" is equivalent to the UTF-8-encouded string of characters. The architects of these systems would be the first to say that this is a very limited context for what a "name" is, and *none* of them would assert that a "name" can never be more than this. Everyone I know understands that all other flavors of "name" imply something much, much more than the string of text characters.
There isn't a whole lot that one would want to know about the string that couldn't just be imparted by letting it be a string literal. If one takes this position, then "Quercus alba L." is a different "thing" (i.e. resource) from "Quercus alba" or "Quercus alba Linnaeus". It seems that something like this is the position that Rich and the GNI are taking. Under this scenario, there is little point in creating URI GUIDs for the name strings.
I only took that position in the *very narrow* context of GNI, which is unusual among the millions of taxonomic datasets in treating a "name" as a distinct text string. And I backed off from that position after reading Dima's post.
On the other hand, if one takes the position that a name can be a conceptual entity that has properties which include its name string(s)
...as, I think, everyone does...
and parts thereof, then it does make sense to apply GUIDs to that kind of entity. I am thinking about a tn:TaxonName as defined in the TDWG ontology (see
http://code.google.com/p/tdwg-ontology/source/browse/trunk/ontology/voc/Taxo nName.rdf),
which comes out of the TCS schema (see http://code.google.com/p/darwin-sw/wiki/ClassTaxon for info and links
regarding TCS).
A tn:TaxonName is "An object that represents a single scientific
biological name..." i.e. an "object"
NOT defined as a string.
While it's nice to see the explicit representation of a "name" as an object, rather than a string; unfortunately that doesn't address the elephant in the room; that is, that different people have different notions of what "a single scientific biological name" is. I'm not talking subtly different shades of fundamentally the same thing; I'm talking about fundamentally different things with different implied sets of properties. This is one of the issues I continued to hammer on during the development of TCS, and the one that gave me the biggest qualms about TCS 1.0. My hope was that it would be resolved in TCS 2.0. I wanted to reduce both names and concepts to the same core entity: usage instances. That's exactly what we're doing with GNUB.
But if the GNI is only a "dirty bucket" that accumulates every name string
that anybody
has ever used in history but with little or no metadata, then I can't see
that I have any
use for a URI point to it, at least as something to which I would refer in
RDF.
I think it's helpful to see GNI and GNUB as a yin-and-yang sort of thing. There *needs* to be a service at the dirty end of the spectrum, because for the vast majority of existing biodiversity data (digitized or not), the only link we have to at taxon concept is a text-string name. There needs to be a service that manages names-as-text strings. GNUB, at the other end of the spectrum, has the rich full-context metadata that I think you are interested in, allowing for unambiguous reconciliation of different text strings as applied to type specimens, or enumerating all spelling variants of the "same" name, etc., etc. What's missing (but DEFINITELY planned and already sketched out), are the services that connect GNUB and GNI together. As soon as we hear definitively from NSF (should be soon now), we'll have the resources to start building those services.
I'm not saying that there isn't a use for the GNI. I think what I'm
saying is that there
doesn't seem to be any point in worrying about how to create URIs for the
GNI
when those URIs don't "do" anything different from what a string literal
does.
I think this is essentially what Rich was saying: "that text string
represents a perfectly
suitable unique identifier. There is no need to generate a surrogate
identifier like
an integer number or UUID or LSID or whatever".
Yes, I think that's exactly what I was saying. Dima's post has forced me to reconsider this somewhat, but even still, more broadly, I never saw GNI as a service in need of "GUIDs" (in the sense that you outlined at the beginning of your post). Certainly there is value in having internal data structures to perform certain functions, but as far as I can tell, the interface between GNI and the outside world should probably be limited to human-readable name-strings.
Although Rich has been very cautionary about maintaining the distinction
between
ITIS TSNs, which he believes to represent some kind of minimal TNU
I would defer to Dave N.'s post concerning what a TSN is, and represents.
and uBio IDs which he believes to represent a name string, I haven't been able to find any evidence that it would be "naughty" to assert that either one is a tn:TaxonName.
That's only true to the extent that tn:TaxonName may be too broadly (imprecisely) defined (just like dwc:Taxon).
Aloha, Rich
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
Reading this thread makes me despair. It's as if we are determined not to make progress, forever debating identifiers and what they identify, with seemingly little hope of resolution, and no clear vision of what the goals are. We wallow in acronym soup, and enjoy the technical challenges, but don't actually get anywhere (http://iphylo.blogspot.com/2010/04/biodiversity-informatic-fail-and-what.htm... )
Here's one vision of a way forward.
1. Someone or some entity shows a little vision and courage, and provides a taxonomic classification where every node in the tree gets a DOI.
2. The nodes in the taxonomic classification are citable objects (hence the DOIs).
3. Nodes in the classification can be accessed either by DOI or by name-based HTTP URI (multiple nodes for a name bounce to ambiguity resolution pages).
4. A taxon-name extraction service locates names in text.
5. We build a taxon name/literature index (which we pretty much have already, albeit distributed and partly proprietary).
Now, when an author (in any field) writes a paper or publishes some data on a taxon they cite the node in the classification as they would any scientific paper. Through CrossRef's citation tracking mechanism, the taxon database automatically accumulates the scientific literature relevant to that taxon.
When a journal publishes an article it calls the name extraction service to make the names clickable, avoiding the need for the journal to create its own taxon pages (a la Pensoft), and automatically building a taxonomic index to the literature. Publishers get enriched content, we get an always up to date taxonomic index.
So, we have services that authors and publishers can use, and use an identifier scheme that publishers understand (and so do some, if not most authors).
But you say "what about RDF and the linked web?". Relax, DOIs are now linked data compliant.
But you say "what about versions?" OK, to a first approximation nobody cares about versions. They really don't. Obviously previous versions will be accessible, but the identifier always points to the most recent version.
But you say "ah but it costs money". Yep, anything worthwhile does. Last time I checked journals costs money, yet we seem to have lots of those.
But you say, "which classification to use?" Does anybody (outside taxonomy) actually care? Classification is a navigational convenience. If you care deeply, you'll make a phylogeny.
But you say, "why DOIs?" Several reasons, 1) publishers and authors understand them, 2) they avoid branding 3) there's an infrastructure underpinning them 4) they show that we are serious
But "what about different taxon concepts?" To a first approximation nobody cares, and for the bulk of life we know too little for there to be much ambiguity. If we do care we can read the literature, which we have conveniently indexed.
Now, there are lots of things we could argue about, but if, say, EOL had done something like this at the start, namely embedded itself in the publication process, and major journals were citing EOL pages and linking to EOL pages, we would have a wonderful tool that was actually useful, way outside our own narrow concerns. Note that I'm using EOL has an example of an organisation with sufficient scope, GBIF would be another candidate.
I suspect a major reason for our continued failure is a the lack of clearly identified users (and I don't mean people who read this list, or TAXACOM), and a failure of ambition.
Anyway, my coffee has arrived. I don't hold out much hope of any of this happening, and I fully expect us to be debating these issues in a year from now. Pity.
Regards
Rod
--------------------------------------------------------- Roderic Page Professor of Taxonomy Institute of Biodiversity, Animal Health and Comparative Medicine College of Medical, Veterinary and Life Sciences Graham Kerr Building University of Glasgow Glasgow G12 8QQ, UK
Email: r.page@bio.gla.ac.uk Tel: +44 141 330 4778 Fax: +44 141 330 2792 AIM: rodpage1962@aim.com Facebook: http://www.facebook.com/profile.php?id=1112517192 Twitter: http://twitter.com/rdmpage Blog: http://iphylo.blogspot.com Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html
Hi Rod,
There is no reason that the EoL, GBIF or both could not adopt and take ownership of what I am working on.
We have been testing different approaches and models as well as discussing various ways that this could be done.
I am actually with some of the MBL Woods Hole people today.
Since my taxonconcepts don't entail a particular classification, you can apply what ever classification you think is most appropriate.
*A species concept can have many classifications*
I have created a basic classification ontology that follows the Catalog of Life to Orders
http://lod.taxonconcept.org/ontology/phylo/CoL/CoL_2010_base.owl
HTML Doc http://lod.taxonconcept.org/ontology/phylo/CoL/doc/index.html
and Bio2RDF/Uniprot have a system based on NCBI taxon_id's
Example http://purl.uniprot.org/taxonomy/27807
http://purl.uniprot.org/taxonomy/27807In regards to DOI's, it don't see how they add more value than simple URI's.
In addition, they continue the problem of an identifier that is not the same as its form of resolution. With URI's, the identifier is the address to the informative documentation.
In addition, if we were to adopted DOI's, we would have the same problem that we have with LSID's - an identifier system that works unlike anything else in the semantic web.
It is difficult for the EoL or GBIF to anticipate all the issues and design a system *a priori* that will work and be accepted.
For now, I believe their best approach is to facilitate the discussion and test various ideas and approaches. Which I believe they are doing.
Despite your argument that this process should really only include the "major players", they seem to be open to considering ideas from the entire community.
In the area of computers and information technology, how many of the major innovations came from the "major players" vs two drop outs in a garage or as the side project of some graduate students?
Respectfully,
- Pete
On Mon, Jun 6, 2011 at 4:21 AM, Roderic Page r.page@bio.gla.ac.uk wrote:
Reading this thread makes me despair. It's as if we are determined not to make progress, forever debating identifiers and what they identify, with seemingly little hope of resolution, and no clear vision of what the goals are. We wallow in acronym soup, and enjoy the technical challenges, but don't actually get anywhere ( http://iphylo.blogspot.com/2010/04/biodiversity-informatic-fail-and-what.htm...)
Here's one vision of a way forward.
- Someone or some entity shows a little vision and courage, and provides a
taxonomic classification where every node in the tree gets a DOI.
- The nodes in the taxonomic classification are citable objects (hence the
DOIs).
- Nodes in the classification can be accessed either by DOI or by
name-based HTTP URI (multiple nodes for a name bounce to ambiguity resolution pages).
A taxon-name extraction service locates names in text.
We build a taxon name/literature index (which we pretty much have
already, albeit distributed and partly proprietary).
Now, when an author (in any field) writes a paper or publishes some data on a taxon they cite the node in the classification as they would any scientific paper. Through CrossRef's citation tracking mechanism, the taxon database automatically accumulates the scientific literature relevant to that taxon.
When a journal publishes an article it calls the name extraction service to make the names clickable, avoiding the need for the journal to create its own taxon pages (a la Pensoft), and automatically building a taxonomic index to the literature. Publishers get enriched content, we get an always up to date taxonomic index.
So, we have services that authors and publishers can use, and use an identifier scheme that publishers understand (and so do some, if not most authors).
But you say "what about RDF and the linked web?". Relax, DOIs are now linked data compliant.
But you say "what about versions?" OK, to a first approximation nobody cares about versions. They really don't. Obviously previous versions will be accessible, but the identifier always points to the most recent version.
But you say "ah but it costs money". Yep, anything worthwhile does. Last time I checked journals costs money, yet we seem to have lots of those.
But you say, "which classification to use?" Does anybody (outside taxonomy) actually care? Classification is a navigational convenience. If you care deeply, you'll make a phylogeny.
But you say, "why DOIs?" Several reasons, 1) publishers and authors understand them, 2) they avoid branding 3) there's an infrastructure underpinning them 4) they show that we are serious
But "what about different taxon concepts?" To a first approximation nobody cares, and for the bulk of life we know too little for there to be much ambiguity. If we do care we can read the literature, which we have conveniently indexed.
Now, there are lots of things we could argue about, but if, say, EOL had done something like this at the start, namely embedded itself in the publication process, and major journals were citing EOL pages and linking to EOL pages, we would have a wonderful tool that was actually useful, way outside our own narrow concerns. Note that I'm using EOL has an example of an organisation with sufficient scope, GBIF would be another candidate.
I suspect a major reason for our continued failure is a the lack of clearly identified users (and I don't mean people who read this list, or TAXACOM), and a failure of ambition.
Anyway, my coffee has arrived. I don't hold out much hope of any of this happening, and I fully expect us to be debating these issues in a year from now. Pity.
Regards
Rod
Roderic Page Professor of Taxonomy Institute of Biodiversity, Animal Health and Comparative Medicine College of Medical, Veterinary and Life Sciences Graham Kerr Building University of Glasgow Glasgow G12 8QQ, UK
Email: r.page@bio.gla.ac.uk Tel: +44 141 330 4778 Fax: +44 141 330 2792 AIM: rodpage1962@aim.com Facebook: http://www.facebook.com/profile.php?id=1112517192 Twitter: http://twitter.com/rdmpage Blog: http://iphylo.blogspot.com Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html
Dear Pete,
A few quick comments:
1. The point about DOIs is that (if we are talking about CrossRef DOIs) we get citation and cross linking. Identifiers by themselves aren't much use, it's the links that matter, both forward and back. If we had a "CrossRef for biology" we'd be well on the way to being able to do useful things.
2. DOIs can be linked data compliant (this happened a few weeks ago when CrossRef flicked a switch and started pumping out RDF.
3. Enough with the "major players" already -- I'm all for two guys in a garage (or one guy in my case), that's were innovation comes from. But we're not talking about innovation, we're talking about getting a large bunch of disparate organisations to effect a change in the way we do things. If, say, GBIF started issuing resolvable identifiers for every specimen, and go publishers, authors, and GenBank to use those, we could do some cool things. But there seems to me to be a lack of ambition, hence we muck around with toy projects.
Regards
Rod
On 6 Jun 2011, at 18:53, Peter DeVries wrote:
Hi Rod,
There is no reason that the EoL, GBIF or both could not adopt and take ownership of what I am working on.
We have been testing different approaches and models as well as discussing various ways that this could be done.
I am actually with some of the MBL Woods Hole people today.
Since my taxonconcepts don't entail a particular classification, you can apply what ever classification you think is most appropriate.
A species concept can have many classifications
I have created a basic classification ontology that follows the Catalog of Life to Orders
http://lod.taxonconcept.org/ontology/phylo/CoL/CoL_2010_base.owl
HTML Doc http://lod.taxonconcept.org/ontology/phylo/CoL/doc/index.html
and Bio2RDF/Uniprot have a system based on NCBI taxon_id's
Example http://purl.uniprot.org/taxonomy/27807
In regards to DOI's, it don't see how they add more value than simple URI's.
In addition, they continue the problem of an identifier that is not the same as its form of resolution. With URI's, the identifier is the address to the informative documentation.
In addition, if we were to adopted DOI's, we would have the same problem that we have with LSID's - an identifier system that works unlike anything else in the semantic web.
It is difficult for the EoL or GBIF to anticipate all the issues and design a system a priori that will work and be accepted.
For now, I believe their best approach is to facilitate the discussion and test various ideas and approaches. Which I believe they are doing.
Despite your argument that this process should really only include the "major players", they seem to be open to considering ideas from the entire community.
In the area of computers and information technology, how many of the major innovations came from the "major players" vs two drop outs in a garage or as the side project of some graduate students?
Respectfully,
- Pete
On Mon, Jun 6, 2011 at 4:21 AM, Roderic Page r.page@bio.gla.ac.uk wrote: Reading this thread makes me despair. It's as if we are determined not to make progress, forever debating identifiers and what they identify, with seemingly little hope of resolution, and no clear vision of what the goals are. We wallow in acronym soup, and enjoy the technical challenges, but don't actually get anywhere (http://iphylo.blogspot.com/2010/04/biodiversity-informatic-fail-and-what.htm... )
Here's one vision of a way forward.
Someone or some entity shows a little vision and courage, and provides a taxonomic classification where every node in the tree gets a DOI.
The nodes in the taxonomic classification are citable objects (hence the DOIs).
Nodes in the classification can be accessed either by DOI or by name-based HTTP URI (multiple nodes for a name bounce to ambiguity resolution pages).
A taxon-name extraction service locates names in text.
We build a taxon name/literature index (which we pretty much have already, albeit distributed and partly proprietary).
Now, when an author (in any field) writes a paper or publishes some data on a taxon they cite the node in the classification as they would any scientific paper. Through CrossRef's citation tracking mechanism, the taxon database automatically accumulates the scientific literature relevant to that taxon.
When a journal publishes an article it calls the name extraction service to make the names clickable, avoiding the need for the journal to create its own taxon pages (a la Pensoft), and automatically building a taxonomic index to the literature. Publishers get enriched content, we get an always up to date taxonomic index.
So, we have services that authors and publishers can use, and use an identifier scheme that publishers understand (and so do some, if not most authors).
But you say "what about RDF and the linked web?". Relax, DOIs are now linked data compliant.
But you say "what about versions?" OK, to a first approximation nobody cares about versions. They really don't. Obviously previous versions will be accessible, but the identifier always points to the most recent version.
But you say "ah but it costs money". Yep, anything worthwhile does. Last time I checked journals costs money, yet we seem to have lots of those.
But you say, "which classification to use?" Does anybody (outside taxonomy) actually care? Classification is a navigational convenience. If you care deeply, you'll make a phylogeny.
But you say, "why DOIs?" Several reasons, 1) publishers and authors understand them, 2) they avoid branding 3) there's an infrastructure underpinning them 4) they show that we are serious
But "what about different taxon concepts?" To a first approximation nobody cares, and for the bulk of life we know too little for there to be much ambiguity. If we do care we can read the literature, which we have conveniently indexed.
Now, there are lots of things we could argue about, but if, say, EOL had done something like this at the start, namely embedded itself in the publication process, and major journals were citing EOL pages and linking to EOL pages, we would have a wonderful tool that was actually useful, way outside our own narrow concerns. Note that I'm using EOL has an example of an organisation with sufficient scope, GBIF would be another candidate.
I suspect a major reason for our continued failure is a the lack of clearly identified users (and I don't mean people who read this list, or TAXACOM), and a failure of ambition.
Anyway, my coffee has arrived. I don't hold out much hope of any of this happening, and I fully expect us to be debating these issues in a year from now. Pity.
Regards
Rod
Roderic Page Professor of Taxonomy Institute of Biodiversity, Animal Health and Comparative Medicine College of Medical, Veterinary and Life Sciences Graham Kerr Building University of Glasgow Glasgow G12 8QQ, UK
Email: r.page@bio.gla.ac.uk Tel: +44 141 330 4778 Fax: +44 141 330 2792 AIM: rodpage1962@aim.com Facebook: http://www.facebook.com/profile.php?id=1112517192 Twitter: http://twitter.com/rdmpage Blog: http://iphylo.blogspot.com Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html
--
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 Email: pdevries@wisc.edu TaxonConcept & GeoSpecies Knowledge Bases A Semantic Web, Linked Open Data Project
--------------------------------------------------------- Roderic Page Professor of Taxonomy Institute of Biodiversity, Animal Health and Comparative Medicine College of Medical, Veterinary and Life Sciences Graham Kerr Building University of Glasgow Glasgow G12 8QQ, UK
Email: r.page@bio.gla.ac.uk Tel: +44 141 330 4778 Fax: +44 141 330 2792 AIM: rodpage1962@aim.com Facebook: http://www.facebook.com/profile.php?id=1112517192 Twitter: http://twitter.com/rdmpage Blog: http://iphylo.blogspot.com Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html
Just wanted to chime in that I'm associated with one of those "major players" (though not in a particularly technical way) and we very much appreciate having researchers exploring the options and proposing fruitful approaches. I'm glad you are in Woods Hole, Pete. And we're working to convince publishers that linking is of value. We'll get there.
Cyndy
On Mon, Jun 6, 2011 at 2:36 PM, Roderic Page r.page@bio.gla.ac.uk wrote:
Dear Pete,
A few quick comments:
- The point about DOIs is that (if we are talking about CrossRef DOIs) we
get citation and cross linking. Identifiers by themselves aren't much use, it's the links that matter, both forward and back. If we had a "CrossRef for biology" we'd be well on the way to being able to do useful things.
- DOIs can be linked data compliant (this happened a few weeks ago when
CrossRef flicked a switch and started pumping out RDF.
- Enough with the "major players" already -- I'm all for two guys in a
garage (or one guy in my case), that's where innovation comes from. But we're not talking about innovation, we're talking about getting a large bunch of disparate organisations to effect a change in the way we do things. If, say, GBIF started issuing resolvable identifiers for every specimen, and go publishers, authors, and GenBank to use those, we could do some cool things. But there seems to me to be a lack of ambition, hence we muck around with toy projects.
Regards
Rod
On 6 Jun 2011, at 18:53, Peter DeVries wrote:
Hi Rod,
There is no reason that the EoL, GBIF or both could not adopt and take
ownership of what I am working on.
We have been testing different approaches and models as well as
discussing various ways that this could be done.
I am actually with some of the MBL Woods Hole people today.
Since my taxonconcepts don't entail a particular classification, you can
apply what ever classification you think is most appropriate.
A species concept can have many classifications
I have created a basic classification ontology that follows the Catalog
of Life to Orders
http://lod.taxonconcept.org/ontology/phylo/CoL/CoL_2010_base.owl
HTML Doc http://lod.taxonconcept.org/ontology/phylo/CoL/doc/index.html
and Bio2RDF/Uniprot have a system based on NCBI taxon_id's
Example http://purl.uniprot.org/taxonomy/27807
In regards to DOI's, it don't see how they add more value than simple
URI's.
In addition, they continue the problem of an identifier that is not the
same as its form of resolution. With URI's, the identifier is the address to the informative documentation.
In addition, if we were to adopted DOI's, we would have the same problem
that we have with LSID's - an identifier system that works unlike anything else in the semantic web.
It is difficult for the EoL or GBIF to anticipate all the issues and
design a system a priori that will work and be accepted.
For now, I believe their best approach is to facilitate the discussion
and test various ideas and approaches. Which I believe they are doing.
Despite your argument that this process should really only include the
"major players", they seem to be open to considering ideas from the entire community.
In the area of computers and information technology, how many of the
major innovations came from the "major players" vs two drop outs in a garage or as the side project of some graduate students?
Respectfully,
- Pete
On Mon, Jun 6, 2011 at 4:21 AM, Roderic Page r.page@bio.gla.ac.uk
wrote:
Reading this thread makes me despair. It's as if we are determined not to
make progress, forever debating identifiers and what they identify, with seemingly little hope of resolution, and no clear vision of what the goals are. We wallow in acronym soup, and enjoy the technical challenges, but don't actually get anywhere ( http://iphylo.blogspot.com/2010/04/biodiversity-informatic-fail-and-what.htm...)
Here's one vision of a way forward.
- Someone or some entity shows a little vision and courage, and provides
a taxonomic classification where every node in the tree gets a DOI.
- The nodes in the taxonomic classification are citable objects (hence
the DOIs).
- Nodes in the classification can be accessed either by DOI or by
name-based HTTP URI (multiple nodes for a name bounce to ambiguity resolution pages).
A taxon-name extraction service locates names in text.
We build a taxon name/literature index (which we pretty much have
already, albeit distributed and partly proprietary).
Now, when an author (in any field) writes a paper or publishes some data
on a taxon they cite the node in the classification as they would any scientific paper. Through CrossRef's citation tracking mechanism, the taxon database automatically accumulates the scientific literature relevant to that taxon.
When a journal publishes an article it calls the name extraction service
to make the names clickable, avoiding the need for the journal to create its own taxon pages (a la Pensoft), and automatically building a taxonomic index to the literature. Publishers get enriched content, we get an always up to date taxonomic index.
So, we have services that authors and publishers can use, and use an
identifier scheme that publishers understand (and so do some, if not most authors).
But you say "what about RDF and the linked web?". Relax, DOIs are now
linked data compliant.
But you say "what about versions?" OK, to a first approximation nobody
cares about versions. They really don't. Obviously previous versions will be accessible, but the identifier always points to the most recent version.
But you say "ah but it costs money". Yep, anything worthwhile does. Last
time I checked journals costs money, yet we seem to have lots of those.
But you say, "which classification to use?" Does anybody (outside
taxonomy) actually care? Classification is a navigational convenience. If you care deeply, you'll make a phylogeny.
But you say, "why DOIs?" Several reasons, 1) publishers and authors
understand them, 2) they avoid branding 3) there's an infrastructure underpinning them 4) they show that we are serious
But "what about different taxon concepts?" To a first approximation
nobody cares, and for the bulk of life we know too little for there to be much ambiguity. If we do care we can read the literature, which we have conveniently indexed.
Now, there are lots of things we could argue about, but if, say, EOL had
done something like this at the start, namely embedded itself in the publication process, and major journals were citing EOL pages and linking to EOL pages, we would have a wonderful tool that was actually useful, way outside our own narrow concerns. Note that I'm using EOL has an example of an organisation with sufficient scope, GBIF would be another candidate.
I suspect a major reason for our continued failure is a the lack of
clearly identified users (and I don't mean people who read this list, or TAXACOM), and a failure of ambition.
Anyway, my coffee has arrived. I don't hold out much hope of any of this
happening, and I fully expect us to be debating these issues in a year from now. Pity.
Regards
Rod
Roderic Page Professor of Taxonomy Institute of Biodiversity, Animal Health and Comparative Medicine College of Medical, Veterinary and Life Sciences Graham Kerr Building University of Glasgow Glasgow G12 8QQ, UK
Email: r.page@bio.gla.ac.uk Tel: +44 141 330 4778 Fax: +44 141 330 2792 AIM: rodpage1962@aim.com Facebook: http://www.facebook.com/profile.php?id=1112517192 Twitter: http://twitter.com/rdmpage Blog: http://iphylo.blogspot.com Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html
--
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 Email: pdevries@wisc.edu TaxonConcept & GeoSpecies Knowledge Bases A Semantic Web, Linked Open Data Project
Roderic Page Professor of Taxonomy Institute of Biodiversity, Animal Health and Comparative Medicine College of Medical, Veterinary and Life Sciences Graham Kerr Building University of Glasgow Glasgow G12 8QQ, UK
Email: r.page@bio.gla.ac.uk Tel: +44 141 330 4778 Fax: +44 141 330 2792 AIM: rodpage1962@aim.com Facebook: http://www.facebook.com/profile.php?id=1112517192 Twitter: http://twitter.com/rdmpage Blog: http://iphylo.blogspot.com Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
Dear Pete,
A few quick comments:
1. The point about DOIs is that (if we are talking about CrossRef DOIs) we get citation and cross linking. Identifiers by themselves aren't much use, it's the links that matter, both forward and back. If we had a "CrossRef for biology" we'd be well on the way to being able to do useful things.
2. DOIs can be linked data compliant (this happened a few weeks ago when CrossRef flicked a switch and started pumping out RDF.
3. Enough with the "major players" already -- I'm all for two guys in a garage (or one guy in my case), that's were innovation comes from. But we're not talking about innovation, we're talking about getting a large bunch of disparate organisations to effect a change in the way we do things. If, say, GBIF started issuing resolvable identifiers for every specimen, and go publishers, authors, and GenBank to use those, we could do some cool things. But there seems to me to be a lack of ambition, hence we muck around with toy projects.
Regards
Rod
On 6 Jun 2011, at 18:53, Peter DeVries wrote:
Hi Rod,
There is no reason that the EoL, GBIF or both could not adopt and take ownership of what I am working on.
We have been testing different approaches and models as well as discussing various ways that this could be done.
I am actually with some of the MBL Woods Hole people today.
Since my taxonconcepts don't entail a particular classification, you can apply what ever classification you think is most appropriate.
A species concept can have many classifications
I have created a basic classification ontology that follows the Catalog of Life to Orders
http://lod.taxonconcept.org/ontology/phylo/CoL/CoL_2010_base.owl
HTML Doc http://lod.taxonconcept.org/ontology/phylo/CoL/doc/index.html
and Bio2RDF/Uniprot have a system based on NCBI taxon_id's
Example http://purl.uniprot.org/taxonomy/27807
In regards to DOI's, it don't see how they add more value than simple URI's.
In addition, they continue the problem of an identifier that is not the same as its form of resolution. With URI's, the identifier is the address to the informative documentation.
In addition, if we were to adopted DOI's, we would have the same problem that we have with LSID's - an identifier system that works unlike anything else in the semantic web.
It is difficult for the EoL or GBIF to anticipate all the issues and design a system a priori that will work and be accepted.
For now, I believe their best approach is to facilitate the discussion and test various ideas and approaches. Which I believe they are doing.
Despite your argument that this process should really only include the "major players", they seem to be open to considering ideas from the entire community.
In the area of computers and information technology, how many of the major innovations came from the "major players" vs two drop outs in a garage or as the side project of some graduate students?
Respectfully,
- Pete
On Mon, Jun 6, 2011 at 4:21 AM, Roderic Page r.page@bio.gla.ac.uk wrote: Reading this thread makes me despair. It's as if we are determined not to make progress, forever debating identifiers and what they identify, with seemingly little hope of resolution, and no clear vision of what the goals are. We wallow in acronym soup, and enjoy the technical challenges, but don't actually get anywhere (http://iphylo.blogspot.com/2010/04/biodiversity-informatic-fail-and-what.htm... )
Here's one vision of a way forward.
Someone or some entity shows a little vision and courage, and provides a taxonomic classification where every node in the tree gets a DOI.
The nodes in the taxonomic classification are citable objects (hence the DOIs).
Nodes in the classification can be accessed either by DOI or by name-based HTTP URI (multiple nodes for a name bounce to ambiguity resolution pages).
A taxon-name extraction service locates names in text.
We build a taxon name/literature index (which we pretty much have already, albeit distributed and partly proprietary).
Now, when an author (in any field) writes a paper or publishes some data on a taxon they cite the node in the classification as they would any scientific paper. Through CrossRef's citation tracking mechanism, the taxon database automatically accumulates the scientific literature relevant to that taxon.
When a journal publishes an article it calls the name extraction service to make the names clickable, avoiding the need for the journal to create its own taxon pages (a la Pensoft), and automatically building a taxonomic index to the literature. Publishers get enriched content, we get an always up to date taxonomic index.
So, we have services that authors and publishers can use, and use an identifier scheme that publishers understand (and so do some, if not most authors).
But you say "what about RDF and the linked web?". Relax, DOIs are now linked data compliant.
But you say "what about versions?" OK, to a first approximation nobody cares about versions. They really don't. Obviously previous versions will be accessible, but the identifier always points to the most recent version.
But you say "ah but it costs money". Yep, anything worthwhile does. Last time I checked journals costs money, yet we seem to have lots of those.
But you say, "which classification to use?" Does anybody (outside taxonomy) actually care? Classification is a navigational convenience. If you care deeply, you'll make a phylogeny.
But you say, "why DOIs?" Several reasons, 1) publishers and authors understand them, 2) they avoid branding 3) there's an infrastructure underpinning them 4) they show that we are serious
But "what about different taxon concepts?" To a first approximation nobody cares, and for the bulk of life we know too little for there to be much ambiguity. If we do care we can read the literature, which we have conveniently indexed.
Now, there are lots of things we could argue about, but if, say, EOL had done something like this at the start, namely embedded itself in the publication process, and major journals were citing EOL pages and linking to EOL pages, we would have a wonderful tool that was actually useful, way outside our own narrow concerns. Note that I'm using EOL has an example of an organisation with sufficient scope, GBIF would be another candidate.
I suspect a major reason for our continued failure is a the lack of clearly identified users (and I don't mean people who read this list, or TAXACOM), and a failure of ambition.
Anyway, my coffee has arrived. I don't hold out much hope of any of this happening, and I fully expect us to be debating these issues in a year from now. Pity.
Regards
Rod
Roderic Page Professor of Taxonomy Institute of Biodiversity, Animal Health and Comparative Medicine College of Medical, Veterinary and Life Sciences Graham Kerr Building University of Glasgow Glasgow G12 8QQ, UK
Email: r.page@bio.gla.ac.uk Tel: +44 141 330 4778 Fax: +44 141 330 2792 AIM: rodpage1962@aim.com Facebook: http://www.facebook.com/profile.php?id=1112517192 Twitter: http://twitter.com/rdmpage Blog: http://iphylo.blogspot.com Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html
--
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 Email: pdevries@wisc.edu TaxonConcept & GeoSpecies Knowledge Bases A Semantic Web, Linked Open Data Project
--------------------------------------------------------- Roderic Page Professor of Taxonomy Institute of Biodiversity, Animal Health and Comparative Medicine College of Medical, Veterinary and Life Sciences Graham Kerr Building University of Glasgow Glasgow G12 8QQ, UK
Email: r.page@bio.gla.ac.uk Tel: +44 141 330 4778 Fax: +44 141 330 2792 AIM: rodpage1962@aim.com Facebook: http://www.facebook.com/profile.php?id=1112517192 Twitter: http://twitter.com/rdmpage Blog: http://iphylo.blogspot.com Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html
There is no reason that the EoL, GBIF or both could not adopt and take ownership of what I am working on.
I think one of the problems is that we have no model of collaborative working under Open Source / Open Content licenses in the biodiversity community. Most try to define unlicensed monopolies (whether for commercial exploitation or just the next grant).
EoL is almost exclusively build on closed content (with bilateral agreements allowing them to display the "Creative Commons Non-Commercial" content). I don't imply this is a failure of EoL, it is the status of the community.
As a result, there is no "we" and little synergies.
Gregor
Hi Gregor,
You are right. I don't really get any info directly from the EoL other than some names, or the authorship of a name I have manually.
I have started to link to images on DBpedia and if I pull any in because the links are unstable then I add the image metadata in the RDF.
It is my understanding that thumbnails representations 128x128 or 135x95 or smaller are ok to use, based on cases involving Google and Bings use of thumbnails.
It would be useful to create an open but attributed set of images for each species that we could share.
It is my understanding that if they are open, Amazon might be willing to host them for free.
I would like to be able to link to specimens in GBIF etc or publications in the BHL that contain at least a minimal amount of metadata.
* It would also be useful to have a standard set of URI's that could be used to track credit to ITIS etc. I could create a simple RDF vocabulary for this but it might be best to have it hosted at TDWG, EoL or GBIF.
- Pete
On Tue, Jun 7, 2011 at 3:41 AM, Gregor Hagedorn g.m.hagedorn@gmail.comwrote:
There is no reason that the EoL, GBIF or both could not adopt and take ownership of what I am working on.
I think one of the problems is that we have no model of collaborative working under Open Source / Open Content licenses in the biodiversity community. Most try to define unlicensed monopolies (whether for commercial exploitation or just the next grant).
EoL is almost exclusively build on closed content (with bilateral agreements allowing them to display the "Creative Commons Non-Commercial" content). I don't imply this is a failure of EoL, it is the status of the community.
As a result, there is no "we" and little synergies.
Gregor
I have started to link to images on DBpedia and if I pull any in because the links are unstable then I add the image metadata in the RDF. It is my understanding that thumbnails representations 128x128 or 135x95 or smaller are ok to use, based on cases involving Google and Bings use of thumbnails.
thumbs: probably generally ok. In the case of dbpedia (which is the interface to Wikipedias) the images are however truly open content and usable in full resolution, given they are attributed appropriately and the license is cited.
It would be useful to create an open but attributed set of images for each species that we could share. It is my understanding that if they are open, Amazon might be willing to host them for free.
Any repository that provides open content licensed images is a great contribution. Images can be put on commons.wikimedia, but cannot be semantically annotated very well there. Species-ID/OpenMedia does allow that (running Semantic Mediawiki) -- we welcome collaboration to improve semantic annotation.
Images can also be put on Morphbank, but care has to be taken to choose an open content license. Most contributors choose to submit the images only under the Closed Content, almost non re-usable "non-commercial" clause (which does not only exclude those making profits, but also non-profits requiring cost re-imbursement as well as anyone gaining non-monetary advantages from using the image).
Gregor
Reading this thread makes me despair. It's as if we are determined not to make progress, forever debating identifiers and what they identify, with seemingly little hope of resolution, and no clear vision of what the goals are. We wallow in acronym soup, and enjoy the technical challenges, but don't actually get anywhere (http://iphylo.blogspot.com/2010/04/biodiversity-informatic-fail-and-what.htm... )
Here's one vision of a way forward.
1. Someone or some entity shows a little vision and courage, and provides a taxonomic classification where every node in the tree gets a DOI.
2. The nodes in the taxonomic classification are citable objects (hence the DOIs).
3. Nodes in the classification can be accessed either by DOI or by name-based HTTP URI (multiple nodes for a name bounce to ambiguity resolution pages).
4. A taxon-name extraction service locates names in text.
5. We build a taxon name/literature index (which we pretty much have already, albeit distributed and partly proprietary).
Now, when an author (in any field) writes a paper or publishes some data on a taxon they cite the node in the classification as they would any scientific paper. Through CrossRef's citation tracking mechanism, the taxon database automatically accumulates the scientific literature relevant to that taxon.
When a journal publishes an article it calls the name extraction service to make the names clickable, avoiding the need for the journal to create its own taxon pages (a la Pensoft), and automatically building a taxonomic index to the literature. Publishers get enriched content, we get an always up to date taxonomic index.
So, we have services that authors and publishers can use, and use an identifier scheme that publishers understand (and so do some, if not most authors).
But you say "what about RDF and the linked web?". Relax, DOIs are now linked data compliant.
But you say "what about versions?" OK, to a first approximation nobody cares about versions. They really don't. Obviously previous versions will be accessible, but the identifier always points to the most recent version.
But you say "ah but it costs money". Yep, anything worthwhile does. Last time I checked journals costs money, yet we seem to have lots of those.
But you say, "which classification to use?" Does anybody (outside taxonomy) actually care? Classification is a navigational convenience. If you care deeply, you'll make a phylogeny.
But you say, "why DOIs?" Several reasons, 1) publishers and authors understand them, 2) they avoid branding 3) there's an infrastructure underpinning them 4) they show that we are serious
But "what about different taxon concepts?" To a first approximation nobody cares, and for the bulk of life we know too little for there to be much ambiguity. If we do care we can read the literature, which we have conveniently indexed.
Now, there are lots of things we could argue about, but if, say, EOL had done something like this at the start, namely embedded itself in the publication process, and major journals were citing EOL pages and linking to EOL pages, we would have a wonderful tool that was actually useful, way outside our own narrow concerns. Note that I'm using EOL has an example of an organisation with sufficient scope, GBIF would be another candidate.
I suspect a major reason for our continued failure is a the lack of clearly identified users (and I don't mean people who read this list, or TAXACOM), and a failure of ambition.
Anyway, my coffee has arrived. I don't hold out much hope of any of this happening, and I fully expect us to be debating these issues in a year from now. Pity.
Regards
Rod
--------------------------------------------------------- Roderic Page Professor of Taxonomy Institute of Biodiversity, Animal Health and Comparative Medicine College of Medical, Veterinary and Life Sciences Graham Kerr Building University of Glasgow Glasgow G12 8QQ, UK
Email: r.page@bio.gla.ac.uk Tel: +44 141 330 4778 Fax: +44 141 330 2792 AIM: rodpage1962@aim.com Facebook: http://www.facebook.com/profile.php?id=1112517192 Twitter: http://twitter.com/rdmpage Blog: http://iphylo.blogspot.com Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html
Rod,
Dispair? Except for DOI we do all of this already. Most of the infrastructure has been in place for many years. It has taken a long time but the early work has attracted funds and we are now all out on building on the underlying datasets.
We currently support URIs and URN:LSIDs but if DOIs add value and we can afford to apply them perhaps we should add them too. We would need a little under 1 million handles for AFD and APNI/APC if we dare to try. As we continue your work to map nomenclatural and taxonomic events into BHL where do we apply these DOI? To the indexed object or the the place on the page? Do we pretend that they are the same thing?
In reallity it is the adoption of name and taxon objects that bring re-usability. True citability will come when these objects are found in taxonomic works. Our indices are quite happy dealing with prepackaged identifiers - even if they are DOIs.
At the nomenclator level we are pretty close to setting name objects free to be the factual open community resource they were designed to be - at least in the botanical domain.
I agree that versioning is not an issue ( twittered comments excepted ) - these objects re-present facts which we should be able to repair together. But concepts anchor value added product and they need to be identified. Ones choice of concept ( accepted/valid) depends largely on context which is usually provided as a classification (albeit fragmentary). Maybe you're right and nobody cares and they just take the current concept within some default context but over time we find that documenting these choices turns out to be significant as concepts and/or context drift to accommodate current thinking. Appropriate infrastructure design supports re-use of these decisions anyway. The scale here may be months or centuries - these principles still apply.
greg
On 6 June 2011 19:35, Roderic Page r.page@bio.gla.ac.uk wrote:
Reading this thread makes me despair. It's as if we are determined not to make progress, forever debating identifiers and what they identify, with seemingly little hope of resolution, and no clear vision of what the goals are. We wallow in acronym soup, and enjoy the technical challenges, but don't actually get anywhere (http://iphylo.blogspot.com/2010/04/biodiversity-informatic-fail-and-what.htm... )
Here's one vision of a way forward.
Someone or some entity shows a little vision and courage, and provides a taxonomic classification where every node in the tree gets a DOI.
The nodes in the taxonomic classification are citable objects (hence the DOIs).
Nodes in the classification can be accessed either by DOI or by name-based HTTP URI (multiple nodes for a name bounce to ambiguity resolution pages).
A taxon-name extraction service locates names in text.
We build a taxon name/literature index (which we pretty much have already, albeit distributed and partly proprietary).
Now, when an author (in any field) writes a paper or publishes some data on a taxon they cite the node in the classification as they would any scientific paper. Through CrossRef's citation tracking mechanism, the taxon database automatically accumulates the scientific literature relevant to that taxon.
When a journal publishes an article it calls the name extraction service to make the names clickable, avoiding the need for the journal to create its own taxon pages (a la Pensoft), and automatically building a taxonomic index to the literature. Publishers get enriched content, we get an always up to date taxonomic index.
So, we have services that authors and publishers can use, and use an identifier scheme that publishers understand (and so do some, if not most authors).
But you say "what about RDF and the linked web?". Relax, DOIs are now linked data compliant.
But you say "what about versions?" OK, to a first approximation nobody cares about versions. They really don't. Obviously previous versions will be accessible, but the identifier always points to the most recent version.
But you say "ah but it costs money". Yep, anything worthwhile does. Last time I checked journals costs money, yet we seem to have lots of those.
But you say, "which classification to use?" Does anybody (outside taxonomy) actually care? Classification is a navigational convenience. If you care deeply, you'll make a phylogeny.
But you say, "why DOIs?" Several reasons, 1) publishers and authors understand them, 2) they avoid branding 3) there's an infrastructure underpinning them 4) they show that we are serious
But "what about different taxon concepts?" To a first approximation nobody cares, and for the bulk of life we know too little for there to be much ambiguity. If we do care we can read the literature, which we have conveniently indexed.
Now, there are lots of things we could argue about, but if, say, EOL had done something like this at the start, namely embedded itself in the publication process, and major journals were citing EOL pages and linking to EOL pages, we would have a wonderful tool that was actually useful, way outside our own narrow concerns. Note that I'm using EOL has an example of an organisation with sufficient scope, GBIF would be another candidate.
I suspect a major reason for our continued failure is a the lack of clearly identified users (and I don't mean people who read this list, or TAXACOM), and a failure of ambition.
Anyway, my coffee has arrived. I don't hold out much hope of any of this happening, and I fully expect us to be debating these issues in a year from now. Pity.
Regards
Rod
Roderic Page Professor of Taxonomy Institute of Biodiversity, Animal Health and Comparative Medicine College of Medical, Veterinary and Life Sciences Graham Kerr Building University of Glasgow Glasgow G12 8QQ, UK
Email: r.page@bio.gla.ac.uk Tel: +44 141 330 4778 Fax: +44 141 330 2792 AIM: rodpage1962@aim.com Facebook: http://www.facebook.com/profile.php?id=1112517192 Twitter: http://twitter.com/rdmpage Blog: http://iphylo.blogspot.com Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
If you have received this transmission in error please notify us immediately by return e-mail and delete all copies. If this e-mail or any attachments have been sent to you in error, that error does not constitute waiver of any confidentiality, privilege or copyright in respect of information in the e-mail or attachments.
Please consider the environment before printing this email.
This is a bit OT, but it's related to dois ...I'm sure you noticed L Penevs (pensoft) post at taxacom:
Instructions to authors wishing to submit Data Papers based on metadata entered via GBIF’s IPT can be found in the Data Publishing Policies and G uidelineshttp://www.pensoft.net/J_FILES/Pensoft_Data_Publishing_Policies_and_Guidelines.pdfissued by Pensoft.
From this document:
"The well-established norm for citing genetic data is that one simply cited the Genbank identifier (accession number) in the text. Similar usage is also commonplace for items in other bioinformatics databases. Pensoft is not recommending a change in that practice. The following guidelines apply to more heterogeneous research data published in other institutional or subject-specific data repositories, frequently described in related journal articles or Data Papers (see below). They are intended to permit data citations to be treated as ‚first class‘ citation objects, on a par with bibliographic citations, and to enable them to be more easily harvested from reference lists, so that those who have made the effort to publish their research data might more easily be ascribed academic credit for their work through the normal mechanisms of citation recognition. For such data in data repositories, each published data package and each published data file should always be associated with a persistent unique identifier. A Digital Object Identifier (DOI) issued by DataCite should be used wherever possible."
best regards, Robert
On Mon, Jun 6, 2011 at 11:35 AM, Roderic Page r.page@bio.gla.ac.uk wrote:
Reading this thread makes me despair. It's as if we are determined not to make progress, forever debating identifiers and what they identify, with seemingly little hope of resolution, and no clear vision of what the goals are. We wallow in acronym soup, and enjoy the technical challenges, but don't actually get anywhere (http://iphylo.blogspot.com/2010/04/biodiversity-informatic-fail-and-what.htm... )
Here's one vision of a way forward.
Someone or some entity shows a little vision and courage, and provides a taxonomic classification where every node in the tree gets a DOI.
The nodes in the taxonomic classification are citable objects (hence the DOIs).
Nodes in the classification can be accessed either by DOI or by name-based HTTP URI (multiple nodes for a name bounce to ambiguity resolution pages).
A taxon-name extraction service locates names in text.
We build a taxon name/literature index (which we pretty much have already, albeit distributed and partly proprietary).
Now, when an author (in any field) writes a paper or publishes some data on a taxon they cite the node in the classification as they would any scientific paper. Through CrossRef's citation tracking mechanism, the taxon database automatically accumulates the scientific literature relevant to that taxon.
When a journal publishes an article it calls the name extraction service to make the names clickable, avoiding the need for the journal to create its own taxon pages (a la Pensoft), and automatically building a taxonomic index to the literature. Publishers get enriched content, we get an always up to date taxonomic index.
So, we have services that authors and publishers can use, and use an identifier scheme that publishers understand (and so do some, if not most authors).
But you say "what about RDF and the linked web?". Relax, DOIs are now linked data compliant.
But you say "what about versions?" OK, to a first approximation nobody cares about versions. They really don't. Obviously previous versions will be accessible, but the identifier always points to the most recent version.
But you say "ah but it costs money". Yep, anything worthwhile does. Last time I checked journals costs money, yet we seem to have lots of those.
But you say, "which classification to use?" Does anybody (outside taxonomy) actually care? Classification is a navigational convenience. If you care deeply, you'll make a phylogeny.
But you say, "why DOIs?" Several reasons, 1) publishers and authors understand them, 2) they avoid branding 3) there's an infrastructure underpinning them 4) they show that we are serious
But "what about different taxon concepts?" To a first approximation nobody cares, and for the bulk of life we know too little for there to be much ambiguity. If we do care we can read the literature, which we have conveniently indexed.
Now, there are lots of things we could argue about, but if, say, EOL had done something like this at the start, namely embedded itself in the publication process, and major journals were citing EOL pages and linking to EOL pages, we would have a wonderful tool that was actually useful, way outside our own narrow concerns. Note that I'm using EOL has an example of an organisation with sufficient scope, GBIF would be another candidate.
I suspect a major reason for our continued failure is a the lack of clearly identified users (and I don't mean people who read this list, or TAXACOM), and a failure of ambition.
Anyway, my coffee has arrived. I don't hold out much hope of any of this happening, and I fully expect us to be debating these issues in a year from now. Pity.
Regards
Rod
Roderic Page Professor of Taxonomy Institute of Biodiversity, Animal Health and Comparative Medicine College of Medical, Veterinary and Life Sciences Graham Kerr Building University of Glasgow Glasgow G12 8QQ, UK
Email: r.page@bio.gla.ac.uk Tel: +44 141 330 4778 Fax: +44 141 330 2792 AIM: rodpage1962@aim.com Facebook: http://www.facebook.com/profile.php?id=1112517192 Twitter: http://twitter.com/rdmpage Blog: http://iphylo.blogspot.com Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
Comments regarding several emails inline:
Richard Pyle wrote:
By contrast, the core object in GNUB is a taxon name usage instance -- which is a purely abstract notion of the usage of a taxon name within some documentation source (like a publication). In this case, the text-string name is merely a property of the GUID-identified object, and would be an extremely BAD choice to use as a unique identifier.
It is possible that I'm not understanding what you are saying here, but if you are saying that the only name-related property of your GNUB taxon instances will be one which has a name string literal as its object, then I think that is a big mistake. That will require any client using your taxon instance metadata to re-process the literal name string to cross reference it with lexical variants, parse it into its pieces, etc. That should only need to be done once and then referenced via a GUID for the name (i.e. in the sense of tn:TaxonName).
This is why GNUB needs to generate a unique identifier to represent this core data object. The form that identifier takes (UUID, LSID, integer, DOI, whatever) from the perspective of the end user should be completely irrelevant, because the user should rarely (if ever) see it, and should certainly *never* be in a position to type it on a keyboard (we can discuss the appearance of ZooBank LSIDs on printed pages separately).
OK, again maybe I'm not understanding what you are saying here, but if you are saying that you don't intend to expose your unique GNUB identifiers to the public, then as far as I'm concerned you are setting up GNUB to be irrelevant from the start. You mention a number of cool taxonomist-geek type things that you hope to accomplish with GNUB. But from my perspective as a non-taxonomist-geek, the main purpose I have for GNUB is as a place to anchor dwc:Identification instances so that I can indicate whether my identified resource is a representative of the same taxon that is being referred to by somebody else (or at least to make it possible for somebody to figure that out via computery cleverness, Semantic Web or otherwise). How am I going to do that if you don't provide me with a good (i.e. meeting the 8 criteria of my last email) GUID to use as the object of my dwc:Identification properties? For over a year, I've heard you lament that the whole problem is that people make identifications and don't indicate the sensu/sec. reference for the names they use. You are now creating a system that would allow people to unambiguously make it clear what taxon they mean but you aren't giving them any way to say what it is? Again, I may just be misunderstanding what you wrote here.
Kevin Richards wrote:
Oh, now that I have read Rich's email here, it seems we are in agreement, of sorts. I think there is obviously a need for both of these "identifier" approaches - ie a record based ID that no one should really ever need to interact with directly, and a human friendly "ID" that allows people to discuss the same semantic "thing".
Yes. This "record based ID" can be anything you want. I don't really don't and shouldn't have to care about that. The "human friendly ID that allows people to discuss the same semantic thing" is precisely what the TDWG GUID Applicability Statement (a ratified TDWG standard, thanks to Kevin) is talking about. As I read that standard, I don't see any requirement that a GUID be "human friendly", but I would consider "human friendliness" to be a desirable "best practice" (influenced somewhat by http://www.w3.org/Provider/Style/URI and http://www.w3.org/TR/cooluris/) - if we have a choice of creating externally exposed GUIDs that are either human-friendly or not human-friendly, and if either works equally well, why not choose ones that are human-friendly?
It is interesting all this discussion of identifiers when in the end it doesn't matter too much what the identifier is, just that you have an identifier at all. The important thing is the semantics, the "are we talking about the same thing" question - so this is where I believe RDF/semantic web comes in - I might see if I can come up with some RDF/sem web example for TDWG that could demonstrate this, hmmm...
Already done in the context of tc:Taxon and tn:TaxonName and posted on this list in January: http://lists.tdwg.org/pipermail/tdwg-content/2011-January/002204.html . http://biodiversity.org.au/apni.taxon/118883 an identifier that is both friendly to humans and computers. Through content negotiation a computer gets http://biodiversity.org.au/apni.taxon/118883.rdf and the human gets http://biodiversity.org.au/apni.taxon/118883.html The resource itself has rdf:type tc:TaxonConcept (defined in the ontology to be equivalent to tc:Taxon), well-known because it is part of the TDWG ontology. In these examples, the approach for referring to name strings through tc:hasName, the subsequent reference to a name record (http://biodiversity.org.au/apni.name/36530), and the structure of that name record in RDF (http://biodiversity.org.au/apni.name/36530.rdf) follow the approach of the TSC standard (as incarnated in the TDWG ontology) very precisely. I can't see anything in these examples that doesn't follow TDWG standards and what I know of as "best practices". Thank you, Paul... Also we have many examples of appropriate HTTP URI GUID use from Pete, although not involving tc:Taxon and tn:TaxonName specifically.
Richard Pyle wrote:
I like your list of what we want "GUIDs" (see below) to do, and I think it's an excellent starting point for a bar we should all strive for. I'm particularly grateful to learn that the existing ZooBank service fails so many of them. I've forwarded your post to Rob Whitton, who will be working on Gen-2 of ZooBank in the coming weeks, and asked him if we can use your 8 tests as a metric to adhere to.
Better yet, read the TDWG GUID Applicability Statement http://www.tdwg.org/standards/150/ and http://www.w3.org/TR/cooluris/ . My 8 points are just a paraphrase out of my head. Striving is not good enough. Follow the standard.
"But really, from the perspective of the end-user, does it matter if it's an identifier or a service? Ultimately, they ask the questions, and the answers appear on their computer screens."
I would answer this question by saying "yes, it does matter!" - it is important that a well-designed GUID do more than just throw something up onto a human user's web browser.
I absolutely agree with you, but that's not the distinction I was making in my quoted text. I was only talking about whether we call something an "identifier" (not GUID, which has more specific implications), or a "service", in the context of human-machine conversations. I think your enumeration of things we want GUIDs to do is a very good framework for discussion. I would only caution that "GUID" means different things to different people (some people use it synonymously with UUID, for example), and also that GUID does not imply "actionable".
Again I would say read http://www.tdwg.org/standards/150/ . When I say "GUID" I am not throwing around a colloquial term. I intend for it to have the exact technical meaning that it is given in the TDWG standard. At this point in time (i.e. after we finally have a ratified standard on GUIDs), nobody in our community has any business designing and exposing GUIDs without having read this document and completely understanding its requirements and recommendations. I should not have to be "explaining" any of this to anybody on the list. It is explained clearly and concisely in the standard. I really am somewhat flabbergasted about how participants in TDWG, which I think is supposed to be a biodiversity standards organization, generally don't seem to read and follow the ratified standards. I think the process could be helped somewhat if the TDWG website were cleaned up a bit to make the obsolete stuff less easy to find and the important, current stuff easier to find. Also, I don't understand why all important documents aren't linked to the permanent URI page (e.g. http://www.tdwg.org/standards/150/) in pdf format. That would allow users to view the page directly in a web browser rather than having to open a zip file and then open a Word document.
There has been a bit of a debate over the importance of embedding "actionability" into identifiers inherently (the Tim Berners-Lee perspective)
Wrong. "GUIDs should be resolvable" (direct quote of recommendation 7 from the GUID applicability statement).
, vs thinking about "identification" separately from how we perform some action on it. For example, UUIDs and Social Security numbers are extremely useful identifiers, even though they are not inherently actionable. It's amazingly easy to perform action on a non-actionable identifier by simply appending it to a actionable prefix. For example, going back to the list of "identifiers":
A. A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523 B. urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523 C. http://zoobank.org/urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E5BF4 1523 D. http://zoobank.org/A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523 E. http://lsid.tdwg.org/urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E5B F41523 F. http://www.google.com/search?q=Danaus+plexippus+(Linnaeus+1758) G. http://lsid.tdwg.org/summary/urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB 4-EA8E5BF41523 H. http://darwin.zoology.gla.ac.uk/~rpage/lsid/tester/?q=urn:lsid:zoobank.org:a ct:A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523&submit=Go
There are two different ways of looking at this:
- There are 8 different identifiers
- There is one identifier (A)
A is an identifier but A does not meet the requirement of the GUID Applicability statement. Quote recommendation 2: "HTTP GET resolution *must* be provided for non-self resolving GUIDs". Pick one of your proxied HTTP URIs, call it your GUID and stop there. (Note: the emphasis on "must" is present in the standards document, not added by me.)
, and 6 ways to perform action on it (B-E, G-H).
If you treat them all as distinct identifiers, then let me add a few more to the list:
I. http://zoobank.org/?lsid=urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA 8E5BF41523 J. http://zoobank.org/?id=urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E 5BF41523 K. http://zoobank.org/?id=A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523 L. http://zoobank.org/?uuid=A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523
Don't add more of them to the list. Recommendation 3: "Providers *must* assign at most one GUID to any particular object." Recommendation 4: "Only one globally unique identifier should be assigned to each object".
Note that all four of the above, plus B-D in the original list, are all resolved through zoobank.org. Why are there so many different ways to perform action on the "same" identifier? Because I wanted the ZooBank resolution service to be flexible. And, because in my mind, there is only one identifier (A); and lots of different ways to retrieve the metadata of the object it represents.
I would assert that what you "want" and what you have in your mind is at odds with the TDWG standard for GUIDs.
Now consider this from the TB-L perspective. Eleven different identifiers for the same object (excluding F). Does that mean we need to generate owl:sameAs statements for all pair-wise relationships? That's a lot of owl:sameAs statements! Even if I'm the bad guy in foolishly allowing so many different ways to resolve ZooBank identifiers, and needlessly fabricated so many "different" identifiers for the same thing unnecessarily. Fair enough. But I still think we're a lot better off by disentangling identifiers from the services we use to perform action on them.
This may be your opinion, but it is at odds with the ratified standard which says (recommendation 2) that "HTTP GET resolution *must *be provided for non-self-resolving GUIDs".
One of the arguments on the TB-L side is that a non-actionable identifier by itself is useless if you cannot inherently perform action on it. For example, if you were walking through the park and stumbled upon a slip of paper with "A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523" written on it, you probably wouldn't be able to do much with it. But in reality, that's not what happens. We never expose identifiers as a simple context-free identifiers in their non-resolvable form. These identifiers are *always* exposed in some context. The problem is that if you treat the "resolution metadata" (as I call it -- e.g., "urn:lsid:zoobank.org:act:" or "http://zoobank.org/") as *part* of the identifier (as you have to do if you make things like "urn:lsid:ubio.org:namebank:11815"), then it becomes difficult for an application to distinguish between "http://zoobank.org/?id=A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523", and "http://zoobank.org/?uuid=A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523"; which, to a human, obviously refers to the same thing. In other words, absent all those owl:sameAs statements, an application could break if it harvests content from different sources that use different resolution metadata for the "same" (sensu Pyle) identifier.
The problem here is caused by you when you create and expose so many different HTTP URI forms of your UUID. Stop doing that (recommendation 4).
Maybe what we need to think about is a registry of "persistent resolution services", which our community relies on. That way, we can apply the owl:sameAs statements to the resolution services, rather than to every single individual identifier.
There is no need for this. Make a single HTTP URI version of your UUID and stick with it. Preferably one without the query string and use Mod rewrite (or whatever it's called) to transform the simple, clear, and permanent version of the URI into whatever flavor of temporary URL you are liking at the moment. Every application today understands HTTP GET. No need for a registry.
An important question that I think has been underlying much of this
discussion
is whether GUIDs are actually needed for names.
I think the answer is clearly "yes". The problem is defining what is meant by the word "name".
Go with the TCS standard and the TDWG ontology as it exists currently.
and parts thereof, then it does make sense to apply GUIDs to that kind of entity. I am thinking about a tn:TaxonName as defined in the TDWG ontology (see
http://code.google.com/p/tdwg-ontology/source/browse/trunk/ontology/voc/Taxo nName.rdf),
which comes out of the TCS schema (see http://code.google.com/p/darwin-sw/wiki/ClassTaxon for info and links
regarding TCS).
A tn:TaxonName is "An object that represents a single scientific
biological name..." i.e. an "object"
NOT defined as a string.
While it's nice to see the explicit representation of a "name" as an object, rather than a string; unfortunately that doesn't address the elephant in the room; that is, that different people have different notions of what "a single scientific biological name" is. I'm not talking subtly different shades of fundamentally the same thing; I'm talking about fundamentally different things with different implied sets of properties. This is one of the issues I continued to hammer on during the development of TCS, and the one that gave me the biggest qualms about TCS 1.0. My hope was that it would be resolved in TCS 2.0.
There ain't no TCS 2.0 . There is only TCS 1.2 . I'm sorry about it, but that's the ratified standard.
I wanted to reduce both names and concepts to the same core entity: usage instances. That's exactly what we're doing with GNUB.
There have been any number of things that I would "like" to be the way I want. However, the point of standards is that they get hammered out in a form that satisfies the community in a general way. Individual people often are left without everything that they wanted. From within our own personal projects, we can do anything we darn well please. But when it comes to communicating with others, we should discipline ourselves to follow the standards. I understand that for existing systems, there is considerable time and money required to retrofit old systems to a new standard. But GNUB is not an "old system". It is being build from scratch and I would assert that where it comes to interfacing it with the outside world, it should follow standards such as they exist at the moment. At the moment, people are allowed to think about and describe names without reducing them solely to usage instances as you would like. I spend about an hour yesterday composing a rant about how counterproductive it is for taxonomy and computer geeks to create tools and systems that won't ever actually be used by the people who need them. I decided that it wasn't helpful to actually post it, but now I'm thinking that maybe I should have...
That's only true to the extent that tn:TaxonName may be too broadly (imprecisely) defined (just like dwc:Taxon).
dwc:Taxon doesn't really have much of any useful definition, so I'm with you there. tn:TaxonName is actually rather precisely defined, at least if you look at the RDF (http://code.google.com/p/tdwg-ontology/source/browse/trunk/ontology/voc/Taxo...) and relate it to the TCS documents on which it is based (http://www.tdwg.org/standards/117/ , again it would be extremely useful to have a pdf version of the User Guide directly linked to that page so that people could look at it in their browsers rather than having to download a zip archive. Note also Kennedy et al. 2005 http://www.springerlink.com/content/7bv5pa3falxwrrvx/ which I found helpful for understanding the rationale for TCS). In my opinion, TCS (and by extension, the TDWG ontology) puts a rather restrictive collar and leash on taxon names. I quote from the user guide page 9: "<TaxonName> elements do not represent taxa. They serve only as abstract nomenclatural data structures that encapsulate the core rules of the different nomenclatural codes. Their purpose is to prevent nomenclatural statements becoming confused with statements about the circumscription of, and relationships between, different taxon concepts. No taxonomic opinion can be expressed using <TaxonName> elements in TCS. As a rule of thumb if you are dealing with anything beyond a type specimen and references to it, you are talking about a TaxonConcept of some form." This does not seem like a broad and imprecise definition to me. One is allowed to describe the pieces of the name and that's about it.
When I look carefully at how the TDWG ontology deals with taxon names and taxon concepts, it seems very simple and "usable" to me. If one defines a Taxon to be composed of a name component and a sensu/sec. component as several people (including you, I think) on this list have done and as TSC has done (I think), then representing it in RDF becomes tractable. One anchors the name part to a tn:TaxonName instance (properly collared and chained and wearing a GUID as a dog tag). How one anchors the sensu/sec. part is still a subject for discussion. I have been thinking about the following approach. It is based on a Venn diagram that I have in my head which I created from your descriptions of TNUs on this list. The Venn diagram has a big rectangle labeled "nominal taxon". Inside that is a smaller rectangle named "taxon name usage (TNU)". Inside that is an even smaller rectangle named "taxon concept". In this view, Taxon concepts are well-described/circumscribed by a publication. TNUs (which include taxon concepts) are associated with a particular person's idea of what the taxon is, but which may or may not be described in a publication. Nominal taxa are all instances of a scientific name use including those where we have no idea who applied the name or what set of organisms they intended to be included in the taxon. In terms of RDF metadata: 1. Go ahead and let the rdf:type of the thing be tc:Taxon 2. Make the object of tc:hasName be a GUID (i.e. as described by the TDWG GUID Applicability Statement, not some other kind of GUID)-identified resource, preferably from a well-known source like uBio. 3. If the sensu/sec. is described in a publication (in my mind a true taxon concept), then the object of tc:accordingTo is an HTTP proxied DOI, HTTP URI of a BHL-scanned publication, or if both of those fail, something non-resolvable but globally-unique like an ISBN or URL of a stable web page. 4. If the sensu/sec. is not described in a publication, but is associated with a particular person (in my mind a TNU that isn't a true taxon concept), then the object of tc:accordingTo could be the URI of a foaf:Person or foaf:Group. 5. If the sensu/sec. is completely unknown, then the taxon is a nominal taxon that is not a TNU. I don't know whether it is better for the taxon to simply lack a tc:accordingTo property or to have a tc:accordingTo property that somehow says "we don't know anything about the sensu/sec.".
I realize that you probably aren't going to like this because it isn't as sophisticated and nuanced as you would like for your GNUB TNUs to be. However, there would be nothing that would prohibit you from creating and adding a myriad of clever properties to the tc:Taxon instance RDF to make it do all of the things you want. The practice I have described would break down the act of defining a taxon into well-known, standardized pieces and it is a practice that could be fairly easily be followed by people without sophisticated IT resources. It would allow for the transfer and comparison of taxa information and make the possibility of reconciling at some central location (like GNUB) the taxa that are described in a distributed network of users. Doing something like this is, I believe, the entire reason for the existence of TCS, the TDWG ontology, old TDWG TAG roadmaps, etc. Please apply some self-discipline to follow the ratified standards or risk blowing us all back to 2005 where we would have to discuss all of the settled things again. If that is going to happen, I will give up on TDWG because I'll be retired before it is done over again.
In some ways what I'm talking about here is really (as I understand it) the principle that underlies REST. Within your big GNUB kingdom and my little Bioimages kingdom, we are free to do whatever clever things we want, structure databases as we wish, do clever programming stuff or whatever. But when you and I talk, we follow commonly established rules, namely we talk using the HTTP protocol and identify the things that we want to talk about using HTTP URIs. Since we are talking specifically about biodiversity informatics, we should choose to follow more restrictive rules about the identifiers themselves (following the TDWG GUID applicability statement) and the nature of the RDF (following the GUID applicability statement, well-known vocabularies such as the TDWG ontology, FOAF, DCMI, Darwin Core, geo, etc.). If we fail to do that, then every interaction that I have with another entity requires me to establish in advance the rules of that interaction. The Web works well because people follow a defined set of rules about URLs and HTML. I would assert that we now (at last) have a similar model available to us in the biodiversity informatics community if organizations would just have the self-discipline to use it.
Roderic Page wrote:
Reading this thread makes me despair. It's as if we are determined not to make progress, forever debating identifiers and what they identify, with seemingly little hope of resolution, and no clear vision of what the goals are. We wallow in acronym soup, and enjoy the technical challenges, but don't actually get anywhere
I have to say that I'm not as pessimistic as Rod is. Maybe that's just because I haven't been involved in the process as long as he has and haven't had sufficient time to develop appropriate cynicism. But I think there has been real progress, even in the couple years I've been tracking TDWG. We DO have a GUID Applicability Statement Standard now. We DO have a Darwin Core standard that defines terms which could be used to describe properties of biodiversity resources. We DO have doi's that are HTTP proxied and which return real metadata. We DO have people in our community who know how to write RDF and set up content negotiation for GUIDs as described in standards and best practices. I would also say that we do have a relatively clear vision of what the goals are. When I look at the old TAG roadmaps from 2006-2008 http://www.tdwg.org/uploads/media/TAG_Roadmap_01.doc (2006) http://www.tdwg.org/fileadmin/subgroups/tag/TAG_Roadmap_2007_final.pdf (2007) http://www.tdwg.org/fileadmin/subgroups/tag/TAG_Roadmap_2008.pdf (2008) the goals laid out there are the same ones I hear people talking about now. The difference is that we now have the tools and standards to do what was desired in 2006-8. We also have a funded project (BiSciCol) that is making progress toward developing a system that will track when changes occur in metadata for resources that are described by GUIDs. So I'm actually pretty optimistic about the whole venture assuming that we can get people and organizations to actually read and try to follow the standards that we have already agreed upon.
Steve
By contrast, the core object in GNUB is a taxon name usage instance --
which
is a purely abstract notion of the usage of a taxon name within some documentation source (like a publication). In this case, the text-string name is merely a property of the GUID-identified object, and would be an extremely BAD choice to use as a unique identifier.
It is possible that I'm not understanding what you are saying here, but if
you
are saying that the only name-related property of your GNUB taxon instances will be one which has a name string literal as its object,
Goodness, no!! The point I was making was that for GNI, the name-string *is* the object. For GNUB, the name-string is merely one (of MANY) properties of the object.
That will require any client using your taxon instance metadata to
re-process
the literal name string to cross reference it with lexical variants, parse
it into
its pieces, etc.
No -- that's definitely NOT the case. GNUB is highly normalized/atomized/parsed.
That should only need to be done once and then referenced via a GUID for the name (i.e. in the sense of tn:TaxonName).
Yes, but the name-string is only one of the properties. Other properties include most of the other elements in dwc:Taxon (and more).
This is why GNUB needs to generate a unique identifier to represent this core data object. The form that identifier takes (UUID, LSID, integer, DOI, whatever) from the perspective of the end user should be completely irrelevant, because the user should rarely (if ever) see it, and should certainly *never* be in a position to type it on a keyboard (we can discuss the appearance of
ZooBank
LSIDs on printed pages separately).
OK, again maybe I'm not understanding what you are saying here, but if you are saying that you don't intend to expose your unique GNUB identifiers to the public, then as far as I'm concerned you are setting up
GNUB to be irrelevant from the start.
Let me clarify: Obviously, GNUB identifiers will be fully exposed to the "public", in the sense that anyone who WANTS to see them (developers, IT specialists, hard-core name nerds, etc.), will be able to see them. In fact, anyone who wants a replicate copy of the ENTIRE dataset, including all Identifiers, raw tables, etc., will be able to do so. The idea is that you can download a snapshot of the entire database (all tables in their native structure; not dumbed down or flattened), and then set up a simple replication service that allows your local copy to automatically stay in synch with the "master" copy/ies. So yes, anyone who wants access to the identifiers has full access to them.
The point I was making was that most end-users won't care what the identifier is, or what kind it is, or how beautiful or ugly it is, or whatever. A good analogy is DNS: All users ever see is "google.com". They never see "74.125.224.176" (which google.com maps to from my machine at this moment). But the "ugly" "74.125.224.176" is what actually identifies the server to which google.com takes you. Analogously, users should only ever see "Danaus plexippus (Linnaeus 1758)"; they should never need to see "A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523".
You mention a number of cool taxonomist-geek type things that you hope to accomplish with GNUB. But from my perspective as a
non-taxonomist-geek,
the main purpose I have for GNUB is as a place to anchor
dwc:Identification instances
so that I can indicate whether my identified resource is a representative
of the same
taxon that is being referred to by somebody else (or at least to make it
possible for
somebody to figure that out via computery cleverness, Semantic Web or
otherwise).
Yes, exactly. But remember, GNUB is just an index to information. You will anchor your dwc:Identification instances to GNUB identifers, which will give you a precise indication of the concept that was used for the Identification. For example, the Field guide or taxonomic key that the taxonomist used to make the identification in the first place. No information on the field guide or key that was used when applying the identification to the occurrence? No problem -- generate a new TNU, "authored" by the person/entity making the identification, and voila! You're now plugged into the GNA matrix. What does that give you? Well...a few immediate options include:
- Access to the full literature citation and other nomenclatural details for the name; - Access to all other usages of that name, including variant spellings, combinations, synonymy treatments, etc. - Access to all other resources of relevance that are also plugged into the GNA "matrix".
But more to your interests:
- Cross-linking to other usage instances in a way that allows you to figure that out via computery cleverness whether you and someone else are referring to the same taxon concept. This little piece of magic can happen and two ways:
1) Implicitly. By comparing other usages in the contexts of their collective synonymies. For example, suppose RefA and RefB both treated "Aus bus" and "Aus xus" as two distinct species. RefC treated "Aus xus" as a heterotypic/junior synonym of "Aus bus". If your identification of a specimen as "Aus bus" links into the TNU associated with RefA, then implicitly we can say that its (likely to be) congruent with the concept represented by the TNU for that name of RefB; but may or may not be comparable to the concept represented by RefC. This is an example of addressing the "many concepts for one name" problem. Conversely, suppose your specimen identification is linked to the TNU for RefC. In that case, we can infer that your concept of "Aus bus" could apply to representatives of either "Aus bus" *or* "Aus xus" as cited in RefA and RefB. This is addressing the "many names for one concept" problem. These are just two very simple examples (of many possible examples) of the sort of computery cleverness that can be used to infer implicit concept-mapping among TNUs. Obviously, there are assumptions and caveats and such -- but it's still better (a LOT better) than trying to make inferences based on the text-string name only.
2) Explicitly. In the same way that TNUs can serve as the "molecules" behind nomenclatural services (like ZooBank, Index Fungorum, and possibly IPNI/APNI/Tropicos, if/when they embrace GNA), these TNU molecules can also underpin taxon concept services, such as those represented in TCS RelationshipAsserions. In other words, there can be a structure/service that sits on top of GNUB that allows explicit declarations of the sort: TNU1 represents a concept circumscription that is congruent with TNU2; etc. These third-party assertions about concept-concept mappings could provide a very valuable service for making inferences involving both many-names-for-same-concept issues and many-concepts-with-one-name issues, presumably with greater precision and reliability than the implicit mappings.
How am I going to do that if you don't provide me with a good (i.e. meeting the 8 criteria of my last email) GUID to use as the object of my dwc:Identification properties?
Have we cleared up that misunderstanding?
For over a year, I've heard you lament that the whole problem is that people make identifications and don't indicate the sensu/sec. reference for the names they use.
Yes, exactly! And that's the real problem with our information domain: one of the key pieces of information needed to apply computery cleverness to identifications of Occurrence instances is missing from the vast majority of datasets. That, unfortunately, means we're limited in our ability to make inferences about concept mappings -- not because an informatic structure doesn't exist to accommodate it, but because one of the key pieces of information is lost (i.e., what *concept* of this taxon were you thinking when you assigned this name to this occurrence instance?)
You are now creating a system that would allow people to unambiguously make it clear what taxon they mean but you aren't giving them any way to say what it is? Again, I may just be misunderstanding what you
wrote here.
Indeed, it seems that you are. Please let me know if I have not cleared up the misunderstanding.
Yes. This "record based ID" can be anything you want. I don't really
don't and
shouldn't have to care about that. The "human friendly ID that allows
people to
discuss the same semantic thing" is precisely what the TDWG GUID
Applicability
Statement (a ratified TDWG standard, thanks to Kevin) is talking about.
Hmmm...my turn to worry that I'm misunderstanding something. I'm fairly certain that the TDWG GUID applicability statement applies primarily to what you are referring to as the "record based ID". I think (not sure) that what Kevin meant by the other thing ("human friendly ID that allows people to discuss the same semantic thing") was more of a human-friendly service that accepts the human-friendly form of an "identifier" (e.g., the text-string taxon name), and then converts that into the real GUID (our "record based ID") for actual embedded linking purposes. Sort of like how DNS converts "google.com" (human-friendly representation of a domain name) to "74.125.224.176" (actual "GUID" used to route to a specific server).
As I read that standard, I don't see any requirement that a GUID be "human
friendly",
but I would consider "human friendliness" to be a desirable "best
practice"
(influenced somewhat by http://www.w3.org/Provider/Style/URI and http://www.w3.org/TR/cooluris/) - if we have a choice of creating
externally exposed
GUIDs that are either human-friendly or not human-friendly, and if either
works
equally well, why not choose ones that are human-friendly?
Here is where I completely disagree. I've said it before, and I'll keep saying it: GUIDs are (should be) intended and necessary for computer-computer communication; *NOT*for human-computer or human-human communication. Their beauty or ugliness should be determined by what's beautiful or ugly to a computer, not to a human. A consistent 128 bits is "beautiful" to a computer, but a UUID is ugly to a human; whereas " Danaus plexippus (Linnaeus 1758)" is beautiful to a human, but ugly to a computer (for reasons Dima already outlined).
More fundamentally, one lesson of history that seems to be perpetually repeated is the mistake of encoding human-interpretable information into what is intended to be a stable, permanent identifier. INEVITABLY, a system that uses human-interpretable information as identifiers will include some fraction of instances where the human-interpretable part is somehow "wrong" (e.g., the user entered a Cyrillic "а" was accitdentally entered instead of a latin "a", or a typographic error in a scientific name, or worst of all, the assignement of a text-string name to a homonym due to a mix-up in authorship). The temptation to "fix" those "wrong" values is enormous. And, of course, by "fixing" them, permanence is broken.
Almost by definition, then, a "beautiful" identifier for computer-computer communication should be "ugly" to a pair of human eyeballs.
It is interesting all this discussion of identifiers when in the end it
doesn't
matter too much what the identifier is, just that you have an identifier
at all.
Yes and know. I guess it depends on what the word "is" means in your "what the identifier is" phrase (Channeling Bill Clinton here). If by "is" you include "is permanent", "is unique", or "is actionable", then it does matter what the identifier "is". If you mean "is a DOI" vs. "is an lsid", then it may matter (see Rod's post), or it may not -- depending on what you want the Identifier to be able to do.
The important thing is the semantics, the "are we talking about the same
thing"
question - so this is where I believe RDF/semantic web comes in - I might
see if
I can come up with some RDF/sem web example for TDWG that could
demonstrate this, hmmm...
This is where the real problem in our community is. We are *WAY* too fast and loose with the definitions of what our "things" are. We think that by simply distinguishing "Taxon Names" from "Taxon Concepts" that we've removed ambiguity. Not even close. There are multiple flavors within each of those two "domains", and far too few people in our community (both on the IT side *and* the taxonomy side) have thought through the implications of defining the different flavors, let alone trying to establish a "sameAs" between two different flavors.
Better yet, read the TDWG GUID Applicability Statement http://www.tdwg.org/standards/150/ and
I think I helped write that one, so I'm pretty sure I've got a lot of that covered already (except the parts I vehemently disagree with... :-) )
That one I didn't know about, so thanks for the link. Of course, GUIDs (sensu lato) and "uris" are not necessarily the same thing. But that's another argument for another day.
When I say "GUID" I am not throwing around a colloquial term. I intend for it to have the exact technical meaning that it is given in the TDWG standard.
Fair enough -- I must have missed when you defined your use of "GUID" specifically in the context of the TDWG standard.
At this point in time (i.e. after we finally have a ratified standard on
GUIDs),
Maybe I'm mistaken, but I don't think we do. I don't think that an "Applicability Statement" rises to the level of "ratified standard", in the sense that TCS 1.0 and DwC are "ratified standards". Someone with better knowledge of the TDWG process can clarify this.
nobody in our community has any business designing and exposing GUIDs without having read this document and completely understanding its requirements and recommendations.
I certainly would agree with that statement.
I should not have to be "explaining" any of this to anybody on the list.
*Sigh* I often feel the same way. Too often, in fact. I hope you realize that when I complimented you on your 8 points I was complimenting you on the way you "paraphrase out of [your] head".
It is explained clearly and concisely in the standard.
...err "applicability statement".
There has been a bit of a debate over the importance of embedding "actionability" into identifiers inherently (the Tim Berners-Lee perspective)
Wrong. "GUIDs should be resolvable" (direct quote of recommendation 7 from the GUID applicability statement).
No, *NOT* wrong! I will say it again to be perfectly clear: There has been (and continues to be) a bit of a debate over the importance of embedding "actionability" into identifiers inherently. This is and continues to be a true statement. The only extent to which that statement is "wrong" is that I understated it with the words "bit of a". I should have either eliminated those words, or replaced them with "robust".
Don't add more of them to the list. Recommendation 3: "Providers must assign at most one GUID to any particular object." Recommendation 4: "Only one globally unique identifier should be assigned
to each object".
*Exactly*. That's why I think it's foolish to regard all of these different resolution mechanisms as distinct "identifiers". There is *ONE* GUID. It is: A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523. There are ten different ways to make it actionable. It therefore meets the recommendations of the applicability statement.
I draw your attention to p.7 of the "TDWG GUID Applicability Statement", under the heading "Uniqueness and Resolution", where it states the following: ============================ The global uniqueness of an identifier is often confused with the issue of resolution of the identifier. These two attributes of GUIDs can be distinguished and discussed separately. For example a Universally Unique Identifier (UUID) is a globally unique identifier, but there are no widely known and used protocols for resolving a UUID over the Internet (unlike HTTP URIs). This form of GUID is perfectly acceptable for uniquely identifying data objects within a dataset. Some identifiers therefore provide uniqueness, but not resolvability. ============================
The part that's not written there, but I think should have been written there (and that I argued strongly in favor of writing there when the document was drafted), is that GUIDs that are not self-resolving (i.e., not inherently actionable), can be *made* actionable when represented in the context of resolution metadata.
I would assert that what you "want" and what you have in your mind is at odds with the TDWG standard for GUIDs.
I would assert otherwise.
This may be your opinion, but it is at odds with the ratified standard which says.
Again, I don't agree with you on this assertion.
(recommendation 2) that "HTTP GET resolution must be provided for non-self-resolving GUIDs".
Yes, exactly -- and I trust you realize that this is exactly what ZooBank does. Note that the applicability statement does not say there must be *only one* HTTP proxy for the non-self-resolving GUID.
The problem here is caused by you when you create and expose so many different HTTP URI forms of your UUID. Stop doing that (recommendation 4).
And I disagree. ZooBank follows recommendation 4 *precisely*. There is only *ONE* globally unique *identifier* assigned to each object. In this case, that identifier is: A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523. Full stop. End of story. 'Nuff said.
The problem is not that I create and expose so many different HTTP URI forms of my UUID. The problem is when people conflate the function of *identification* with *resolution*. This is where I part company with (what I've been told is) the TB-L school of thought. And no, I don't think I'm smarter than TB-L. If I were the only one who disagreed on this point of conflating resolution metadata with unique and global identification, then I would assume that I'm an idiot and would stop complaining about this. But the more I think about it, and read about it, and understand about it, the more confident I am in standing my ground on this.
There is no need for this. Make a single HTTP URI version of your UUID and stick with it. Preferably one without the query string and use Mod rewrite (or whatever it's called) to transform the simple, clear, and permanent version of the URI into whatever flavor of temporary URL you are liking at the moment. Every application today understands HTTP GET. No need for a registry.
Of course every application understands HTTP GET. That's not the point (at all).
Go with the TCS standard and the TDWG ontology as it exists currently.
If you think that TCS has "the" answer to the "name" problem, then I don't think you fully appreciate the magnitude of the problem.
While it's nice to see the explicit representation of a "name" as an
object,
rather than a string; unfortunately that doesn't address the elephant in
the
room; that is, that different people have different notions of what "a single scientific biological name" is. I'm not talking subtly different shades of fundamentally the same thing; I'm talking about fundamentally different things with different implied sets of properties. This is one
of
the issues I continued to hammer on during the development of TCS, and
the
one that gave me the biggest qualms about TCS 1.0. My hope was that it would be resolved in TCS 2.0.
There ain't no TCS 2.0 . There is only TCS 1.2 . I'm sorry about it, but that's the ratified standard.
Please understand, I'm trying to illustrate where the existing standards fall short of what this community *needs* in order to move forward. Of course we have the standards, and if we allowed our hands to be tied to those standards, there wouldn't be any progress. TCS 1.2 DOES NOT MEET THE NEED. I want to move in the direction of something that DOES meet the need.
There have been any number of things that I would "like" to be the way I want. However, the point of standards is that they get hammered out in a form that satisfies the community in a general way.
Are you saying that the standards are written in stone, and we should be happy with them, and simply live with their limitations? If so, then you're operating in a world that I don't want any part of. I don't think you are, but frankly, the tone of this particular email exchange (by either of us) has not been especially helpful. OBVIOUSLY we should use the standards, as they exist, as much as possible WHEN THEY MEET THE NEEDS. What I was talking about (perhaps in an overly friendly, informal and loose way) is where we need to go to next. We clearly disagree on a few specific interpretations of the TDWG GUID applicability statement, but that's fine -- that's what we should be spending our time focused on.
But GNUB is not an "old system". It is being build from scratch and I would assert that where it comes to interfacing it with the outside world, it should follow standards such as they exist at the moment.
*Exactly*, and obviously it will, as much as feasible, practical and desirable, accommodate the existing standards -- and even the applicability statements -- within their inherent limitations. But speaking as someone who was a very active participant in the development of both the GUID applicability statements and TCS back when those were new and on the cutting edge, I have absolutely no interest in *limiting* what GNUB can do to what those standards articulate. We've moved along now that it's time to start pushing to the next level -- time to start overcoming the limitations those existing standards imposed. Many of those limitations were recognized at the time those documents were drafted, and the drafters acknowledged that some of the improvements would need to wait for the next version. With the development of GNA/GNUB, it's time to move on to the next version. We obviously want the next level to be backward compatible with existing standards, and obviously every effort will be made to maintain backward compatibility.
At the moment, people are allowed to think about and describe names without reducing them solely to usage instances as you would like.
Yes -- which is why I keep emphasizing why GNI will remain an important component.
I spend about an hour yesterday composing a rant about how
counterproductive
it is for taxonomy and computer geeks to create tools and systems that
won't
ever actually be used by the people who need them. I decided that it
wasn't
helpful to actually post it, but now I'm thinking that maybe I should
have...
Perhaps you should -- but keep in mind that statements like "won't ever actually be used by the people who need them" is an awfully broad and bold assertion. Backing up such an assertion begs for an articulation of the full scope of all possible users, and a deep understanding of the function of the systems you are making such assertions about.
dwc:Taxon doesn't really have much of any useful definition, so I'm with
you there.
tn:TaxonName is actually rather precisely defined, at least if you look at
the RDF
(http://code.google.com/p/tdwg-ontology/source/browse/trunk/ontology/voc/Tax onName.rdf)
Is this the definition you refer to:
"A scientific biological name. An object that represents a single scientific biological name that either is governed by or appears to be governed by one of the biological codes of nomenclature. These are not taxa. Taxa, whether accepted or not, are represented by TaxonConcept objects."
If so, by that definition, how many TaxonName instances are included in the following list?
Aus bus Aus dus Xus bus
Three? Four? Five? I can defend all three of those answers within the scope of the definition above. Assuming no homonyms or misspellings are involved, GNUB would establish four separate Protonyms, each of which can be thought of as a "name object", each with Code-specific properties. Additionally, if these fell under the botanical Code, there would be at least one, and as many as three additional nomenclaturally-relevant TNUs that would establish combination(s) other than the original as distinct "name objects" under the botanical Code.
In my opinion, TCS (and by extension, the TDWG ontology) puts a rather restrictive collar and leash on taxon names.
Enthusiastically Agreed! :-)
I quote from the user guide page 9: "<TaxonName> elements do not represent taxa. They serve only as abstract nomenclatural data structures that encapsulate the core rules of the different nomenclatural codes. Their purpose is to prevent nomenclatural statements becoming confused with statements about the circumscription of, and relationships between, different taxon concepts. No taxonomic opinion can be expressed using <TaxonName> elements in TCS. As a rule of thumb if you are dealing with anything beyond a type specimen and references to it, you are talking about a TaxonConcept of some form." This does not seem like a broad and imprecise definition to me. One is allowed to describe the pieces of the name and that's about it.
Yes, I know -- I helped write that. Unfortunately, it's still not precise enough (as is documented on some wiki somewhere, as we were defining what was originally called "LinneanCore", which later was subsumed into what is now TCS).
When I look carefully at how the TDWG ontology deals with taxon names and taxon concepts, it seems very simple and "usable" to me.
I'll definitely concede that point to you -- is strikes a good balance between ideal and practical. One of the over-arching goals within GNA development is to nudge a bit further towards the "ideal" without compromising on the "simple" and "usable" Whether or not this is possible remains to be seen.
If one defines a Taxon to be composed of a name component and a sensu/sec. component as several people (including you, I think) on this list have done and as TSC has done (I think), then representing it in RDF becomes tractable.
OK, good -- now I'm getting my head back into this conversation. Yes, *my* intent was to keep TCS open-ended such that any "[Name] sec. [Reference]" (=TNU) could be represented through TCS. That is the intention of GNUB. This is where Jessie Kennedy and I had many long debates. From her perspective, only the subset of "[Name] sec. [Reference]" (=TNU) instances that rise to the level of a "taxon definition" should be represented in TCS. This comes down to the fuzzy distinction between an "Identification" and a "Concept Definition". In the latter, presumably one provides a suite of information to help define the boundaries of a taxon-concept circumscription (specimens, characters, synonymy, etc.). In the former, presumably one simply assigns a name-string to an occurrence (or similar) instance of an organism. The problem is that every imaginable version between these two endpoints exists in biodiversity-land, so there is no clear distinction between which instances rise to the level of a "Taxon" and thus are legitimately represented via TCS, and which do not. In my mind, the approach of GNUB should be to not try to establish a distinction, and just accommodate any "[Name] sec. [Reference]" (=TNU) instance.
One anchors the name part to a tn:TaxonName instance (properly collared and chained and wearing a GUID as a dog tag). How one anchors the sensu/sec. part is still a subject for discussion.
This is the essence of a TNU. Except in GNUB-speak, a "TaxonName" is represented by another TNU -- specifically, the TNU that established the name in the first place. So, for example, Linnaeus (1758) established the name "Aus bus". Smith (1990) defines a taxon concept for "Aus bus L.".
TNU1: Aus bus Linnaeus 1758 sec. Linnaeus 1758 TNU2: Aus bus Linnaeus 1758 sec. Smith 1990
The Protonym is TNU1. TNU2 links to TNU1 as the Protonym, and basically translates to "Smith's taxon concept definition labeled with the name 'Aus bus L.'"; or more simply: "Aus bus L. sec. Smith 1990".
I have been thinking about the following approach. It is based on a Venn diagram that I have in my head which I created from your descriptions of TNUs on this list. The Venn diagram has a big rectangle labeled "nominal taxon".
If I correctly understand what you mean by the "Nominal Taxon", I think this equates in GNUB-speak to a Protonym.
Inside that is a smaller rectangle named "taxon name usage (TNU)". Inside that is an even smaller rectangle named "taxon concept".
Hmmmm...maybe. I need to digest this a bit.
In this view, Taxon concepts are well-described/circumscribed by a publication.
Yes.
TNUs (which include taxon concepts) are associated with a particular person's idea of what the taxon is, but which may or may not be described in a publication.
Yes, I think. I would state it this way: a subset of all TNUs are the TNUs that represent well-defined, published definitions of taxon concepts. That is, all taxon concepts are anchored to (born as?) a TNU, but not all TNUs rise to the level of Taxon Concepts.
Depending on how you distinguish "Publication" from non-publication, this may be somewhat of a distracting parameter. Generally, good taxon concept definitions exist within documentation sources that are what most of us would call "published"; but there's nothing inherent to "publication" that is necessary for "good taxon concept definition". Good taxon concept definitions can certainly exist in what many of us would described as "unpublished" form; just as many published TNU's don't rise to the level of good taxon concept definition.
Nominal taxa are all instances of a scientific name use including those where we have no idea who applied the name or what set of organisms they intended to be included in the taxon.
Yes! In GNUB, this is represented by the fact that all the relevant TNUs are anchored to the same Protonym (e.g., Aus bus L. sec. Linnaeus 1758).
In terms of RDF metadata:
- Go ahead and let the rdf:type of the thing be tc:Taxon
Ok. But how does that map to dwc:Taxon?
- Make the object of tc:hasName be a GUID (i.e. as described
by the TDWG GUID Applicability Statement, not some other kind of GUID)-identified resource, preferably from a well-known source like uBio.
Not sure. I don't see uBio as a source of "name objects" so much as "name-strings". I think a better GUID link would be to a GNUB TNU that is a Protonym. This is what is currently registered in ZooBank: Protonyms (the most common kind of Nomenclatural Act; that is, the TNU that represents the establishment of a new scientific name).
- If the sensu/sec. is described in a publication (in my mind
a true taxon concept), then the object of tc:accordingTo is an HTTP proxied DOI, HTTP URI of a BHL-scanned publication, or if both of those fail, something non-resolvable but globally- unique like an ISBN or URL of a stable web page.
OK, yes, I think so. Translated into GNUB-speak, I would say that if the TNU (treatment of a taxon name within a documentation source, like a publication) includes a robust definition of a Taxon Concept, then the linked ReferenceID (GNUB-generated GUID) would ideally be cross-mapped to a content-rich rendering of the identified reference, such as a DOI (presumably resolving to a PDF), an HTTP URI to a set of BHL page-images, or a PLAZI Handle for an XML-marked-up taxon treatment (or any or all of the above).
- If the sensu/sec. is not described in a publication, but is
associated with a particular person (in my mind a TNU that isn't a true taxon concept), then the object of tc:accordingTo could be the URI of a foaf:Person or foaf:Group.
Well, that's not exactly how GNUB would handle it -- but close. Basically, a "Reference" in GNUB represents some form of documentation of information that has been authored (e.g., foaf:Person), and is static as of some moment in time (e.g., publication date). Again, I don't think "publication" is the right parameter to distinguish "taxon concept" from non-taxon-concept. There are many, many TNUs appearing in published works that do not really rise to the level of taxon concept definition. In any case, whether it's published or not, and whether it represents a good taxon definition or not, are two different things that may be correlated, but not hard-linked. Also, regardless of whether it's published, any kind of documentation has the potential of authorship (attribution) and some point in time....in other words, a gnub:Reference instance. There's no reason to use the class of "thing" to which a TNU is linked (e.g., publication object vs. Agent object, as you seem to be suggesting) as the delimiter of what should be treated as a "Taxon Concept" and what should not.
- If the sensu/sec. is completely unknown, then the taxon
is a nominal taxon that is not a TNU. I don't know whether it is better for the taxon to simply lack a tc:accordingTo property or to have a tc:accordingTo property that somehow says "we don't know anything about the sensu/sec.".
Agreed! GNUB-speak, the ReferenceID would be null or (my preference from an implementation perspective "0" (which translates to "we don't have any information about the specific implied usage, so treat it as a nominal taxon").
I realize that you probably aren't going to like this because it isn't as sophisticated and nuanced as you would like for your GNUB TNUs to be.
No, actually I think it's perfectly fine. The reason I like normalized back-end data structures is that they give you much greater flexibility in offering any range of services, from extremely simple to as complex as the back-end data model allows. Moreover, as you said:
However, there would be nothing that would prohibit you from creating and adding a myriad of clever properties to the tc:Taxon instance RDF to make it do all of the things you want.
Exactly.
The practice I have described would break down the act of defining a taxon into well-known, standardized pieces and it is a practice that could be fairly easily be followed by people without sophisticated IT resources. It would allow for the transfer and comparison of taxa information and make the possibility of reconciling at some central location (like GNUB) the taxa that are described in a distributed network of users. Doing something like this is, I believe, the entire reason for the existence of TCS, the TDWG ontology, old TDWG TAG roadmaps, etc.
We are in full agreement!
Please apply some self-discipline to follow the ratified standards or risk blowing us all back to 2005 where we would have to discuss all of the settled things again.
I guess this is where we differ. Besides the semantic issue of "ratified standard" vs. "applicability statement", and the fact that we seem to have somewhat different interpretations of what the GUID applicability statement is actually recommending, I have a somewhat opposite perspective from you on this. In my view, constraining ourselves to TCS 1.2 is forcing us to STAY back in 2005, which had a somewhat different biodiversity informatics landscape from today, and even more different from what (I *hope*) we see emerge over the next 2-3 years. As I said, we want to maintain backward compatibility with TCS 1.2, and we certainly want to adhere to the recommendations of the GUID applicability statement (which I believe I do, except for the specific known issues that are on the "to do" list), but also push forward to overcome the limitations those technologies as a way to prototype the next generation of these equivalent standards & recommendations.
In some ways what I'm talking about here is really (as I understand it) the principle that underlies REST.
Yes! Ever since I had REST explained to me, I've been anxious to implement those kinds of services. Rob Whitton is already at work on ZooBank 2.0, which will be a complete ground-up re-write, and will be services-based.
Within your big GNUB kingdom and my little Bioimages kingdom, we are free to do whatever clever things we want, structure databases as we wish, do clever programming stuff or whatever. But when you and I talk, we follow commonly established rules, namely we talk using the HTTP protocol
Total agreement!
and identify the things that we want to talk about using HTTP URIs.
Errr..sort of. I say we identify things using GUIDs, and provide services that resolve those GUIDs via actionable HTTP URIs (or, if you prefer, embedding those GUIDs within a resolution metadata "wrapper"). Yes, I know it's all the rage to collapse the functions of actionability and globally unique identification into the same text-string URI (what I've been referring to as the TB-L perspective). But to be perfectly blunt, I see this as a mistake that will, in the long run, sow down our progress.
Since we are talking specifically about biodiversity informatics, we should choose to follow more restrictive rules about the identifiers themselves (following the TDWG GUID applicability statement) and the nature of the RDF (following the GUID applicability statement, well-known vocabularies such as the TDWG ontology, FOAF, DCMI, Darwin Core, geo, etc.). If we fail to do that, then every interaction that I have with another entity requires me to establish in advance the rules of that interaction. The Web works well because people follow a defined set of rules about URLs and HTML. I would assert that we now (at last) have a similar model available to us in the biodiversity informatics community if organizations would just have the self-discipline to use it.
Agreed! I think when we distill this entire exchange, we'll find that we have slightly different interpretations about what the GUID applicability statement actually says & means, and a non-trivial amount of miscommunication, but otherwise (as was the case the last time we had such a voluminous exchange), we're actually more on the same page than not.
So I'm actually pretty optimistic about the whole venture assuming that we can get people and organizations to actually read and try to follow the standards that we have already agreed upon.
I think it's nice to end this email on a point of strong agreement!
Aloha, Rich
Richard L. Pyle, PhD Database Coordinator for Natural Sciences Associate Zoologist in Ichthyology Dive Safety Officer Department of Natural Sciences, Bishop Museum 1525 Bernice St., Honolulu, HI 96817 Ph: (808)848-4115, Fax: (808)847-8252 email: deepreef@bishopmuseum.org http://hbs.bishopmuseum.org/staff/pylerichard.html
Rich, I should say that my inclusion of references, acronym definitions etc. is not to insinuate that you are unaware of those things, but is a recognition that this is a discussion on a public list and that some of the readers may have never heard of these things and may not be aware of the references. Also, the message to which you responded was a response to chunks of several emails - I guess a bad practice intended to cut down on the number of postings and to group related thoughts. I thought I had included enough of the the "Roderic Page wrote:" and " Kevin Richards wrote:" headings to make it clear to which message I was referring. A couple statements to which you responded to were written by Kevin and not me.
For the purposes of clarity, any time I say "GUID" here, I intend it in the sense of the TDWG GUID Applicability Statement. In the GBIF "Adoption of Persistent Identifiers for Biodiversity Informatics" document (http://www2.gbif.org/Persistent-Identifiers.pdf), the term "persistent actionable identifiers" is used instead of GUID, but in the interest of brevity I'll use GUID.
Thanks for taking the time to explain more about how GNUB will work. I am anxious to see it come to fruition and to use it. I have additional comments and questions relative to your description of it, but they will have to wait for another email. I think it would be best to focus this post on the subject of GUIDs because I think that this is the crux of our disagreement here.
First a word about the TDWG GUID Applicability Statement. You were expressing some reservations about calling it a "standard". If you go to http://www.tdwg.org/standards/, you will find it listed under "Current Standards". My understanding is (and I may be corrected by those who know better) that a TDWG Standard can be either an Applicability Statement or a Technical Definition (like Darwin Core). In either case, the standard has gone through the review process, been subjected to public comment, and approved by the TDWG Executive. So I consider either an Applicability Statement or a Technical Definition to have considerably more "weight" than something like a blog post or ad hoc usage guide. One problem with the GUID A.S. (Applicability Statement) is found on the title page. It says "there is, or will be, a separate document for the applicability of each specific GUID technology". Unfortunately, the "there is" part currently only applies to LSIDs - no other GUID technology has its own document. So an understanding of the "appropriate" way to apply something like a UUID must be inferred from the general statements and examples about UUIDs, by "reading between the lines" by considering how general recommendations about GUIDs would impact the handling of UUIDs, and by analogy to how LSIDs (another non-HTTP URI-based GUID) are handled.
You quoted p.7 of the guide:
============================ The global uniqueness of an identifier is often confused with the issue of resolution of the identifier. These two attributes of GUIDs can be distinguished and discussed separately. For example a Universally Unique Identifier (UUID) is a globally unique identifier, but there are no widely known and used protocols for resolving a UUID over the Internet (unlike HTTP URIs). This form of GUID is perfectly acceptable for uniquely identifying data objects within a dataset. Some identifiers therefore provide uniqueness, but not resolvability. ============================
So based on this, you are correct to call a UUID a GUID. However, the part that I disagree with is:
... I think it's foolish to regard all of these different resolution mechanisms as distinct "identifiers". There is *ONE* GUID. It is: A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523. There are ten different ways to make it actionable. It therefore meets the recommendations of the applicability statement.
The problem is that when you create an HTTP URI out of a UUID, you are creating an identifier whether you think you are or not. I suppose as a matter of semantics, you could say "I don't intend for the ten ways I showed of making my UUID actionable to be GUIDs", but if I encounter one of them, how am I supposed to know that? You may not think that an HTTP proxied non-HTTP URI GUID (e.g. an HTTP proxied UUID) is a GUID, but anyone who is interested in describing the properties of the identified resource in RDF (which should be everyone, GUID A.S. recommendation 10) will think so. The GUID A.S. does not contain any RDF examples (unfortunately) but the LSID Applicability Statement talks in detail about how LSIDs should be used in RDF. Recommendation 29 of the LSID A.S. states that "objects must be identified by an LSID in its standard form using the rdf:about attribute". You can do this with an LSID because it is a urn (subset of the more generic URI) and therefore a describable thing in RDF. However, a UUID cannot be used similarly in an rdf:about attribute because it is not any kind of URI. It is just a globally unique string. Recommendation 31 says "All references to objects identified by LSIDs using the rdf:resource attribute must use a proxy version of the LSID." This is because an LSID (nor a UUID) cannot be used by a client to retrieve information about the object of the property (the value of the rdf:resource attribute). That can only be done if the GUID is an HTTP URI. Recommendation 30 says that the description of all objects identified by an LSID must contain an owl:sameAs, owl:equivalentProperty or owl:equivalentClass statement expressing the equivalence beteen the object identifier in its standard form and its proxy version. The RDF example given on page 18 show how this is to be accomplished (fragment shown here):
<rdf:Description rdf:about="urn:lsid:ubio.org:namebank:11815"> dc:identifierurn:lsid:ubio.org:namebank:11815</dc:identifier> <owl:sameAs rdf:resource="http://lsid.tdwg.org/urn:lsid:ubio.org:namebank:11815%22/%3E ... </rdf:Description>
In this example, the HTTP URI "http://lsid.tdwg.org/urn:lsid:ubio.org:namebank:11815" is not just a "resolution mechanism". It IS an identifier whether you want it to be or not. I suppose you could try to define it out of a role as a "GUID" but that would be playing with semantics (no pun intended). Semantic clients would consider it to be just as much an identifier as the unproxied LSID Now consider how the example you were giving would need to be handled in RDF. I am extrapolating here because as I said, there is no "UUID Applicability guide". To handle all of the identifiers you listed:
A. A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523 B. urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523 C.http://zoobank.org/urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E5BF4... D. http://zoobank.org/A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523 E.http://lsid.tdwg.org/urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E5B... F. http://www.google.com/search?q=Danaus+plexippus+(Linnaeus+1758) G.http://lsid.tdwg.org/summary/urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB... H. http://darwin.zoology.gla.ac.uk/~rpage/lsid/tester/?q=urn:lsid:zoobank.org:a... I. http://zoobank.org/?lsid=urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA... J. http://zoobank.org/?id=urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E... K. http://zoobank.org/?id=A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523 L. http://zoobank.org/?uuid=A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523
one would write this:
<rdf:Description rdf:about="urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523"> dc:identifierA9F435E0-8ED7-46DD-BAB4-EA8E5BF41523</dc:identifier> <owl:sameAs rdf:resource="http://zoobank.org/urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E5BF4... <owl:sameAs rdf:resource="http://zoobank.org/A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523%22/%3E <owl:sameAs rdf:resource="http://lsid.tdwg.org/urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E5B... <owl:sameAs rdf:resource="http://www.google.com/search?q=Danaus+plexippus+(Linnaeus+1758)%22/%3E <owl:sameAs rdf:resource="http://lsid.tdwg.org/summary/urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB... <owl:sameAs rdf:resource="http://darwin.zoology.gla.ac.uk/~rpage/lsid/tester/?q=urn:lsid:zoobank.org:a... <owl:sameAs rdf:resource="http://zoobank.org/?lsid=urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA... <owl:sameAs rdf:resource="http://zoobank.org/?id=urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E... <owl:sameAs rdf:resource="http://zoobank.org/?id=A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523%22/%3E <owl:sameAs rdf:resource="http://zoobank.org/?uuid=A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523%22/%3E ... </rdf:Description>
Note that it would not be necessary (nor in my opinion a good idea) to use the LSID in the rdf:about attribute. Any of the 10 HTTP URIs could have been switched with it. (Well, the google.com one really shouldn't be there because it represents a web page, not a name.) However the UUID can NOT be used in the rdf:about attribute, nor can it be used in an rdf:resource attribute. From the standpoint of the RDF, it has no use as an identifier that the client can "understand" (i.e. use as a subject or object of any object property).
I don't think you were seriously suggesting that all 12 of the identifiers on the list would actually be used in "real life". You were making a point about how a UUID could be made actionable. But my point is that you simply cannot meet the requirements of the GUID A.S. with ONLY a UUID. You MUST have an HTTP proxied version of it in order to "do the right thing" (i.e. GUID A.S. rec 10) and provide metadata in the form of RDF serialized as XML. That HTTP proxied version isn't *just *going to be seen as a "resolution mechanism". It is going to be the ONLY identifier of any relevance in terms of the operation of the RDF which will see the UUID in the dc:identifier property as nothing more than a string literal. If you and GNUB are going to participate in BiSciCol as I understand it to be developing (and I believe that you are), you will HAVE to have an HTTP URI version of your UUIDs and in that context the raw UUID will be relatively irrelevant.
My point is that you should decide on just one of these HTTP URIs and use that as your identifier when you communicate with the outside world. My preference would be "http://zoobank.org/A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523" as the shortest and least complex one that would do everything that needs to get done. I guess that there isn't problem with the other nine existing, but from my point of view there is nothing but harm to be done by exposing them to the outside world. If you do, there is a chance that people will think that you intend for them to be an HTTP URI GUID for the object and you will be stuck forever having to put owl:sameAs statements about them in your RDF. You noted that the GUID A.S. says about UUIDs: "This form of GUID is perfectly acceptable for uniquely identifying data objects within a dataset." I would put emphasis on the word "within". Outside of that dataset, the UUID is not as useful as its HTTP proxied version. You could (from the standpoint of the outside world) refer to your object by both "http://zoobank.org/A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523" and "A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523", or you could ONLY refer to your object in the outside world as "http://zoobank.org/A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523". You can't ONLY refer to your object to the outside world as "A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523" and describe it in RDF. From this point of view, why would you want to expose two identifiers when you only need to expose one? This is what I meant when I said you should just pick one and stick with it.
The other point which I was trying to make is: why would you choose to expose to the outside world an identifier that only does part of the desirable things that we want (i.e. my list of 8 desirable attributes of a GUID), when you could use a modification of that identifier that would do everything you want? You mention how GUIDs for names are primarily of interest to machines. That is undoubtedly true. But with virtually no additional cost (15 minutes of time from somebody who knows how to create a single 3 kB XSLT file) an HTTP URI GUID could resolve to something readable by humans in additional to the more useful machine-readable RDF/XML.
I would assert the same thing about LSIDs. Why would you create in identifier that is part of (what seems to me to be universally recognized as) a dead technology when you could create a simpler HTTP URI that would do the same thing and potentially more? In the case of uBio and Biodiversity Collections Index, they were set up when LSIDs were believed to be the "Next Big Thing". That did not turn out to be the case, so those organizations are stuck with painful HTTP URIs like "http://biocol.org/urn:lsid:biocol.org:col:35115" and "http://www.ubio.org/authority/metadata.php?lsid=urn:lsid:ubio.org:namebank:9..." when they could have had "http://biocol.org/35115" and "http://www.ubio.org/9479554". I would say "lesson learned" - we know how to construct good HTTP URI GUIDs that will do everything people want so why not just do that? If it turns out that Linked Data and the Semantic Web are also "The Next Big Thing" that turns out to be a flop, we still have globally unique strings that are not actionable. But I think that the demonstrations of multiple members of our community show that at least to some degree LOD/Semantic Web technologies "work" and can be implemented by almost anybody.
You said:
Here is where I completely disagree. I've said it before, and I'll keep saying it: GUIDs are (should be) intended and necessary for computer-computer communication; *NOT*for human-computer or human-human communication. Their beauty or ugliness should be determined by what's beautiful or ugly to a computer, not to a human. A consistent 128 bits is "beautiful" to a computer, but a UUID is ugly to a human; whereas " Danaus plexippus (Linnaeus 1758)" is beautiful to a human, but ugly to a computer (for reasons Dima already outlined).
More fundamentally, one lesson of history that seems to be perpetually repeated is the mistake of encoding human-interpretable information into what is intended to be a stable, permanent identifier. INEVITABLY, a system that uses human-interpretable information as identifiers will include some fraction of instances where the human-interpretable part is somehow "wrong" (e.g., the user entered a Cyrillic "а" was accitdentally entered instead of a latin "a", or a typographic error in a scientific name, or worst of all, the assignement of a text-string name to a homonym due to a mix-up in authorship). The temptation to "fix" those "wrong" values is enormous. And, of course, by "fixing" them, permanence is broken.
Almost by definition, then, a "beautiful" identifier for computer-computer communication should be "ugly" to a pair of human eyeballs.
I disagree with you completely here. If you haven't read the "Cool URIs" piece, you should before we talk about this more. It is full of examples that are easy to read and type and are intended to be "understood" by both humans and computers. The piece at http://www.w3.org/Provider/Style/URI is an even easier read. GUIDs CAN be easy to "read" and type, although they don't have to be. The degree to which it "matters" whether a GUID is human readable or not depends primarily on the likelihood that humans will see it in print or type it in the URL box of a web browser. In the examples of GUIDs for names that you provided, I will agree that it's not very likely that humans will be seeing them. But if the GUID is of a specimen, an image, or a tree (which could easily appear in print or be written down by somebody to look at its web page), I would argue that readability is desirable, e.g. http://bioimages.vanderbilt.edu/uncg/966 . I realize that everyone does not agree with me on this, particularly the fans of UUIDs. As far as I know, there isn't any rule about what characters should be in an HTTP URI. But there is a general understanding that it is a best practice that an HTTP URI that is intended as an identifier should do content negotiation and produce both HTML for humans and RDF for machines.
[lots of stuff cut out here that will have to wait for another email]
Errr..sort of. I say we identify things using GUIDs, and provide services that resolve those GUIDs via actionable HTTP URIs (or, if you prefer, embedding those GUIDs within a resolution metadata "wrapper"). Yes, I know it's all the rage to collapse the functions of actionability and globally unique identification into the same text-string URI (what I've been referring to as the TB-L perspective). But to be perfectly blunt, I see this as a mistake that will, in the long run, sow down our progress.
Why does this slow down our progress? I don't get that at all. I see your viewpoint as the one impeding progress because non-HTTP GUIDs make it difficult or impossible to describe things in RDF.
... Agreed! I think when we distill this entire exchange, we'll find that we have slightly different interpretations about what the GUID applicability statement actually says & means, and a non-trivial amount of miscommunication, but otherwise (as was the case the last time we had such a voluminous exchange), we're actually more on the same page than not.
I'm sure this is probably the case! I hope that I am not coming across as rude or disrespectful in this kind of discussion. When I question your statements and those of others, I expect to often be shown to be wrong and learn from the experience. I also expect that my statements will be subjected to the same scrutiny and criticism that I may dish out! :-)
So I'm actually pretty optimistic about the whole venture assuming that we can get people and organizations to actually read and try to follow the standards that we have already agreed upon.
I think it's nice to end this email on a point of strong agreement!
Likewise! Steve
Aloha, Rich
Richard L. Pyle, PhD Database Coordinator for Natural Sciences Associate Zoologist in Ichthyology Dive Safety Officer Department of Natural Sciences, Bishop Museum 1525 Bernice St., Honolulu, HI 96817 Ph: (808)848-4115, Fax: (808)847-8252 email: deepreef@bishopmuseum.org http://hbs.bishopmuseum.org/staff/pylerichard.html
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content .
Several points in your dialog with Rich Pyle confuse me. I can't tell at places where you are relying on TDWG Applicability Statements, where on W3 Recommendations, where on IETF RFCs, and where on Cool URIs. Your "easier to read" cited opinion pieces of Tim Berners-Lee carries a warning that it is obsolete in places. (I find it hard to identify those places, but, http://www.w3.org/TR/2008/NOTE-cooluris-20081203/ has a status a little less personal than the TBL pieces, and I assume that's what you are resting on.) That CoolURI NOTE explicitly declaims discussion of non-http URIs, so it is hard for me to see how it supports arguments about http URIs vs non-http URIs, although the last few sections make arguments to bolster its position. Also, as written, the document seems to have a vision of applicability to static web documents. Hence(?) extrapolating to data services seems to require choosing an RDF-based model of data services, and by no means is LOD the only possible such model, ab-hominem (sic) arguments notwithstanding.
I've eliminated so much of the dialog with Rich that I may be ignoring context that will show me wrong below.
2011/6/7 Steve Baskauf steve.baskauf@vanderbilt.edu:
Rich,
[Rich Pyle said:] Here is where I completely disagree. I've said it before, and I'll keep saying it: GUIDs are (should be) intended and necessary for computer-computer communication; *NOT*for human-computer or human-human communication. Their beauty or ugliness should be determined by what's beautiful or ugly to a computer, not to a human. A consistent 128 bits is "beautiful" to a computer, but a UUID is ugly to a human; whereas " Danaus plexippus (Linnaeus 1758)" is beautiful to a human, but ugly to a computer (for reasons Dima already outlined).
[cyrillic character and other mumbles omitted]
Almost by definition, then, a "beautiful" identifier for computer-computer communication should be "ugly" to a pair of human eyeballs.
[Steve replied: ] I disagree with you completely here. If you haven't read the "Cool URIs" piece, you should before we talk >about this more. It is full of examples that are easy to read and type and are intended to be "understood" >by both humans and computers. The piece at http://www.w3.org/Provider/Style/URI is an even easier read. >GUIDs CAN be easy to "read" and type, although they don't have to be. The degree to which it "matters" >whether a GUID is human readable or not depends primarily on the likelihood that humans will see it in print >or type it in the URL box of a web browser. In the examples of GUIDs for names that you provided, I will >agree that it's not very likely that humans will be seeing them. But if the GUID is of a specimen, an image, >or a tree (which could easily appear in print or be written down by somebody to look at its web page), I would >argue that readability is desirable, e.g. http://bioimages.vanderbilt.edu/uncg/966 . I realize that everyone >does not agree with me on this, particularly the fans of UUIDs. As far as I know, there isn't any rule about >what characters should be in an HTTP URI.
As far as I know, there isn't any rule about what characters should be in an HTTP URI.
The ASCII control characters are forbidden in URI's used in RDF, but I guess that has no impact on your arguments.
[Steve continued:] But there is a general understanding that it is a best practice that an HTTP URI that is intended as an identifier should do content negotiation and produce both HTML for humans and RDF for machines.
This "general understanding" is about particular models of how to solve the dual use problem, and it's quite bound to the http protocol and web browsers as clients. Historically, such problems have sometimes been solved at the client side also. For example, most (all?) modern browsers can do pretty well with the FTP URI and the MAILTO URI.
History should not be ignored, especially the history of using protocols du-jour. The convergence of mobile telephony and information management and access arrived rather faster than most predicted. In a mobile world, http may well prove a junior player for data-centric apps, and http servers may not be whence data is fetched. (This is already the case for Android phones. See http://developer.android.com/guide/topics/providers/content-providers.html#u... which describes Android's CONTENT URI scheme ). Similarly, a number of popular P2P network clients implement the MAGNET URI scheme http://en.wikipedia.org/wiki/Magnet_URI_scheme. Indeed, a cynical view of http://www.w3.org/Mobile/ would hold that W3C's direction is a plan to keep the worldwide web relevant. Will it succeed for data? Possibly only with a redefinition of the web. For databases, it's not any harder to make android content: protocol servers than to make http: protocol servers. See http://developer.android.com/guide/topics/providers/content-providers.html
[lots of stuff cut out here that will have to wait for another email]
[Rich said:] Errr..sort of. I say we identify things using GUIDs, and provide services that resolve those GUIDs via actionable HTTP URIs (or, if you prefer, embedding those GUIDs within a resolution metadata "wrapper"). Yes, I know it's all the rage to collapse the functions of actionability and globally unique identification into the same text-string URI (what I've been referring to as the TB-L perspective). But to be perfectly blunt, I see this as a mistake that will, in the long run, sow down our progress.
[Steve replied: ] Why does this slow down our progress? I don't get that at all. I see your viewpoint as the one impeding progress because non-HTTP GUIDs make it difficult or impossible to describe things in RDF.
Non-http GUIDS at worst make it difficult to play with data providers and clients that only understand the http protocol, which is a circular argument. Certainly, for example, Non-http GUIDS do not interfere with SPARQL queries, or with RDF reasoners or with RDF data integration. In fact, even LOD has no need of http URIs except for the convenience of the existing infrastructure. Any dereferencable URI scheme would work. And so would multiple ones, provided only the clients and servers both understood the schemes.
Finally a social issue about your arguments on the importance of the ease of transcribing URIs from paper. Of course, wholly within electronic clients for humans, this is irrelevant because the client can render the identifier in any form mutually agreeable to the human and the software. With no insult intended (well---maybe a friendly little poke... :-) ) , the social issue is this: mainly it's people over 30 who find paper publication anything other than a quaint annoyance. Others will be bemused if not astonished that some people think that paper is important for prospective publishing in the sciences.
In the spirit of ending on agreement: I agree with everything where you and Rich agree...oh, wait, that's because I agree with everything Rich said. :-)
Bob Morris
Richard L. Pyle, PhD Database Coordinator for Natural Sciences Associate Zoologist in Ichthyology Dive Safety Officer Department of Natural Sciences, Bishop Museum 1525 Bernice St., Honolulu, HI 96817 Ph: (808)848-4115, Fax: (808)847-8252 email: deepreef@bishopmuseum.org http://hbs.bishopmuseum.org/staff/pylerichard.html
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content .
-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences
postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A.
delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235
office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707 http://bioimages.vanderbilt.edu
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
I would like to add an observation that comes from a computer science tradition
An object should have one responsibility and several responsibilities should be achieved by combination of objects.
In case of identifiers an illustration can be a scientific name. It does work as an identifier and as a tiny classification. For example Pinus silverstris identifies a species and also tells us the genus of the species. As a result identifier changes when classification changes and also you cannot identify species until you find a genus placement. Combination of two responsibilities in my opinion decreased usefulness of this particular identifier dramatically.
Now imagine that Linnaeus would also add a resolution responsibility to identifier. Would not his inclusion of resolution mechanism into identifier be not that appropriate at this day and age?
Dima
On Wed, Jun 8, 2011 at 1:56 AM, Bob Morris morris.bob@gmail.com wrote:
Several points in your dialog with Rich Pyle confuse me. I can't tell at places where you are relying on TDWG Applicability Statements, where on W3 Recommendations, where on IETF RFCs, and where on Cool URIs. Your "easier to read" cited opinion pieces of Tim Berners-Lee carries a warning that it is obsolete in places. (I find it hard to identify those places, but, http://www.w3.org/TR/2008/NOTE-cooluris-20081203/ has a status a little less personal than the TBL pieces, and I assume that's what you are resting on.) That CoolURI NOTE explicitly declaims discussion of non-http URIs, so it is hard for me to see how it supports arguments about http URIs vs non-http URIs, although the last few sections make arguments to bolster its position. Also, as written, the document seems to have a vision of applicability to static web documents. Hence(?) extrapolating to data services seems to require choosing an RDF-based model of data services, and by no means is LOD the only possible such model, ab-hominem (sic) arguments notwithstanding.
I've eliminated so much of the dialog with Rich that I may be ignoring context that will show me wrong below.
2011/6/7 Steve Baskauf steve.baskauf@vanderbilt.edu:
Rich,
[Rich Pyle said:] Here is where I completely disagree. I've said it before, and I'll keep saying it: GUIDs are (should be) intended and necessary for computer-computer communication; *NOT*for human-computer or human-human communication. Their beauty or ugliness should be determined by what's beautiful or ugly to a computer, not to a human. A consistent 128 bits is "beautiful" to a computer, but a UUID is ugly to a human; whereas " Danaus plexippus (Linnaeus 1758)" is beautiful to a human, but ugly to a computer (for reasons Dima already outlined).
[cyrillic character and other mumbles omitted]
Almost by definition, then, a "beautiful" identifier for computer-computer communication should be "ugly" to a pair of human eyeballs.
[Steve replied: ] I disagree with you completely here. If you haven't read the "Cool URIs" piece, you should before we talk >about this more. It is full of examples that are easy to read and type and are intended to be "understood" >by both humans and computers. The piece at http://www.w3.org/Provider/Style/URI is an even easier read. >GUIDs CAN be easy to "read" and type, although they don't have to be. The degree to which it "matters" >whether a GUID is human readable or not depends primarily on the likelihood that humans will see it in print >or type it in the URL box of a web browser. In the examples of GUIDs for names that you provided, I will >agree that it's not very likely that humans will be seeing them. But if the GUID is of a specimen, an image, >or a tree (which could easily appear in print or be written down by somebody to look at its web page), I would >argue that readability is desirable, e.g. http://bioimages.vanderbilt.edu/uncg/966 . I realize that everyone >does not agree with me on this, particularly the fans of UUIDs. As far as I know, there isn't any rule about >what characters should be in an HTTP URI.
As far as I know, there isn't any rule about what characters should be in an HTTP URI.
The ASCII control characters are forbidden in URI's used in RDF, but I guess that has no impact on your arguments.
[Steve continued:] But there is a general understanding that it is a best practice that an HTTP URI that is intended as an identifier should do content negotiation and produce both HTML for humans and RDF for machines.
This "general understanding" is about particular models of how to solve the dual use problem, and it's quite bound to the http protocol and web browsers as clients. Historically, such problems have sometimes been solved at the client side also. For example, most (all?) modern browsers can do pretty well with the FTP URI and the MAILTO URI.
History should not be ignored, especially the history of using protocols du-jour. The convergence of mobile telephony and information management and access arrived rather faster than most predicted. In a mobile world, http may well prove a junior player for data-centric apps, and http servers may not be whence data is fetched. (This is already the case for Android phones. See http://developer.android.com/guide/topics/providers/content-providers.html#u... which describes Android's CONTENT URI scheme ). Similarly, a number of popular P2P network clients implement the MAGNET URI scheme http://en.wikipedia.org/wiki/Magnet_URI_scheme. Indeed, a cynical view of http://www.w3.org/Mobile/ would hold that W3C's direction is a plan to keep the worldwide web relevant. Will it succeed for data? Possibly only with a redefinition of the web. For databases, it's not any harder to make android content: protocol servers than to make http: protocol servers. See http://developer.android.com/guide/topics/providers/content-providers.html
[lots of stuff cut out here that will have to wait for another email]
[Rich said:] Errr..sort of. I say we identify things using GUIDs, and provide services that resolve those GUIDs via actionable HTTP URIs (or, if you prefer, embedding those GUIDs within a resolution metadata "wrapper"). Yes, I know it's all the rage to collapse the functions of actionability and globally unique identification into the same text-string URI (what I've been referring to as the TB-L perspective). But to be perfectly blunt, I see this as a mistake that will, in the long run, sow down our progress.
[Steve replied: ] Why does this slow down our progress? I don't get that at all. I see your viewpoint as the one impeding progress because non-HTTP GUIDs make it difficult or impossible to describe things in RDF.
Non-http GUIDS at worst make it difficult to play with data providers and clients that only understand the http protocol, which is a circular argument. Certainly, for example, Non-http GUIDS do not interfere with SPARQL queries, or with RDF reasoners or with RDF data integration. In fact, even LOD has no need of http URIs except for the convenience of the existing infrastructure. Any dereferencable URI scheme would work. And so would multiple ones, provided only the clients and servers both understood the schemes.
Finally a social issue about your arguments on the importance of the ease of transcribing URIs from paper. Of course, wholly within electronic clients for humans, this is irrelevant because the client can render the identifier in any form mutually agreeable to the human and the software. With no insult intended (well---maybe a friendly little poke... :-) ) , the social issue is this: mainly it's people over 30 who find paper publication anything other than a quaint annoyance. Others will be bemused if not astonished that some people think that paper is important for prospective publishing in the sciences.
In the spirit of ending on agreement: I agree with everything where you and Rich agree...oh, wait, that's because I agree with everything Rich said. :-)
Bob Morris
Richard L. Pyle, PhD Database Coordinator for Natural Sciences Associate Zoologist in Ichthyology Dive Safety Officer Department of Natural Sciences, Bishop Museum 1525 Bernice St., Honolulu, HI 96817 Ph: (808)848-4115, Fax: (808)847-8252 email: deepreef@bishopmuseum.org http://hbs.bishopmuseum.org/staff/pylerichard.html
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content .
-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences
postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A.
delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235
office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707 http://bioimages.vanderbilt.edu
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
-- Robert A. Morris
Emeritus Professor of Computer Science UMASS-Boston 100 Morrissey Blvd Boston, MA 02125-3390 IT Staff Filtered Push Project Department of Organismal and Evolutionary Biology Harvard University
email: morris.bob@gmail.com web: http://efg.cs.umb.edu/ web: http://etaxonomy.org/mw/FilteredPush http://www.cs.umb.edu/~ram phone (+1) 857 222 7992 (mobile) _______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
And,honoring that tradition, actionable identifiers should fit in 16 bits. Like "ls", "rm", "sh", "wc", and, most honored of all, "cc" :-) Bob
On Wed, Jun 8, 2011 at 1:43 PM, Dmitry Mozzherin dmozzherin@eol.org wrote:
I would like to add an observation that comes from a computer science tradition
An object should have one responsibility and several responsibilities should be achieved by combination of objects.
In case of identifiers an illustration can be a scientific name. It does work as an identifier and as a tiny classification. For example Pinus silverstris identifies a species and also tells us the genus of the species. As a result identifier changes when classification changes and also you cannot identify species until you find a genus placement. Combination of two responsibilities in my opinion decreased usefulness of this particular identifier dramatically.
Now imagine that Linnaeus would also add a resolution responsibility to identifier. Would not his inclusion of resolution mechanism into identifier be not that appropriate at this day and age?
Dima
On Wed, Jun 8, 2011 at 1:56 AM, Bob Morris morris.bob@gmail.com wrote:
Several points in your dialog with Rich Pyle confuse me. I can't tell at places where you are relying on TDWG Applicability Statements, where on W3 Recommendations, where on IETF RFCs, and where on Cool URIs. Your "easier to read" cited opinion pieces of Tim Berners-Lee carries a warning that it is obsolete in places. (I find it hard to identify those places, but, http://www.w3.org/TR/2008/NOTE-cooluris-20081203/ has a status a little less personal than the TBL pieces, and I assume that's what you are resting on.) That CoolURI NOTE explicitly declaims discussion of non-http URIs, so it is hard for me to see how it supports arguments about http URIs vs non-http URIs, although the last few sections make arguments to bolster its position. Also, as written, the document seems to have a vision of applicability to static web documents. Hence(?) extrapolating to data services seems to require choosing an RDF-based model of data services, and by no means is LOD the only possible such model, ab-hominem (sic) arguments notwithstanding.
I've eliminated so much of the dialog with Rich that I may be ignoring context that will show me wrong below.
2011/6/7 Steve Baskauf steve.baskauf@vanderbilt.edu:
Rich,
[Rich Pyle said:] Here is where I completely disagree. I've said it before, and I'll keep saying it: GUIDs are (should be) intended and necessary for computer-computer communication; *NOT*for human-computer or human-human communication. Their beauty or ugliness should be determined by what's beautiful or ugly to a computer, not to a human. A consistent 128 bits is "beautiful" to a computer, but a UUID is ugly to a human; whereas " Danaus plexippus (Linnaeus 1758)" is beautiful to a human, but ugly to a computer (for reasons Dima already outlined).
[cyrillic character and other mumbles omitted]
Almost by definition, then, a "beautiful" identifier for computer-computer communication should be "ugly" to a pair of human eyeballs.
[Steve replied: ] I disagree with you completely here. If you haven't read the "Cool URIs" piece, you should before we talk >about this more. It is full of examples that are easy to read and type and are intended to be "understood" >by both humans and computers. The piece at http://www.w3.org/Provider/Style/URI is an even easier read. >GUIDs CAN be easy to "read" and type, although they don't have to be. The degree to which it "matters" >whether a GUID is human readable or not depends primarily on the likelihood that humans will see it in print >or type it in the URL box of a web browser. In the examples of GUIDs for names that you provided, I will >agree that it's not very likely that humans will be seeing them. But if the GUID is of a specimen, an image, >or a tree (which could easily appear in print or be written down by somebody to look at its web page), I would >argue that readability is desirable, e.g. http://bioimages.vanderbilt.edu/uncg/966 . I realize that everyone >does not agree with me on this, particularly the fans of UUIDs. As far as I know, there isn't any rule about >what characters should be in an HTTP URI.
As far as I know, there isn't any rule about what characters should be in an HTTP URI.
The ASCII control characters are forbidden in URI's used in RDF, but I guess that has no impact on your arguments.
[Steve continued:] But there is a general understanding that it is a best practice that an HTTP URI that is intended as an identifier should do content negotiation and produce both HTML for humans and RDF for machines.
This "general understanding" is about particular models of how to solve the dual use problem, and it's quite bound to the http protocol and web browsers as clients. Historically, such problems have sometimes been solved at the client side also. For example, most (all?) modern browsers can do pretty well with the FTP URI and the MAILTO URI.
History should not be ignored, especially the history of using protocols du-jour. The convergence of mobile telephony and information management and access arrived rather faster than most predicted. In a mobile world, http may well prove a junior player for data-centric apps, and http servers may not be whence data is fetched. (This is already the case for Android phones. See http://developer.android.com/guide/topics/providers/content-providers.html#u... which describes Android's CONTENT URI scheme ). Similarly, a number of popular P2P network clients implement the MAGNET URI scheme http://en.wikipedia.org/wiki/Magnet_URI_scheme. Indeed, a cynical view of http://www.w3.org/Mobile/ would hold that W3C's direction is a plan to keep the worldwide web relevant. Will it succeed for data? Possibly only with a redefinition of the web. For databases, it's not any harder to make android content: protocol servers than to make http: protocol servers. See http://developer.android.com/guide/topics/providers/content-providers.html
[lots of stuff cut out here that will have to wait for another email]
[Rich said:] Errr..sort of. I say we identify things using GUIDs, and provide services that resolve those GUIDs via actionable HTTP URIs (or, if you prefer, embedding those GUIDs within a resolution metadata "wrapper"). Yes, I know it's all the rage to collapse the functions of actionability and globally unique identification into the same text-string URI (what I've been referring to as the TB-L perspective). But to be perfectly blunt, I see this as a mistake that will, in the long run, sow down our progress.
[Steve replied: ] Why does this slow down our progress? I don't get that at all. I see your viewpoint as the one impeding progress because non-HTTP GUIDs make it difficult or impossible to describe things in RDF.
Non-http GUIDS at worst make it difficult to play with data providers and clients that only understand the http protocol, which is a circular argument. Certainly, for example, Non-http GUIDS do not interfere with SPARQL queries, or with RDF reasoners or with RDF data integration. In fact, even LOD has no need of http URIs except for the convenience of the existing infrastructure. Any dereferencable URI scheme would work. And so would multiple ones, provided only the clients and servers both understood the schemes.
Finally a social issue about your arguments on the importance of the ease of transcribing URIs from paper. Of course, wholly within electronic clients for humans, this is irrelevant because the client can render the identifier in any form mutually agreeable to the human and the software. With no insult intended (well---maybe a friendly little poke... :-) ) , the social issue is this: mainly it's people over 30 who find paper publication anything other than a quaint annoyance. Others will be bemused if not astonished that some people think that paper is important for prospective publishing in the sciences.
In the spirit of ending on agreement: I agree with everything where you and Rich agree...oh, wait, that's because I agree with everything Rich said. :-)
Bob Morris
Richard L. Pyle, PhD Database Coordinator for Natural Sciences Associate Zoologist in Ichthyology Dive Safety Officer Department of Natural Sciences, Bishop Museum 1525 Bernice St., Honolulu, HI 96817 Ph: (808)848-4115, Fax: (808)847-8252 email: deepreef@bishopmuseum.org http://hbs.bishopmuseum.org/staff/pylerichard.html
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content .
-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences
postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A.
delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235
office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707 http://bioimages.vanderbilt.edu
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
-- Robert A. Morris
Emeritus Professor of Computer Science UMASS-Boston 100 Morrissey Blvd Boston, MA 02125-3390 IT Staff Filtered Push Project Department of Organismal and Evolutionary Biology Harvard University
email: morris.bob@gmail.com web: http://efg.cs.umb.edu/ web: http://etaxonomy.org/mw/FilteredPush http://www.cs.umb.edu/~ram phone (+1) 857 222 7992 (mobile) _______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
Hi Steve,
First of all, I owe you (and the list at large) a sincere apology for my excessively long and largely discombobulated email. I was distracted by many things, and I ended up writing it in chunks over the course of a full day. You were very clear on who you were responding to in each section, but I probably lost track of that because of the discontinuous mode of my response. Another problem is that I converted my reply to plain text, and that caused me to lose track in a few places whom I was responding to. Again, my sincere apologies.
For the purposes of clarity, any time I say "GUID" here, I intend it in the sense of the TDWG GUID Applicability Statement.
OK thanks. That became clear as I responded, but somehow I didn't pick up on that when I first started responding. But even the TDWG GUID Applicability Statement (TGAS) is not perfectly clear or consistent in its use of the term GUID. In some cases, the term implies self-actionability; in other cases, it says what to do when GUIDs are not self-actionable.
In the GBIF "Adoption of Persistent Identifiers for Biodiversity Informatics" document (http://www2.gbif.org/Persistent-Identifiers.pdf), the term "persistent actionable identifiers" is used instead of GUID, but in the interest of brevity I'll use GUID.
OK, fair enough. The GBIF document was the most recent one I contributed to, so I was thinking in those terms for using the qualified "persistent actionable identifiers" language in contrast to "GUID"; but I'm perfectly happy using the term "GUID" now that we have it (reasonably) well-defined.
Thanks for taking the time to explain more about how GNUB will work. I am anxious to see it come to fruition and to use it.
I'm hoping that by late summer we'll have it functioning with several core services, and perhaps you and others on this list can help test those services and provide suggestions for new services. Before that can be a productive use of everyone's time, though, we need to hammer out some technical documentation. As I am writing this from my hotel room at Disney's Caribbean Beach Resort in Orlando (while my family naps after a long flight in preparation for some serious Magic Kingdom action tonight), I'm not really in a position to delve into this in too much detail right now. But I'll take a stab at it.
First a word about the TDWG GUID Applicability Statement. You were expressing some reservations about calling it a "standard". If you go to http://www.tdwg.org/standards/, you will find it listed under "Current Standards".
My reservations were mostly about calling it a "ratified standard". I honestly don't know if it is or isn't, but I don't rememeber a vote on it (like there was for TCS and for the "ratified" DwC). Perhaps Kevin Richards or someone else at TDWG can clarify (for both of us).
So an understanding of the "appropriate" way to apply something like a UUID must be inferred from the general statements and examples about UUIDs, by "reading between the lines" by considering how general recommendations about GUIDs would impact the handling of UUIDs, and by analogy to how LSIDs (another non-HTTP URI-based GUID) are handled.
Perhaps instead of reading between the lines, the discussion surrounding the drafting of the "TGAS" is available online somewhere. That would include details about the thinking behind the final wording.
So based on this, you are correct to call a UUID a GUID. However, the part that I disagree with is:
... I think it's foolish to regard all of these different resolution mechanisms as distinct "identifiers". There is *ONE* GUID. It is: A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523. There are ten different ways to make it actionable. It therefore meets the recommendations of the applicability statement.
You are not alone in disagreeing with me on this.
The problem is that when you create an HTTP URI out of a UUID, you are creating an identifier whether you think you are or not.
Fair enough; but by that definition & logic, *every* HTTP URI (sensu the "Contemporary View" explained at http://www.w3.org/TR/uri-clarification/; i.e., inclusive of things we sometimes call URN or URL) is an identifier. But I think that goes well beyond the scope of the discussion we're having here about GUIDs.
I suppose as a matter of semantics, you could say "I don't intend for the ten ways I showed of making my UUID actionable to be GUIDs", but if I encounter one of them, how am I supposed to know that?
That is *exactly* the point I was trying to get at in my earlier message. Right now, everything that resolves via HTTP GET must be treated as a GUID. But it's not guaranteed to be persistent (thinking again in terms of the more explicit "persistent actionable identifiers"). I think our community can do better than that. The problem is not the resolution -- I can (and intend to) persist all ten service syntax forms, so they will all fit the TGAS recommendation as GUIDs. But that doesn't do you any good if you're trying to compare cited objects in two different datasets that each happened to use different syntax for the resolution mechanism.
A little more context might be helpful here. Those ten different mechanisms to resolve ZooBank identifiers existed before the drafting of the TGAS document. I assumed, at the time I established them, that everyone would see as clearly as I do that the need for identification is different from the need for "resolution" (=actionability). So strong was the opposition to what seemed obvious to me, that I followed my normal pattern in such cases, which is to assume that I was wrong. But the unsettling part is that the more carefully I thought about it, the more obvious it became that I was right, and the opposing viewpoint was wrong (despite the inherent assumption by various big-name web luminaries, who I otherwise hold enormous respect for). So, through the early TDWG/GBIF discussions, and both TDWG/GBIF GUID workshops, and the drafting of the various TDWG and GBIF documents, I stubbornly maintained this perspective (that identification and resolution should not be conflated). I believe that it was my stubbornness that accounts for the acknowledgement of the distinction between identification and resolution in TGAS and other documents.
Now, the easy way out would be to throw in the towel and terminate 9 of those resolution services, and make everyone happy with a single ZooBank URI that can be actioned via HTTP GET. But to do so instills in me the same sort of lack of conviction that I would feel if I confessed to a crime I did not commit just because it was the easy way out. On this issue, I'm not ready to do that, because it is so glaringly obvious to me that we *must* maintain a distinction between identification and resolution.
You may not think that an HTTP proxied non-HTTP URI GUID (e.g. an HTTP proxied UUID) is a GUID, but anyone who is interested in describing the properties of the identified resource in RDF (which should be everyone, GUID A.S. recommendation 10) will think so.
Not everyone. But I concede that most would. And this is what I want to fix.
Another part of the TGAS that I quoted was this part (p 11):
"For non-self-resolving GUIDs, such as UUIDs, resolution of that GUID via the HTTP protocol’s GET method (the standard method by which a resource is retrieved on the web) must be implemented. This ensures that the data for the object being identified can be obtained from the provider of that GUID with tools that a majority of Internet users and developers already understand and use."
This, I believe, is one of the paragraphs inserted because of my insistence that the roles of identification and actionability be distinguished. Nothing in that statement -- or anywhere else in the TGAS that I am aware of -- suggests that HTTP-proxied "non-self-resolving GUIDs" themselves represent distinct GUIDs. Nor does it say that multiple mechanisms for establishing that HTTP-proxied actionability function represent a violation of Recommendation 4.
The GUID A.S. does not contain any RDF examples (unfortunately) but the LSID Applicability Statement talks in detail about how LSIDs should be used in RDF. Recommendation 29 of the LSID A.S. states that "objects must be identified by an LSID in its standard form using the rdf:about attribute". You can do this with an LSID because it is a urn (subset of the more generic URI) and therefore a describable thing in RDF. However, a UUID cannot be used similarly in an rdf:about attribute because it is not any kind of URI. It is just a globally unique string.
Right -- which is exactly why ZooBank identifiers are presented publicly as LSIDs (with proper resolution mechanisms), rather than simply as UUIDs. But that doesn't change the fact that the UUID is the "real" identifier, and is simply "wrapped" in LSID-compliant resolution metadata. But I will say that I also regard the LSID as a bona-fide "identifier" in and of itself, because that's how the LSID spec is written. So I (grudgingly) admit that our minting of LSIDs commits us to treating the full-context LSID as though it is a distinct identifier from the UUID that it encapsulates. However, I don't think this applies to all the flavors of HTTP proxying, because there is no spec (that I am aware of) that says "all HTTP URIs should be treated as though they are GUIDs" -- even though, by some definitions, they technically are.
Recommendation 31 says "All references to objects identified by LSIDs using the rdf:resource attribute must use a proxy version of the LSID."
Right, and this is where I think I dropped the ball on ZooBank LSID resolution. At the moment, resolving a ZooBank LSID directly (e.g., via Rod Page's LSID tester, or TDWG's LSID resolver service) retruns the proper RDF (thanks to Kevin Richards, who set that service up). However, the HTTP proxy version returns HTTP by default. I needed to do this because I didn't (and still don't) know enough about applying style sheets to RDF to render them in a human-friendly form. I spoke with Rob Whitton about this last week, and he will have this fixed soon.
Recommendation 30 says that the description of all objects identified by an LSID must contain an owl:sameAs, owl:equivalentProperty or owl:equivalentClass statement expressing the equivalence beteen the object identifier in its standard form and its proxy version.
Ahh!! OK, this may be the fatal bullet to my argument. But let me explain a bit further:
The "true" GUID for a ZooBank record is the UUID. The standard form of presenting this UUID to the public is as an LSID. I'm happy with saying that the LSID *is* the TDWG-context GUID for the record (calling the UUID the "true" GUID is just a semantic technicality that has no real bearing in the context of TDWG standards). The standard http proxy for ZooBank LSIDs is "http://zoobank.org/%5BLSID]" -- that is, the LSID appended to a "http://zoobank.org" prefix.
I have no argument with the Recommendation 30 that says there should be an owl:sameAs, owl:equivalentProperty or owl:equivalentClass statement expressing the equivalence between the LSID and its proxy version.
But I do have an argument against the notion that *any* web service that can resolve the LSID into its constituent metadata (whether HTTP, RDF, or whatever) must be treated as a distinct GUID, with a similar need for the owl:sameAs [etc.] statement. Perhaps this, ultimately, is the crux of our argument.
I don't think you were seriously suggesting that all 12 of the identifiers on the list would actually be used in "real life". You were making a point about how a UUID could be made actionable.
In part yes. But what I was really saying is that it's silly to think of all of those different metadata resolution services as distinct GUIDs (even though in the broad sense, all HTTP URIs are technically GUIDs). Also, it depends on what you mean by "used in real life". They should certainly not be used in "real life" as identifiers of the sort you gave examples for. But they may well be "used" in other real-life contexts.
But my point is that you simply cannot meet the requirements of the GUID A.S. with ONLY a UUID.
We may quibbling about semantics here. I never said that the TGAS was met with ONLY a UUID. My point was, the UUID *is* the identifier, and it can meet the TGAS requirements and recommendations *provided* that there is an appropriate HTTP GET resolution service for it, and provided that the UUID is exposed externally only in the context of the relevant resolution metadata. In other words, I *COMPLETELY* agree with you (and have tried to make this clear all along) that one would never see something like "dc:identifierA9F435E0-8ED7-46DD-BAB4-EA8E5BF41523</dc:identifier>" in an RDF (or other similar) document. But I do believe that something like "dc:identifierhttp://zoobank.org/A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523</dc:identifier>" *would* be compliant.
You MUST have an HTTP proxied version of it in order to "do the right thing" (i.e. GUID A.S. rec 10) and provide metadata in the form of RDF serialized as XML.
Yes, exactly.
That HTTP proxied version isn't just going to be seen as a "resolution mechanism".
But my point is that it *should* be. In other words, our community should rise to that level of sophistication, because it would, I am quite certain, benefit us in the long run.
If you and GNUB are going to participate in BiSciCol as I understand it to be developing (and I believe that you are), you will HAVE to have an HTTP URI version of your UUIDs and in that context the raw UUID will be relatively irrelevant.
Of course! And if you ever thought otherwise, then obviously I am not expressing myself well. Maybe part of our argument is that you are focused on implementation, and I am speaking more on principle. I thought I made it clear in my first post on this thread that a UUID by itself is not actionable (recall my example of walking through the park and discovering a UUID written on a slip of paper), and therefore not, by itself, functional as a persistent actionable identifier (sensu TDWG/GBIF). My only point in all of this is that identification and resolution are two separate functions, and we should be sophisticated enough to recognize the distinction. I don't know if it's feasible, but I think one way that it could be made feasible comes back to my suggestion of a registry of resolution services. This is not going backward; it's going forward. However, our community may have its hands full with just implementing the things we most need to implement, and may not have the luxury of time and resources to implement a standard acknowledgement of the distinction between resolution services and object identification -- by my contention is that we ignore that distinction at our peril.
My point is that you should decide on just one of these HTTP URIs and use that as your identifier when you communicate with the outside world.
That is already the case (has been the case ever since July 2007, when Kevin Richards set up our LSID resolution service).
My preference would be "http://zoobank.org/A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523" as the shortest and least complex one that would do everything that needs to get done.
Well, for various reasons we went with the LSID version: "http://zoobank.org/urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E5BF4..."
Or, as RDF in accordance with the LSID spec:
http://zoobank.org/authority/metadata/?lsid=urn:lsid:zoobank.org:act:A9F435E...
I guess that there isn't problem with the other nine existing, but from my point of view there is nothing but harm to be done by exposing them to the outside world.
I guess that depends on what you mean by "exposing" them. In my mind, they are already "exposed" because they work. However, I don't think anyone would (or should) embed them in semantic documents as though they were TDWG-style GUIDs. HOWEVER, the point I was originally making is that if we could (rightly) recognize the different roles of identification and resolution, then we wouldn't have a problem. You could very easily use your preferred "short" version of "http://zoobank.org/A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523", and reasoning service would have no difficulty recognizing it as identifying the same object as urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523, or http://zoobank.org/urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E5BF4.... I realize there is no elegant way to do this using existing RDF syntax, which is why this is *really* a much more fundamental argument than just TDWG-space. But in my extremely naïve way of representing it, it might look something like:
<rdf:Description rdf:about="http://zoobank.org/A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523%22%3E dc:identifierA9F435E0-8ED7-46DD-BAB4-EA8E5BF41523</dc:identifier> xxx:resolutionServicehttp://zoobank.org/</xxx:resolutionService>
...which would have no trouble combining with a document that had something like this:
<rdf:Description rdf:about="http://zoobank.org/?uuid=A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523%22%3E dc:identifierA9F435E0-8ED7-46DD-BAB4-EA8E5BF41523</dc:identifier> xxx:resolutionServicehttp://zoobank.org/?uuid=</xxx:resolutionService>
The other point which I was trying to make is: why would you choose to expose to the outside world an identifier that only does part of the desirable things that we want (i.e. my list of 8 desirable attributes of a GUID), when you could use a modification of that identifier that would do everything you want?
I would *never* "choose" to do that. However, I may very well be stuck with that due to insufficient resources and expertise. *That* is what I intend to fix now that I (finally) have both resources and expertise.
But with virtually no additional cost (15 minutes of time from somebody who knows how to create a single 3 kB XSLT file)
Ah....if only I had 15 minutes of such a person's time before now! :-)
I would assert the same thing about LSIDs. Why would you create in identifier that is part of (what seems to me to be universally recognized as) a dead technology when you could create a simpler HTTP URI that would do the same thing and potentially more?
The answer to that is much easier, and should be self-evident when you consider what I already mentioned previously: that the service was established in the summer of 2007. At that time, LSID was absolutely NOT dead, and indeed was actively being promoted by both TDWG and GBIF. This was the outcome of the two GUID workshops those organizations sponsored. There certainly were detractors to LSIDs back then, making the same arguments they are making now. To the extent that LSIDs are currently perceived as "dead" by some, is due largely to the self-fulfilling prophecy of those detractors.
But in any case, regardless of whether LSIDs really are dead or not, and regardless of why that may be so (if it is so), there were very good reasons why ZooBank went with LSIDs. And while I realize that the four years since then are a veritable EON in IT contexts, keep in mind that ZooBank has to think in terms of centuries. In that context, the HTTP protocol is not guaranteed to be persistent, and things like DOI are pretty-much downright ephemeral. In fact, this is exactly why I went with UUIDs in the first place. As long as electronic data are stored in binary form, 128 bits will have mathematical stability. *That's* why I realized that UUIDs were the only defensible choice for the "real" identifier, and is the identifier that ZooBank will persist. The choice of LSID as a resolution protocol was, as already stated, influenced by the thinking of our community at the time. *My* thinking at the time was that the only thing with any real plausibility of ICZN-scale longevity was binary data encoding (even that may not withstand more than a few decades), so I embraced UUIDs (which is to say, I embraced 128-bit identity). Everything else (LSID protocol, HTTP protocol, etc.) could be regarded as no more than the "resolution mechanism du joir". Perhaps this starts to explain why I keep emphasizing the distinction between identity and metadata resolution. The ZooBank registry has to think in terms of long-term identity, and assume that resolution mechanisms will continue to change as the technological wind blows.
In the case of uBio and Biodiversity Collections Index, they were set up when LSIDs were believed to be the "Next Big Thing".
Actually, all of us were implementing them at the same time. I think IPNI was one of the first; BCI came later. This all emerged from the two TDWG/GBIF GUID workshops.
That did not turn out to be the case, so those organizations are stuck with painful HTTP URIs like "http://biocol.org/urn:lsid:biocol.org:col:35115" and "http://www.ubio.org/authority/metadata.php?lsid=urn:lsid:ubio.org:namebank:9..." when they could have had "http://biocol.org/35115" and "http://www.ubio.org/9479554". I would say "lesson learned" -
Ha! Hardly! We are only just now beginning to start learning lessons. Let's revisit this conversation again in a couple of decades and see how many more lessons are yet in store for us.
In any case, my family just woke up from their nap, so I'll have to look at the rest of your message later, after some time with Mickey and the gang.
Aloha, Rich
Richard L. Pyle, PhD Database Coordinator for Natural Sciences Associate Zoologist in Ichthyology Dive Safety Officer Department of Natural Sciences, Bishop Museum 1525 Bernice St., Honolulu, HI 96817 Ph: (808)848-4115, Fax: (808)847-8252 email: deepreef@bishopmuseum.org http://hbs.bishopmuseum.org/staff/pylerichard.html
Answering your question about whether the GUID applicability statements are "ratified" standards...
To be honest, I am not sure of the difference between an un-ratified standard and a ratified standard. My impression was that we have done all we need to with the applicability statements in the standards process, so perhaps they are ratified??
An email from 22 February (ironic date - the date of our devastating earthquake) about the applicability statements:
"The TDWG Executive Committee has approved the Life Sciences Identifiers Applicability Statement (LSID_AS) and the Globally Unique Identifiers (GUID_AS) Applicability Statement as new TDWG standards.
The Executive committee acknowledges Kevin Richards of Landcare New Zealand as author of the GUID Applicability Statement. Likewise, Kevin Richards, Ricardo Pereira (TDWG Infrastructure Project), Donald Hobern (Atlas of Living Australia), Roger Hyam (TDWG Infrastructure Project), Lee Belbin (TDWG Infrastructure Project) and Stan Blum (California Academy of Sciences) as co-authors of the LSID Applicability Statement.
The committee also greatly appreciated the patience and perseverance of Ben Richardson of the Department of Environment and Conservation of Western Australia who was the Review Manager for these standards. The process, as can be seen from the institutional associations, was in this case longer than all would have liked, but we hope that the standards will prove useful to the Biodiversity Informatics community.
We would also thank all those who were involved as formal or public reviewers of these standards. Your input was greatly appreciated and was in various ways, incorporated into the final standards.
These standards can be downloaded from http://www.tdwg.org/standards/150/download/.
Chuck Miller TDWG Chair On behalf of the TDWG Executive Committee"
Kevin
-----Original Message----- From: tdwg-content-bounces@lists.tdwg.org [mailto:tdwg-content-bounces@lists.tdwg.org] On Behalf Of Richard Pyle Sent: Thursday, 9 June 2011 8:46 a.m. To: 'Steve Baskauf' Cc: tdwg-content@lists.tdwg.org Subject: Re: [tdwg-content] Why UUIDs alone are not adequate as GUIDs, was Re: ITIS TSNID to uBio NamebankIDs mapping
Hi Steve,
First of all, I owe you (and the list at large) a sincere apology for my excessively long and largely discombobulated email. I was distracted by many things, and I ended up writing it in chunks over the course of a full day. You were very clear on who you were responding to in each section, but I probably lost track of that because of the discontinuous mode of my response. Another problem is that I converted my reply to plain text, and that caused me to lose track in a few places whom I was responding to. Again, my sincere apologies.
For the purposes of clarity, any time I say "GUID" here, I intend it in the sense of the TDWG GUID Applicability Statement.
OK thanks. That became clear as I responded, but somehow I didn't pick up on that when I first started responding. But even the TDWG GUID Applicability Statement (TGAS) is not perfectly clear or consistent in its use of the term GUID. In some cases, the term implies self-actionability; in other cases, it says what to do when GUIDs are not self-actionable.
In the GBIF "Adoption of Persistent Identifiers for Biodiversity Informatics" document (http://www2.gbif.org/Persistent-Identifiers.pdf), the term "persistent actionable identifiers" is used instead of GUID, but in the interest of brevity I'll use GUID.
OK, fair enough. The GBIF document was the most recent one I contributed to, so I was thinking in those terms for using the qualified "persistent actionable identifiers" language in contrast to "GUID"; but I'm perfectly happy using the term "GUID" now that we have it (reasonably) well-defined.
Thanks for taking the time to explain more about how GNUB will work. I am anxious to see it come to fruition and to use it.
I'm hoping that by late summer we'll have it functioning with several core services, and perhaps you and others on this list can help test those services and provide suggestions for new services. Before that can be a productive use of everyone's time, though, we need to hammer out some technical documentation. As I am writing this from my hotel room at Disney's Caribbean Beach Resort in Orlando (while my family naps after a long flight in preparation for some serious Magic Kingdom action tonight), I'm not really in a position to delve into this in too much detail right now. But I'll take a stab at it.
First a word about the TDWG GUID Applicability Statement. You were expressing some reservations about calling it a "standard". If you go to http://www.tdwg.org/standards/, you will find it listed under "Current Standards".
My reservations were mostly about calling it a "ratified standard". I honestly don't know if it is or isn't, but I don't rememeber a vote on it (like there was for TCS and for the "ratified" DwC). Perhaps Kevin Richards or someone else at TDWG can clarify (for both of us).
So an understanding of the "appropriate" way to apply something like a UUID must be inferred from the general statements and examples about UUIDs, by "reading between the lines" by considering how general recommendations about GUIDs would impact the handling of UUIDs, and by analogy to how LSIDs (another non-HTTP URI-based GUID) are handled.
Perhaps instead of reading between the lines, the discussion surrounding the drafting of the "TGAS" is available online somewhere. That would include details about the thinking behind the final wording.
So based on this, you are correct to call a UUID a GUID. However, the part that I disagree with is:
... I think it's foolish to regard all of these different resolution mechanisms as distinct "identifiers". There is *ONE* GUID. It is: A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523. There are ten different ways to make it actionable. It therefore meets the recommendations of the applicability statement.
You are not alone in disagreeing with me on this.
The problem is that when you create an HTTP URI out of a UUID, you are creating an identifier whether you think you are or not.
Fair enough; but by that definition & logic, *every* HTTP URI (sensu the "Contemporary View" explained at http://www.w3.org/TR/uri-clarification/; i.e., inclusive of things we sometimes call URN or URL) is an identifier. But I think that goes well beyond the scope of the discussion we're having here about GUIDs.
I suppose as a matter of semantics, you could say "I don't intend for the ten ways I showed of making my UUID actionable to be GUIDs", but if I encounter one of them, how am I supposed to know that?
That is *exactly* the point I was trying to get at in my earlier message. Right now, everything that resolves via HTTP GET must be treated as a GUID. But it's not guaranteed to be persistent (thinking again in terms of the more explicit "persistent actionable identifiers"). I think our community can do better than that. The problem is not the resolution -- I can (and intend to) persist all ten service syntax forms, so they will all fit the TGAS recommendation as GUIDs. But that doesn't do you any good if you're trying to compare cited objects in two different datasets that each happened to use different syntax for the resolution mechanism.
A little more context might be helpful here. Those ten different mechanisms to resolve ZooBank identifiers existed before the drafting of the TGAS document. I assumed, at the time I established them, that everyone would see as clearly as I do that the need for identification is different from the need for "resolution" (=actionability). So strong was the opposition to what seemed obvious to me, that I followed my normal pattern in such cases, which is to assume that I was wrong. But the unsettling part is that the more carefully I thought about it, the more obvious it became that I was right, and the opposing viewpoint was wrong (despite the inherent assumption by various big-name web luminaries, who I otherwise hold enormous respect for). So, through the early TDWG/GBIF discussions, and both TDWG/GBIF GUID workshops, and the drafting of the various TDWG and GBIF documents, I stubbornly maintained this perspective (that identification and resolution should not be conflated). I believe that it was my stubbornness that accounts for the acknowledgement of the distinction between identification and resolution in TGAS and other documents.
Now, the easy way out would be to throw in the towel and terminate 9 of those resolution services, and make everyone happy with a single ZooBank URI that can be actioned via HTTP GET. But to do so instills in me the same sort of lack of conviction that I would feel if I confessed to a crime I did not commit just because it was the easy way out. On this issue, I'm not ready to do that, because it is so glaringly obvious to me that we *must* maintain a distinction between identification and resolution.
You may not think that an HTTP proxied non-HTTP URI GUID (e.g. an HTTP proxied UUID) is a GUID, but anyone who is interested in describing the properties of the identified resource in RDF (which should be everyone, GUID A.S. recommendation 10) will think so.
Not everyone. But I concede that most would. And this is what I want to fix.
Another part of the TGAS that I quoted was this part (p 11):
"For non-self-resolving GUIDs, such as UUIDs, resolution of that GUID via the HTTP protocol’s GET method (the standard method by which a resource is retrieved on the web) must be implemented. This ensures that the data for the object being identified can be obtained from the provider of that GUID with tools that a majority of Internet users and developers already understand and use."
This, I believe, is one of the paragraphs inserted because of my insistence that the roles of identification and actionability be distinguished. Nothing in that statement -- or anywhere else in the TGAS that I am aware of -- suggests that HTTP-proxied "non-self-resolving GUIDs" themselves represent distinct GUIDs. Nor does it say that multiple mechanisms for establishing that HTTP-proxied actionability function represent a violation of Recommendation 4.
The GUID A.S. does not contain any RDF examples (unfortunately) but the LSID Applicability Statement talks in detail about how LSIDs should be used in RDF. Recommendation 29 of the LSID A.S. states that "objects must be identified by an LSID in its standard form using the rdf:about attribute". You can do this with an LSID because it is a urn (subset of the more generic URI) and therefore a describable thing in RDF. However, a UUID cannot be used similarly in an rdf:about attribute because it is not any kind of URI. It is just a globally unique string.
Right -- which is exactly why ZooBank identifiers are presented publicly as LSIDs (with proper resolution mechanisms), rather than simply as UUIDs. But that doesn't change the fact that the UUID is the "real" identifier, and is simply "wrapped" in LSID-compliant resolution metadata. But I will say that I also regard the LSID as a bona-fide "identifier" in and of itself, because that's how the LSID spec is written. So I (grudgingly) admit that our minting of LSIDs commits us to treating the full-context LSID as though it is a distinct identifier from the UUID that it encapsulates. However, I don't think this applies to all the flavors of HTTP proxying, because there is no spec (that I am aware of) that says "all HTTP URIs should be treated as though they are GUIDs" -- even though, by some definitions, they technically are.
Recommendation 31 says "All references to objects identified by LSIDs using the rdf:resource attribute must use a proxy version of the LSID."
Right, and this is where I think I dropped the ball on ZooBank LSID resolution. At the moment, resolving a ZooBank LSID directly (e.g., via Rod Page's LSID tester, or TDWG's LSID resolver service) retruns the proper RDF (thanks to Kevin Richards, who set that service up). However, the HTTP proxy version returns HTTP by default. I needed to do this because I didn't (and still don't) know enough about applying style sheets to RDF to render them in a human-friendly form. I spoke with Rob Whitton about this last week, and he will have this fixed soon.
Recommendation 30 says that the description of all objects identified by an LSID must contain an owl:sameAs, owl:equivalentProperty or owl:equivalentClass statement expressing the equivalence beteen the object identifier in its standard form and its proxy version.
Ahh!! OK, this may be the fatal bullet to my argument. But let me explain a bit further:
The "true" GUID for a ZooBank record is the UUID. The standard form of presenting this UUID to the public is as an LSID. I'm happy with saying that the LSID *is* the TDWG-context GUID for the record (calling the UUID the "true" GUID is just a semantic technicality that has no real bearing in the context of TDWG standards). The standard http proxy for ZooBank LSIDs is "http://zoobank.org/%5BLSID]" -- that is, the LSID appended to a "http://zoobank.org" prefix.
I have no argument with the Recommendation 30 that says there should be an owl:sameAs, owl:equivalentProperty or owl:equivalentClass statement expressing the equivalence between the LSID and its proxy version.
But I do have an argument against the notion that *any* web service that can resolve the LSID into its constituent metadata (whether HTTP, RDF, or whatever) must be treated as a distinct GUID, with a similar need for the owl:sameAs [etc.] statement. Perhaps this, ultimately, is the crux of our argument.
I don't think you were seriously suggesting that all 12 of the identifiers on the list would actually be used in "real life". You were making a point about how a UUID could be made actionable.
In part yes. But what I was really saying is that it's silly to think of all of those different metadata resolution services as distinct GUIDs (even though in the broad sense, all HTTP URIs are technically GUIDs). Also, it depends on what you mean by "used in real life". They should certainly not be used in "real life" as identifiers of the sort you gave examples for. But they may well be "used" in other real-life contexts.
But my point is that you simply cannot meet the requirements of the GUID A.S. with ONLY a UUID.
We may quibbling about semantics here. I never said that the TGAS was met with ONLY a UUID. My point was, the UUID *is* the identifier, and it can meet the TGAS requirements and recommendations *provided* that there is an appropriate HTTP GET resolution service for it, and provided that the UUID is exposed externally only in the context of the relevant resolution metadata. In other words, I *COMPLETELY* agree with you (and have tried to make this clear all along) that one would never see something like "dc:identifierA9F435E0-8ED7-46DD-BAB4-EA8E5BF41523</dc:identifier>" in an RDF (or other similar) document. But I do believe that something like "dc:identifierhttp://zoobank.org/A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523</dc:identifier>" *would* be compliant.
You MUST have an HTTP proxied version of it in order to "do the right thing" (i.e. GUID A.S. rec 10) and provide metadata in the form of RDF serialized as XML.
Yes, exactly.
That HTTP proxied version isn't just going to be seen as a "resolution mechanism".
But my point is that it *should* be. In other words, our community should rise to that level of sophistication, because it would, I am quite certain, benefit us in the long run.
If you and GNUB are going to participate in BiSciCol as I understand it to be developing (and I believe that you are), you will HAVE to have an HTTP URI version of your UUIDs and in that context the raw UUID will be relatively irrelevant.
Of course! And if you ever thought otherwise, then obviously I am not expressing myself well. Maybe part of our argument is that you are focused on implementation, and I am speaking more on principle. I thought I made it clear in my first post on this thread that a UUID by itself is not actionable (recall my example of walking through the park and discovering a UUID written on a slip of paper), and therefore not, by itself, functional as a persistent actionable identifier (sensu TDWG/GBIF). My only point in all of this is that identification and resolution are two separate functions, and we should be sophisticated enough to recognize the distinction. I don't know if it's feasible, but I think one way that it could be made feasible comes back to my suggestion of a registry of resolution services. This is not going backward; it's going forward. However, our community may have its hands full with just implementing the things we most need to implement, and may not have the luxury of time and resources to implement a standard acknowledgement of the distinction between resolution services and object identification -- by my contention is that we ignore that distinction at our peril.
My point is that you should decide on just one of these HTTP URIs and use that as your identifier when you communicate with the outside world.
That is already the case (has been the case ever since July 2007, when Kevin Richards set up our LSID resolution service).
My preference would be "http://zoobank.org/A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523" as the shortest and least complex one that would do everything that needs to get done.
Well, for various reasons we went with the LSID version: "http://zoobank.org/urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E5BF4..."
Or, as RDF in accordance with the LSID spec:
http://zoobank.org/authority/metadata/?lsid=urn:lsid:zoobank.org:act:A9F435E...
I guess that there isn't problem with the other nine existing, but from my point of view there is nothing but harm to be done by exposing them to the outside world.
I guess that depends on what you mean by "exposing" them. In my mind, they are already "exposed" because they work. However, I don't think anyone would (or should) embed them in semantic documents as though they were TDWG-style GUIDs. HOWEVER, the point I was originally making is that if we could (rightly) recognize the different roles of identification and resolution, then we wouldn't have a problem. You could very easily use your preferred "short" version of "http://zoobank.org/A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523", and reasoning service would have no difficulty recognizing it as identifying the same object as urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523, or http://zoobank.org/urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E5BF4.... I realize there is no elegant way to do this using existing RDF syntax, which is why this is *really* a much more fundamental argument than just TDWG-space. But in my extremely naïve way of representing it, it might look something like:
<rdf:Description rdf:about="http://zoobank.org/A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523%22%3E dc:identifierA9F435E0-8ED7-46DD-BAB4-EA8E5BF41523</dc:identifier> xxx:resolutionServicehttp://zoobank.org/</xxx:resolutionService>
...which would have no trouble combining with a document that had something like this:
<rdf:Description rdf:about="http://zoobank.org/?uuid=A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523%22%3E dc:identifierA9F435E0-8ED7-46DD-BAB4-EA8E5BF41523</dc:identifier> xxx:resolutionServicehttp://zoobank.org/?uuid=</xxx:resolutionService>
The other point which I was trying to make is: why would you choose to expose to the outside world an identifier that only does part of the desirable things that we want (i.e. my list of 8 desirable attributes of a GUID), when you could use a modification of that identifier that would do everything you want?
I would *never* "choose" to do that. However, I may very well be stuck with that due to insufficient resources and expertise. *That* is what I intend to fix now that I (finally) have both resources and expertise.
But with virtually no additional cost (15 minutes of time from somebody who knows how to create a single 3 kB XSLT file)
Ah....if only I had 15 minutes of such a person's time before now! :-)
I would assert the same thing about LSIDs. Why would you create in identifier that is part of (what seems to me to be universally recognized as) a dead technology when you could create a simpler HTTP URI that would do the same thing and potentially more?
The answer to that is much easier, and should be self-evident when you consider what I already mentioned previously: that the service was established in the summer of 2007. At that time, LSID was absolutely NOT dead, and indeed was actively being promoted by both TDWG and GBIF. This was the outcome of the two GUID workshops those organizations sponsored. There certainly were detractors to LSIDs back then, making the same arguments they are making now. To the extent that LSIDs are currently perceived as "dead" by some, is due largely to the self-fulfilling prophecy of those detractors.
But in any case, regardless of whether LSIDs really are dead or not, and regardless of why that may be so (if it is so), there were very good reasons why ZooBank went with LSIDs. And while I realize that the four years since then are a veritable EON in IT contexts, keep in mind that ZooBank has to think in terms of centuries. In that context, the HTTP protocol is not guaranteed to be persistent, and things like DOI are pretty-much downright ephemeral. In fact, this is exactly why I went with UUIDs in the first place. As long as electronic data are stored in binary form, 128 bits will have mathematical stability. *That's* why I realized that UUIDs were the only defensible choice for the "real" identifier, and is the identifier that ZooBank will persist. The choice of LSID as a resolution protocol was, as already stated, influenced by the thinking of our community at the time. *My* thinking at the time was that the only thing with any real plausibility of ICZN-scale longevity was binary data encoding (even that may not withstand more than a few decades), so I embraced UUIDs (which is to say, I embraced 128-bit identity). Everything else (LSID protocol, HTTP protocol, etc.) could be regarded as no more than the "resolution mechanism du joir". Perhaps this starts to explain why I keep emphasizing the distinction between identity and metadata resolution. The ZooBank registry has to think in terms of long-term identity, and assume that resolution mechanisms will continue to change as the technological wind blows.
In the case of uBio and Biodiversity Collections Index, they were set up when LSIDs were believed to be the "Next Big Thing".
Actually, all of us were implementing them at the same time. I think IPNI was one of the first; BCI came later. This all emerged from the two TDWG/GBIF GUID workshops.
That did not turn out to be the case, so those organizations are stuck with painful HTTP URIs like "http://biocol.org/urn:lsid:biocol.org:col:35115" and "http://www.ubio.org/authority/metadata.php?lsid=urn:lsid:ubio.org:namebank:9..." when they could have had "http://biocol.org/35115" and "http://www.ubio.org/9479554". I would say "lesson learned" -
Ha! Hardly! We are only just now beginning to start learning lessons. Let's revisit this conversation again in a couple of decades and see how many more lessons are yet in store for us.
In any case, my family just woke up from their nap, so I'll have to look at the rest of your message later, after some time with Mickey and the gang.
Aloha, Rich
Richard L. Pyle, PhD Database Coordinator for Natural Sciences Associate Zoologist in Ichthyology Dive Safety Officer Department of Natural Sciences, Bishop Museum 1525 Bernice St., Honolulu, HI 96817 Ph: (808)848-4115, Fax: (808)847-8252 email: deepreef@bishopmuseum.org http://hbs.bishopmuseum.org/staff/pylerichard.html
_______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
Please consider the environment before printing this email Warning: This electronic message together with any attachments is confidential. If you receive it in error: (i) you must not read, use, disclose, copy or retain it; (ii) please contact the sender immediately by reply email and then delete the emails. The views expressed in this email may not be those of Landcare Research New Zealand Limited. http://www.landcareresearch.co.nz
On 08/06/2011, at 8:05 AM, Steve Baskauf wrote:
... I think it's foolish to regard all of these different resolution mechanisms as distinct "identifiers". There is *ONE* GUID. It is: A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523. There are ten different ways to make it actionable. It therefore meets the recommendations of the applicability statement. The problem is that when you create an HTTP URI out of a UUID, you are creating an identifier whether you think you are or not.
Jumping in again, but perhaps RFC 4122 might help a little here.
A GUID (or UUID) is a set of 128 bits, 16 octets, 32 hex digits, 5 inches of punched paper tape. However you choose to write or express it, there is indeed "*ONE* GUID".
A URI is not a GUID. This: http://example.org/A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523
is a different URI to this http://my.organisation.org/A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523
is a different URI to this: http://example.org/A9F435E08ED746DDBAB4EA8E5BF41523
Furthermore, these uris have nothing whatever to do with the guid - apart from the fact that it's obvious to we humans that they do.
Fortunately, there is a standard for expressing a guid/uuid as a URI, and it is the "uuid" urn namespace, defined in RFC-4122. Thus:
urn:uuid:A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523
is a URI that - according to a w3c standard - corresponds to the 128-bit guid. This:
urn:uuid:A9F435E08ED746DDBAB4EA8E5BF41523
is *not valid* - it doesn't conform to the schema. There is one unique (case insensitive) uuid urn for any guid, and a defined equivalence between them. These are not "cool uris", but guids are inherently uncool so that's to be expected.
If you want to use GUIDs for identifiers and need equivalent URIs (for use in RDF and the semweb), then urn:uuid:<the guid> might be a good way to go.
If you have received this transmission in error please notify us immediately by return e-mail and delete all copies. If this e-mail or any attachments have been sent to you in error, that error does not constitute waiver of any confidentiality, privilege or copyright in respect of information in the e-mail or attachments.
Please consider the environment before printing this email.
I don't think any URNs work with typical semantic web tools do they? Ie they don't know how to resolve them. Kevin
From: tdwg-content-bounces@lists.tdwg.org [mailto:tdwg-content-bounces@lists.tdwg.org] On Behalf Of Paul Murray Sent: Thursday, 9 June 2011 2:56 p.m. To: Steve Baskauf Cc: tdwg-content@lists.tdwg.org Subject: Re: [tdwg-content] Why UUIDs alone are not adequate as GUIDs, was Re: ITIS TSNID to uBio NamebankIDs mapping [SEC=UNCLASSIFIED]
On 08/06/2011, at 8:05 AM, Steve Baskauf wrote:
... I think it's foolish to regard all of these different
resolution mechanisms as distinct "identifiers". There is *ONE* GUID. It
is: A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523. There are ten different ways to
make it actionable. It therefore meets the recommendations of the
applicability statement. The problem is that when you create an HTTP URI out of a UUID, you are creating an identifier whether you think you are or not.
Jumping in again, but perhaps RFC 4122 might help a little here.
A GUID (or UUID) is a set of 128 bits, 16 octets, 32 hex digits, 5 inches of punched paper tape. However you choose to write or express it, there is indeed "*ONE* GUID".
A URI is not a GUID. This: http://example.org/A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523
is a different URI to this http://my.organisation.org/A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523
is a different URI to this: http://example.org/A9F435E08ED746DDBAB4EA8E5BF41523
Furthermore, these uris have nothing whatever to do with the guid - apart from the fact that it's obvious to we humans that they do.
Fortunately, there is a standard for expressing a guid/uuid as a URI, and it is the "uuid" urn namespace, defined in RFC-4122. Thus:
urn:uuid:A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523
is a URI that - according to a w3c standard - corresponds to the 128-bit guid. This:
urn:uuid:A9F435E08ED746DDBAB4EA8E5BF41523
is *not valid* - it doesn't conform to the schema. There is one unique (case insensitive) uuid urn for any guid, and a defined equivalence between them. These are not "cool uris", but guids are inherently uncool so that's to be expected.
If you want to use GUIDs for identifiers and need equivalent URIs (for use in RDF and the semweb), then urn:uuid:<the guid> might be a good way to go.
If you have received this transmission in error please notify us immediately by return e-mail and delete all copies. If this e-mail or any attachments have been sent to you in error, that error does not constitute waiver of any confidentiality, privilege or copyright in respect of information in the e-mail or attachments. Please consider the environment before printing this email.
________________________________ Please consider the environment before printing this email Warning: This electronic message together with any attachments is confidential. If you receive it in error: (i) you must not read, use, disclose, copy or retain it; (ii) please contact the sender immediately by reply email and then delete the emails. The views expressed in this email may not be those of Landcare Research New Zealand Limited. http://www.landcareresearch.co.nz
On 09/06/2011, at 1:01 PM, Kevin Richards wrote:
I don’t think any URNs work with typical semantic web tools do they? Ie they don’t know how to resolve them.
A tool does not need to be able to resolve a uri to be able to treat it as an identifier about which it knows facts. We can say
http://example.org/Fred http://example.org/brother-of http://example.org/Sue
And let a reasoner deduce that Sue is sibling-of Fred, even with no http server at example.org. Of course, most tools on seeing a URI that is a HTTP URL will (or can) also take that additional step and pull down data about Fred and Sue - that's what "linked data" is about. But thats an additional step and is not a requirement. Without resolution, the tool knows nothing about the things identified by those ids other than what it is explicitly given. But that's valid.
If you have received this transmission in error please notify us immediately by return e-mail and delete all copies. If this e-mail or any attachments have been sent to you in error, that error does not constitute waiver of any confidentiality, privilege or copyright in respect of information in the e-mail or attachments.
Please consider the environment before printing this email.
Valid, but often useless. :-)
From: Paul Murray [mailto:pmurray@anbg.gov.au] Sent: Thursday, 9 June 2011 3:22 p.m. To: Kevin Richards Cc: Steve Baskauf; tdwg-content@lists.tdwg.org Subject: Re: [tdwg-content] Why UUIDs alone are not adequate as GUIDs, was Re: ITIS TSNID to uBio NamebankIDs mapping [SEC=UNCLASSIFIED]
On 09/06/2011, at 1:01 PM, Kevin Richards wrote:
I don't think any URNs work with typical semantic web tools do they? Ie they don't know how to resolve them.
A tool does not need to be able to resolve a uri to be able to treat it as an identifier about which it knows facts. We can say
http://example.org/Fred http://example.org/brother-of http://example.org/Sue
And let a reasoner deduce that Sue is sibling-of Fred, even with no http server at example.orghttp://example.org. Of course, most tools on seeing a URI that is a HTTP URL will (or can) also take that additional step and pull down data about Fred and Sue - that's what "linked data" is about. But thats an additional step and is not a requirement. Without resolution, the tool knows nothing about the things identified by those ids other than what it is explicitly given. But that's valid.
If you have received this transmission in error please notify us immediately by return e-mail and delete all copies. If this e-mail or any attachments have been sent to you in error, that error does not constitute waiver of any confidentiality, privilege or copyright in respect of information in the e-mail or attachments. Please consider the environment before printing this email.
________________________________ Please consider the environment before printing this email Warning: This electronic message together with any attachments is confidential. If you receive it in error: (i) you must not read, use, disclose, copy or retain it; (ii) please contact the sender immediately by reply email and then delete the emails. The views expressed in this email may not be those of Landcare Research New Zealand Limited. http://www.landcareresearch.co.nz
On 09/06/2011, at 1:23 PM, Kevin Richards wrote:
Valid, but often useless. :-)
The only use I can think of is if you want to give something a logical id (that is, you don't want to use a database row id for whatever reason) but that is is never accessed as a thing in its own right.
For instance, you might want to identify taxon relationship objects in such a way that other inhabitants of the semantic web can annotate them, but you might not wish to serve up those objects as isolated atoms. Additionally, it may be the case that these relationships in your database do not have stable ids. Or, they might be indexed with a compound key that you do not want o issue some clunky compound URI or LSID for.
To accomplish this, you could give each relationship record a guid which would "follow it around". The relationship records are served up as part of a taxon concept - never on their own, perhaps they might have an rdf:isDefinedBy property pointing to the taxon concept that they are primariliy a part of.
In this case, naming these records with a urn:uuid: guid might make sense. A third party could keep the guid urn, the fact that it isDefinedBy the taxon id, and whatever information that that third party might want to attach to it, and expose that information to the web.
.
If you have received this transmission in error please notify us immediately by return e-mail and delete all copies. If this e-mail or any attachments have been sent to you in error, that error does not constitute waiver of any confidentiality, privilege or copyright in respect of information in the e-mail or attachments.
Please consider the environment before printing this email.
Hi all,
Another use for a GUID (if it has not previously been mentioned) even where not resolvable is to identify two items as clones of one another (for example data or metadata records residing on different systems in multiple copies) rather than different, so that for example they are only actioned or counted once (an increasing issue with proliferation of data and metadata aggregators around the place...)
Of course if the GUID is an LSID then it can indicate the issuing/custodian agency as well, which can certainly assist in determining the "point of truth" for such data items.
Cheers
Tony
________________________________ From: tdwg-content-bounces@lists.tdwg.org [tdwg-content-bounces@lists.tdwg.org] On Behalf Of Paul Murray [pmurray@anbg.gov.au] Sent: Thursday, 9 June 2011 5:46 PM To: Kevin Richards Cc: tdwg-content@lists.tdwg.org Subject: Re: [tdwg-content] Why UUIDs alone are not adequate as GUIDs, was Re: ITIS TSNID to uBio NamebankIDs mapping [SEC=UNCLASSIFIED]
On 09/06/2011, at 1:23 PM, Kevin Richards wrote:
Valid, but often useless. :-)
The only use I can think of is if you want to give something a logical id (that is, you don't want to use a database row id for whatever reason) but that is is never accessed as a thing in its own right.
For instance, you might want to identify taxon relationship objects in such a way that other inhabitants of the semantic web can annotate them, but you might not wish to serve up those objects as isolated atoms. Additionally, it may be the case that these relationships in your database do not have stable ids. Or, they might be indexed with a compound key that you do not want o issue some clunky compound URI or LSID for.
To accomplish this, you could give each relationship record a guid which would "follow it around". The relationship records are served up as part of a taxon concept - never on their own, perhaps they might have an rdf:isDefinedBy property pointing to the taxon concept that they are primariliy a part of.
In this case, naming these records with a urn:uuid: guid might make sense. A third party could keep the guid urn, the fact that it isDefinedBy the taxon id, and whatever information that that third party might want to attach to it, and expose that information to the web.
.
If you have received this transmission in error please notify us immediately by return e-mail and delete all copies. If this e-mail or any attachments have been sent to you in error, that error does not constitute waiver of any confidentiality, privilege or copyright in respect of information in the e-mail or attachments. Please consider the environment before printing this email.
On 09/06/2011, at 12:56 PM, Paul Murray wrote:
A URI is not a GUID. This: http://example.org/A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523
Thus: urn:uuid:A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523 is a URI that - according to a w3c standard - corresponds to the 128-bit guid.
To make the semantic web "go" and work with the cool uris, each data source would need to declare that it's cool uri is "same as" the urn:uuid: form. If a third party pulls together an ontology that includes both of those data sources, then that is enough to identify those two cool uris as being same as each other.
If the result of this is that that third party gets semantic inconsistencies (for instance, the two different data sources declaring those objects to be of incompatible types), then so be it: the data sources treat the object identified by that guid in incompatible ways. This is something that must be resolved at the human level.
is a URI that - according to a w3c standard - corresponds to the 128-bit guid. This:
A correction: the RFC is not a w3c standard, as such. I think RFCs belong to the IETF.
If you have received this transmission in error please notify us immediately by return e-mail and delete all copies. If this e-mail or any attachments have been sent to you in error, that error does not constitute waiver of any confidentiality, privilege or copyright in respect of information in the e-mail or attachments.
Please consider the environment before printing this email.
participants (16)
-
Bob Morris
-
Cynthia Parr
-
David Remsen (GBIF)
-
Dmitry Mozzherin
-
greg whitbread
-
Gregor Hagedorn
-
Kevin Richards
-
Nicolson, David
-
Paul Murray
-
Peter DeVries
-
Richard Pyle
-
Robert Huber
-
Roderic Page
-
Steve Baskauf
-
Steven J. Baskauf
-
Tony.Rees@csiro.au