[tdwg-content] ITIS TSNID to uBio NamebankIDs mapping

Steve Baskauf steve.baskauf at vanderbilt.edu
Tue May 31 14:07:36 CEST 2011


I had actually written a response to this thread about a week ago in 
which I tried to clarify why I wanted to connect the ITIS and uBio 
identifiers.  However, I decided that the email was too cynical and not 
helpful, so I erased it.  However, I think that a couple of the points I 
had in that email probably should have been made, so I will try to state 
them again in a more constructive manner. 

My reason for wanting to connect the uBio and ITIS identifiers really 
had nothing to do with making use of any of the tools or services that 
either group provides.  Rather it has to do with my desire to follow the 
best practices for GUIDs as laid out in the TDWG GUID Applicability 
Statement (now an official standard).  In particular, I have in mind 
Recommendations 2 and 8, which I paraphrase here as: "make HTTP URIs out 
of your identifiers" and "stop making up new identifiers when somebody 
else already has one for the thing you are talking about".  I suppose 
Recommendation 10 should also be mentioned, which I paraphrase as 
"provide RDF/XML to users that want it". 

I am actually using ITIS TSNs internally in my database.  However, last 
time I checked there were no GUIDs based on TSNs that met the 
recommendations I've paraphrased above.  (The ITIS website does mention 
"LSIDs" in the context of web services, but they don't follow either 
recommendation 2 or 10.)  However outdated they are, uBio identifiers do 
actually meet recommendations 2 and 10 and that is why I wanted to use 
them (although the http proxied forms are unnecessarily ugly and long).  
So that explains in a nutshell the reason for my request.  If ITIS would 
provide a simple http URI form of their TSNs which could resolve via 
content negotiation to either HTML or RDF/XML, it would be much easier 
for me to just use them.

OK, here is where I risk stepping on people's toes.  So I'll try to 
stomp gently.  I think that the area of taxon names is one where the 
TDWG community fails miserably at recommendation 8.  I've lost count of 
the number of different kinds of identifiers that are available for 
referring to taxon names (this issue was discussed previously in the 
thread that starts with 
http://lists.tdwg.org/pipermail/tdwg-content/2011-January/002231.html so 
I won't repeat it here).  I don't know about (nor particularly care 
about) "turf" in this area, but I would challenge the community to get 
serious about recommendation 8 and come up with some consensus about a 
single, universal set of GUIDs for taxon names.  Those identifiers 
should (in my opinion which stems from the GUID recommendations):
- be http URIs (rec 2)
- be based on an existing identifier (rec 8)
- return RDF/XML when a client requests it (rec 10)
- not change (rec 4)
I do not like proxied LSIDs (unnecessarily long with many useless 
characters) and I despise UUIDs (what is the point of creating a long, 
un-typeable string to replace a serial number that is already globally 
unique if appended to a domain name?).  Why not just register something 
like "http://purl.org/tn/" (with "tn" representing "taxon name") and 
stick one of the existing serial numbers onto it?  The domain name would 
be "turf-neutral" and anybody (GBIF, TDWG, or another organization) 
could manage the actual resolution through redirection from that 
domain.  Somebody else could take over the management of the GUIDs if 
the first group got tired of it or ran out of money.  The result would 
be a short and simple URI like "http://purl.org/tn/12345".  What would 
be wrong with that?  This is not rocket science and could be easily 
accomplished by a few tech-savvy people if the will were there.

Steve

Nicolson, David wrote:
> Hi Steve (and Dave),
>
> [NB: After having composed the email below, just before sending it, I re-read your initial email more carefully and realized that you said you already had the ITIS TSNs, and were looking to add the NamebankIDs! Doh! Well, in case you (or anyone else) is interested in methods of matching names to get TSNs, I'll go ahead and send this anyway. But do note the comments below about the ITIS "versions" and ongoing overhaul of the vascular plant data in ITIS!!! -Dave]
>
> I noticed this just before leaving work last week, and was out yesterday, but I wanted to chime in on this. I'm glad the uBio tools are meeting your needs (they do have some cool stuff!), but it should be noted that those tools are using a static snapshot of ITIS data from January 2009, and we have added about 50,000 additional scientific names, and updated tens of thousands of names beyond that (most of that in the last 6 months, as the frequency of loads dropped off in 2009-2010 due to technical issues).
>
> I also want to note that ITIS is right in the middle of a full update of the vascular plant data in ITIS, and we're loading updated families on a monthly basis... and at long last we are tackling all the leftover issues from several bulk loads from USDA PLANTS data that left unreconciled bits of ITIS' older vascular plant data in various confusing states... so it is a VAST improvement that is underway.
>
> There are several options for bouncing your names off the current version of ITIS.
>
> One is to automate a matching process using the live ITIS data, based on the existing ITIS Web Services. I am CC'ing Alan Hampson, our IT fellow who built the Web Services ( http://www.itis.gov/web_service.html ), in case you'd like to follow up with him on that option. The advantage is that once you have a process in place it is completely self-serve and can always utilize the current ITIS data. If you have the resources to do this I think it would be greatly to your advantage to use this approach. 
>
> You can explore some ideas for client software to use the services at: 
> http://www.itis.gov/ws_develop.html
>
> And for more information on ITIS web services try 
> http://www.itis.gov/ws_description.html
> http://www.itis.gov/ITISWebService.xml
>
> The ability to flag multiply-matched names (as you noted) should probably be considered, so that appropriate manual steps can be taken. This solution will allow you to take advantage of subsequent updates to ITIS with a minimum of additional effort, and given that the plant data are in the middle of a major overhaul, this bears consideration!
>
> Another possibility is to grab a full snapshot of the ITIS data, and load it into a database so you can do what you wish. The obvious drawback is that it goes out of date, as with the ITIS snapshot uBio is currently using. But it puts you in the driver's seat re what to do & getting new versions of ITIS. Some general information about the full exports is in the following page, although conspicuously absent is any mention of the MySQL version which (assuming you have the free MySQL properly installed & configured) can be loaded with just a few clicks or a few command lines (depending on your platform):
> http://www.itis.gov/ftp_download.html
> And the current ITIS data are all here for downloading:
> http://www.itis.gov/downloads/
>
> A third option, which I note with some trepidation, is the old "Compare Nomenclature/Taxonomy" function on the ITIS site:
> http://www.itis.gov/taxmatch_ftp.html
> This is a VERY old function that we do plan on replacing (timeframe not yet certain), and it is vulnerable to timeouts, etc., which is why it notes to limit the number of names per pass. But with smaller chunks of names it does work quite well. The caveat is that I would make sure to choose the 4th option in Step 4, as it is at least aware (unlike the 3 other options) of multiply-matched name cases, and lists them separately at the bottom of the report. Just a bare listing of the scientific names, with the word "name" at the top, saved as plain text, is all that is needed for input.
>
> A final option would be to ask someone at ITIS to handle the matching for you (leaving you to decide re the multiply-matched names). This might be simple from your end, but is suboptimal as it leaves you in the same position as you are now should you want or need to compare names again in the future (whether due to acquiring new names in your system, or wanting to check against a later updated version of ITIS), and it pulls someone here (probably me) off of the push to get more updates into ITIS. But in a pinch, I'm certainly willing to try to help you, should it come down to that! I would just ask that you seriously consider the web services option (in particular) or the others above first.
>
> I hope this helps some. If you have already run all your matches against the old "ITIS" data via uBio then you might consider re-running (against the current ITIS data) at least the leftover names that you did not yet get matched. Let us know if you have questions (the itiswebmaster at itis.gov address goes to myself and Alan and several others, so that might be the best bet for a follow-up unless you have a question specifically for me).
>
> Regards,
> Dave
>
> David Nicolson
> Data Development Coordinator, Integrated Taxonomic Information System
> Biologist, USGS Core Science Systems, Biological Informatics Program
> nicolsod at si.edu     Office 202-633-2149    Fax 202-786-2934
> http://www.itis.gov/
> http://www.cbif.gc.ca/itis/
> "Nihil sumas necesse est..."
>
>
> -----Original Message-----
> Date: Fri, 20 May 2011 05:42:03 -0500
> From: Steve Baskauf <steve.baskauf at vanderbilt.edu>
> Subject: Re: [tdwg-content] ITIS TSNID to uBio NamebankIDs mapping
> To: "David Remsen (GBIF)" <dremsen at gbif.org>
> Cc: "tdwg-content at lists.tdwg.org" <tdwg-content at lists.tdwg.org>
> Message-ID: <4DD6457B.2080204 at vanderbilt.edu>
> Content-Type: text/plain; charset="iso-8859-1"
>
> Thanks, all, for the responses.  The "Compare to ITIS" function does 
> just what I want.  I did a test run of 1000 names and it worked like a 
> charm.  I will need to do a little massaging because sometimes two or 
> more ITIS IDs come back for each uBio ID.  But I can handle that.
> Steve
>
> David Remsen (GBIF) wrote:
>   
>> Steve
>>
>> Have you tried this?
>> http://www.ubio.org/clients/ITIS/index.php
>>
>> or this?
>> http://www.ubio.org/services/mapper/index2.php
>>
>> All this ubio talk makes me think we were on to something.  Worth a thought about adopting the new stnadrds and tools and making it really smooth.
>>
>> DR
>>
>>
>> On 20 May 2011, at 04:46, Steve Baskauf wrote:
>>
>>   
>>     
>>> I have generated a csv spreadsheet of about 39 000 plant names for the 
>>> U.S. which has the ITIS TSNIDs for the names in a column.  I would like 
>>> to have the uBio Namebank IDs in another column of the table.  I have 
>>> been looking them up on the uBio website by typing in the names as I 
>>> need to know the IDs, but after doing about 300 of them, I'm getting 
>>> tired of it.  Does anybody have a clever idea of a way to get the other 
>>> 38 000 Namebank IDs without looking them up.  I'm sure that it would be 
>>> possible to find this out because uBio gets names from ITIS.  However, I 
>>> haven't seen any clues about how to do it in an automated fashion.  I'm 
>>> guessing that there might be some way to use the uBio web services, but 
>>> if so, it isn't obvious and I probably don't have the skills to carry it 
>>> out anyway. 
>>>
>>> Any ideas?
>>> Steve
>>>
>>> -- 
>>> Steven J. Baskauf, Ph.D., Senior Lecturer
>>> Vanderbilt University Dept. of Biological Sciences
>>>
>>> postal mail address:
>>> VU Station B 351634
>>> Nashville, TN  37235-1634,  U.S.A.
>>>
>>> delivery address:
>>> 2125 Stevenson Center
>>> 1161 21st Ave., S.
>>> Nashville, TN 37235
>>>
>>> office: 2128 Stevenson Center
>>> phone: (615) 343-4582,  fax: (615) 343-6707
>>> http://bioimages.vanderbilt.edu
>>>
>>> _______________________________________________
>>> tdwg-content mailing list
>>> tdwg-content at lists.tdwg.org
>>> http://lists.tdwg.org/mailman/listinfo/tdwg-content
>>>
>>>     
>>>       
>> .
>>
>>   
>>     
>
>   

-- 
Steven J. Baskauf, Ph.D., Senior Lecturer
Vanderbilt University Dept. of Biological Sciences

postal mail address:
VU Station B 351634
Nashville, TN  37235-1634,  U.S.A.

delivery address:
2125 Stevenson Center
1161 21st Ave., S.
Nashville, TN 37235

office: 2128 Stevenson Center
phone: (615) 343-4582,  fax: (615) 343-6707
http://bioimages.vanderbilt.edu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.tdwg.org/pipermail/tdwg-content/attachments/20110531/f77fec59/attachment.html 


More information about the tdwg-content mailing list