[tdwg-content] ITIS TSNID to uBio NamebankIDs mapping

Tue May 31 21:16:24 CEST 2011

Dear Pete and Steve,

I cannot comment on the technical content of your emails (sorry, I'm a content guy!), but I do note this comment by Pete:
"Most of the plants also have the USDA Plants identifier. In fact you might be able to get the ITIS numbers via the USDA Plants Database."

I would not recommend getting ITIS TSNs from any other source than ITIS (see my prior email for some how-to ideas).

Firstly, the PLANTS Symbol-to-TSN matches were not always well managed due to some technical issues (involving early changes to ITIS, some due to problematic bulk-updates of ITIS from the PLANTS data, or sometimes due to other artifacts). In MOST cases the TSNs they list will be fine, but a silent subset will not.

Secondly, as I noted, we are mid-stream in a full overhaul of the vascular plant data in ITIS, in almost every case using cooperatively-produced data sets that have also been made available to PLANTS as well. When/whether they use them to update that database is another question, but the ITIS updates are proceeding full-steam, with additional improvements where needed.

Finally, at least when dealing with non-static data sets, I feel it is just 'best practice' to get them from the source wherever feasible, rather than from other places.

Best,
Dave

David Nicolson
Data Development Coordinator, Integrated Taxonomic Information System
Biologist, USGS Core Science Systems, Biological Informatics Program
nicolsod at si.edu    Office 202-633-2149    Fax 202-786-2934
http://www.itis.gov/
http://www.cbif.gc.ca/itis/
"Nihil sumas necesse est..."

From: Peter DeVries [mailto:pete.devries at gmail.com]
Sent: Tuesday, May 31, 2011 2:48 PM
To: Steve Baskauf
Cc: Nicolson, David; tdwg-content at lists.tdwg.org; Gerald Guala; Orrell, Thomas; Alan J Hampson
Subject: Re: [tdwg-content] ITIS TSNID to uBio NamebankIDs mapping

Hi Steve et al.,

I agree. This was one of the reasons that I setup TaxonConcept the way I did. It attempts to connect both the LOD entities and the foreign key based entities.

For example:

The Racoon http://lod.taxonconcept.org/ses/CTZ8z.html

Has links to many other URL's and URI's as well as the integer id's for:

EoL
NCBI
ITIS
BOLD

* For some of these it might be best to represent these as a one to many since there are often many names for each concept.

I have uBio ID's in GeoSpecies but I thought that would be eventually pulled in via the GNI.

I also have a small set of other foreign keys for things like the Hymenoptera name server, FishBase, Mushroom Observer and Tropicos.

Since these are specific to specific subsets of organisms, and came on later in my project I thought it would be best to use a separate RDF file to map to those.

For instance with Fishbase http://assets.taxonconcept.org/fb/index.rdf

Insects like this one http://lod.taxonconcept.org/ses/ICmLC.html also have the id for bugguide if it exists there and I have found it under the same name or a synonym.

Of the ~105,000 concepts I have about 47,000 with ITIS ID's. This may be useful for your plant list and I can send you a spreadsheet if that is easier.

Most of the plants also have the USDA Plants identifier. In fact you might be able to get the ITIS numbers via the USDA Plants Database.

I have come to realize that many other groups see the solution to data access is with a custom API, but this requires understanding and debugging your code for each API.

Once the data is available in RDF it is one API for everything. Some issues like what to call each field can be overcome by simply rewriting (converting) the RDF.

This is easy as long as you have equivalent semantics in the meaning of the field.

For instance, it does not really matter if this name is represented as

<txn:hasScientificName>Procyon lotor</txn:hasScientificName> or <dwc:scientificName>Procyon lotor</dwc:scientificName>

The important thing to understand is that in my model this field does not include the authorship string.

This makes it easier to map this to other datasets and publications that don't include the authorship string.

<txn:scientificNameAuthorship>(Linnaeus 1758)</txn:scientificNameAuthorship>

 * The scientificNameAuthorship should eventually be mapped to a publication or a list of probable publications. It is too ambiguous.

There was a debate about <scientificName> earlier on the list which seemed to go back and forth.

I got tired of rewriting my examples each time and decided to use my own vocabulary that works in my example queries and has fields that map as closely to dwc as possible.

- Pete
On Tue, May 31, 2011 at 7:07 AM, Steve Baskauf <steve.baskauf at vanderbilt.edu<mailto:steve.baskauf at vanderbilt.edu>> wrote:
I had actually written a response to this thread about a week ago in which I tried to clarify why I wanted to connect the ITIS and uBio identifiers.  However, I decided that the email was too cynical and not helpful, so I erased it.  However, I think that a couple of the points I had in that email probably should have been made, so I will try to state them again in a more constructive manner.

My reason for wanting to connect the uBio and ITIS identifiers really had nothing to do with making use of any of the tools or services that either group provides.  Rather it has to do with my desire to follow the best practices for GUIDs as laid out in the TDWG GUID Applicability Statement (now an official standard).  In particular, I have in mind Recommendations 2 and 8, which I paraphrase here as: "make HTTP URIs out of your identifiers" and "stop making up new identifiers when somebody else already has one for the thing you are talking about".  I suppose Recommendation 10 should also be mentioned, which I paraphrase as "provide RDF/XML to users that want it".

I am actually using ITIS TSNs internally in my database.  However, last time I checked there were no GUIDs based on TSNs that met the recommendations I've paraphrased above.  (The ITIS website does mention "LSIDs" in the context of web services, but they don't follow either recommendation 2 or 10.)  However outdated they are, uBio identifiers do actually meet recommendations 2 and 10 and that is why I wanted to use them (although the http proxied forms are unnecessarily ugly and long).  So that explains in a nutshell the reason for my request.  If ITIS would provide a simple http URI form of their TSNs which could resolve via content negotiation to either HTML or RDF/XML, it would be much easier for me to just use them.

OK, here is where I risk stepping on people's toes.  So I'll try to stomp gently.  I think that the area of taxon names is one where the TDWG community fails miserably at recommendation 8.  I've lost count of the number of different kinds of identifiers that are available for referring to taxon names (this issue was discussed previously in the thread that starts with http://lists.tdwg.org/pipermail/tdwg-content/2011-January/002231.html so I won't repeat it here).  I don't know about (nor particularly care about) "turf" in this area, but I would challenge the community to get serious about recommendation 8 and come up with some consensus about a single, universal set of GUIDs for taxon names.  Those identifiers should (in my opinion which stems from the GUID recommendations):
- be http URIs (rec 2)
- be based on an existing identifier (rec 8)
- return RDF/XML when a client requests it (rec 10)
- not change (rec 4)
I do not like proxied LSIDs (unnecessarily long with many useless characters) and I despise UUIDs (what is the point of creating a long, un-typeable string to replace a serial number that is already globally unique if appended to a domain name?).  Why not just register something like "http://purl.org/tn/"<http://purl.org/tn/> (with "tn" representing "taxon name") and stick one of the existing serial numbers onto it?  The domain name would be "turf-neutral" and anybody (GBIF, TDWG, or another organization) could manage the actual resolution through redirection from that domain.  Somebody else could take over the management of the GUIDs if the first group got tired of it or ran out of money.  The result would be a short and simple URI like "http://purl.org/tn/12345"<http://purl.org/tn/12345>.  What would be wrong with that?  This is not rocket science and could be easily accomplished by a few tech-savvy people if the will were there.

Steve

Nicolson, David wrote:

Hi Steve (and Dave),

[NB: After having composed the email below, just before sending it, I re-read your initial email more carefully and realized that you said you already had the ITIS TSNs, and were looking to add the NamebankIDs! Doh! Well, in case you (or anyone else) is interested in methods of matching names to get TSNs, I'll go ahead and send this anyway. But do note the comments below about the ITIS "versions" and ongoing overhaul of the vascular plant data in ITIS!!! -Dave]

I noticed this just before leaving work last week, and was out yesterday, but I wanted to chime in on this. I'm glad the uBio tools are meeting your needs (they do have some cool stuff!), but it should be noted that those tools are using a static snapshot of ITIS data from January 2009, and we have added about 50,000 additional scientific names, and updated tens of thousands of names beyond that (most of that in the last 6 months, as the frequency of loads dropped off in 2009-2010 due to technical issues).

I also want to note that ITIS is right in the middle of a full update of the vascular plant data in ITIS, and we're loading updated families on a monthly basis... and at long last we are tackling all the leftover issues from several bulk loads from USDA PLANTS data that left unreconciled bits of ITIS' older vascular plant data in various confusing states... so it is a VAST improvement that is underway.

There are several options for bouncing your names off the current version of ITIS.

One is to automate a matching process using the live ITIS data, based on the existing ITIS Web Services. I am CC'ing Alan Hampson, our IT fellow who built the Web Services ( http://www.itis.gov/web_service.html ), in case you'd like to follow up with him on that option. The advantage is that once you have a process in place it is completely self-serve and can always utilize the current ITIS data. If you have the resources to do this I think it would be greatly to your advantage to use this approach.

You can explore some ideas for client software to use the services at:

http://www.itis.gov/ws_develop.html

And for more information on ITIS web services try

http://www.itis.gov/ws_description.html

http://www.itis.gov/ITISWebService.xml

The ability to flag multiply-matched names (as you noted) should probably be considered, so that appropriate manual steps can be taken. This solution will allow you to take advantage of subsequent updates to ITIS with a minimum of additional effort, and given that the plant data are in the middle of a major overhaul, this bears consideration!

Another possibility is to grab a full snapshot of the ITIS data, and load it into a database so you can do what you wish. The obvious drawback is that it goes out of date, as with the ITIS snapshot uBio is currently using. But it puts you in the driver's seat re what to do & getting new versions of ITIS. Some general information about the full exports is in the following page, although conspicuously absent is any mention of the MySQL version which (assuming you have the free MySQL properly installed & configured) can be loaded with just a few clicks or a few command lines (depending on your platform):

http://www.itis.gov/ftp_download.html

And the current ITIS data are all here for downloading:

http://www.itis.gov/downloads/

A third option, which I note with some trepidation, is the old "Compare Nomenclature/Taxonomy" function on the ITIS site:

http://www.itis.gov/taxmatch_ftp.html

This is a VERY old function that we do plan on replacing (timeframe not yet certain), and it is vulnerable to timeouts, etc., which is why it notes to limit the number of names per pass. But with smaller chunks of names it does work quite well. The caveat is that I would make sure to choose the 4th option in Step 4, as it is at least aware (unlike the 3 other options) of multiply-matched name cases, and lists them separately at the bottom of the report. Just a bare listing of the scientific names, with the word "name" at the top, saved as plain text, is all that is needed for input.

A final option would be to ask someone at ITIS to handle the matching for you (leaving you to decide re the multiply-matched names). This might be simple from your end, but is suboptimal as it leaves you in the same position as you are now should you want or need to compare names again in the future (whether due to acquiring new names in your system, or wanting to check against a later updated version of ITIS), and it pulls someone here (probably me) off of the push to get more updates into ITIS. But in a pinch, I'm certainly willing to try to help you, should it come down to that! I would just ask that you seriously consider the web services option (in particular) or the others above first.

I hope this helps some. If you have already run all your matches against the old "ITIS" data via uBio then you might consider re-running (against the current ITIS data) at least the leftover names that you did not yet get matched. Let us know if you have questions (the itiswebmaster at itis.gov<mailto:itiswebmaster at itis.gov> address goes to myself and Alan and several others, so that might be the best bet for a follow-up unless you have a question specifically for me).

Regards,

Dave

David Nicolson

Data Development Coordinator, Integrated Taxonomic Information System

Biologist, USGS Core Science Systems, Biological Informatics Program

nicolsod at si.edu<mailto:nicolsod at si.edu>     Office 202-633-2149<tel:202-633-2149>    Fax 202-786-2934<tel:202-786-2934>

http://www.itis.gov/

http://www.cbif.gc.ca/itis/

"Nihil sumas necesse est..."

-----Original Message-----

Date: Fri, 20 May 2011 05:42:03 -0500

From: Steve Baskauf <steve.baskauf at vanderbilt.edu><mailto:steve.baskauf at vanderbilt.edu>

Subject: Re: [tdwg-content] ITIS TSNID to uBio NamebankIDs mapping

To: "David Remsen (GBIF)" <dremsen at gbif.org><mailto:dremsen at gbif.org>

Cc: "tdwg-content at lists.tdwg.org"<mailto:tdwg-content at lists.tdwg.org> <tdwg-content at lists.tdwg.org><mailto:tdwg-content at lists.tdwg.org>

Message-ID: <4DD6457B.2080204 at vanderbilt.edu><mailto:4DD6457B.2080204 at vanderbilt.edu>

Content-Type: text/plain; charset="iso-8859-1"

Thanks, all, for the responses.  The "Compare to ITIS" function does

just what I want.  I did a test run of 1000 names and it worked like a

charm.  I will need to do a little massaging because sometimes two or

more ITIS IDs come back for each uBio ID.  But I can handle that.

Steve

David Remsen (GBIF) wrote:

Steve

Have you tried this?

http://www.ubio.org/clients/ITIS/index.php

or this?

http://www.ubio.org/services/mapper/index2.php

All this ubio talk makes me think we were on to something.  Worth a thought about adopting the new stnadrds and tools and making it really smooth.

DR

On 20 May 2011, at 04:46, Steve Baskauf wrote:

I have generated a csv spreadsheet of about 39 000 plant names for the

U.S. which has the ITIS TSNIDs for the names in a column.  I would like

to have the uBio Namebank IDs in another column of the table.  I have

been looking them up on the uBio website by typing in the names as I

need to know the IDs, but after doing about 300 of them, I'm getting

tired of it.  Does anybody have a clever idea of a way to get the other

38 000 Namebank IDs without looking them up.  I'm sure that it would be

possible to find this out because uBio gets names from ITIS.  However, I

haven't seen any clues about how to do it in an automated fashion.  I'm

guessing that there might be some way to use the uBio web services, but

if so, it isn't obvious and I probably don't have the skills to carry it

out anyway.

Any ideas?

Steve

--

Steven J. Baskauf, Ph.D., Senior Lecturer

Vanderbilt University Dept. of Biological Sciences

postal mail address:

VU Station B 351634

Nashville, TN  37235-1634,  U.S.A.

delivery address:

2125 Stevenson Center

1161 21st Ave., S.

Nashville, TN 37235

office: 2128 Stevenson Center

phone: (615) 343-4582<tel:%28615%29%20343-4582>,  fax: (615) 343-6707<tel:%28615%29%20343-6707>

http://bioimages.vanderbilt.edu

_______________________________________________

tdwg-content mailing list

tdwg-content at lists.tdwg.org<mailto:tdwg-content at lists.tdwg.org>

http://lists.tdwg.org/mailman/listinfo/tdwg-content

.

--

Steven J. Baskauf, Ph.D., Senior Lecturer

Vanderbilt University Dept. of Biological Sciences

postal mail address:

VU Station B 351634

Nashville, TN  37235-1634,  U.S.A.

delivery address:

2125 Stevenson Center

1161 21st Ave., S.

Nashville, TN 37235

office: 2128 Stevenson Center

phone: (615) 343-4582<tel:%28615%29%20343-4582>,  fax: (615) 343-6707<tel:%28615%29%20343-6707>

http://bioimages.vanderbilt.edu

_______________________________________________
tdwg-content mailing list
tdwg-content at lists.tdwg.org<mailto:tdwg-content at lists.tdwg.org>
http://lists.tdwg.org/mailman/listinfo/tdwg-content

--
------------------------------------------------------------------------------------
Pete DeVries
Department of Entomology
University of Wisconsin - Madison
445 Russell Laboratories
1630 Linden Drive
Madison, WI 53706
Email: pdevries at wisc.edu<mailto:pdevries at wisc.edu>
TaxonConcept<http://www.taxonconcept.org/>  &  GeoSpecies<http://about.geospecies.org/> Knowledge Bases
A Semantic Web, Linked Open Data<http://linkeddata.org/>  Project
--------------------------------------------------------------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.tdwg.org/pipermail/tdwg-content/attachments/20110531/9666c992/attachment.html