Re: [tdwg-content] ITIS TSNID to uBio NamebankIDs mapping

2 Jun 2011

      My email access has been sporadic since this thread developed, so at 
this point I'll respond to points made in several of the messages.

First, I should note that there has been previous discussion on this 
list on a similar topic from 
http://lists.tdwg.org/pipermail/tdwg-content/2011-January/002231.html 
through 
http://lists.tdwg.org/pipermail/tdwg-content/2011-January/002231.html.  
One can review what was said at that time rather quickly by starting on 
the first linked message and clicking on the "Next Message" link until 
you get to the end of the range I gave above.

My reason for the request for information that started this thread was 
that I wanted to link to a URI that would anchor the name portion of a 
name/sensu pair (TNU or Taxon Concept a la TCS if you prefer) as in this 
RDF snippet:

    <tc:nameString>Quercus rubra L.</tc:nameString>
    <tc:hasName  rdf:about="http://www.ubio.org/authority/metadata.php?lsid=urn:lsid:ubio.org:namebank:448439"/>

At this point in the discussion, I'm not actually talking about creating 
a link to a taxon concept but rather to a taxon name, so some of the 
issues Pete raised don't apply here (e.g. what's the "right" name for a 
concept - the question here is simply what's a stable identifier for the 
name) .  In principle, I could probably just provide the name string and 
be done with it.  However, having some degree of faith that Smart, 
Computer Savvy People might some day be able to use the metadata 
returned by the URI (or perhaps metadata which they already have in a 
triple store onsite) to do cool things like knowing that my name is the 
same as an orthographic variant or that "Quercus rubra  L." is basically 
the same thing as "Quercus rubra", I would like to also provide a 
functional URI.

As an end -user who isn't very interested in the technical issues 
involving names, I don't really care what URI I use.  I would prefer for 
it to be widely recognized and for it to "work" (i.e. be resolvable).  
In the earlier (January) thread, there was discussion about existing 
identifiers.  There were a number of posts, but in particular  
http://lists.tdwg.org/pipermail/tdwg-content/2011-January/002258.html
http://lists.tdwg.org/pipermail/tdwg-content/2011-January/002259.html 
discussed the relative merits of ITIS and uBio ID numbers.  My take-home 
message from this was that uBio represented the largest single set of 
names with assigned identifiers (see 
http://gni.globalnames.org/data_sources cited in Pete's email) and that 
uBio metadata provides useful references.  Hence my interest in 
referencing uBio ids as a URI.  However, as a practical matter, the 
organizations that I share images with either want ITIS TSNs (EOL and 
Morphbank) or just names (Discover Life).  Nobody is asking for uBio 
identifiers or any other identifier.

I found Kevin's comment at 
http://lists.tdwg.org/pipermail/tdwg-content/2011-May/002486.html very 
thought-provoking: "My thoughts are that the most likely way this will 
be solved is by standard market type pressures - ie the best 
solution/IDs will be used the most and 'float' to the top."  I'm not 
going to make a judgment about what is the "best" solution or ID.  But I 
would say that in "computer" history, being the "best" doesn't 
necessarily mean that something will be used.  Take for example, the 
FOAF vocabulary.  What the heck is Friend of a Friend?  I would venture 
to say that most of the people using the FOAF vocabulary don't know or 
care.  The FOAF vocabulary was the one that people started to use and 
once that happened, people didn't switch even if there was something 
better.  I'm not familiar with the history of other stuff like YouTube 
and Craig's List, but I would guess that they weren't necessarily "the 
best" systems - they were just the one that the most people started 
using first and once that happened, people didn't switch.  I'm using 
ITIS IDs because they are easy to get and the people I communicate with 
want them.  Whether they are the "best" or "done correctly" doesn't 
matter to me as much as the fact that that they are widely recognized 
and stable (and that thus far every name that I've looked for has been 
in their database).

I think that one reason why this question has been on my mind is that 
I've been waiting for GNUB (Global Name Use Bank) to come out.  I'm not 
really up on how it is going to work, but my impression is that it was 
going to be based on the Global Name Index (GNI) which was mentioned in 
that earlier January thread.  At that point, the GNI names didn't have 
any identifiers that were exposed to the public as permanent GUIDs.  I'm 
assuming that if GNUB refers to GNI names, they will have some kind of 
identifiers.  So if that happens how is the GUID recommendation 8 going 
to be followed?  As Kevin said in 
http://lists.tdwg.org/pipermail/tdwg-content/2011-June/002499.html "What 
I take from recommendation 8 of the GUID applicability guide ... is that 
if you DON'T already have a record in your own database for a taxon 
name/concept, then reuse an existing one.  "  What we have here with GNI 
is a situation where none of the records have identifiers.  In my mind, 
the "best practice" according to recommendation 8 would be for the GNI 
to reuse existing identifiers where they exist and NOT make up new 
ones.  This is a bit more complicated because the ITIS identifiers 
(which are in common use) don't have an http URI version that is 
resolvable, and while the uBio identifiers have a resolvable http URI, 
it's in the form of a proxied LSID, which I've already complained is 
very ugly.  So I'd like to hear some ideas about how to have "reused" 
identifiers in the GNI.

One thing that comes to my mind would be to have a "domain name" like 
"http://purl.org/gni/" or "http://purl.org/tn/" ("tn" for "taxon name") 
and to follow it with a namespace/id combination similar to what is done 
with lsids.  So for example "itis/19408" and "ubio/448439" could be 
appended, creating http://purl.org/gni/itis/19408 and 
http://purl.org/gni/ubio/448439 for "Quercus rubra  L."  Both URIs could 
point to the same RDF and that RDF could indicate that the two 
identifiers are owl:sameAs .  I realize from what Bob Morris has 
cautioned in the past that there are problems with owl:sameAs when the 
two things aren't actually the same thing (e.g. if the uBio ID refers to 
a name string only but the ITIS TSN refers to the name plus an 
"accepted" status and a relationship to parent taxa).  However, if there 
were an understanding that the GNI only refers to name strings, then one 
could still refer to http://purl.org/gni/itis/19408 as an identifier for 
the name string of the thing (whatever it is) that is referred to by an 
ITIS TSN of 19408.  I don't think there would be a problem saying that 
and the ubio ID were "owl:sameAs".  Some kind of solution like this 
would allow people to easily generate a resolvable URI for a name if 
they were using ITIS TSNs or uBio IDs.  If the name that one wanted to 
use was so obscure that it was one of the 9.5 million names that uBio 
has that ITIS doesn't have, then that name would only have the ubio 
version.  I have no idea whether this would be a good idea or not, but I 
was really cringing to think about 19 million newly minted UUIDs 
appended to "http://gni.globalnames.org/" and figuring out how to 
connect those horrid things to the names and ITIS TSNs that I'm already 
using.  I think that I said this before, but using the purl.org domain 
rather than one like http://gni.globalnames.org/ would in the future 
allow somebody else to take over management of providing the metadata 
when the GUIDs are resolved without having to deal with issues of who 
"owns" the domain name.

Steve

Kevin Richards wrote:
...
Pete,
I’m not trying to say what you are doing is a waste of 
time/impossible.  I actually think RDF + semantics are a good way 
forward, but this really implies that we need to rely on the semantics 
and linkages rather than having a SINGLE ID for a taxon name.  (which 
is what I thought Steve was getting at).  Each instance of a taxon 
name can have its own ID and then all these instances are connected 
via ontology defined semantic links.  This seems more appropriate to 
me than insisting everyone uses the “Global Taxon Name ID X”.
In your example of /Aedes triseriatus/ and /Ochlerotatus triseriatus/– 
these are two different names so they need two different IDs, they may 
be linked by a single taxon concept, but they are separate names.  So 
which of these now 3 IDs do you expect people to use, and according to 
what source??
For example if we have a name, eg the Robin, Erithacus rubecula, 
mentioned in IT IS (TSN : 559964) and also in EOL 
(www.eol.org/pages/1051567 <http://www.eol.org/pages/1051567>), also 
in GBIF (http://data.gbif.org/species/21266780), also in avibase 
(http://avibase.bsc-eoc.org/species.jsp?avibaseid=C809B2B90399A43D), 
which ID are you hoping people will use??  Would you put the IT IS ID 
in your own dataset as the ID for that name – unlikely.  Or would it 
be better to link them up with semantic linkages.
What I take from recommendation 8 of the GUID applicability guide (as 
Steve puts is "stop making up new identifiers when somebody else 
already has one for the thing you are talking about”) is that if you 
DON’T already have a record in your own database for a taxon 
name/concept, then reuse an existing one.  NOT ditch all your current 
IDs and adopt someone else’s (especially hard considering it is so 
hard to work out which if the multitude of names ad concept IDs that 
directly relates to your taxon name).
I am all for limiting the number of IDs for the “same” thing, but in 
some cases it is more useful to build linkages than force this tight 
integration of data and IDs.  Especially for taxon names and concepts, 
where it is complex to define if you are even talking about the “same” 
thing or not.
Kevin
*From:*Peter DeVries [mailto:pete.devries@gmail.com]
*Sent:* Wednesday, 1 June 2011 12:38 p.m.
*To:* Kevin Richards
*Cc:* Steve Baskauf; tdwg-content@lists.tdwg.org; Gerald Guala; 
Nicolson, David; Alan J Hampson; Orrell, Thomas
*Subject:* Re: [tdwg-content] ITIS TSNID to uBio NamebankIDs mapping
Hi Kevin,
I forgot one mention some other things that are different about my 
project.
You can write a simple SPARQL query to get a list of all the 
TaxonConcept's that have ITIS ids, or all those that have ITIS and 
NCBI ID's etc.
You can do this on any SPARQL endpoint that hosts the data.
You can download the entire data set and run the queries on your own 
endpoint.
You can write a script that runs the query and downloads the ITIS 
numbers and exports them to CSV etc.
- Pete
On Tue, May 31, 2011 at 5:16 PM, Peter DeVries <pete.devries@gmail.com 
<mailto:pete.devries@gmail.com>> wrote:
Hi Kevin,
On Tue, May 31, 2011 at 3:27 PM, Kevin Richards 
<RichardsK@landcareresearch.co.nz 
<mailto:RichardsK@landcareresearch.co.nz>> wrote:
This is exactly why this problem still exists and will be very complex 
to solve - everyone says "we should have a single ID for a specific 
taxon name, there seems to be several IDs 'out there' that refer to 
the same taxon name, so Im going to create another ID to link them all 
up" - yet another ID that no one will particularly want to follow - 
you would have to get everyone to agree that your 
combinations/integration of taxon names is the best one and hope 
everyone follows it - unlikely in this domain.
Isn't this kind of what the The Plant List, and eBird already do?
A difference being that they tie these to a specific name and specific 
classification.
The Plant list is not really even open so it is difficult to people to 
adopt it in mass.
For instance, if I manage a herbarium, how do I easily reconcile my 
species list with the entities represented in the Plant List?
eBird has millions of records which implies that they have been able 
to convince the observers in the field to adopt their system. You are 
correct in that there are probably a lot of taxonomists that don't 
like their list.
It differs from many of the other classifications, but remember the 
system rewards them for not agreeing. Note the difference between the 
microbial taxonomists and other taxonomists. In the case of the microbial
workers, the system rewards them for solving problems not debating 
alternatives. Also, if a good idea comes out that will make it easier 
for the microbiologists to solve the problems they are rewarded for 
solving, they are less likely to care whose idea it is.
Like the microbiologists, there are lots of biologists that work with 
species with the goal of addressing some non-taxonomic problem.
They don't really care if the name is /Aedes triseriatus/ or 
/Ochlerotatus triseriatus, /but they do care that the identifier that 
they connect their data to is stable.
In regards to the issue of market forces,I suspect (but have no 
knowledge of) that there were probably decisions made in devising 
these lists that have more to do with appeasing certain personalities 
that creating best list. With the way this system rewards people it is 
likely that the "correct" version will float to the top only after 
that person has passed away. I don't have much faith that the best 
system will always float to the top, That has a lot to do with the 
personalities and how the system rewards are setup. Theoretically, it 
is possible for one strong personality or group to force others to 
adopt their less than optimal solution - at least this seems to happen 
in other environments.
Also, there are all sorts of ways that people can use the publication 
record to rewrite history. Simply cite the review paper that cites the 
original paper. Or don't cite it at all.
I would have used only the ITIS TSN but if the name changes the ID 
changes. This isn't "wrong", it just does not solve my problem.
* ITIS also should add the spiders from the World Spider Catalog.
Another issue that I think has inhibited adoption of a common list is 
that people can't agree on a particular name or a particular 
classification.
Since you can model a species concept as having many names and many 
classifications why not do so?
If this idea was originally accepted, I would not have needed to 
create TaxonConcept.org.
My plan has aways been to get something that works to solve some 
problems and then let some larger group take it over.
In a sense, I am more like the microbiologists in that I am not being 
paid to solve this or debate this problem.
I am doing it because I think something like this is needed, and it is 
an interesting and personally rewarding puzzle.
- Pete
My thoughts are that the most likely way this will be solve is by
    stnadard market type pressures - ie the best solution/IDs will be
    used the most and "float" to the top.  It is easy to say that the
    global taxon name data is a mess, but if you think about it 30
    years ago taxon name data were very disparate, duplicated,
    unconnected, many with NO IDs at all.  So I beleive we are making
    progress and that we will continue to do so albeit at a fairly
    slow rate.
Kevin
"I agree. This was one of the reasons that I setup TaxonConcept
    the way I did. It attempts to connect both the LOD entities and
    the foreign key based entities."
Please consider the environment before printing this email
    Warning:  This electronic message together with any attachments is
    confidential. If you receive it in error: (i) you must not read,
    use, disclose, copy or retain it; (ii) please contact the sender
    immediately by reply email and then delete the emails.
    The views expressed in this email may not be those of Landcare
    Research New Zealand Limited. http://www.landcareresearch.co.nz
-- 
------------------------------------------------------------------------------------
Pete DeVries
Department of Entomology
University of Wisconsin - Madison
445 Russell Laboratories
1630 Linden Drive
Madison, WI 53706
Email: pdevries@wisc.edu <mailto:pdevries@wisc.edu>
TaxonConcept <http://www.taxonconcept.org/> & GeoSpecies 
<http://about.geospecies.org/> Knowledge Bases
A Semantic Web, Linked Open Data <http://linkeddata.org/>  Project
--------------------------------------------------------------------------------------
-- 
------------------------------------------------------------------------------------
Pete DeVries
Department of Entomology
University of Wisconsin - Madison
445 Russell Laboratories
1630 Linden Drive
Madison, WI 53706
Email: pdevries@wisc.edu <mailto:pdevries@wisc.edu>
TaxonConcept <http://www.taxonconcept.org/> & GeoSpecies 
<http://about.geospecies.org/> Knowledge Bases
A Semantic Web, Linked Open Data <http://linkeddata.org/>  Project
--------------------------------------------------------------------------------------
------------------------------------------------------------------------
Please consider the environment before printing this email
Warning: This electronic message together with any attachments is 
confidential. If you receive it in error: (i) you must not read, use, 
disclose, copy or retain it; (ii) please contact the sender 
immediately by reply email and then delete the emails.
The views expressed in this email may not be those of Landcare 
Research New Zealand Limited. http://www.landcareresearch.co.nz
-- 
Steven J. Baskauf, Ph.D., Senior Lecturer
Vanderbilt University Dept. of Biological Sciences

postal mail address:
VU Station B 351634
Nashville, TN  37235-1634,  U.S.A.

delivery address:
2125 Stevenson Center
1161 21st Ave., S.
Nashville, TN 37235

office: 2128 Stevenson Center
phone: (615) 343-4582,  fax: (615) 343-6707
http://bioimages.vanderbilt.edu