[tdwg-content] ITIS TSNID to uBio NamebankIDs mapping

David Remsen (GBIF) dremsen at gbif.org
Fri Jun 3 13:59:39 CEST 2011


Why not use the name as the basis for the resolvable identifier instead of
a uuid. Isnt there a 1:1 cardinality between the name and the uuid in the
GNI?  Doesnt that mean that

http://gni.globalnames.org/name_strings/4ef223c4-0c3e-5e84-ace9-755c34c601ec
and
http://gni.globalnames.org/name_strings/Danaus_plexippus_(Linnaeus_1758)

are equally unique?  The latter is certainly more readable.  In those
cases where the namestring is a homonym like

http://gni.globalnames.org/name_strings/Oenanthe

couldn't you just return the addresses of the two globally unique forms of
the name when you resolve it?

http://gni.globalnames.org/name_strings/Oenanthe_Smith_1899

http://gni.globalnames.org/name_strings/Oenanthe_Jones_1900

Wouldn't those be as globally unique and easier to read and adjust to?  Or
am I missing something.  I always wanted to do that with ubio IDs after a
back and forth with Gregor Hagedorn and wished we hadn't exposed those
integers.

DR

> Hi Steve,
>
> I don't have time to go through this in detail, and I can't speak for the
> GNI, but I can tell you about how the GNI URI's work at least for now.
>
> A while back Dima Mozzherin and I were looking into how triples etc. might
> be of use to the GNI.
>
> We needed a way to generate unique URI's for each name.
>
> We wanted to avoid having to keep these in sync and not require everyone
> to
> look each ID up through some service.
>
> Dima came up with the following plan. We use the namestring as seed to
> generate a unique UUID.
>
> Basically this is a shared algorithm which the GNI and TaxonConcept both
> use. But it could be used by anyone.
>
> You feed the name string to the algorithm and it spits out a UUID. We
> append
> then append that to a URI and web service so it is resolvable.
>
> So the name Danaus plexippus (Linnaeus 1758)
> => 4ef223c4-0c3e-5e84-ace9-755c34c601ec
>
> So if the GNI and and another group have the same namestring they have the
> same UUID.
>
> People can then can link their data set to the GNI with the following URI
>
> http://gni.globalnames.org/name_strings/4ef223c4-0c3e-5e84-ace9-755c34c601ec
>
> RDF
> http://gni.globalnames.org/name_strings/4ef223c4-0c3e-5e84-ace9-755c34c601ec.rdf
>
> <http://gni.globalnames.org/name_strings/4ef223c4-0c3e-5e84-ace9-755c34c601ec.rdf>If
> you think of your data set as one table and the GNI as another, this URI
> serves as the foreign key that connects them together.
>
> Some on the list don't like how these look, but there is a tremendous
> advantage in not having to worry about syncing two large data sets and
> determining if a given integer is already in use.
>
> Also Rod Page has written a recently about UUID's.
> http://iphylo.blogspot.com/2011/05/zoobank-on-couchdb-uuids-replication.html
>
> <http://iphylo.blogspot.com/2011/05/zoobank-on-couchdb-uuids-replication.html>There
> may be a way to do something similar with bit.ly like identifiers that are
> shorter (mCcSp), but I think it the general idea is a good one.
>
> If you recall from my talk at TDWG, I was able to use these to make
> statements that one namestring was a synonym etc. of another etc.
>
> The algorithm we use is written in Ruby but I could be ported to many
> different languages since UUIDs are widely supported.
>
> Respectfully,
>
> - Pete
>
>
>
> On Thu, Jun 2, 2011 at 11:41 PM, Steven J. Baskauf <
> steve.baskauf at vanderbilt.edu> wrote:
>
>>  My email access has been sporadic since this thread developed, so at
>> this
>> point I'll respond to points made in several of the messages.
>>
>> First, I should note that there has been previous discussion on this
>> list
>> on a similar topic from
>> http://lists.tdwg.org/pipermail/tdwg-content/2011-January/002231.htmlthrough
>> http://lists.tdwg.org/pipermail/tdwg-content/2011-January/002231.html.
>> One can review what was said at that time rather quickly by starting on
>> the
>> first linked message and clicking on the "Next Message" link until you
>> get
>> to the end of the range I gave above.
>>
>> My reason for the request for information that started this thread was
>> that
>> I wanted to link to a URI that would anchor the name portion of a
>> name/sensu
>> pair (TNU or Taxon Concept a la TCS if you prefer) as in this RDF
>> snippet:
>>
>>    <tc:nameString>Quercus rubra L.</tc:nameString>
>>    <tc:hasName
>> rdf:about="http://www.ubio.org/authority/metadata.php?lsid=urn:lsid:ubio.org:namebank:448439"
>> <http://www.ubio.org/authority/metadata.php?lsid=urn:lsid:ubio.org:namebank:448439>/>
>>
>>
>> At this point in the discussion, I'm not actually talking about creating
>> a
>> link to a taxon concept but rather to a taxon name, so some of the
>> issues
>> Pete raised don't apply here (e.g. what's the "right" name for a concept
>> -
>> the question here is simply what's a stable identifier for the name) .
>> In
>> principle, I could probably just provide the name string and be done
>> with
>> it.  However, having some degree of faith that Smart, Computer Savvy
>> People
>> might some day be able to use the metadata returned by the URI (or
>> perhaps
>> metadata which they already have in a triple store onsite) to do cool
>> things
>> like knowing that my name is the same as an orthographic variant or that
>> "Quercus rubra  L." is basically the same thing as "Quercus rubra", I
>> would
>> like to also provide a functional URI.
>>
>> As an end -user who isn't very interested in the technical issues
>> involving
>> names, I don't really care what URI I use.  I would prefer for it to be
>> widely recognized and for it to "work" (i.e. be resolvable).  In the
>> earlier
>> (January) thread, there was discussion about existing identifiers.
>> There
>> were a number of posts, but in particular
>> http://lists.tdwg.org/pipermail/tdwg-content/2011-January/002258.html
>> http://lists.tdwg.org/pipermail/tdwg-content/2011-January/002259.htmldiscussed
>> the relative merits of ITIS and uBio ID numbers.  My take-home
>> message from this was that uBio represented the largest single set of
>> names
>> with assigned identifiers (see
>> http://gni.globalnames.org/data_sourcescited in Pete's email) and that
>> uBio metadata provides useful references.
>> Hence my interest in referencing uBio ids as a URI.  However, as a
>> practical
>> matter, the organizations that I share images with either want ITIS TSNs
>> (EOL and Morphbank) or just names (Discover Life).  Nobody is asking for
>> uBio identifiers or any other identifier.
>>
>> I found Kevin's comment at
>> http://lists.tdwg.org/pipermail/tdwg-content/2011-May/002486.html very
>> thought-provoking: "My thoughts are that the most likely way this will
>> be
>> solved is by standard market type pressures - ie the best solution/IDs
>> will
>> be used the most and 'float' to the top."  I'm not going to make a
>> judgment
>> about what is the "best" solution or ID.  But I would say that in
>> "computer"
>> history, being the "best" doesn't necessarily mean that something will
>> be
>> used.  Take for example, the FOAF vocabulary.  What the heck is Friend
>> of a
>> Friend?  I would venture to say that most of the people using the FOAF
>> vocabulary don't know or care.  The FOAF vocabulary was the one that
>> people
>> started to use and once that happened, people didn't switch even if
>> there
>> was something better.  I'm not familiar with the history of other stuff
>> like
>> YouTube and Craig's List, but I would guess that they weren't
>> necessarily
>> "the best" systems - they were just the one that the most people started
>> using first and once that happened, people didn't switch.  I'm using
>> ITIS
>> IDs because they are easy to get and the people I communicate with want
>> them.  Whether they are the "best" or "done correctly" doesn't matter to
>> me
>> as much as the fact that that they are widely recognized and stable (and
>> that thus far every name that I've looked for has been in their
>> database).
>>
>> I think that one reason why this question has been on my mind is that
>> I've
>> been waiting for GNUB (Global Name Use Bank) to come out.  I'm not
>> really up
>> on how it is going to work, but my impression is that it was going to be
>> based on the Global Name Index (GNI) which was mentioned in that earlier
>> January thread.  At that point, the GNI names didn't have any
>> identifiers
>> that were exposed to the public as permanent GUIDs.  I'm assuming that
>> if
>> GNUB refers to GNI names, they will have some kind of identifiers.  So
>> if
>> that happens how is the GUID recommendation 8 going to be followed?  As
>> Kevin said in
>> http://lists.tdwg.org/pipermail/tdwg-content/2011-June/002499.html "What
>> I
>> take from recommendation 8 of the GUID applicability guide ... is that
>> if
>> you DON'T already have a record in your own database for a taxon
>> name/concept, then reuse an existing one.  "  What we have here with GNI
>> is
>> a situation where none of the records have identifiers.  In my mind, the
>> "best practice" according to recommendation 8 would be for the GNI to
>> reuse
>> existing identifiers where they exist and NOT make up new ones.  This is
>> a
>> bit more complicated because the ITIS identifiers (which are in common
>> use)
>> don't have an http URI version that is resolvable, and while the uBio
>> identifiers have a resolvable http URI, it's in the form of a proxied
>> LSID,
>> which I've already complained is very ugly.  So I'd like to hear some
>> ideas
>> about how to have "reused" identifiers in the GNI.
>>
>> One thing that comes to my mind would be to have a "domain name" like
>> "http://purl.org/gni/" <http://purl.org/gni/> or
>> "http://purl.org/tn/"<http://purl.org/tn/>("tn" for "taxon name") and to
>> follow it with a namespace/id combination
>> similar to what is done with lsids.  So for example "itis/19408" and
>> "ubio/448439" could be appended, creating
>> http://purl.org/gni/itis/19408and
>> http://purl.org/gni/ubio/448439 for "Quercus rubra  L."  Both URIs could
>> point to the same RDF and that RDF could indicate that the two
>> identifiers
>> are owl:sameAs .  I realize from what Bob Morris has cautioned in the
>> past
>> that there are problems with owl:sameAs when the two things aren't
>> actually
>> the same thing (e.g. if the uBio ID refers to a name string only but the
>> ITIS TSN refers to the name plus an "accepted" status and a relationship
>> to
>> parent taxa).  However, if there were an understanding that the GNI only
>> refers to name strings, then one could still refer to
>> http://purl.org/gni/itis/19408 as an identifier for the name string of
>> the
>> thing (whatever it is) that is referred to by an ITIS TSN of 19408.  I
>> don't
>> think there would be a problem saying that and the ubio ID were
>> "owl:sameAs".  Some kind of solution like this would allow people to
>> easily
>> generate a resolvable URI for a name if they were using ITIS TSNs or
>> uBio
>> IDs.  If the name that one wanted to use was so obscure that it was one
>> of
>> the 9.5 million names that uBio has that ITIS doesn't have, then that
>> name
>> would only have the ubio version.  I have no idea whether this would be
>> a
>> good idea or not, but I was really cringing to think about 19 million
>> newly
>> minted UUIDs appended to
>> "http://gni.globalnames.org/"<http://gni.globalnames.org/>and figuring
>> out how to connect those horrid things to the names and ITIS
>> TSNs that I'm already using.  I think that I said this before, but using
>> the
>> purl.org domain rather than one like http://gni.globalnames.org/ would
>> in
>> the future allow somebody else to take over management of providing the
>> metadata when the GUIDs are resolved without having to deal with issues
>> of
>> who "owns" the domain name.
>>
>> Steve
>>
>>
>>
>> Kevin Richards wrote:
>>
>>  Pete,
>>
>> I’m not trying to say what you are doing is a waste of time/impossible.
>> I
>> actually think RDF + semantics are a good way forward, but this really
>> implies that we need to rely on the semantics and linkages rather than
>> having a SINGLE ID for a taxon name.  (which is what I thought Steve was
>> getting at).  Each instance of a taxon name can have its own ID and then
>> all
>> these instances are connected via ontology defined semantic links.  This
>> seems more appropriate to me than insisting everyone uses the “Global
>> Taxon
>> Name ID X”.
>>
>>
>>
>> In your example of *Aedes triseriatus* and *Ochlerotatus triseriatus* –
>> these are two different names so they need two different IDs, they may
>> be
>> linked by a single taxon concept, but they are separate names.  So which
>> of
>> these now 3 IDs do you expect people to use, and according to what
>> source??
>>
>>
>>
>> For example if we have a name, eg the Robin, Erithacus rubecula,
>> mentioned
>> in IT IS (TSN : 559964) and also in EOL (www.eol.org/pages/1051567),
>> also
>> in GBIF (http://data.gbif.org/species/21266780), also in avibase (
>> http://avibase.bsc-eoc.org/species.jsp?avibaseid=C809B2B90399A43D),
>> which
>> ID are you hoping people will use??  Would you put the IT IS ID in your
>> own
>> dataset as the ID for that name – unlikely.  Or would it be better to
>> link
>> them up with semantic linkages.
>>
>>
>>
>> What I take from recommendation 8 of the GUID applicability guide (as
>> Steve
>> puts is "stop making up new identifiers when somebody else already has
>> one
>> for the thing you are talking about”) is that if you DON’T already have
>> a
>> record in your own database for a taxon name/concept, then reuse an
>> existing
>> one.  NOT ditch all your current IDs and adopt someone else’s
>> (especially
>> hard considering it is so hard to work out which if the multitude of
>> names
>> ad concept IDs that directly relates to your taxon name).
>>
>>
>>
>> I am all for limiting the number of IDs for the “same” thing, but in
>> some
>> cases it is more useful to build linkages than force this tight
>> integration
>> of data and IDs.  Especially for taxon names and concepts, where it is
>> complex to define if you are even talking about the “same” thing or not.
>>
>>
>>
>> Kevin
>>
>>
>>
>> *From:* Peter DeVries
>> [mailto:pete.devries at gmail.com<pete.devries at gmail.com>]
>>
>> *Sent:* Wednesday, 1 June 2011 12:38 p.m.
>> *To:* Kevin Richards
>> *Cc:* Steve Baskauf; tdwg-content at lists.tdwg.org; Gerald Guala;
>> Nicolson,
>> David; Alan J Hampson; Orrell, Thomas
>> *Subject:* Re: [tdwg-content] ITIS TSNID to uBio NamebankIDs mapping
>>
>>
>>
>> Hi Kevin,
>>
>>
>>
>> I forgot one mention some other things that are different about my
>> project.
>>
>>
>>
>> You can write a simple SPARQL query to get a list of all the
>> TaxonConcept's
>> that have ITIS ids, or all those that have ITIS and NCBI ID's etc.
>>
>>
>>
>> You can do this on any SPARQL endpoint that hosts the data.
>>
>>
>>
>> You can download the entire data set and run the queries on your own
>> endpoint.
>>
>>
>>
>> You can write a script that runs the query and downloads the ITIS
>> numbers
>> and exports them to CSV etc.
>>
>>
>>
>> - Pete
>>
>>
>>
>> On Tue, May 31, 2011 at 5:16 PM, Peter DeVries <pete.devries at gmail.com>
>> wrote:
>>
>> Hi Kevin,
>>
>> On Tue, May 31, 2011 at 3:27 PM, Kevin Richards <
>> RichardsK at landcareresearch.co.nz> wrote:
>>
>> This is exactly why this problem still exists and will be very complex
>> to
>> solve - everyone says "we should have a single ID for a specific taxon
>> name,
>> there seems to be several IDs 'out there' that refer to the same taxon
>> name,
>> so Im going to create another ID to link them all up" - yet another ID
>> that
>> no one will particularly want to follow - you would have to get everyone
>> to
>> agree that your combinations/integration of taxon names is the best one
>> and
>> hope everyone follows it - unlikely in this domain.
>>
>>
>>
>> Isn't this kind of what the The Plant List, and eBird already do?
>>
>>
>>
>> A difference being that they tie these to a specific name and specific
>> classification.
>>
>>
>>
>> The Plant list is not really even open so it is difficult to people to
>> adopt it in mass.
>>
>>
>>
>> For instance, if I manage a herbarium, how do I easily reconcile my
>> species
>> list with the entities represented in the Plant List?
>>
>>
>>
>> eBird has millions of records which implies that they have been able to
>> convince the observers in the field to adopt their system. You are
>> correct
>> in that there are probably a lot of taxonomists that don't like their
>> list.
>>
>> It differs from many of the other classifications, but remember the
>> system
>> rewards them for not agreeing. Note the difference between the microbial
>> taxonomists and other taxonomists. In the case of the microbial
>>
>> workers, the system rewards them for solving problems not debating
>> alternatives. Also, if a good idea comes out that will make it easier
>> for
>> the microbiologists to solve the problems they are rewarded for solving,
>> they are less likely to care whose idea it is.
>>
>>
>>
>> Like the microbiologists, there are lots of biologists that work with
>> species with the goal of addressing some non-taxonomic problem.
>>
>>
>>
>> They don't really care if the name is *Aedes triseriatus* or
>> *Ochlerotatus
>> triseriatus, *but they do care that the identifier that they connect
>> their
>> data to is stable.
>>
>>
>>
>> In regards to the issue of market forces,I suspect (but have no
>> knowledge
>> of) that there were probably decisions made in devising these lists that
>> have more to do with appeasing certain personalities that creating best
>> list. With the way this system rewards people it is likely that the
>> "correct" version will float to the top only after that person has
>> passed
>> away. I don't have much faith that the best system will always float to
>> the
>> top, That has a lot to do with the personalities and how the system
>> rewards
>> are setup. Theoretically, it is possible for one strong personality or
>> group
>> to force others to adopt their less than optimal solution - at least
>> this
>> seems to happen in other environments.
>>
>>
>>
>> Also, there are all sorts of ways that people can use the publication
>> record to rewrite history. Simply cite the review paper that cites the
>> original paper. Or don't cite it at all.
>>
>>
>>
>> I would have used only the ITIS TSN but if the name changes the ID
>> changes.
>> This isn't "wrong", it just does not solve my problem.
>>
>>
>>
>> * ITIS also should add the spiders from the World Spider Catalog.
>>
>>
>>
>> Another issue that I think has inhibited adoption of a common list is
>> that
>> people can't agree on a particular name or a particular classification.
>>
>>
>>
>> Since you can model a species concept as having many names and many
>> classifications why not do so?
>>
>>
>>
>> If this idea was originally accepted, I would not have needed to create
>> TaxonConcept.org.
>>
>>
>>
>> My plan has aways been to get something that works to solve some
>> problems
>> and then let some larger group take it over.
>>
>>
>>
>> In a sense, I am more like the microbiologists in that I am not being
>> paid
>> to solve this or debate this problem.
>>
>>
>>
>> I am doing it because I think something like this is needed, and it is
>> an
>> interesting and personally rewarding puzzle.
>>
>>
>>
>> - Pete
>>
>>
>>
>>
>>
>>
>> My thoughts are that the most likely way this will be solve is by
>> stnadard
>> market type pressures - ie the best solution/IDs will be used the most
>> and
>> "float" to the top.  It is easy to say that the global taxon name data
>> is a
>> mess, but if you think about it 30 years ago taxon name data were very
>> disparate, duplicated, unconnected, many with NO IDs at all.  So I
>> beleive
>> we are making progress and that we will continue to do so albeit at a
>> fairly
>> slow rate.
>>
>> Kevin
>>
>>
>>
>> "I agree. This was one of the reasons that I setup TaxonConcept the way
>> I
>> did. It attempts to connect both the LOD entities and the foreign key
>> based
>> entities."
>>
>>  Please consider the environment before printing this email
>> Warning:  This electronic message together with any attachments is
>> confidential. If you receive it in error: (i) you must not read, use,
>> disclose, copy or retain it; (ii) please contact the sender immediately
>> by
>> reply email and then delete the emails.
>> The views expressed in this email may not be those of Landcare Research
>> New
>> Zealand Limited. http://www.landcareresearch.co.nz
>>
>>
>>
>>
>> --
>>
>> ------------------------------------------------------------------------------------
>> Pete DeVries
>> Department of Entomology
>> University of Wisconsin - Madison
>> 445 Russell Laboratories
>> 1630 Linden Drive
>> Madison, WI 53706
>> Email: pdevries at wisc.edu
>> TaxonConcept <http://www.taxonconcept.org/>  &
>> GeoSpecies<http://about.geospecies.org/> Knowledge
>> Bases
>> A Semantic Web, Linked Open Data <http://linkeddata.org/>  Project
>>
>> --------------------------------------------------------------------------------------
>>
>>
>>
>>
>> --
>>
>> ------------------------------------------------------------------------------------
>> Pete DeVries
>> Department of Entomology
>> University of Wisconsin - Madison
>> 445 Russell Laboratories
>> 1630 Linden Drive
>> Madison, WI 53706
>> Email: pdevries at wisc.edu
>> TaxonConcept <http://www.taxonconcept.org/>  &
>> GeoSpecies<http://about.geospecies.org/> Knowledge
>> Bases
>> A Semantic Web, Linked Open Data <http://linkeddata.org/>  Project
>>
>> --------------------------------------------------------------------------------------
>>
>> ------------------------------
>> Please consider the environment before printing this email
>> Warning: This electronic message together with any attachments is
>> confidential. If you receive it in error: (i) you must not read, use,
>> disclose, copy or retain it; (ii) please contact the sender immediately
>> by
>> reply email and then delete the emails.
>> The views expressed in this email may not be those of Landcare Research
>> New
>> Zealand Limited. http://www.landcareresearch.co.nz
>>
>>
>> --
>> Steven J. Baskauf, Ph.D., Senior Lecturer
>> Vanderbilt University Dept. of Biological Sciences
>>
>> postal mail address:
>> VU Station B 351634
>> Nashville, TN  37235-1634,  U.S.A.
>>
>> delivery address:
>> 2125 Stevenson Center
>> 1161 21st Ave., S.
>> Nashville, TN 37235
>>
>> office: 2128 Stevenson Center
>> phone: (615) 343-4582,  fax: (615)
>> 343-6707http://bioimages.vanderbilt.edu
>>
>>
>
>
> --
> ------------------------------------------------------------------------------------
> Pete DeVries
> Department of Entomology
> University of Wisconsin - Madison
> 445 Russell Laboratories
> 1630 Linden Drive
> Madison, WI 53706
> Email: pdevries at wisc.edu
> TaxonConcept <http://www.taxonconcept.org/>  &
> GeoSpecies<http://about.geospecies.org/> Knowledge
> Bases
> A Semantic Web, Linked Open Data <http://linkeddata.org/>  Project
> --------------------------------------------------------------------------------------
> _______________________________________________
> tdwg-content mailing list
> tdwg-content at lists.tdwg.org
> http://lists.tdwg.org/mailman/listinfo/tdwg-content
>



----------------------------------------------------------------------------
David Remsen, Senior Programme Officer
Electronic Catalog of Names of Known Organisms
Global Biodiversity Information Facility Secretariat
Universitetsparken 15, DK-2100 Copenhagen, Denmark
Tel: +45-35321472   Fax: +45-35321480
Skype: dremsen
----------------------------------------------------------------------------





More information about the tdwg-content mailing list