Producing a global taxon register (was: ITIS TSNID to uBio NamebankIDs mapping)
Hi all (jumping in with some trepidation...)
It's good to hear some ramp-up may be coming of activity in the GNUB space (congratulations, Rich et al.). My main concern, however is that it does not solve my particular problem - which is in a nutshell, given "any" cited taxonomic name, what can we tell about it - with regard to its classification, nomenclatural and taxonomic/synonym status, and certain attributes (initially for my use case, simple geologic time - is it extant or not - and simple habitat classification - is it marine or not - though of course infinitely expandable from there).
To me the vision of GNUB is too grand - to index all usages of all names in all sources - and the vision of GNI is too limited - to index the names but not actually record/harmonise/verify/manage (in a structured way) any associated information. I'm after something in between - what I have tentatively previously called HCAL - a hierarchical catalogue of all life (presuming that at least one "management" hierarchy is incorporated) - or maybe just a GTR - global taxon register. Sort of, waiting for the Catalogue of Life and/or ITIS to be complete, for both extant and fossil taxa, and also incorporate selected "taxon attributes" as above. (This is the space into which my IRMNG database is cast as a preliminary/"working for now" solution, but obviously without the significant resourcing / community cooperation required to build and sustain the thing for the long term).
So my question is, how can such a product emerge from ongoing developments in GN* space, or other...
Over to the experts,
Best - Tony
________________________________________ From: tdwg-content-bounces@lists.tdwg.org [tdwg-content-bounces@lists.tdwg.org] On Behalf Of Richard Pyle [deepreef@bishopmuseum.org] Sent: Saturday, 4 June 2011 8:48 AM To: tdwg-content@lists.tdwg.org Subject: Re: [tdwg-content] ITIS TSNID to uBio NamebankIDs mapping
Working backwards through this thread...
I hadn't read Dima's post until just now, and I see that at least a couple of his points (i.e., #2, #5, #6) apply to exposing the UUIDs externally. However, I think that a simple protocol (such as replacing spaces with "_", and avoiding characters that look the same but are different -- such as the Cyrillic 'a') could go a long way to mitigating those problems.
On the other hand, it really depends on what the identifier is for. The string "Danaus_plexippus_(Linnaeus_1758)" may be more friendly to our eyes, but "A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523" is definitely more friendly to a computer (Dima's points 1, 3 & 4, among others). My feeling is that the push for GUIDs is more about enabling computer-computer conversations, than it is about enabling human-human or human-computer interactions; and therefore we should not get bogged down in the "ugliness" of the identifiers. In the context of electronic data services, the "ugliness" potential of the "Danaus_plexippus_(Linnaeus_1758)" approach to identifiers is far greater than the ugliness potential of "A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523", when it comes to interlinking electronic biodiversity data. It is nothing for a computer to render relevant metadata of the object identified by "A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523" into "Danaus plexippus (Linnaeus_1758)" on a computer screen or piece of paper for human-eyeball consumption. But there are many pitfalls (some noted by Dima) for a computer to unambiguously resolve "Danaus_plexippus_(Linnaeus_1758)" back to a meaningful data object.
I guess my revised point is: GNI (and uBio/NameBank) are essentially the only taxonomic databases out there where a human-friendly persistent/actionable identifier of the sort being discussed is even plausible as an option. It may not even be wise in this context (as per Dima's points), but it *might* be, depending on the need for a human-friendly identifier.
Maybe the simplest thing to do would be to not regard "http://gni.globalnames.org/name_strings/Danaus_plexippus_(Linnaeus_1758)" as an identifier per se, but rather as a protocol for a web service. In other words, if you append a text string to the root URL "http://gni.globalnames.org/name_strings/", GNI would run that text string against its index and return whatever metadata based on a text-string match. This is not mutually exclusive with an "identifier" in the form of "http://gni.globalnames.org/name_strings/A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523", that would less ambiguously resolve a known record in GNI. At this point, the line between "identifier" and "service" gets fuzzy, of course. But the analogy is true in ZooBank:
The persistent "Identifer" looks like this: A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523
One way that this identifier can be represented as an *actionable* identifier is this: urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523
Another "actionable" form of the identifier might be this: http://zoobank.org/urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E5BF4...
or this: http://zoobank.org/A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523
or even this(?): http://lsid.tdwg.org/urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E5B...
(all of which work, by the way)
However, the following are examples of what I would think of as *services*: http://www.google.com/search?q=Danaus+plexippus+(Linnaeus+1758) http://lsid.tdwg.org/summary/urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB... http://darwin.zoology.gla.ac.uk/~rpage/lsid/tester/?q=urn:lsid:zoobank.org:a...
But really, from the perspective of the end-user, does it matter if it's an identifier or a service? Ultimately, they ask the questions, and the answers appear on their computer screens.
Aloha, Rich
-----Original Message----- From: tdwg-content-bounces@lists.tdwg.org [mailto:tdwg-content- bounces@lists.tdwg.org] On Behalf Of Dmitry Mozzherin Sent: Friday, June 03, 2011 4:34 AM To: David Remsen (GBIF) Cc: tdwg-content@lists.tdwg.org; Dmitry Mozzherin; Orrell, Thomas; Alan J Hampson; Nicolson, David; Gerald Guala Subject: Re: [tdwg-content] ITIS TSNID to uBio NamebankIDs mapping
In my opinion UUIDs have a few advantages over strings --
- It is uuid, so it will work with uuid tools (current and future ones)
- It is less ambiguous -- For example -- what is the difference between Betulа and
Betula for your eyes? (one of them has a Cyrillic 'a') 3. Database wise it is faster to search because it is just a 128bit number, while a name is at least 245 byte varchar -- it makes searching much faster because in relational databases the size of keys directly proportional to the search speed 4. UUID v. 5 (http://en.wikipedia.org/wiki/Universally_unique_identifier) allows to generate UUID algorithmically without looking up a database (no need for network connection) 5. Links like http://gni.globalnames.org/name_strings/Danaus_plexippus_(Linnaeus_1758) might be ambigous -- I can think of several ways I can represent name string part in the url and they will all resolve to the same thing in GNI. 6. Unescaped unicode characters in url containing literal name strings (people will forget to escape them) will depend on an implementation of a url resolver
Saying this links like http://gni.globalnames.org/name_strings/Danaus_plexippus_(Linnaeus_175 8) are definitely attractive and is it good to have them as another way to access a name! My personal preference would be not use them as main identifier because of the reasons 1, 2, 3 and 5.
Dima
On Fri, Jun 3, 2011 at 7:59 AM, David Remsen (GBIF) dremsen@gbif.org wrote:
Why not use the name as the basis for the resolvable identifier instead of a uuid. Isnt there a 1:1 cardinality between the name and the uuid in the GNI? Doesnt that mean that
http://gni.globalnames.org/name_strings/4ef223c4-0c3e-5e84-ace9-
755c34
c601ec and
http://gni.globalnames.org/name_strings/Danaus_plexippus_(Linnaeus_175
are equally unique? The latter is certainly more readable. In those cases where the namestring is a homonym like
http://gni.globalnames.org/name_strings/Oenanthe
couldn't you just return the addresses of the two globally unique forms of the name when you resolve it?
http://gni.globalnames.org/name_strings/Oenanthe_Smith_1899
http://gni.globalnames.org/name_strings/Oenanthe_Jones_1900
Wouldn't those be as globally unique and easier to read and adjust to? Or am I missing something. I always wanted to do that with ubio IDs after a back and forth with Gregor Hagedorn and wished we hadn't exposed those integers.
DR
Hi Steve,
I don't have time to go through this in detail, and I can't speak for the GNI, but I can tell you about how the GNI URI's work at least for now.
A while back Dima Mozzherin and I were looking into how triples etc. might be of use to the GNI.
We needed a way to generate unique URI's for each name.
We wanted to avoid having to keep these in sync and not require everyone to look each ID up through some service.
Dima came up with the following plan. We use the namestring as seed to generate a unique UUID.
Basically this is a shared algorithm which the GNI and TaxonConcept both use. But it could be used by anyone.
You feed the name string to the algorithm and it spits out a UUID. We append then append that to a URI and web service so it is resolvable.
So the name Danaus plexippus (Linnaeus 1758) => 4ef223c4-0c3e-5e84-ace9-755c34c601ec
So if the GNI and and another group have the same namestring they have the same UUID.
People can then can link their data set to the GNI with the following URI
http://gni.globalnames.org/name_strings/4ef223c4-0c3e-5e84-ace9-
755c3
4c601ec
RDF http://gni.globalnames.org/name_strings/4ef223c4-0c3e-5e84-ace9-
755c3
4c601ec.rdf
<http://gni.globalnames.org/name_strings/4ef223c4-0c3e-5e84-ace9-
755c
34c601ec.rdf>If you think of your data set as one table and the GNI as another, this URI serves as the foreign key that connects them together.
Some on the list don't like how these look, but there is a tremendous advantage in not having to worry about syncing two large data sets and determining if a given integer is already in use.
Also Rod Page has written a recently about UUID's. http://iphylo.blogspot.com/2011/05/zoobank-on-couchdb-uuids-replicati on.html
<http://iphylo.blogspot.com/2011/05/zoobank-on-couchdb-uuids-
replicat
ion.html>There may be a way to do something similar with bit.ly like identifiers that are shorter (mCcSp), but I think it the general idea is a good one.
If you recall from my talk at TDWG, I was able to use these to make statements that one namestring was a synonym etc. of another etc.
The algorithm we use is written in Ruby but I could be ported to many different languages since UUIDs are widely supported.
Respectfully,
- Pete
On Thu, Jun 2, 2011 at 11:41 PM, Steven J. Baskauf < steve.baskauf@vanderbilt.edu> wrote:
My email access has been sporadic since this thread developed, so at this point I'll respond to points made in several of the messages.
First, I should note that there has been previous discussion on this list on a similar topic from http://lists.tdwg.org/pipermail/tdwg-content/2011-January/002231.htm lthrough http://lists.tdwg.org/pipermail/tdwg-content/2011-
January/002231.html.
One can review what was said at that time rather quickly by starting on the first linked message and clicking on the "Next Message" link until you get to the end of the range I gave above.
My reason for the request for information that started this thread was that I wanted to link to a URI that would anchor the name portion of a name/sensu pair (TNU or Taxon Concept a la TCS if you prefer) as in this RDF snippet:
tc:nameStringQuercus rubra L.</tc:nameString> <tc:hasName
rdf:about="http://www.ubio.org/authority/metadata.php?lsid=urn:lsid:ubio .org:namebank:448439"
http://www.ubio.org/authority/metadata.php?lsid=urn:lsid:ubio.org:n amebank:448439/>
At this point in the discussion, I'm not actually talking about creating a link to a taxon concept but rather to a taxon name, so some of the issues Pete raised don't apply here (e.g. what's the "right" name for a concept
the question here is simply what's a stable identifier for the name) . In principle, I could probably just provide the name string and be done with it. However, having some degree of faith that Smart, Computer Savvy People might some day be able to use the metadata returned by the URI (or perhaps metadata which they already have in a triple store onsite) to do cool things like knowing that my name is the same as an orthographic variant or that "Quercus rubra L." is basically the same thing as "Quercus rubra", I would like to also provide a functional URI.
As an end -user who isn't very interested in the technical issues involving names, I don't really care what URI I use. I would prefer for it to be widely recognized and for it to "work" (i.e. be resolvable). In the earlier (January) thread, there was discussion about existing identifiers. There were a number of posts, but in particular http://lists.tdwg.org/pipermail/tdwg-content/2011-January/002258.htm l http://lists.tdwg.org/pipermail/tdwg-content/2011-January/002259.htm ldiscussed the relative merits of ITIS and uBio ID numbers. My take-home message from this was that uBio represented the largest single set of names with assigned identifiers (see http://gni.globalnames.org/data_sourcescited in Pete's email) and that uBio metadata provides useful references. Hence my interest in referencing uBio ids as a URI. However, as a practical matter, the organizations that I share images with either want ITIS TSNs (EOL and Morphbank) or just names (Discover Life). Nobody is asking for uBio identifiers or any other identifier.
I found Kevin's comment at http://lists.tdwg.org/pipermail/tdwg-content/2011-May/002486.html very thought-provoking: "My thoughts are that the most likely way this will be solved is by standard market type pressures - ie the best solution/IDs will be used the most and 'float' to the top." I'm not going to make a judgment about what is the "best" solution or ID. But I would say that in "computer" history, being the "best" doesn't necessarily mean that something will be used. Take for example, the FOAF vocabulary. What the heck is Friend of a Friend? I would venture to say that most of the people using the FOAF vocabulary don't know or care. The FOAF vocabulary was the one that people started to use and once that happened, people didn't switch even if there was something better. I'm not familiar with the history of other stuff like YouTube and Craig's List, but I would guess that they weren't necessarily "the best" systems - they were just the one that the most people started using first and once that happened, people didn't switch. I'm using ITIS IDs because they are easy to get and the people I communicate with want them. Whether they are the "best" or "done correctly" doesn't matter to me as much as the fact that that they are widely recognized and stable (and that thus far every name that I've looked for has been in their database).
I think that one reason why this question has been on my mind is that I've been waiting for GNUB (Global Name Use Bank) to come out. I'm not really up on how it is going to work, but my impression is that it was going to be based on the Global Name Index (GNI) which was mentioned in that earlier January thread. At that point, the GNI names didn't have any identifiers that were exposed to the public as permanent GUIDs. I'm assuming that if GNUB refers to GNI names, they will have some kind of identifiers. So if that happens how is the GUID recommendation 8 going to be followed? As Kevin said in http://lists.tdwg.org/pipermail/tdwg-content/2011-June/002499.html "What I take from recommendation 8 of the GUID applicability guide ... is that if you DON'T already have a record in your own database for a taxon name/concept, then reuse an existing one. " What we have here with GNI is a situation where none of the records have identifiers. In my mind, the "best practice" according to recommendation 8 would be for the GNI to reuse existing identifiers where they exist and NOT make up new ones. This is a bit more complicated because the ITIS identifiers (which are in common use) don't have an http URI version that is resolvable, and while the uBio identifiers have a resolvable http URI, it's in the form of a proxied LSID, which I've already complained is very ugly. So I'd like to hear some ideas about how to have "reused" identifiers in the GNI.
One thing that comes to my mind would be to have a "domain name" like "http://purl.org/gni/" http://purl.org/gni/ or "http://purl.org/tn/"http://purl.org/tn/("tn" for "taxon name") and to follow it with a namespace/id combination similar to what is done with lsids. So for example "itis/19408" and "ubio/448439" could be appended, creating http://purl.org/gni/itis/19408and http://purl.org/gni/ubio/448439 for "Quercus rubra L." Both URIs could point to the same RDF and that RDF could indicate that the two identifiers are owl:sameAs . I realize from what Bob Morris has cautioned in the past that there are problems with owl:sameAs when the two things aren't actually the same thing (e.g. if the uBio ID refers to a name string only but the ITIS TSN refers to the name plus an "accepted" status and a relationship to parent taxa). However, if there were an understanding that the GNI only refers to name strings, then one could still refer to http://purl.org/gni/itis/19408 as an identifier for the name string of the thing (whatever it is) that is referred to by an ITIS TSN of 19408. I don't think there would be a problem saying that and the ubio ID were "owl:sameAs". Some kind of solution like this would allow people to easily generate a resolvable URI for a name if they were using ITIS TSNs or uBio IDs. If the name that one wanted to use was so obscure that it was one of the 9.5 million names that uBio has that ITIS doesn't have, then that name would only have the ubio version. I have no idea whether this would be a good idea or not, but I was really cringing to think about 19 million newly minted UUIDs appended to "http://gni.globalnames.org/"http://gni.globalnames.org/and figuring out how to connect those horrid things to the names and ITIS TSNs that I'm already using. I think that I said this before, but using the purl.org domain rather than one like http://gni.globalnames.org/ would in the future allow somebody else to take over management of providing the metadata when the GUIDs
are
resolved without having to deal with issues of who "owns" the domain name.
Steve
Kevin Richards wrote:
Pete,
I'm not trying to say what you are doing is a waste of time/impossible. I actually think RDF + semantics are a good way forward, but this really implies that we need to rely on the semantics and linkages rather than having a SINGLE ID for a taxon name. (which is what I thought Steve was getting at). Each instance of a taxon name can have its own ID and then all these instances are connected via ontology defined semantic links. This seems more appropriate to me than insisting everyone uses the "Global Taxon Name ID X".
In your example of *Aedes triseriatus* and *Ochlerotatus triseriatus* - these are two different names so they need two different IDs, they may be linked by a single taxon concept, but they are separate names. So which of these now 3 IDs do you expect people to use, and according to what source??
For example if we have a name, eg the Robin, Erithacus rubecula, mentioned in IT IS (TSN : 559964) and also in EOL (www.eol.org/pages/1051567), also in GBIF (http://data.gbif.org/species/21266780), also in avibase ( http://avibase.bsc-eoc.org/species.jsp?avibaseid=C809B2B90399A43D), which ID are you hoping people will use?? Would you put the IT IS ID in your own dataset as the ID for that name - unlikely. Or would it be better to link them up with semantic linkages.
What I take from recommendation 8 of the GUID applicability guide (as Steve puts is "stop making up new identifiers when somebody else already has one for the thing you are talking about") is that if you DON'T already have a record in your own database for a taxon name/concept, then reuse an existing one. NOT ditch all your current IDs and adopt someone else's (especially hard considering it is so hard to work out which if the multitude of names ad concept IDs that directly relates to your taxon name).
I am all for limiting the number of IDs for the "same" thing, but in some cases it is more useful to build linkages than force this tight integration of data and IDs. Especially for taxon names and concepts, where it is complex to define if you are even talking about the "same" thing or not.
Kevin
*From:* Peter DeVries [mailto:pete.devries@gmail.compete.devries@gmail.com]
*Sent:* Wednesday, 1 June 2011 12:38 p.m. *To:* Kevin Richards *Cc:* Steve Baskauf; tdwg-content@lists.tdwg.org; Gerald Guala; Nicolson, David; Alan J Hampson; Orrell, Thomas *Subject:* Re: [tdwg-content] ITIS TSNID to uBio NamebankIDs mapping
Hi Kevin,
I forgot one mention some other things that are different about my project.
You can write a simple SPARQL query to get a list of all the TaxonConcept's that have ITIS ids, or all those that have ITIS and NCBI ID's etc.
You can do this on any SPARQL endpoint that hosts the data.
You can download the entire data set and run the queries on your own endpoint.
You can write a script that runs the query and downloads the ITIS numbers and exports them to CSV etc.
- Pete
On Tue, May 31, 2011 at 5:16 PM, Peter DeVries
wrote:
Hi Kevin,
On Tue, May 31, 2011 at 3:27 PM, Kevin Richards < RichardsK@landcareresearch.co.nz> wrote:
This is exactly why this problem still exists and will be very complex to solve - everyone says "we should have a single ID for a specific taxon name, there seems to be several IDs 'out there' that refer to the same taxon name, so Im going to create another ID to link them all up" - yet another ID that no one will particularly want to follow - you would have to get everyone to agree that your combinations/integration of taxon names is the best one and hope everyone follows it - unlikely in this domain.
Isn't this kind of what the The Plant List, and eBird already do?
A difference being that they tie these to a specific name and specific classification.
The Plant list is not really even open so it is difficult to people to adopt it in mass.
For instance, if I manage a herbarium, how do I easily reconcile my species list with the entities represented in the Plant List?
eBird has millions of records which implies that they have been able to convince the observers in the field to adopt their system. You are correct in that there are probably a lot of taxonomists that don't like their list.
It differs from many of the other classifications, but remember the system rewards them for not agreeing. Note the difference between the
microbial
taxonomists and other taxonomists. In the case of the microbial
workers, the system rewards them for solving problems not debating alternatives. Also, if a good idea comes out that will make it easier for the microbiologists to solve the problems they are rewarded for solving, they are less likely to care whose idea it is.
Like the microbiologists, there are lots of biologists that work with species with the goal of addressing some non-taxonomic problem.
They don't really care if the name is *Aedes triseriatus* or *Ochlerotatus triseriatus, *but they do care that the identifier that they connect their data to is stable.
In regards to the issue of market forces,I suspect (but have no knowledge of) that there were probably decisions made in devising these lists that have more to do with appeasing certain personalities that creating best list. With the way this system rewards people it is likely that the "correct" version will float to the top only after that person has passed away. I don't have much faith that the best system will always float to the top, That has a lot to do with the personalities and how the system rewards are setup. Theoretically, it is possible for one strong personality or group to force others to adopt their less than optimal solution - at least this seems to happen in other environments.
Also, there are all sorts of ways that people can use the publication record to rewrite history. Simply cite the review paper that cites the original paper. Or don't cite it at all.
I would have used only the ITIS TSN but if the name changes the ID changes. This isn't "wrong", it just does not solve my problem.
- ITIS also should add the spiders from the World Spider Catalog.
Another issue that I think has inhibited adoption of a common list is that people can't agree on a particular name or a particular classification.
Since you can model a species concept as having many names and many classifications why not do so?
If this idea was originally accepted, I would not have needed to create TaxonConcept.org.
My plan has aways been to get something that works to solve some problems and then let some larger group take it over.
In a sense, I am more like the microbiologists in that I am not being paid to solve this or debate this problem.
I am doing it because I think something like this is needed, and it is an interesting and personally rewarding puzzle.
- Pete
My thoughts are that the most likely way this will be solve is by stnadard market type pressures - ie the best solution/IDs will be used the most and "float" to the top. It is easy to say that the global taxon name data is a mess, but if you think about it 30 years ago taxon name data were very disparate, duplicated, unconnected, many with NO IDs at all. So I beleive we are making progress and that we will continue to do so albeit at a fairly slow rate.
Kevin
"I agree. This was one of the reasons that I setup TaxonConcept the way I did. It attempts to connect both the LOD entities and the foreign key based entities."
Please consider the environment before printing this email Warning: This electronic message together with any attachments is confidential. If you receive it in error: (i) you must not read, use, disclose, copy or retain it; (ii) please contact the sender immediately by reply email and then delete the emails. The views expressed in this email may not be those of Landcare
Research
New Zealand Limited. http://www.landcareresearch.co.nz
--
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 Email: pdevries@wisc.edu TaxonConcept http://www.taxonconcept.org/ & GeoSpecieshttp://about.geospecies.org/ Knowledge Bases A Semantic Web, Linked Open Data http://linkeddata.org/ Project
--
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 Email: pdevries@wisc.edu TaxonConcept http://www.taxonconcept.org/ & GeoSpecieshttp://about.geospecies.org/ Knowledge Bases A Semantic Web, Linked Open Data http://linkeddata.org/ Project
Please consider the environment before printing this email Warning: This electronic message together with any attachments is confidential. If you receive it in error: (i) you must not read, use, disclose, copy or retain it; (ii) please contact the sender immediately by reply email and then delete the emails. The views expressed in this email may not be those of Landcare
Research
New Zealand Limited. http://www.landcareresearch.co.nz
-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences
postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A.
delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235
office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707http://bioimages.vanderbilt.edu
--
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 Email: pdevries@wisc.edu TaxonConcept http://www.taxonconcept.org/ & GeoSpecieshttp://about.geospecies.org/ Knowledge Bases A Semantic Web, Linked Open Data http://linkeddata.org/ Project
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
David Remsen, Senior Programme Officer Electronic Catalog of Names of Known Organisms Global Biodiversity Information Facility Secretariat Universitetsparken 15, DK-2100 Copenhagen, Denmark Tel: +45-35321472 Fax: +45-35321480 Skype: dremsen
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
_______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
Hi Tony,
This is probably best handled with a coordinated set of vocabularies. Some could apply to a large set of taxa while others would need to be somewhat specific probably at the level of family or lower.
I think that with the instability of classification, and the uncertainty of reliably reasoning over millions of records might be best to apply these properties at the level of species rather than something like Order.
For instance do all Diptera have two wings?
I made up an example for testing etc under under GeoSpecies that you can see here: http://about.geospecies.org/sparql.xhtml#example_8
http://about.geospecies.org/sparql.xhtml#example_8Some things to think about are attributes like "Common" or "Rare" How does one apply these to taxa as different as mosquitoes and elephants?
I did not mention this in my previous sparql example, but there is a specific species of mosquito that seems to be dependent on a specific pitcher plant to reproduce.
One could infer that the distribution of this species in Wisconsin is limited by the distribution of that Pitcher Plant.
There are a whole host of other relationships such as pollinators/plants, pathogen/vectors, predator/prey that could be modeled and tested using these tools.
Respectfully,
- Pete
2011/6/3 Tony.Rees@csiro.au
Hi all (jumping in with some trepidation...)
It's good to hear some ramp-up may be coming of activity in the GNUB space (congratulations, Rich et al.). My main concern, however is that it does not solve my particular problem - which is in a nutshell, given "any" cited taxonomic name, what can we tell about it - with regard to its classification, nomenclatural and taxonomic/synonym status, and certain attributes (initially for my use case, simple geologic time - is it extant or not - and simple habitat classification - is it marine or not - though of course infinitely expandable from there).
To me the vision of GNUB is too grand - to index all usages of all names in all sources - and the vision of GNI is too limited - to index the names but not actually record/harmonise/verify/manage (in a structured way) any associated information. I'm after something in between - what I have tentatively previously called HCAL - a hierarchical catalogue of all life (presuming that at least one "management" hierarchy is incorporated) - or maybe just a GTR - global taxon register. Sort of, waiting for the Catalogue of Life and/or ITIS to be complete, for both extant and fossil taxa, and also incorporate selected "taxon attributes" as above. (This is the space into which my IRMNG database is cast as a preliminary/"working for now" solution, but obviously without the significant resourcing / community cooperation required to build and sustain the thing for the long term).
So my question is, how can such a product emerge from ongoing developments in GN* space, or other...
Over to the experts,
Best - Tony
From: tdwg-content-bounces@lists.tdwg.org [ tdwg-content-bounces@lists.tdwg.org] On Behalf Of Richard Pyle [ deepreef@bishopmuseum.org] Sent: Saturday, 4 June 2011 8:48 AM To: tdwg-content@lists.tdwg.org Subject: Re: [tdwg-content] ITIS TSNID to uBio NamebankIDs mapping
Working backwards through this thread...
I hadn't read Dima's post until just now, and I see that at least a couple of his points (i.e., #2, #5, #6) apply to exposing the UUIDs externally. However, I think that a simple protocol (such as replacing spaces with "_", and avoiding characters that look the same but are different -- such as the Cyrillic 'a') could go a long way to mitigating those problems.
On the other hand, it really depends on what the identifier is for. The string "Danaus_plexippus_(Linnaeus_1758)" may be more friendly to our eyes, but "A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523" is definitely more friendly to a computer (Dima's points 1, 3 & 4, among others). My feeling is that the push for GUIDs is more about enabling computer-computer conversations, than it is about enabling human-human or human-computer interactions; and therefore we should not get bogged down in the "ugliness" of the identifiers. In the context of electronic data services, the "ugliness" potential of the "Danaus_plexippus_(Linnaeus_1758)" approach to identifiers is far greater than the ugliness potential of "A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523", when it comes to interlinking electronic biodiversity data. It is nothing for a computer to render relevant metadata of the object identified by "A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523" into "Danaus plexippus (Linnaeus_1758)" on a computer screen or piece of paper for human-eyeball consumption. But there are many pitfalls (some noted by Dima) for a computer to unambiguously resolve "Danaus_plexippus_(Linnaeus_1758)" back to a meaningful data object.
I guess my revised point is: GNI (and uBio/NameBank) are essentially the only taxonomic databases out there where a human-friendly persistent/actionable identifier of the sort being discussed is even plausible as an option. It may not even be wise in this context (as per Dima's points), but it *might* be, depending on the need for a human-friendly identifier.
Maybe the simplest thing to do would be to not regard " http://gni.globalnames.org/name_strings/Danaus_plexippus_(Linnaeus_1758)" as an identifier per se, but rather as a protocol for a web service. In other words, if you append a text string to the root URL " http://gni.globalnames.org/name_strings/", GNI would run that text string against its index and return whatever metadata based on a text-string match. This is not mutually exclusive with an "identifier" in the form of " http://gni.globalnames.org/name_strings/A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523", that would less ambiguously resolve a known record in GNI. At this point, the line between "identifier" and "service" gets fuzzy, of course. But the analogy is true in ZooBank:
The persistent "Identifer" looks like this: A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523
One way that this identifier can be represented as an *actionable* identifier is this: urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523
Another "actionable" form of the identifier might be this:
http://zoobank.org/urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E5BF4...
or this: http://zoobank.org/A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523
or even this(?):
http://lsid.tdwg.org/urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E5B...
(all of which work, by the way)
However, the following are examples of what I would think of as *services*: http://www.google.com/search?q=Danaus+plexippus+(Linnaeus+1758)
http://lsid.tdwg.org/summary/urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB...
http://darwin.zoology.gla.ac.uk/~rpage/lsid/tester/?q=urn:lsid:zoobank.org:a...
But really, from the perspective of the end-user, does it matter if it's an identifier or a service? Ultimately, they ask the questions, and the answers appear on their computer screens.
Aloha, Rich
-----Original Message----- From: tdwg-content-bounces@lists.tdwg.org [mailto:tdwg-content- bounces@lists.tdwg.org] On Behalf Of Dmitry Mozzherin Sent: Friday, June 03, 2011 4:34 AM To: David Remsen (GBIF) Cc: tdwg-content@lists.tdwg.org; Dmitry Mozzherin; Orrell, Thomas; Alan
J
Hampson; Nicolson, David; Gerald Guala Subject: Re: [tdwg-content] ITIS TSNID to uBio NamebankIDs mapping
In my opinion UUIDs have a few advantages over strings --
- It is uuid, so it will work with uuid tools (current and future ones)
- It is less ambiguous -- For example -- what is the difference between
Betulа and
Betula for your eyes? (one of them has a Cyrillic 'a') 3. Database wise it is faster to search because it is just a 128bit
number, while
a name is at least 245 byte varchar -- it makes searching much faster
because
in relational databases the size of keys directly proportional to the
search
speed 4. UUID v. 5 (http://en.wikipedia.org/wiki/Universally_unique_identifier) allows to generate UUID algorithmically without looking up a database (no need for network connection) 5. Links like
http://gni.globalnames.org/name_strings/Danaus_plexippus_(Linnaeus_1758)migh... be ambigous -- I can think of several ways I can represent name string
part in the url and they will all resolve to the same thing in GNI. 6. Unescaped unicode characters in url containing literal name strings
(people
will forget to escape them) will depend on an implementation of a url resolver
Saying this links like http://gni.globalnames.org/name_strings/Danaus_plexippus_(Linnaeus_175 8) are definitely attractive and is it good to have them as another way to
access
a name! My personal preference would be not use them as main identifier because of the reasons 1, 2, 3 and 5.
Dima
On Fri, Jun 3, 2011 at 7:59 AM, David Remsen (GBIF) dremsen@gbif.org wrote:
Why not use the name as the basis for the resolvable identifier instead of a uuid. Isnt there a 1:1 cardinality between the name and the uuid in the GNI? Doesnt that mean that
http://gni.globalnames.org/name_strings/4ef223c4-0c3e-5e84-ace9-
755c34
c601ec and
http://gni.globalnames.org/name_strings/Danaus_plexippus_(Linnaeus_175
are equally unique? The latter is certainly more readable. In those cases where the namestring is a homonym like
http://gni.globalnames.org/name_strings/Oenanthe
couldn't you just return the addresses of the two globally unique forms of the name when you resolve it?
http://gni.globalnames.org/name_strings/Oenanthe_Smith_1899
http://gni.globalnames.org/name_strings/Oenanthe_Jones_1900
Wouldn't those be as globally unique and easier to read and adjust to? Or am I missing something. I always wanted to do that with ubio IDs after a back and forth with Gregor Hagedorn and wished we hadn't exposed those integers.
DR
Hi Steve,
I don't have time to go through this in detail, and I can't speak for the GNI, but I can tell you about how the GNI URI's work at least for
now.
A while back Dima Mozzherin and I were looking into how triples etc. might be of use to the GNI.
We needed a way to generate unique URI's for each name.
We wanted to avoid having to keep these in sync and not require everyone to look each ID up through some service.
Dima came up with the following plan. We use the namestring as seed to generate a unique UUID.
Basically this is a shared algorithm which the GNI and TaxonConcept both use. But it could be used by anyone.
You feed the name string to the algorithm and it spits out a UUID. We append then append that to a URI and web service so it is resolvable.
So the name Danaus plexippus (Linnaeus 1758) => 4ef223c4-0c3e-5e84-ace9-755c34c601ec
So if the GNI and and another group have the same namestring they have the same UUID.
People can then can link their data set to the GNI with the following URI
http://gni.globalnames.org/name_strings/4ef223c4-0c3e-5e84-ace9-
755c3
4c601ec
RDF http://gni.globalnames.org/name_strings/4ef223c4-0c3e-5e84-ace9-
755c3
4c601ec.rdf
<http://gni.globalnames.org/name_strings/4ef223c4-0c3e-5e84-ace9-
755c
34c601ec.rdf>If you think of your data set as one table and the GNI as another, this URI serves as the foreign key that connects them together.
Some on the list don't like how these look, but there is a tremendous advantage in not having to worry about syncing two large data sets and determining if a given integer is already in use.
Also Rod Page has written a recently about UUID's. http://iphylo.blogspot.com/2011/05/zoobank-on-couchdb-uuids-replicati on.html
<http://iphylo.blogspot.com/2011/05/zoobank-on-couchdb-uuids-
replicat
ion.html>There may be a way to do something similar with bit.ly like identifiers that are shorter (mCcSp), but I think it the general idea is a good one.
If you recall from my talk at TDWG, I was able to use these to make statements that one namestring was a synonym etc. of another etc.
The algorithm we use is written in Ruby but I could be ported to many different languages since UUIDs are widely supported.
Respectfully,
- Pete
On Thu, Jun 2, 2011 at 11:41 PM, Steven J. Baskauf < steve.baskauf@vanderbilt.edu> wrote:
My email access has been sporadic since this thread developed, so at this point I'll respond to points made in several of the messages.
First, I should note that there has been previous discussion on this list on a similar topic from http://lists.tdwg.org/pipermail/tdwg-content/2011-January/002231.htm lthrough http://lists.tdwg.org/pipermail/tdwg-content/2011-
January/002231.html.
One can review what was said at that time rather quickly by starting on the first linked message and clicking on the "Next Message" link until you get to the end of the range I gave above.
My reason for the request for information that started this thread was that I wanted to link to a URI that would anchor the name portion of a name/sensu pair (TNU or Taxon Concept a la TCS if you prefer) as in this RDF snippet:
tc:nameStringQuercus rubra L.</tc:nameString> <tc:hasName
rdf:about="http://www.ubio.org/authority/metadata.php?lsid=urn:lsid:ubio .org:namebank:448439"
http://www.ubio.org/authority/metadata.php?lsid=urn:lsid:ubio.org:n amebank:448439/>
At this point in the discussion, I'm not actually talking about creating a link to a taxon concept but rather to a taxon name, so some of the issues Pete raised don't apply here (e.g. what's the "right" name for a concept
the question here is simply what's a stable identifier for the name)
.
In principle, I could probably just provide the name string and be done with it. However, having some degree of faith that Smart, Computer Savvy People might some day be able to use the metadata returned by the URI (or perhaps metadata which they already have in a triple store onsite) to do cool things like knowing that my name is the same as an orthographic variant or that "Quercus rubra L." is basically the same thing as "Quercus rubra", I would like to also provide a functional URI.
As an end -user who isn't very interested in the technical issues involving names, I don't really care what URI I use. I would prefer for it to be widely recognized and for it to "work" (i.e. be resolvable). In the earlier (January) thread, there was discussion about existing identifiers. There were a number of posts, but in particular http://lists.tdwg.org/pipermail/tdwg-content/2011-January/002258.htm l http://lists.tdwg.org/pipermail/tdwg-content/2011-January/002259.htm ldiscussed the relative merits of ITIS and uBio ID numbers. My take-home message from this was that uBio represented the largest single set of names with assigned identifiers (see http://gni.globalnames.org/data_sourcescited in Pete's email) and that uBio metadata provides useful references. Hence my interest in referencing uBio ids as a URI. However, as a practical matter, the organizations that I share images with either want ITIS TSNs (EOL and Morphbank) or just names (Discover Life). Nobody is asking for uBio identifiers or any other identifier.
I found Kevin's comment at http://lists.tdwg.org/pipermail/tdwg-content/2011-May/002486.html very thought-provoking: "My thoughts are that the most likely way this will be solved is by standard market type pressures - ie the best solution/IDs will be used the most and 'float' to the top." I'm not going to make a judgment about what is the "best" solution or ID. But I would say that in "computer" history, being the "best" doesn't necessarily mean that something will be used. Take for example, the FOAF vocabulary. What the heck is Friend of a Friend? I would venture to say that most of the people using the FOAF vocabulary don't know or care. The FOAF vocabulary was the one that people started to use and once that happened, people didn't switch even if there was something better. I'm not familiar with the history of other stuff like YouTube and Craig's List, but I would guess that they weren't necessarily "the best" systems - they were just the one that the most people started using first and once that happened, people didn't switch. I'm using ITIS IDs because they are easy to get and the people I communicate with want them. Whether they are the "best" or "done correctly" doesn't matter to me as much as the fact that that they are widely recognized and stable (and that thus far every name that I've looked for has been in their database).
I think that one reason why this question has been on my mind is that I've been waiting for GNUB (Global Name Use Bank) to come out. I'm not really up on how it is going to work, but my impression is that it was going to be based on the Global Name Index (GNI) which was mentioned in that earlier January thread. At that point, the GNI names didn't have any identifiers that were exposed to the public as permanent GUIDs. I'm assuming that if GNUB refers to GNI names, they will have some kind of identifiers. So if that happens how is the GUID recommendation 8 going to be followed? As Kevin said in http://lists.tdwg.org/pipermail/tdwg-content/2011-June/002499.html "What I take from recommendation 8 of the GUID applicability guide ... is that if you DON'T already have a record in your own database for a taxon name/concept, then reuse an existing one. " What we have here with GNI is a situation where none of the records have identifiers. In my mind, the "best practice" according to recommendation 8 would be for the GNI to reuse existing identifiers where they exist and NOT make up new ones. This is a bit more complicated because the ITIS identifiers (which are in common use) don't have an http URI version that is resolvable, and while the uBio identifiers have a resolvable http URI, it's in the form of a proxied LSID, which I've already complained is very ugly. So I'd like to hear some ideas about how to have "reused" identifiers in the GNI.
One thing that comes to my mind would be to have a "domain name" like "http://purl.org/gni/" http://purl.org/gni/ or "http://purl.org/tn/"http://purl.org/tn/("tn" for "taxon name") and to follow it with a namespace/id combination similar to what is done with lsids. So for example "itis/19408" and "ubio/448439" could be appended, creating http://purl.org/gni/itis/19408and http://purl.org/gni/ubio/448439 for "Quercus rubra L." Both URIs could point to the same RDF and that RDF could indicate that the two identifiers are owl:sameAs . I realize from what Bob Morris has cautioned in the past that there are problems with owl:sameAs when the two things aren't actually the same thing (e.g. if the uBio ID refers to a name string only but the ITIS TSN refers to the name plus an "accepted" status and a relationship to parent taxa). However, if there were an understanding that the GNI only refers to name strings, then one could still refer to http://purl.org/gni/itis/19408 as an identifier for the name string of the thing (whatever it is) that is referred to by an ITIS TSN of 19408. I don't think there would be a problem saying that and the ubio ID were "owl:sameAs". Some kind of solution like this would allow people to easily generate a resolvable URI for a name if they were using ITIS TSNs or uBio IDs. If the name that one wanted to use was so obscure that it was one of the 9.5 million names that uBio has that ITIS doesn't have, then that name would only have the ubio version. I have no idea whether this would be a good idea or not, but I was really cringing to think about 19 million newly minted UUIDs appended to "http://gni.globalnames.org/"http://gni.globalnames.org/and figuring out how to connect those horrid things to the names and ITIS TSNs that I'm already using. I think that I said this before, but using the purl.org domain rather than one like http://gni.globalnames.org/ would in the future allow somebody else to take over management of providing the metadata when the GUIDs
are
resolved without having to deal with issues of who "owns" the domain name.
Steve
Kevin Richards wrote:
Pete,
I'm not trying to say what you are doing is a waste of
time/impossible.
I actually think RDF + semantics are a good way forward, but this really implies that we need to rely on the semantics and linkages rather than having a SINGLE ID for a taxon name. (which is what I thought Steve was getting at). Each instance of a taxon name can have its own ID and then all these instances are connected via ontology defined semantic links. This seems more appropriate to me than insisting everyone uses the "Global Taxon Name ID X".
In your example of *Aedes triseriatus* and *Ochlerotatus triseriatus* - these are two different names so they need two different IDs, they may be linked by a single taxon concept, but they are separate names. So which of these now 3 IDs do you expect people to use, and according to what source??
For example if we have a name, eg the Robin, Erithacus rubecula, mentioned in IT IS (TSN : 559964) and also in EOL (www.eol.org/pages/1051567), also in GBIF (http://data.gbif.org/species/21266780), also in avibase ( http://avibase.bsc-eoc.org/species.jsp?avibaseid=C809B2B90399A43D), which ID are you hoping people will use?? Would you put the IT IS ID in
your
own dataset as the ID for that name - unlikely. Or would it be better to link them up with semantic linkages.
What I take from recommendation 8 of the GUID applicability guide (as Steve puts is "stop making up new identifiers when somebody else already
has
one for the thing you are talking about") is that if you DON'T already
have
a record in your own database for a taxon name/concept, then reuse an existing one. NOT ditch all your current IDs and adopt someone else's (especially hard considering it is so hard to work out which if the multitude of names ad concept IDs that directly relates to your taxon name).
I am all for limiting the number of IDs for the "same" thing, but in some cases it is more useful to build linkages than force this tight integration of data and IDs. Especially for taxon names and concepts, where it
is
complex to define if you are even talking about the "same" thing or
not.
Kevin
*From:* Peter DeVries [mailto:pete.devries@gmail.compete.devries@gmail.com]
*Sent:* Wednesday, 1 June 2011 12:38 p.m. *To:* Kevin Richards *Cc:* Steve Baskauf; tdwg-content@lists.tdwg.org; Gerald Guala; Nicolson, David; Alan J Hampson; Orrell, Thomas *Subject:* Re: [tdwg-content] ITIS TSNID to uBio NamebankIDs mapping
Hi Kevin,
I forgot one mention some other things that are different about my project.
You can write a simple SPARQL query to get a list of all the TaxonConcept's that have ITIS ids, or all those that have ITIS and NCBI ID's etc.
You can do this on any SPARQL endpoint that hosts the data.
You can download the entire data set and run the queries on your own endpoint.
You can write a script that runs the query and downloads the ITIS numbers and exports them to CSV etc.
- Pete
On Tue, May 31, 2011 at 5:16 PM, Peter DeVries
wrote:
Hi Kevin,
On Tue, May 31, 2011 at 3:27 PM, Kevin Richards < RichardsK@landcareresearch.co.nz> wrote:
This is exactly why this problem still exists and will be very
complex
to solve - everyone says "we should have a single ID for a specific
taxon
name, there seems to be several IDs 'out there' that refer to the same
taxon
name, so Im going to create another ID to link them all up" - yet another
ID
that no one will particularly want to follow - you would have to get
everyone
to agree that your combinations/integration of taxon names is the best
one
and hope everyone follows it - unlikely in this domain.
Isn't this kind of what the The Plant List, and eBird already do?
A difference being that they tie these to a specific name and
specific
classification.
The Plant list is not really even open so it is difficult to people
to
adopt it in mass.
For instance, if I manage a herbarium, how do I easily reconcile my species list with the entities represented in the Plant List?
eBird has millions of records which implies that they have been able
to
convince the observers in the field to adopt their system. You are correct in that there are probably a lot of taxonomists that don't like their list.
It differs from many of the other classifications, but remember the system rewards them for not agreeing. Note the difference between the
microbial
taxonomists and other taxonomists. In the case of the microbial
workers, the system rewards them for solving problems not debating alternatives. Also, if a good idea comes out that will make it easier for the microbiologists to solve the problems they are rewarded for
solving,
they are less likely to care whose idea it is.
Like the microbiologists, there are lots of biologists that work with species with the goal of addressing some non-taxonomic problem.
They don't really care if the name is *Aedes triseriatus* or *Ochlerotatus triseriatus, *but they do care that the identifier that they connect their data to is stable.
In regards to the issue of market forces,I suspect (but have no knowledge of) that there were probably decisions made in devising these lists
that
have more to do with appeasing certain personalities that creating
best
list. With the way this system rewards people it is likely that the "correct" version will float to the top only after that person has passed away. I don't have much faith that the best system will always float
to
the top, That has a lot to do with the personalities and how the system rewards are setup. Theoretically, it is possible for one strong personality
or
group to force others to adopt their less than optimal solution - at least this seems to happen in other environments.
Also, there are all sorts of ways that people can use the publication record to rewrite history. Simply cite the review paper that cites
the
original paper. Or don't cite it at all.
I would have used only the ITIS TSN but if the name changes the ID changes. This isn't "wrong", it just does not solve my problem.
- ITIS also should add the spiders from the World Spider Catalog.
Another issue that I think has inhibited adoption of a common list is that people can't agree on a particular name or a particular
classification.
Since you can model a species concept as having many names and many classifications why not do so?
If this idea was originally accepted, I would not have needed to
create
TaxonConcept.org.
My plan has aways been to get something that works to solve some problems and then let some larger group take it over.
In a sense, I am more like the microbiologists in that I am not being paid to solve this or debate this problem.
I am doing it because I think something like this is needed, and it
is
an interesting and personally rewarding puzzle.
- Pete
My thoughts are that the most likely way this will be solve is by stnadard market type pressures - ie the best solution/IDs will be used the
most
and "float" to the top. It is easy to say that the global taxon name
data
is a mess, but if you think about it 30 years ago taxon name data were
very
disparate, duplicated, unconnected, many with NO IDs at all. So I beleive we are making progress and that we will continue to do so albeit at a fairly slow rate.
Kevin
"I agree. This was one of the reasons that I setup TaxonConcept the
way
I did. It attempts to connect both the LOD entities and the foreign key based entities."
Please consider the environment before printing this email Warning: This electronic message together with any attachments is confidential. If you receive it in error: (i) you must not read, use, disclose, copy or retain it; (ii) please contact the sender
immediately
by reply email and then delete the emails. The views expressed in this email may not be those of Landcare
Research
New Zealand Limited. http://www.landcareresearch.co.nz
--
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 Email: pdevries@wisc.edu TaxonConcept http://www.taxonconcept.org/ & GeoSpecieshttp://about.geospecies.org/ Knowledge Bases A Semantic Web, Linked Open Data http://linkeddata.org/ Project
--
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 Email: pdevries@wisc.edu TaxonConcept http://www.taxonconcept.org/ & GeoSpecieshttp://about.geospecies.org/ Knowledge Bases A Semantic Web, Linked Open Data http://linkeddata.org/ Project
Please consider the environment before printing this email Warning: This electronic message together with any attachments is confidential. If you receive it in error: (i) you must not read, use, disclose, copy or retain it; (ii) please contact the sender
immediately
by reply email and then delete the emails. The views expressed in this email may not be those of Landcare
Research
New Zealand Limited. http://www.landcareresearch.co.nz
-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences
postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A.
delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235
office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707http://bioimages.vanderbilt.edu
--
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 Email: pdevries@wisc.edu TaxonConcept http://www.taxonconcept.org/ & GeoSpecieshttp://about.geospecies.org/ Knowledge Bases A Semantic Web, Linked Open Data http://linkeddata.org/ Project
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
David Remsen, Senior Programme Officer Electronic Catalog of Names of Known Organisms Global Biodiversity Information Facility Secretariat Universitetsparken 15, DK-2100 Copenhagen, Denmark Tel: +45-35321472 Fax: +45-35321480 Skype: dremsen
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content _______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
Hi Tony,
I see I need to clarify a few things. The vision of GNUB is most decidedly *not* to "index all usages of all names in all sources". Rather, it is to develop an infrastructure that assigns permanent, resolvable GUIDs to name-usage instances, with appropriate amounts of related metadata, that does not explicitly *exclude* any class of usage instance by its structure. The infrastructure and the content are two different things. The content population will most likely begin with usage instances of relevance to the Codes; particularly those that represent the establishment of new names and other Code-relevant nomenclatural acts. As services are built to manage taxon concepts, then I suppose the next priority will be major taxonomic revisions. Other content will likely flow in from cyber-aware journals as their articles are published and marked up with XML. Different communities might use the infrastructure for their own purposes, e.g. to build a checklist of taxa from a particular geographic region, or to index field guides to reference when conducting ecological surveys. The point is, the content will populate as priorities drive content into it.
So, just be clear on what the vision of GNUB (and GNA more generally) really is: to develop a common *architecture* to enable cross-linking of data via taxon names; not to index all usages of all names in all sources.
Yes, what you describe as your need fits very well into what my understanding of the roles of ITIS and CoL are. I don't know that I would characterize it as this service you need emerging from GN* space. Rather, I believe that GN* infrastructure can both facilitate the growth in content of initiatives like ITIS/CoL, and allow that content to be leveraged for a broader array of purposes. GN* is not about replacing the function of existing initiatives. It's about allowing existing initiates to work more effectively with each other, and with end-users, to better leverage the real value of the content they manage.
Aloha, Rich
-----Original Message----- From: Tony.Rees@csiro.au [mailto:Tony.Rees@csiro.au] Sent: Friday, June 03, 2011 1:05 PM To: Richard Pyle; tdwg-content@lists.tdwg.org Subject: Producing a global taxon register (was: ITIS TSNID to uBio NamebankIDs mapping)
Hi all (jumping in with some trepidation...)
It's good to hear some ramp-up may be coming of activity in the GNUB space (congratulations, Rich et al.). My main concern, however is that it does
not
solve my particular problem - which is in a nutshell, given "any" cited taxonomic name, what can we tell about it - with regard to its
classification,
nomenclatural and taxonomic/synonym status, and certain attributes (initially for my use case, simple geologic time - is it extant or not -
and simple
habitat classification - is it marine or not - though of course infinitely expandable from there).
To me the vision of GNUB is too grand - to index all usages of all names
in all
sources - and the vision of GNI is too limited - to index the names but
not
actually record/harmonise/verify/manage (in a structured way) any associated information. I'm after something in between - what I have tentatively previously called HCAL - a hierarchical catalogue of all life (presuming that at least one "management" hierarchy is incorporated) - or maybe just a GTR - global taxon register. Sort of, waiting for the
Catalogue of
Life and/or ITIS to be complete, for both extant and fossil taxa, and also incorporate selected "taxon attributes" as above. (This is the space into which my IRMNG database is cast as a preliminary/"working for now" solution, but obviously without the significant resourcing / community cooperation required to build and sustain the thing for the long term).
So my question is, how can such a product emerge from ongoing developments in GN* space, or other...
Over to the experts,
Best - Tony
From: tdwg-content-bounces@lists.tdwg.org [tdwg-content- bounces@lists.tdwg.org] On Behalf Of Richard Pyle [deepreef@bishopmuseum.org] Sent: Saturday, 4 June 2011 8:48 AM To: tdwg-content@lists.tdwg.org Subject: Re: [tdwg-content] ITIS TSNID to uBio NamebankIDs mapping
Working backwards through this thread...
I hadn't read Dima's post until just now, and I see that at least a couple
of his
points (i.e., #2, #5, #6) apply to exposing the UUIDs externally. However,
I
think that a simple protocol (such as replacing spaces with "_", and
avoiding
characters that look the same but are different -- such as the Cyrillic
'a') could
go a long way to mitigating those problems.
On the other hand, it really depends on what the identifier is for. The
string
"Danaus_plexippus_(Linnaeus_1758)" may be more friendly to our eyes, but "A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523" is definitely more friendly to a computer (Dima's points 1, 3 & 4, among others). My feeling is that the
push
for GUIDs is more about enabling computer-computer conversations, than it is about enabling human-human or human-computer interactions; and therefore we should not get bogged down in the "ugliness" of the identifiers. In the context of electronic data services, the "ugliness"
potential
of the "Danaus_plexippus_(Linnaeus_1758)" approach to identifiers is far greater than the ugliness potential of "A9F435E0-8ED7-46DD-BAB4- EA8E5BF41523", when it comes to interlinking electronic biodiversity data.
It
is nothing for a computer to render relevant metadata of the object identified by "A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523" into "Danaus plexippus (Linnaeus_1758)" on a computer screen or piece of paper for human-eyeball consumption. But there are many pitfalls (some noted by Dima) for a computer to unambiguously resolve "Danaus_plexippus_(Linnaeus_1758)" back to a meaningful data object.
I guess my revised point is: GNI (and uBio/NameBank) are essentially the only taxonomic databases out there where a human-friendly persistent/actionable identifier of the sort being discussed is even
plausible
as an option. It may not even be wise in this context (as per Dima's
points),
but it *might* be, depending on the need for a human-friendly identifier.
Maybe the simplest thing to do would be to not regard "http://gni.globalnames.org/name_strings/Danaus_plexippus_(Linnaeus_17 58)" as an identifier per se, but rather as a protocol for a web service.
In
other words, if you append a text string to the root URL "http://gni.globalnames.org/name_strings/", GNI would run that text string against its index and return whatever metadata based on a text-string
match.
This is not mutually exclusive with an "identifier" in the form of "http://gni.globalnames.org/name_strings/A9F435E0-8ED7-46DD-BAB4- EA8E5BF41523", that would less ambiguously resolve a known record in GNI. At this point, the line between "identifier" and "service" gets fuzzy, of course. But the analogy is true in ZooBank:
The persistent "Identifer" looks like this: A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523
One way that this identifier can be represented as an *actionable*
identifier
is this: urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523
Another "actionable" form of the identifier might be this: http://zoobank.org/urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4- EA8E5BF41523
or this: http://zoobank.org/A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523
or even this(?): http://lsid.tdwg.org/urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4- EA8E5BF41523
(all of which work, by the way)
However, the following are examples of what I would think of as
*services*:
http://www.google.com/search?q=Danaus+plexippus+(Linnaeus+1758) http://lsid.tdwg.org/summary/urn:lsid:zoobank.org:act:A9F435E0-8ED7- 46DD-BAB4-EA8E5BF41523
http://darwin.zoology.gla.ac.uk/~rpage/lsid/tester/?q=urn:lsid:zoobank.org:
act:A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523&submit=Go
But really, from the perspective of the end-user, does it matter if it's
an
identifier or a service? Ultimately, they ask the questions, and the
answers
appear on their computer screens.
Aloha, Rich
-----Original Message----- From: tdwg-content-bounces@lists.tdwg.org [mailto:tdwg-content- bounces@lists.tdwg.org] On Behalf Of Dmitry Mozzherin Sent: Friday, June 03, 2011 4:34 AM To: David Remsen (GBIF) Cc: tdwg-content@lists.tdwg.org; Dmitry Mozzherin; Orrell, Thomas; Alan J Hampson; Nicolson, David; Gerald Guala Subject: Re: [tdwg-content] ITIS TSNID to uBio NamebankIDs mapping
In my opinion UUIDs have a few advantages over strings --
- It is uuid, so it will work with uuid tools (current and future
ones) 2. It is less ambiguous -- For example -- what is the difference between Betulа and Betula for your eyes? (one of them has a Cyrillic 'a') 3. Database wise it is faster to search because it is just a 128bit number, while a name is at least 245 byte varchar -- it makes searching much faster because in relational databases the size of keys directly proportional to the search speed 4. UUID v. 5 (http://en.wikipedia.org/wiki/Universally_unique_identifier) allows to generate UUID algorithmically without looking up a database (no need for network connection) 5. Links like
http://gni.globalnames.org/name_strings/Danaus_plexippus_(Linnaeus_175
- might be ambigous -- I can think of several ways I can represent name
string part in the url and they will all resolve to the same thing in GNI.
- Unescaped unicode characters in url containing literal name strings
(people will forget to escape them) will depend on an implementation of a url resolver
Saying this links like
http://gni.globalnames.org/name_strings/Danaus_plexippus_(Linnaeus_175
are definitely attractive and is it good to have them as another way to access a name! My personal preference would be not use them as main identifier because of the reasons 1, 2, 3 and 5.
Dima
On Fri, Jun 3, 2011 at 7:59 AM, David Remsen (GBIF) dremsen@gbif.org wrote:
Why not use the name as the basis for the resolvable identifier instead of a uuid. Isnt there a 1:1 cardinality between the name and the uuid in the GNI? Doesnt that mean that
http://gni.globalnames.org/name_strings/4ef223c4-0c3e-5e84-ace9-
755c34
c601ec and
http://gni.globalnames.org/name_strings/Danaus_plexippus_(Linnaeus_175
are equally unique? The latter is certainly more readable. In those cases where the namestring is a homonym like
http://gni.globalnames.org/name_strings/Oenanthe
couldn't you just return the addresses of the two globally unique forms of the name when you resolve it?
http://gni.globalnames.org/name_strings/Oenanthe_Smith_1899
http://gni.globalnames.org/name_strings/Oenanthe_Jones_1900
Wouldn't those be as globally unique and easier to read and adjust to? Or am I missing something. I always wanted to do that with ubio IDs after a back and forth with Gregor Hagedorn and wished we hadn't exposed those integers.
DR
Hi Steve,
I don't have time to go through this in detail, and I can't speak for the GNI, but I can tell you about how the GNI URI's work at least
for
now.
A while back Dima Mozzherin and I were looking into how triples etc. might be of use to the GNI.
We needed a way to generate unique URI's for each name.
We wanted to avoid having to keep these in sync and not require everyone to look each ID up through some service.
Dima came up with the following plan. We use the namestring as seed to generate a unique UUID.
Basically this is a shared algorithm which the GNI and TaxonConcept both use. But it could be used by anyone.
You feed the name string to the algorithm and it spits out a UUID. We append then append that to a URI and web service so it is
resolvable.
So the name Danaus plexippus (Linnaeus 1758) => 4ef223c4-0c3e-5e84-ace9-755c34c601ec
So if the GNI and and another group have the same namestring they have the same UUID.
People can then can link their data set to the GNI with the following URI
http://gni.globalnames.org/name_strings/4ef223c4-0c3e-5e84-ace9-
755c3
4c601ec
RDF http://gni.globalnames.org/name_strings/4ef223c4-0c3e-5e84-ace9-
755c3
4c601ec.rdf
<http://gni.globalnames.org/name_strings/4ef223c4-0c3e-5e84-ace9-
755c
34c601ec.rdf>If you think of your data set as one table and the GNI as another, this URI serves as the foreign key that connects them together.
Some on the list don't like how these look, but there is a tremendous advantage in not having to worry about syncing two large data sets and determining if a given integer is already in use.
Also Rod Page has written a recently about UUID's. http://iphylo.blogspot.com/2011/05/zoobank-on-couchdb-uuids-replica ti on.html
<http://iphylo.blogspot.com/2011/05/zoobank-on-couchdb-uuids-
replicat
ion.html>There may be a way to do something similar with bit.ly like identifiers that are shorter (mCcSp), but I think it the general idea is a good one.
If you recall from my talk at TDWG, I was able to use these to make statements that one namestring was a synonym etc. of another etc.
The algorithm we use is written in Ruby but I could be ported to many different languages since UUIDs are widely supported.
Respectfully,
- Pete
On Thu, Jun 2, 2011 at 11:41 PM, Steven J. Baskauf < steve.baskauf@vanderbilt.edu> wrote:
My email access has been sporadic since this thread developed, so at this point I'll respond to points made in several of the messages.
First, I should note that there has been previous discussion on this list on a similar topic from http://lists.tdwg.org/pipermail/tdwg-content/2011-January/002231.h tm lthrough http://lists.tdwg.org/pipermail/tdwg-content/2011-
January/002231.html.
One can review what was said at that time rather quickly by starting on the first linked message and clicking on the "Next Message" link until you get to the end of the range I gave above.
My reason for the request for information that started this thread was that I wanted to link to a URI that would anchor the name portion of a name/sensu pair (TNU or Taxon Concept a la TCS if you prefer) as in this RDF snippet:
tc:nameStringQuercus rubra L.</tc:nameString> <tc:hasName
rdf:about="http://www.ubio.org/authority/metadata.php?lsid=urn:lsid:ub io .org:namebank:448439"
http://www.ubio.org/authority/metadata.php?lsid=urn:lsid:ubio.org :n amebank:448439/>
At this point in the discussion, I'm not actually talking about creating a link to a taxon concept but rather to a taxon name, so some of the issues Pete raised don't apply here (e.g. what's the "right" name for a concept
the question here is simply what's a stable identifier for the name)
.
In principle, I could probably just provide the name string and be done with it. However, having some degree of faith that Smart, Computer Savvy People might some day be able to use the metadata returned by the URI (or perhaps metadata which they already have in a triple store onsite) to do cool things like knowing that my name is the same as an orthographic variant or that "Quercus rubra L." is basically the same thing as "Quercus rubra", I would like to also provide a functional URI.
As an end -user who isn't very interested in the technical issues involving names, I don't really care what URI I use. I would prefer for it to be widely recognized and for it to "work" (i.e. be resolvable). In the earlier (January) thread, there was discussion about existing identifiers. There were a number of posts, but in particular http://lists.tdwg.org/pipermail/tdwg-content/2011-January/002258.h tm l http://lists.tdwg.org/pipermail/tdwg-content/2011-January/002259.h tm ldiscussed the relative merits of ITIS and uBio ID numbers. My take-home message from this was that uBio represented the largest single set of names with assigned identifiers (see http://gni.globalnames.org/data_sourcescited in Pete's email) and that uBio metadata provides useful references. Hence my interest in referencing uBio ids as a URI. However, as a practical matter, the organizations that I share images with either want ITIS TSNs (EOL and Morphbank) or just names (Discover
Life).
Nobody is asking for uBio identifiers or any other identifier.
I found Kevin's comment at http://lists.tdwg.org/pipermail/tdwg-content/2011-May/002486.html very thought-provoking: "My thoughts are that the most likely way this will be solved is by standard market type pressures - ie the best solution/IDs will be used the most and 'float' to the top." I'm not going to make a judgment about what is the "best" solution or
ID.
But I would say that in "computer" history, being the "best" doesn't necessarily mean that something will be used. Take for example, the FOAF vocabulary. What the heck is Friend of a Friend? I would venture to say that most of the people using the FOAF vocabulary don't know or care. The FOAF vocabulary was the one that people started to use and once that happened, people didn't switch even if there was something better. I'm not familiar with the history of other stuff like YouTube and Craig's List, but I would guess that they weren't necessarily "the best" systems - they were just the one that the most people started using first and once that happened, people didn't switch. I'm using ITIS IDs because they are easy to get and the people I communicate with want them. Whether they are the "best" or "done
correctly"
doesn't matter to me as much as the fact that that they are widely recognized and stable (and that thus far every name that I've looked for has been in their database).
I think that one reason why this question has been on my mind is that I've been waiting for GNUB (Global Name Use Bank) to come out. I'm not really up on how it is going to work, but my impression is that it was going to be based on the Global Name Index (GNI) which was mentioned in that earlier January thread. At that point, the GNI names didn't have any identifiers that were exposed to the public as permanent GUIDs. I'm assuming that if GNUB refers to GNI names, they will have some kind of identifiers. So if that happens how is the GUID recommendation 8 going to be followed? As Kevin said in http://lists.tdwg.org/pipermail/tdwg-content/2011-June/002499.html "What I take from recommendation 8 of the GUID applicability guide ... is that if you DON'T already have a record in your own database for a taxon name/concept, then reuse an existing one. " What we have here with GNI is a situation where none of the records have identifiers. In my mind, the "best practice" according to recommendation 8 would be for the GNI to reuse existing identifiers where they exist and NOT make up new ones. This is a bit more complicated because the ITIS identifiers (which are in common use) don't have an http URI version that is resolvable, and while the uBio identifiers have a resolvable http URI, it's in the form of a proxied LSID, which I've already complained is very ugly. So I'd like to hear some ideas about how to have "reused" identifiers in the GNI.
One thing that comes to my mind would be to have a "domain name" like "http://purl.org/gni/" http://purl.org/gni/ or "http://purl.org/tn/"http://purl.org/tn/("tn" for "taxon name") and to follow it with a namespace/id combination similar to what is done with lsids. So for example "itis/19408" and "ubio/448439" could be appended, creating http://purl.org/gni/itis/19408and http://purl.org/gni/ubio/448439 for "Quercus rubra L." Both URIs could point to the same RDF and that RDF could indicate that the two identifiers are owl:sameAs . I realize from what Bob Morris has cautioned in the past that there are problems with owl:sameAs when the two things aren't actually the same thing (e.g. if the uBio ID refers to a name string only but the ITIS TSN refers to the name plus an "accepted" status and a relationship to parent
taxa).
However, if there were an understanding that the GNI only refers to name strings, then one could still refer to http://purl.org/gni/itis/19408 as an identifier for the name string of the thing (whatever it is) that is referred to by an ITIS TSN of 19408. I don't think there would be a problem saying that and the ubio ID were "owl:sameAs". Some kind of solution like this would allow people to easily generate a resolvable URI for a name if they were using ITIS TSNs or uBio IDs. If the name that one wanted to use was so obscure that it was one of the 9.5 million names that uBio has that ITIS doesn't have, then that name would only have the ubio version. I have no idea whether this would be a good idea or not, but I was really cringing to think about 19 million newly minted UUIDs appended to "http://gni.globalnames.org/"http://gni.globalnames.org/and figuring out how to connect those horrid things to the names and ITIS TSNs that I'm already using. I think that I said this before, but using the purl.org domain rather than one like http://gni.globalnames.org/ would in the future allow somebody else to take over management of providing the metadata when the GUIDs
are
resolved without having to deal with issues of who "owns" the domain name.
Steve
Kevin Richards wrote:
Pete,
I'm not trying to say what you are doing is a waste of
time/impossible.
I actually think RDF + semantics are a good way forward, but this really implies that we need to rely on the semantics and linkages rather than having a SINGLE ID for a taxon name. (which is what I thought Steve was getting at). Each instance of a taxon name can have its own ID and then all these instances are connected via ontology defined semantic links. This seems more appropriate to me than insisting everyone uses the "Global Taxon Name ID X".
In your example of *Aedes triseriatus* and *Ochlerotatus triseriatus* - these are two different names so they need two different IDs, they may be linked by a single taxon concept, but they are separate names. So which of these now 3 IDs do you expect people to use, and according to what source??
For example if we have a name, eg the Robin, Erithacus rubecula, mentioned in IT IS (TSN : 559964) and also in EOL (www.eol.org/pages/1051567), also in GBIF (http://data.gbif.org/species/21266780), also in avibase ( http://avibase.bsc-eoc.org/species.jsp?avibaseid=C809B2B90399A43D) , which ID are you hoping people will use?? Would you put the IT IS ID in your own dataset as the ID for that name - unlikely. Or would it be better to link them up with semantic linkages.
What I take from recommendation 8 of the GUID applicability guide (as Steve puts is "stop making up new identifiers when somebody else already has one for the thing you are talking about") is that if you DON'T already have a record in your own database for a taxon name/concept, then reuse an existing one. NOT ditch all your current IDs and adopt someone else's (especially hard considering it is so hard to work out which if the multitude of names ad concept IDs that directly relates to your taxon name).
I am all for limiting the number of IDs for the "same" thing, but in some cases it is more useful to build linkages than force this tight integration of data and IDs. Especially for taxon names and concepts, where it is complex to define if you are even talking about the "same" thing or not.
Kevin
*From:* Peter DeVries [mailto:pete.devries@gmail.compete.devries@gmail.com]
*Sent:* Wednesday, 1 June 2011 12:38 p.m. *To:* Kevin Richards *Cc:* Steve Baskauf; tdwg-content@lists.tdwg.org; Gerald Guala; Nicolson, David; Alan J Hampson; Orrell, Thomas *Subject:* Re: [tdwg-content] ITIS TSNID to uBio NamebankIDs mapping
Hi Kevin,
I forgot one mention some other things that are different about my project.
You can write a simple SPARQL query to get a list of all the TaxonConcept's that have ITIS ids, or all those that have ITIS and NCBI ID's etc.
You can do this on any SPARQL endpoint that hosts the data.
You can download the entire data set and run the queries on your own endpoint.
You can write a script that runs the query and downloads the ITIS numbers and exports them to CSV etc.
- Pete
On Tue, May 31, 2011 at 5:16 PM, Peter DeVries
wrote:
Hi Kevin,
On Tue, May 31, 2011 at 3:27 PM, Kevin Richards < RichardsK@landcareresearch.co.nz> wrote:
This is exactly why this problem still exists and will be very complex to solve - everyone says "we should have a single ID for a specific taxon name, there seems to be several IDs 'out there' that refer to the same taxon name, so Im going to create another ID to link them all up" - yet another ID that no one will particularly want to follow - you would have to get everyone to agree that your combinations/integration of taxon names is the best one and hope everyone follows it - unlikely in this domain.
Isn't this kind of what the The Plant List, and eBird already do?
A difference being that they tie these to a specific name and specific classification.
The Plant list is not really even open so it is difficult to people to adopt it in mass.
For instance, if I manage a herbarium, how do I easily reconcile my species list with the entities represented in the Plant List?
eBird has millions of records which implies that they have been able to convince the observers in the field to adopt their system. You are correct in that there are probably a lot of taxonomists that don't like their list.
It differs from many of the other classifications, but remember the system rewards them for not agreeing. Note the difference between the
microbial
taxonomists and other taxonomists. In the case of the microbial
workers, the system rewards them for solving problems not debating alternatives. Also, if a good idea comes out that will make it easier for the microbiologists to solve the problems they are rewarded for solving, they are less likely to care whose idea it is.
Like the microbiologists, there are lots of biologists that work with species with the goal of addressing some non-taxonomic
problem.
They don't really care if the name is *Aedes triseriatus* or *Ochlerotatus triseriatus, *but they do care that the identifier that they connect their data to is stable.
In regards to the issue of market forces,I suspect (but have no knowledge of) that there were probably decisions made in devising these lists that have more to do with appeasing certain personalities that creating best list. With the way this system rewards people it is likely that the "correct" version will float to the top only after that person has passed away. I don't have much faith that the best system will always float to the top, That has a lot to do with the personalities and how the system rewards are setup. Theoretically, it is possible for one strong personality or group to force others to adopt their less than optimal solution - at least this seems to happen in other environments.
Also, there are all sorts of ways that people can use the publication record to rewrite history. Simply cite the review paper that cites the original paper. Or don't cite it at all.
I would have used only the ITIS TSN but if the name changes the ID changes. This isn't "wrong", it just does not solve my problem.
- ITIS also should add the spiders from the World Spider Catalog.
Another issue that I think has inhibited adoption of a common list is that people can't agree on a particular name or a particular classification.
Since you can model a species concept as having many names and many classifications why not do so?
If this idea was originally accepted, I would not have needed to create TaxonConcept.org.
My plan has aways been to get something that works to solve some problems and then let some larger group take it over.
In a sense, I am more like the microbiologists in that I am not being paid to solve this or debate this problem.
I am doing it because I think something like this is needed, and it is an interesting and personally rewarding puzzle.
- Pete
My thoughts are that the most likely way this will be solve is by stnadard market type pressures - ie the best solution/IDs will be used the most and "float" to the top. It is easy to say that the global taxon name data is a mess, but if you think about it 30 years ago taxon name data were very disparate, duplicated, unconnected, many with NO IDs at all. So I beleive we are making progress and that we will continue to do so albeit at a fairly slow rate.
Kevin
"I agree. This was one of the reasons that I setup TaxonConcept the way I did. It attempts to connect both the LOD entities and the foreign key based entities."
Please consider the environment before printing this email Warning: This electronic message together with any attachments is confidential. If you receive it in error: (i) you must not read, use, disclose, copy or retain it; (ii) please contact the sender immediately by reply email and then delete the emails. The views expressed in this email may not be those of Landcare
Research
New Zealand Limited. http://www.landcareresearch.co.nz
--
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 Email: pdevries@wisc.edu TaxonConcept http://www.taxonconcept.org/ & GeoSpecieshttp://about.geospecies.org/ Knowledge Bases A Semantic Web, Linked Open Data http://linkeddata.org/ Project
--
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 Email: pdevries@wisc.edu TaxonConcept http://www.taxonconcept.org/ & GeoSpecieshttp://about.geospecies.org/ Knowledge Bases A Semantic Web, Linked Open Data http://linkeddata.org/ Project
Please consider the environment before printing this email Warning: This electronic message together with any attachments is confidential. If you receive it in error: (i) you must not read, use, disclose, copy or retain it; (ii) please contact the sender immediately by reply email and then delete the emails. The views expressed in this email may not be those of Landcare
Research
New Zealand Limited. http://www.landcareresearch.co.nz
-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences
postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A.
delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235
office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707http://bioimages.vanderbilt.edu
--
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 Email: pdevries@wisc.edu TaxonConcept http://www.taxonconcept.org/ & GeoSpecieshttp://about.geospecies.org/ Knowledge Bases A
Semantic
Web, Linked Open Data http://linkeddata.org/ Project
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
-------- David Remsen, Senior Programme Officer Electronic Catalog of Names of Known Organisms Global Biodiversity Information Facility Secretariat Universitetsparken 15, DK-2100 Copenhagen, Denmark Tel: +45-35321472 Fax: +45-35321480 Skype: dremsen
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
I agree that it is important to have clarity of what the goal of a project is
* a HCAL - a hierarchical catalogue of all life - is a very popular type of project; Catalogue of Life, ITIS, NCBI, Wikispecies, etc all pursue this.
* a GTR - global taxon register - is something else entirely, at least if the term is taken literally. It would be indispensable if the purpose "to index all usages of all names in all sources" is to be realized. I don't know of any project that pursues this in a systematic way (I suppose the French Wikipedia rates a mention, at least making some attempt).
and of course there are projects that focus on names, but at the moment we still don't have something like a complete nomenclatural index (inventorying all nomenclatural acts), and are just moving towards lists of currently accepted names (closely connected to the HCAL). For information on biodiversity the latter is only marginally relevant, and the GNI is much less so.
Names and taxa are quite different things and they are interconnected in a complex way.
Paul
-----Oorspronkelijk bericht----- Van: tdwg-content-bounces@lists.tdwg.org namens Tony.Rees@csiro.au Verzonden: za 4-6-2011 1:04 Aan: deepreef@bishopmuseum.org; tdwg-content@lists.tdwg.org Onderwerp: [tdwg-content] Producing a global taxon register (was: ITIS TSNID to uBio NamebankIDs mapping)
Hi all (jumping in with some trepidation...)
It's good to hear some ramp-up may be coming of activity in the GNUB space (congratulations, Rich et al.). My main concern, however is that it does not solve my particular problem - which is in a nutshell, given "any" cited taxonomic name, what can we tell about it - with regard to its classification, nomenclatural and taxonomic/synonym status, and certain attributes (initially for my use case, simple geologic time - is it extant or not - and simple habitat classification - is it marine or not - though of course infinitely expandable from there).
To me the vision of GNUB is too grand - to index all usages of all names in all sources - and the vision of GNI is too limited - to index the names but not actually record/harmonise/verify/manage (in a structured way) any associated information. I'm after something in between - what I have tentatively previously called HCAL - a hierarchical catalogue of all life (presuming that at least one "management" hierarchy is incorporated) - or maybe just a GTR - global taxon register. Sort of, waiting for the Catalogue of Life and/or ITIS to be complete, for both extant and fossil taxa, and also incorporate selected "taxon attributes" as above. (This is the space into which my IRMNG database is cast as a preliminary/"working for now" solution, but obviously without the significant resourcing / community cooperation required to build and sustain the thing for the long term).
So my question is, how can such a product emerge from ongoing developments in GN* space, or other...
Over to the experts,
Best - Tony
________________________________________ From: tdwg-content-bounces@lists.tdwg.org [tdwg-content-bounces@lists.tdwg.org] On Behalf Of Richard Pyle [deepreef@bishopmuseum.org] Sent: Saturday, 4 June 2011 8:48 AM To: tdwg-content@lists.tdwg.org Subject: Re: [tdwg-content] ITIS TSNID to uBio NamebankIDs mapping
Working backwards through this thread...
I hadn't read Dima's post until just now, and I see that at least a couple of his points (i.e., #2, #5, #6) apply to exposing the UUIDs externally. However, I think that a simple protocol (such as replacing spaces with "_", and avoiding characters that look the same but are different -- such as the Cyrillic 'a') could go a long way to mitigating those problems.
On the other hand, it really depends on what the identifier is for. The string "Danaus_plexippus_(Linnaeus_1758)" may be more friendly to our eyes, but "A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523" is definitely more friendly to a computer (Dima's points 1, 3 & 4, among others). My feeling is that the push for GUIDs is more about enabling computer-computer conversations, than it is about enabling human-human or human-computer interactions; and therefore we should not get bogged down in the "ugliness" of the identifiers. In the context of electronic data services, the "ugliness" potential of the "Danaus_plexippus_(Linnaeus_1758)" approach to identifiers is far greater than the ugliness potential of "A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523", when it comes to interlinking electronic biodiversity data. It is nothing for a computer to render relevant metadata of the object identified by "A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523" into "Danaus plexippus (Linnaeus_1758)" on a computer screen or piece of paper for human-eyeball consumption. But there are many pitfalls (some noted by Dima) for a computer to unambiguously resolve "Danaus_plexippus_(Linnaeus_1758)" back to a meaningful data object.
I guess my revised point is: GNI (and uBio/NameBank) are essentially the only taxonomic databases out there where a human-friendly persistent/actionable identifier of the sort being discussed is even plausible as an option. It may not even be wise in this context (as per Dima's points), but it *might* be, depending on the need for a human-friendly identifier.
Maybe the simplest thing to do would be to not regard "http://gni.globalnames.org/name_strings/Danaus_plexippus_(Linnaeus_1758)" as an identifier per se, but rather as a protocol for a web service. In other words, if you append a text string to the root URL "http://gni.globalnames.org/name_strings/", GNI would run that text string against its index and return whatever metadata based on a text-string match. This is not mutually exclusive with an "identifier" in the form of "http://gni.globalnames.org/name_strings/A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523", that would less ambiguously resolve a known record in GNI. At this point, the line between "identifier" and "service" gets fuzzy, of course. But the analogy is true in ZooBank:
The persistent "Identifer" looks like this: A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523
One way that this identifier can be represented as an *actionable* identifier is this: urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523
Another "actionable" form of the identifier might be this: http://zoobank.org/urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E5BF4...
or this: http://zoobank.org/A9F435E0-8ED7-46DD-BAB4-EA8E5BF41523
or even this(?): http://lsid.tdwg.org/urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB4-EA8E5B...
(all of which work, by the way)
However, the following are examples of what I would think of as *services*: http://www.google.com/search?q=Danaus+plexippus+(Linnaeus+1758) http://lsid.tdwg.org/summary/urn:lsid:zoobank.org:act:A9F435E0-8ED7-46DD-BAB... http://darwin.zoology.gla.ac.uk/~rpage/lsid/tester/?q=urn:lsid:zoobank.org:a...
But really, from the perspective of the end-user, does it matter if it's an identifier or a service? Ultimately, they ask the questions, and the answers appear on their computer screens.
Aloha, Rich
-----Original Message----- From: tdwg-content-bounces@lists.tdwg.org [mailto:tdwg-content- bounces@lists.tdwg.org] On Behalf Of Dmitry Mozzherin Sent: Friday, June 03, 2011 4:34 AM To: David Remsen (GBIF) Cc: tdwg-content@lists.tdwg.org; Dmitry Mozzherin; Orrell, Thomas; Alan J Hampson; Nicolson, David; Gerald Guala Subject: Re: [tdwg-content] ITIS TSNID to uBio NamebankIDs mapping
In my opinion UUIDs have a few advantages over strings --
- It is uuid, so it will work with uuid tools (current and future ones)
- It is less ambiguous -- For example -- what is the difference between Betul? and
Betula for your eyes? (one of them has a Cyrillic 'a') 3. Database wise it is faster to search because it is just a 128bit number, while a name is at least 245 byte varchar -- it makes searching much faster because in relational databases the size of keys directly proportional to the search speed 4. UUID v. 5 (http://en.wikipedia.org/wiki/Universally_unique_identifier) allows to generate UUID algorithmically without looking up a database (no need for network connection) 5. Links like http://gni.globalnames.org/name_strings/Danaus_plexippus_(Linnaeus_1758) might be ambigous -- I can think of several ways I can represent name string part in the url and they will all resolve to the same thing in GNI. 6. Unescaped unicode characters in url containing literal name strings (people will forget to escape them) will depend on an implementation of a url resolver
Saying this links like http://gni.globalnames.org/name_strings/Danaus_plexippus_(Linnaeus_175 8) are definitely attractive and is it good to have them as another way to access a name! My personal preference would be not use them as main identifier because of the reasons 1, 2, 3 and 5.
Dima
On Fri, Jun 3, 2011 at 7:59 AM, David Remsen (GBIF) dremsen@gbif.org wrote:
Why not use the name as the basis for the resolvable identifier instead of a uuid. Isnt there a 1:1 cardinality between the name and the uuid in the GNI? Doesnt that mean that
http://gni.globalnames.org/name_strings/4ef223c4-0c3e-5e84-ace9-
755c34
c601ec and
http://gni.globalnames.org/name_strings/Danaus_plexippus_(Linnaeus_175
are equally unique? The latter is certainly more readable. In those cases where the namestring is a homonym like
http://gni.globalnames.org/name_strings/Oenanthe
couldn't you just return the addresses of the two globally unique forms of the name when you resolve it?
http://gni.globalnames.org/name_strings/Oenanthe_Smith_1899
http://gni.globalnames.org/name_strings/Oenanthe_Jones_1900
Wouldn't those be as globally unique and easier to read and adjust to? Or am I missing something. I always wanted to do that with ubio IDs after a back and forth with Gregor Hagedorn and wished we hadn't exposed those integers.
DR
Hi Steve,
I don't have time to go through this in detail, and I can't speak for the GNI, but I can tell you about how the GNI URI's work at least for now.
A while back Dima Mozzherin and I were looking into how triples etc. might be of use to the GNI.
We needed a way to generate unique URI's for each name.
We wanted to avoid having to keep these in sync and not require everyone to look each ID up through some service.
Dima came up with the following plan. We use the namestring as seed to generate a unique UUID.
Basically this is a shared algorithm which the GNI and TaxonConcept both use. But it could be used by anyone.
You feed the name string to the algorithm and it spits out a UUID. We append then append that to a URI and web service so it is resolvable.
So the name Danaus plexippus (Linnaeus 1758) => 4ef223c4-0c3e-5e84-ace9-755c34c601ec
So if the GNI and and another group have the same namestring they have the same UUID.
People can then can link their data set to the GNI with the following URI
http://gni.globalnames.org/name_strings/4ef223c4-0c3e-5e84-ace9-
755c3
4c601ec
RDF http://gni.globalnames.org/name_strings/4ef223c4-0c3e-5e84-ace9-
755c3
4c601ec.rdf
<http://gni.globalnames.org/name_strings/4ef223c4-0c3e-5e84-ace9-
755c
34c601ec.rdf>If you think of your data set as one table and the GNI as another, this URI serves as the foreign key that connects them together.
Some on the list don't like how these look, but there is a tremendous advantage in not having to worry about syncing two large data sets and determining if a given integer is already in use.
Also Rod Page has written a recently about UUID's. http://iphylo.blogspot.com/2011/05/zoobank-on-couchdb-uuids-replicati on.html
<http://iphylo.blogspot.com/2011/05/zoobank-on-couchdb-uuids-
replicat
ion.html>There may be a way to do something similar with bit.ly like identifiers that are shorter (mCcSp), but I think it the general idea is a good one.
If you recall from my talk at TDWG, I was able to use these to make statements that one namestring was a synonym etc. of another etc.
The algorithm we use is written in Ruby but I could be ported to many different languages since UUIDs are widely supported.
Respectfully,
- Pete
On Thu, Jun 2, 2011 at 11:41 PM, Steven J. Baskauf < steve.baskauf@vanderbilt.edu> wrote:
My email access has been sporadic since this thread developed, so at this point I'll respond to points made in several of the messages.
First, I should note that there has been previous discussion on this list on a similar topic from http://lists.tdwg.org/pipermail/tdwg-content/2011-January/002231.htm lthrough http://lists.tdwg.org/pipermail/tdwg-content/2011-
January/002231.html.
One can review what was said at that time rather quickly by starting on the first linked message and clicking on the "Next Message" link until you get to the end of the range I gave above.
My reason for the request for information that started this thread was that I wanted to link to a URI that would anchor the name portion of a name/sensu pair (TNU or Taxon Concept a la TCS if you prefer) as in this RDF snippet:
tc:nameStringQuercus rubra L.</tc:nameString> <tc:hasName
rdf:about="http://www.ubio.org/authority/metadata.php?lsid=urn:lsid:ubio .org:namebank:448439"
http://www.ubio.org/authority/metadata.php?lsid=urn:lsid:ubio.org:n amebank:448439/>
At this point in the discussion, I'm not actually talking about creating a link to a taxon concept but rather to a taxon name, so some of the issues Pete raised don't apply here (e.g. what's the "right" name for a concept
the question here is simply what's a stable identifier for the name) . In principle, I could probably just provide the name string and be done with it. However, having some degree of faith that Smart, Computer Savvy People might some day be able to use the metadata returned by the URI (or perhaps metadata which they already have in a triple store onsite) to do cool things like knowing that my name is the same as an orthographic variant or that "Quercus rubra L." is basically the same thing as "Quercus rubra", I would like to also provide a functional URI.
As an end -user who isn't very interested in the technical issues involving names, I don't really care what URI I use. I would prefer for it to be widely recognized and for it to "work" (i.e. be resolvable). In the earlier (January) thread, there was discussion about existing identifiers. There were a number of posts, but in particular http://lists.tdwg.org/pipermail/tdwg-content/2011-January/002258.htm l http://lists.tdwg.org/pipermail/tdwg-content/2011-January/002259.htm ldiscussed the relative merits of ITIS and uBio ID numbers. My take-home message from this was that uBio represented the largest single set of names with assigned identifiers (see http://gni.globalnames.org/data_sourcescited in Pete's email) and that uBio metadata provides useful references. Hence my interest in referencing uBio ids as a URI. However, as a practical matter, the organizations that I share images with either want ITIS TSNs (EOL and Morphbank) or just names (Discover Life). Nobody is asking for uBio identifiers or any other identifier.
I found Kevin's comment at http://lists.tdwg.org/pipermail/tdwg-content/2011-May/002486.html very thought-provoking: "My thoughts are that the most likely way this will be solved is by standard market type pressures - ie the best solution/IDs will be used the most and 'float' to the top." I'm not going to make a judgment about what is the "best" solution or ID. But I would say that in "computer" history, being the "best" doesn't necessarily mean that something will be used. Take for example, the FOAF vocabulary. What the heck is Friend of a Friend? I would venture to say that most of the people using the FOAF vocabulary don't know or care. The FOAF vocabulary was the one that people started to use and once that happened, people didn't switch even if there was something better. I'm not familiar with the history of other stuff like YouTube and Craig's List, but I would guess that they weren't necessarily "the best" systems - they were just the one that the most people started using first and once that happened, people didn't switch. I'm using ITIS IDs because they are easy to get and the people I communicate with want them. Whether they are the "best" or "done correctly" doesn't matter to me as much as the fact that that they are widely recognized and stable (and that thus far every name that I've looked for has been in their database).
I think that one reason why this question has been on my mind is that I've been waiting for GNUB (Global Name Use Bank) to come out. I'm not really up on how it is going to work, but my impression is that it was going to be based on the Global Name Index (GNI) which was mentioned in that earlier January thread. At that point, the GNI names didn't have any identifiers that were exposed to the public as permanent GUIDs. I'm assuming that if GNUB refers to GNI names, they will have some kind of identifiers. So if that happens how is the GUID recommendation 8 going to be followed? As Kevin said in http://lists.tdwg.org/pipermail/tdwg-content/2011-June/002499.html "What I take from recommendation 8 of the GUID applicability guide ... is that if you DON'T already have a record in your own database for a taxon name/concept, then reuse an existing one. " What we have here with GNI is a situation where none of the records have identifiers. In my mind, the "best practice" according to recommendation 8 would be for the GNI to reuse existing identifiers where they exist and NOT make up new ones. This is a bit more complicated because the ITIS identifiers (which are in common use) don't have an http URI version that is resolvable, and while the uBio identifiers have a resolvable http URI, it's in the form of a proxied LSID, which I've already complained is very ugly. So I'd like to hear some ideas about how to have "reused" identifiers in the GNI.
One thing that comes to my mind would be to have a "domain name" like "http://purl.org/gni/" http://purl.org/gni/ or "http://purl.org/tn/"http://purl.org/tn/("tn" for "taxon name") and to follow it with a namespace/id combination similar to what is done with lsids. So for example "itis/19408" and "ubio/448439" could be appended, creating http://purl.org/gni/itis/19408and http://purl.org/gni/ubio/448439 for "Quercus rubra L." Both URIs could point to the same RDF and that RDF could indicate that the two identifiers are owl:sameAs . I realize from what Bob Morris has cautioned in the past that there are problems with owl:sameAs when the two things aren't actually the same thing (e.g. if the uBio ID refers to a name string only but the ITIS TSN refers to the name plus an "accepted" status and a relationship to parent taxa). However, if there were an understanding that the GNI only refers to name strings, then one could still refer to http://purl.org/gni/itis/19408 as an identifier for the name string of the thing (whatever it is) that is referred to by an ITIS TSN of 19408. I don't think there would be a problem saying that and the ubio ID were "owl:sameAs". Some kind of solution like this would allow people to easily generate a resolvable URI for a name if they were using ITIS TSNs or uBio IDs. If the name that one wanted to use was so obscure that it was one of the 9.5 million names that uBio has that ITIS doesn't have, then that name would only have the ubio version. I have no idea whether this would be a good idea or not, but I was really cringing to think about 19 million newly minted UUIDs appended to "http://gni.globalnames.org/"http://gni.globalnames.org/and figuring out how to connect those horrid things to the names and ITIS TSNs that I'm already using. I think that I said this before, but using the purl.org domain rather than one like http://gni.globalnames.org/ would in the future allow somebody else to take over management of providing the metadata when the GUIDs
are
resolved without having to deal with issues of who "owns" the domain name.
Steve
Kevin Richards wrote:
Pete,
I'm not trying to say what you are doing is a waste of time/impossible. I actually think RDF + semantics are a good way forward, but this really implies that we need to rely on the semantics and linkages rather than having a SINGLE ID for a taxon name. (which is what I thought Steve was getting at). Each instance of a taxon name can have its own ID and then all these instances are connected via ontology defined semantic links. This seems more appropriate to me than insisting everyone uses the "Global Taxon Name ID X".
In your example of *Aedes triseriatus* and *Ochlerotatus triseriatus* - these are two different names so they need two different IDs, they may be linked by a single taxon concept, but they are separate names. So which of these now 3 IDs do you expect people to use, and according to what source??
For example if we have a name, eg the Robin, Erithacus rubecula, mentioned in IT IS (TSN : 559964) and also in EOL (www.eol.org/pages/1051567), also in GBIF (http://data.gbif.org/species/21266780), also in avibase ( http://avibase.bsc-eoc.org/species.jsp?avibaseid=C809B2B90399A43D), which ID are you hoping people will use?? Would you put the IT IS ID in your own dataset as the ID for that name - unlikely. Or would it be better to link them up with semantic linkages.
What I take from recommendation 8 of the GUID applicability guide (as Steve puts is "stop making up new identifiers when somebody else already has one for the thing you are talking about") is that if you DON'T already have a record in your own database for a taxon name/concept, then reuse an existing one. NOT ditch all your current IDs and adopt someone else's (especially hard considering it is so hard to work out which if the multitude of names ad concept IDs that directly relates to your taxon name).
I am all for limiting the number of IDs for the "same" thing, but in some cases it is more useful to build linkages than force this tight integration of data and IDs. Especially for taxon names and concepts, where it is complex to define if you are even talking about the "same" thing or not.
Kevin
*From:* Peter DeVries [mailto:pete.devries@gmail.compete.devries@gmail.com]
*Sent:* Wednesday, 1 June 2011 12:38 p.m. *To:* Kevin Richards *Cc:* Steve Baskauf; tdwg-content@lists.tdwg.org; Gerald Guala; Nicolson, David; Alan J Hampson; Orrell, Thomas *Subject:* Re: [tdwg-content] ITIS TSNID to uBio NamebankIDs mapping
Hi Kevin,
I forgot one mention some other things that are different about my project.
You can write a simple SPARQL query to get a list of all the TaxonConcept's that have ITIS ids, or all those that have ITIS and NCBI ID's etc.
You can do this on any SPARQL endpoint that hosts the data.
You can download the entire data set and run the queries on your own endpoint.
You can write a script that runs the query and downloads the ITIS numbers and exports them to CSV etc.
- Pete
On Tue, May 31, 2011 at 5:16 PM, Peter DeVries
wrote:
Hi Kevin,
On Tue, May 31, 2011 at 3:27 PM, Kevin Richards < RichardsK@landcareresearch.co.nz> wrote:
This is exactly why this problem still exists and will be very complex to solve - everyone says "we should have a single ID for a specific taxon name, there seems to be several IDs 'out there' that refer to the same taxon name, so Im going to create another ID to link them all up" - yet another ID that no one will particularly want to follow - you would have to get everyone to agree that your combinations/integration of taxon names is the best one and hope everyone follows it - unlikely in this domain.
Isn't this kind of what the The Plant List, and eBird already do?
A difference being that they tie these to a specific name and specific classification.
The Plant list is not really even open so it is difficult to people to adopt it in mass.
For instance, if I manage a herbarium, how do I easily reconcile my species list with the entities represented in the Plant List?
eBird has millions of records which implies that they have been able to convince the observers in the field to adopt their system. You are correct in that there are probably a lot of taxonomists that don't like their list.
It differs from many of the other classifications, but remember the system rewards them for not agreeing. Note the difference between the
microbial
taxonomists and other taxonomists. In the case of the microbial
workers, the system rewards them for solving problems not debating alternatives. Also, if a good idea comes out that will make it easier for the microbiologists to solve the problems they are rewarded for solving, they are less likely to care whose idea it is.
Like the microbiologists, there are lots of biologists that work with species with the goal of addressing some non-taxonomic problem.
They don't really care if the name is *Aedes triseriatus* or *Ochlerotatus triseriatus, *but they do care that the identifier that they connect their data to is stable.
In regards to the issue of market forces,I suspect (but have no knowledge of) that there were probably decisions made in devising these lists that have more to do with appeasing certain personalities that creating best list. With the way this system rewards people it is likely that the "correct" version will float to the top only after that person has passed away. I don't have much faith that the best system will always float to the top, That has a lot to do with the personalities and how the system rewards are setup. Theoretically, it is possible for one strong personality or group to force others to adopt their less than optimal solution - at least this seems to happen in other environments.
Also, there are all sorts of ways that people can use the publication record to rewrite history. Simply cite the review paper that cites the original paper. Or don't cite it at all.
I would have used only the ITIS TSN but if the name changes the ID changes. This isn't "wrong", it just does not solve my problem.
- ITIS also should add the spiders from the World Spider Catalog.
Another issue that I think has inhibited adoption of a common list is that people can't agree on a particular name or a particular classification.
Since you can model a species concept as having many names and many classifications why not do so?
If this idea was originally accepted, I would not have needed to create TaxonConcept.org.
My plan has aways been to get something that works to solve some problems and then let some larger group take it over.
In a sense, I am more like the microbiologists in that I am not being paid to solve this or debate this problem.
I am doing it because I think something like this is needed, and it is an interesting and personally rewarding puzzle.
- Pete
My thoughts are that the most likely way this will be solve is by stnadard market type pressures - ie the best solution/IDs will be used the most and "float" to the top. It is easy to say that the global taxon name data is a mess, but if you think about it 30 years ago taxon name data were very disparate, duplicated, unconnected, many with NO IDs at all. So I beleive we are making progress and that we will continue to do so albeit at a fairly slow rate.
Kevin
"I agree. This was one of the reasons that I setup TaxonConcept the way I did. It attempts to connect both the LOD entities and the foreign key based entities."
Please consider the environment before printing this email Warning: This electronic message together with any attachments is confidential. If you receive it in error: (i) you must not read, use, disclose, copy or retain it; (ii) please contact the sender immediately by reply email and then delete the emails. The views expressed in this email may not be those of Landcare
Research
New Zealand Limited. http://www.landcareresearch.co.nz
--
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 Email: pdevries@wisc.edu TaxonConcept http://www.taxonconcept.org/ & GeoSpecieshttp://about.geospecies.org/ Knowledge Bases A Semantic Web, Linked Open Data http://linkeddata.org/ Project
--
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 Email: pdevries@wisc.edu TaxonConcept http://www.taxonconcept.org/ & GeoSpecieshttp://about.geospecies.org/ Knowledge Bases A Semantic Web, Linked Open Data http://linkeddata.org/ Project
Please consider the environment before printing this email Warning: This electronic message together with any attachments is confidential. If you receive it in error: (i) you must not read, use, disclose, copy or retain it; (ii) please contact the sender immediately by reply email and then delete the emails. The views expressed in this email may not be those of Landcare
Research
New Zealand Limited. http://www.landcareresearch.co.nz
-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences
postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A.
delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235
office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707http://bioimages.vanderbilt.edu
--
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 Email: pdevries@wisc.edu TaxonConcept http://www.taxonconcept.org/ & GeoSpecieshttp://about.geospecies.org/ Knowledge Bases A Semantic Web, Linked Open Data http://linkeddata.org/ Project
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
David Remsen, Senior Programme Officer Electronic Catalog of Names of Known Organisms Global Biodiversity Information Facility Secretariat Universitetsparken 15, DK-2100 Copenhagen, Denmark Tel: +45-35321472 Fax: +45-35321480 Skype: dremsen
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
_______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content _______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
Good evening Paul, glad to have your input on this topic.
I guess in my thinking the only difference between HCAL and GTR is that the former expicitly incorporates a hierarchy, the latter does not, although maybe you have a different view.
In either case (again to my thinking), the intended unit is that of the taxonomic concept, with both its accepted/valid/current name, and its synonyms (according to some preferred treatment, at least). Of course some concepts are contentious, and others subject to revision through time, but that should not detract from the desirability of the task as a "best effort" representing the state of knowledge at any particular point in time.
Lists of names alone (such as GNI) presently do not go so far as reconciling them to currently accepted concepts, though of course many other taxonomic treatments of either particular taxonomic sectors (as per Catalogue of Life, ITIS, WoRMS and many taxon-specific databases) or regional floras/faunas do so; plus various portions of the palaeontological literature (I am thinking in particular of the numerous volumes of the Treatise on Invertebrate Paleontology here, though not all are recently updated), as well as various more up-to-date reviews and monographs of specific groups.
My interest is in what strategies may be possible to stitich together such activities as are currently in progress as well as fill the gaps between them, preferably in the short term (i.e. my working lifetime!) rather than as an open-ended project with no particular urgency or near term aspiration of completion. One comment I would make is that the Catalogue of Life, at least to date, has concentrated on accessing global species directories (GSDs) which leaves a number of conspicuous gaps at present, whereas for some of the missing groups at least genus-level compendia may be available. Another is the disconnect between databasing projects of extant versus fossil taxa, for which again taxonomic, geographic, and nomenclatural issues know no such bounds.
I would be interested to know more of recent developments e.g. with the 4D4life and i4life initiatives and the extent to which they might accelerate progress towards this goal, although presumbly still without the fossil component at this time; a similar comment would apply to the recently released version one of "The Plant List" although of course it is certainly a noteworthy advance on what was available previously.
(And again, with reference to Rich's recent post, whether any of these initiatives are likely to benefit from planned activities in GN* space).
Of course there are two somewhat separate tasks here: one is keeping up with new names/concepts/treatments and they appear, and the other is to organise and make accessible the legacy information as has been published to date. I susect the strategies (and resources needed) to address these are probaby different but in any case, the ultimate goal should be to merge them into a seamless whole - an "integrated taxonomic information system" no less.
Regards - Tony ________________________________ From: tdwg-content-bounces@lists.tdwg.org [tdwg-content-bounces@lists.tdwg.org] On Behalf Of dipteryx@freeler.nl [dipteryx@freeler.nl] Sent: Sunday, 5 June 2011 4:37 PM To: tdwg-content@lists.tdwg.org Subject: Re: [tdwg-content] Producing a global taxon register (was: ITIS TSNID to uBio NamebankIDs mapping)
I agree that it is important to have clarity of what the goal of a project is
* a HCAL - a hierarchical catalogue of all life - is a very popular type of project; Catalogue of Life, ITIS, NCBI, Wikispecies, etc all pursue this.
* a GTR - global taxon register - is something else entirely, at least if the term is taken literally. It would be indispensable if the purpose "to index all usages of all names in all sources" is to be realized. I don't know of any project that pursues this in a systematic way (I suppose the French Wikipedia rates a mention, at least making some attempt).
and of course there are projects that focus on names, but at the moment we still don't have something like a complete nomenclatural index (inventorying all nomenclatural acts), and are just moving towards lists of currently accepted names (closely connected to the HCAL). For information on biodiversity the latter is only marginally relevant, and the GNI is much less so.
Names and taxa are quite different things and they are interconnected in a complex way.
Paul
Hi Tony,
keeping in mind the definition of a taxon (a taxon is what a taxonomist says it is; a good taxon is what a consensus of taxonomists says it is), a taxon register should hold taxon concepts, which by the nature of things will be, in part, mutually exclusive.
A taxon can easily go by half a dozen to a dozen names, even without any change in circumscription. On the other hand a name can easily apply to half a dozen to a dozen differently circumscribed taxa. A regular database is unlikely to be adequate for this.
There are names and there are taxa and there is no, or not necessarily a, 1:1 relationship between them. Even if one had a catalogue of currently accepted taxa, complete with currently accepted names, this would not be enough to map the information from the literature onto.
If fossils are going to be included, matters will become much more complicated, probably by an order of magnitude (just reading the nomenclature proposals for proposed changes in the rules for the naming of plant fossils gave me a headache).
As to the GNI, perhaps this was easy to generate, as it is an index of name strings present in existing databases, but there is no guarantee that all the nomenclaturally correct taxon names are present among the name strings, and it remains unclear to me what purpose it actually serves.
I continue to pin my hopes on the development of bottom-up databases, as the most immediately useful sources.
Paul
-----Oorspronkelijk bericht----- Van: Tony.Rees@csiro.au [mailto:Tony.Rees@csiro.au] Verzonden: zo 5-6-2011 14:18 Aan: dipteryx@freeler.nl; tdwg-content@lists.tdwg.org Onderwerp: RE: [tdwg-content] Producing a global taxon register (was: ITISTSNID to uBio NamebankIDs mapping)
Good evening Paul, glad to have your input on this topic.
I guess in my thinking the only difference between HCAL and GTR is that the former expicitly incorporates a hierarchy, the latter does not, although maybe you have a different view.
In either case (again to my thinking), the intended unit is that of the taxonomic concept, with both its accepted/valid/current name, and its synonyms (according to some preferred treatment, at least). Of course some concepts are contentious, and others subject to revision through time, but that should not detract from the desirability of the task as a "best effort" representing the state of knowledge at any particular point in time.
Lists of names alone (such as GNI) presently do not go so far as reconciling them to currently accepted concepts, though of course many other taxonomic treatments of either particular taxonomic sectors (as per Catalogue of Life, ITIS, WoRMS and many taxon-specific databases) or regional floras/faunas do so; plus various portions of the palaeontological literature (I am thinking in particular of the numerous volumes of the Treatise on Invertebrate Paleontology here, though not all are recently updated), as well as various more up-to-date reviews and monographs of specific groups.
My interest is in what strategies may be possible to stitich together such activities as are currently in progress as well as fill the gaps between them, preferably in the short term (i.e. my working lifetime!) rather than as an open-ended project with no particular urgency or near term aspiration of completion. One comment I would make is that the Catalogue of Life, at least to date, has concentrated on accessing global species directories (GSDs) which leaves a number of conspicuous gaps at present, whereas for some of the missing groups at least genus-level compendia may be available. Another is the disconnect between databasing projects of extant versus fossil taxa, for which again taxonomic, geographic, and nomenclatural issues know no such bounds.
I would be interested to know more of recent developments e.g. with the 4D4life and i4life initiatives and the extent to which they might accelerate progress towards this goal, although presumbly still without the fossil component at this time; a similar comment would apply to the recently released version one of "The Plant List" although of course it is certainly a noteworthy advance on what was available previously.
(And again, with reference to Rich's recent post, whether any of these initiatives are likely to benefit from planned activities in GN* space).
Of course there are two somewhat separate tasks here: one is keeping up with new names/concepts/treatments and they appear, and the other is to organise and make accessible the legacy information as has been published to date. I susect the strategies (and resources needed) to address these are probaby different but in any case, the ultimate goal should be to merge them into a seamless whole - an "integrated taxonomic information system" no less.
Regards - Tony ________________________________ From: tdwg-content-bounces@lists.tdwg.org [tdwg-content-bounces@lists.tdwg.org] On Behalf Of dipteryx@freeler.nl [dipteryx@freeler.nl] Sent: Sunday, 5 June 2011 4:37 PM To: tdwg-content@lists.tdwg.org Subject: Re: [tdwg-content] Producing a global taxon register (was: ITIS TSNID to uBio NamebankIDs mapping)
I agree that it is important to have clarity of what the goal of a project is
* a HCAL - a hierarchical catalogue of all life - is a very popular type of project; Catalogue of Life, ITIS, NCBI, Wikispecies, etc all pursue this.
* a GTR - global taxon register - is something else entirely, at least if the term is taken literally. It would be indispensable if the purpose "to index all usages of all names in all sources" is to be realized. I don't know of any project that pursues this in a systematic way (I suppose the French Wikipedia rates a mention, at least making some attempt).
and of course there are projects that focus on names, but at the moment we still don't have something like a complete nomenclatural index (inventorying all nomenclatural acts), and are just moving towards lists of currently accepted names (closely connected to the HCAL). For information on biodiversity the latter is only marginally relevant, and the GNI is much less so.
Names and taxa are quite different things and they are interconnected in a complex way.
Paul
- a GTR - global taxon register - is something else entirely, at least
if the term is taken literally. It would be indispensable if the purpose "to index all usages of all names in all sources" is to be realized.
Yes, that would be nice. But as Tony indicated, that would be impractical for the foreseeable future. Especially when you consider that "all sources" encompasses not only "all publications" (including popular books and magazine articles, newspaper articles, etc., etc.), but also all unpublished sources (museum specimen labels, field notebooks, personal correspondence, etc., etc.). The GNUB model is designed to accommodate any & all of these, but a proactive attempt to populate it to that extent would represent an unrealistic amount of effort.
However, an enormous benefit would be achieved it a select subset of "all usages of all names in all sources" was targeted. For example, the first priority for populating GNUB will be:
a complete nomenclatural index (inventorying all nomenclatural acts),
And the next step would be:
moving towards lists of currently accepted names
That is, capturing the specific usage instances for each that reflect a modern taxonomic landscape. Of course, there is more than one interpretation of the "modern taxonomic landscape" (i.e., different opinions about how to structure the HCAL). Therefore, you need a spectrum of modern usage instances to capture all of the popular HCAL perspectives.
Names and taxa are quite different things and they are interconnected in a complex way.
I don't think that the interconnection is all that complex. In the same way that nomenclature and biology intersect at the type specimen, names and taxa intersect at the Taxon Name Usage instance. The analogy is reasonably good. A scientific name is "anchored" to the biological world through the type specimen. Likewise, a taxon concept is anchored to a name through a taxon name usage instance. Not all taxon name usage instances rise to the level of an explicit or implicit taxon concept definition. However, all taxon concept definitions exist in the form of a Taxon Name Usage instance.
The problem, as Tony alluded to, is that TNU instances are so abundant that it can be overwhelming to contemplate the TNU universe in its entirety. Dave Remsen referred to TNUs as the "individual molecules" of taxonomy. When we look at a physical object, we don't think of it in terms of an assemblage of individual molecules; we abstract it to the entire object. This is why we have so many databases that focus on the HCAL -- it's much more direct to capture the entire object (in this case, taxon concept), than to enumerate all of the molecules that comprise it.
But unlike physical objects and their constituent molecules, there are "special" TNUs that stand out from all the rest. Capturing a few of these "special" TNUs will allow us to get most of the benefit in representing the parts of the taxon concept we're interested. As already noted, these "special" TNUs include all the relevant nomenclatural acts for all of the names that have been associated with that taxon concept, as well as the main concept definitions (e.g., published taxonomic treatments that may or may not carry nomenclatural acts with them). In other words, unlike trying to describe a physical object by enumerating its individual molecules, we can capture the majority of our interest in taxon names and concepts by enumerating only a small fraction of the TNUs (i.e., the aforementioned "special" ones).
Aloha, Rich
Richard L. Pyle, PhD Database Coordinator for Natural Sciences Associate Zoologist in Ichthyology Dive Safety Officer Department of Natural Sciences, Bishop Museum 1525 Bernice St., Honolulu, HI 96817 Ph: (808)848-4115, Fax: (808)847-8252 email: deepreef@bishopmuseum.org http://hbs.bishopmuseum.org/staff/pylerichard.html
Van: Richard Pyle [mailto:deepreef@bishopmuseum.org] Verzonden: zo 5-6-2011 19:14
- a GTR - global taxon register - is something else entirely, at least
if the term is taken literally. It would be indispensable if the purpose "to index all usages of all names in all sources" is to be realized.
Yes, that would be nice. But as Tony indicated, that would be impractical for the foreseeable future. Especially when you consider that "all sources" encompasses not only "all publications" (including popular books and magazine articles, newspaper articles, etc., etc.), but also all unpublished sources (museum specimen labels, field notebooks, personal correspondence, etc., etc.). The GNUB model is designed to accommodate any & all of these, but a proactive attempt to populate it to that extent would represent an unrealistic amount of effort.
*** I don't know about that. Certainly, if you organize this top down it would be a prohibitively massive undertaking. However, if organized bottom up, it may be quite different; of course, coverage would be uneven. * * *
However, an enormous benefit would be achieved it a select subset of "all usages of all names in all sources" was targeted. For example, the first priority for populating GNUB will be:
a complete nomenclatural index (inventorying all nomenclatural acts),
*** This would not be a first step. The first step will be a complete Checklist; likely, this will be ready years before a complete nomenclatural index. * * *
And the next step would be:
moving towards lists of currently accepted names
That is, capturing the specific usage instances for each that reflect a modern taxonomic landscape. Of course, there is more than one interpretation of the "modern taxonomic landscape" (i.e., different opinions about how to structure the HCAL). Therefore, you need a spectrum of modern usage instances to capture all of the popular HCAL perspectives.
Names and taxa are quite different things and they are interconnected in a complex way.
I don't think that the interconnection is all that complex. In the same way that nomenclature and biology intersect at the type specimen, names and taxa intersect at the Taxon Name Usage instance. The analogy is reasonably good. A scientific name is "anchored" to the biological world through the type specimen. Likewise, a taxon concept is anchored to a name through a taxon name usage instance. Not all taxon name usage instances rise to the level of an explicit or implicit taxon concept definition. However, all taxon concept definitions exist in the form of a Taxon Name Usage instance.
The problem, as Tony alluded to, is that TNU instances are so abundant that it can be overwhelming to contemplate the TNU universe in its entirety.
*** Not so sure if it is handy to refer to this as a universe, given that the components conflict, cluster, overlap, etc. * * *
Dave Remsen referred to TNUs as the "individual molecules" of taxonomy. When we look at a physical object, we don't think of it in terms of an assemblage of individual molecules; we abstract it to the entire object. This is why we have so many databases that focus on the HCAL -- it's much more direct to capture the entire object (in this case, taxon concept), than to enumerate all of the molecules that comprise it.
*** I agree that it is more direct, but its popularity will be because of the popularity of shortcuts, the desire for the One Truth or the Latest Thing, or just laziness. * * *
But unlike physical objects and their constituent molecules, there are "special" TNUs that stand out from all the rest. Capturing a few of these "special" TNUs will allow us to get most of the benefit in representing the parts of the taxon concept we're interested. As already noted, these "special" TNUs include all the relevant nomenclatural acts for all of the names that have been associated with that taxon concept, as well as the main concept definitions (e.g., published taxonomic treatments that may or may not carry nomenclatural acts with them). In other words, unlike trying to describe a physical object by enumerating its individual molecules, we can capture the majority of our interest in taxon names and concepts by enumerating only a small fraction of the TNUs (i.e., the aforementioned "special" ones).
*** Yes, the Law of Diminishing Returns applies. However, there are two issues. Firstly, where to draw the line? Secondly, the marketing pitch. Anything that will offer, say, a 40% usability will be marketed as the Eighth Wonder of the World, and will cause further damage.
Paul
One of the complicating factors in producing a GTR is the human or real-world element. Even if all the academically/taxonomist-generated correct name-taxon relationships could be modeled to the satisfaction of all the use cases that are out there, and that seems nigh impossible based on list traffic of the last few months that seem to suggest a few use cases are orthogonal, there then comes the truth that the only facts that can be employed in a digital/online repository have to be digital. Mental and print-only concepts are out of reach for computers. The vast majority of taxonomic facts are still in paper form, even after all the work of BHL. And the digitized facts are practically never comprehensive, frequently leaving out the really useful details for a taxon concept, and those suboptimal name-taxon facts then have the inevitable human errors randomly mixed in. I think we need techniques and processes that work with the real name and taxon data that are available with all their imperfections/gaps to arrive at a "best guess" about taxon relationships rather than expecting the availability of ideal data to populate an ideal GTR. We need better algorithms, models, and services that work with fuzzy concepts, probabilities, quality indices, and possibly sensitivity analyses to deal with missing, inaccurate and conflicting name and taxon data sources.
Chuck
Chuck Miller VP-IT & CIO Missouri Botanical Garden 4344 Shaw Boulevard Saint Louis, MO 63110 USA
On Jun 5, 2011, at 12:14 PM, "Richard Pyle" deepreef@bishopmuseum.org wrote:
- a GTR - global taxon register - is something else entirely, at least
if the term is taken literally. It would be indispensable if the purpose "to index all usages of all names in all sources" is to be realized.
Yes, that would be nice. But as Tony indicated, that would be impractical for the foreseeable future. Especially when you consider that "all sources" encompasses not only "all publications" (including popular books and magazine articles, newspaper articles, etc., etc.), but also all unpublished sources (museum specimen labels, field notebooks, personal correspondence, etc., etc.). The GNUB model is designed to accommodate any & all of these, but a proactive attempt to populate it to that extent would represent an unrealistic amount of effort.
However, an enormous benefit would be achieved it a select subset of "all usages of all names in all sources" was targeted. For example, the first priority for populating GNUB will be:
a complete nomenclatural index (inventorying all nomenclatural acts),
And the next step would be:
moving towards lists of currently accepted names
That is, capturing the specific usage instances for each that reflect a modern taxonomic landscape. Of course, there is more than one interpretation of the "modern taxonomic landscape" (i.e., different opinions about how to structure the HCAL). Therefore, you need a spectrum of modern usage instances to capture all of the popular HCAL perspectives.
Names and taxa are quite different things and they are interconnected in a complex way.
I don't think that the interconnection is all that complex. In the same way that nomenclature and biology intersect at the type specimen, names and taxa intersect at the Taxon Name Usage instance. The analogy is reasonably good. A scientific name is "anchored" to the biological world through the type specimen. Likewise, a taxon concept is anchored to a name through a taxon name usage instance. Not all taxon name usage instances rise to the level of an explicit or implicit taxon concept definition. However, all taxon concept definitions exist in the form of a Taxon Name Usage instance.
The problem, as Tony alluded to, is that TNU instances are so abundant that it can be overwhelming to contemplate the TNU universe in its entirety. Dave Remsen referred to TNUs as the "individual molecules" of taxonomy. When we look at a physical object, we don't think of it in terms of an assemblage of individual molecules; we abstract it to the entire object. This is why we have so many databases that focus on the HCAL -- it's much more direct to capture the entire object (in this case, taxon concept), than to enumerate all of the molecules that comprise it.
But unlike physical objects and their constituent molecules, there are "special" TNUs that stand out from all the rest. Capturing a few of these "special" TNUs will allow us to get most of the benefit in representing the parts of the taxon concept we're interested. As already noted, these "special" TNUs include all the relevant nomenclatural acts for all of the names that have been associated with that taxon concept, as well as the main concept definitions (e.g., published taxonomic treatments that may or may not carry nomenclatural acts with them). In other words, unlike trying to describe a physical object by enumerating its individual molecules, we can capture the majority of our interest in taxon names and concepts by enumerating only a small fraction of the TNUs (i.e., the aforementioned "special" ones).
Aloha, Rich
Richard L. Pyle, PhD Database Coordinator for Natural Sciences Associate Zoologist in Ichthyology Dive Safety Officer Department of Natural Sciences, Bishop Museum 1525 Bernice St., Honolulu, HI 96817 Ph: (808)848-4115, Fax: (808)847-8252 email: deepreef@bishopmuseum.org http://hbs.bishopmuseum.org/staff/pylerichard.html
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
participants (5)
-
Chuck Miller
-
dipteryx@freeler.nl
-
Peter DeVries
-
Richard Pyle
-
Tony.Rees@csiro.au