Re: [tdwg-content] synonyms in DwC Archives

19 Mar 2014

      Hilmar,
I've been in multiple discussions over the last year on how to exchange complex taxonomic data with or without DwCA and we still don't have a consensus.  Complex taxonomic data to me means that the dataset can include many-to-many relationships.  The checklists GBIF is ingesting include one-to-many relations (e.g. many synonyms to one accepted name), and DwCA has been demonstrated to handle that structure.  But, as the many different suggestions show, there are multiple methods that can be employed to do it.  And it would be great if we could agree on a "primary" DwCA method for most to follow, and then make exceptions when necessary or too complex.

But, you also touch on another of the ongoing issues in biodiversity information: the desire for a "universal taxonomy" that simply lists all the "taxa" and provides all the "related names' for those taxa. This desire  has been repeatedly expressed by many fields outside biological systematics - conservation, ethno-economic use, niche modelling, land use, CITES and on.  Taxonomists resist but I think the closest we have come to that desire so far is Catalogue of Life with 1.5 million species.  Their aim is to create a consensus global  checklist for all species, but achieving 100% of all species is extremely challenging when some organismal groups simply don't have a consensus taxonomy yet.

But even if CoL were complete, eternal stability in taxonomic classification simply doesn't exist.  Until thousands of taxonomists stop collecting and analyzing specimens, they will continue to make revisions to taxonomic classifications based on new discovery, and some names will change from accepted to synonym, or vice versa, for valid scientific reasons.  CoL deals with that change by publishing an Annual and a Dynamic Checklist.  The Annual Checklist provides a stable consensus taxonomy for at least one year, but not eternally.  NCBI's taxonomy contains about 300,000 species and applies only to the sequenced organisms there. And, as they have always done, they give this disclaimer: The NCBI taxonomy database is not an authoritative source for nomenclature or classification - please consult the relevant scientific literature for the most reliable information. Yet, folks continue to use NCBI taxonomy as an authoritative source.

Chuck

From: tdwg-content-bounces@lists.tdwg.org [mailto:tdwg-content-bounces@lists.tdwg.org] On Behalf Of Hilmar Lapp
Sent: Wednesday, March 19, 2014 9:26 AM
To: Markus Döring
Cc: TDWG Content Mailing List; Dan Leehr
Subject: Re: [tdwg-content] synonyms in DwC Archives

On Wed, Mar 19, 2014 at 5:35 AM, Markus Döring <m.doering@mac.com<mailto:m.doering@mac.com>> wrote:
Hilmar,

to include synonyms in the core file they technically do not have to have ids on their own.

How would you recommend I do this then, i.e., which column header should be used for that? (I'm assuming you are not requiring that I make up IDs, which to me seems a non-starter, see below.)

Said that there is the desire in dwc archives though to require an identifier for a record, see last years occurrenceID discussion. The best would be if you could make up some identifier, for example by adding a -synX suffix to the accepted taxonID or even just using the name alone if its known to be unique within your dataset.

It is not necessarily unique. And making up identifiers can lead to all kinds of awkward problems downstream (how do consuming applications distinguish real identifiers that can be linked to and/resolved from those that are simply hacks and otherwise bogus), so I agree it's possibility but it strikes me as a last resort; we ought to be able to do better than that.

Using an extension for synonyms would probably be useful and feel natural for many datasets, but I am concerned we introduce more and more alternative ways of expressing the same kind of data and that becomes a huge burden on the consumer side.

I share that concern in principle. However, DwCA made a deliberate choice to flatten out its core taxon table and force a 1:1 relationship between row and <taxon or whatever other designated type> record. Since there may be multiple synonyms for a taxon, of different types, I'm not sure how you would do this in the core table unless you mint an identifier for each one that doesn't have one from the nomenclator, and cast all records into type Taxon, whether that's what they are or not.

All - to come back to my original question, I fully agree with the concerns re: proper treatment of synonyms for nomenclatural applications. But we also shouldn't forget that there's a wide area of application for taxonomies as an informatics tool, to discover and connect data linked to taxon. One of the key requirements for such applications when finding data by taxon is to find it by all names that were or are being potentially used to label the taxon, including basionyms, invalid names, misspellings, vernacular names. This may turn up false positives, and I agree the more provenance one has for the synonyms the better the ability to weed those out subsequently. But having an extensive list of synonyms is still critical, even if all one can say is that it's "related". Having a rich set of synonyms was one of the driving use-cases for synthesizing the Vertebrate Taxonomy Ontology [1] (and it's predecessor, the Teleost Taxonomy Ontology); it was enough of a pain that we would have gladly used an existing nomenclator's taxonomy.

What has brought me to this in the first place is the use-case of taxonomy synthesis, and the recognition that not only have we not managed in 300 years to converge on a universal taxonomy, also there is no generally accepted format for exchanging taxonomies. There are several dozens of taxonomies each of which comes in its own idiosyncratic format, often enough a straight database dump. Perhaps there's an opportunity here for the community as a whole to converge on DwCA as a universal taxonomy exchange format. I was going to write up some thoughts on that as a blog post; hopefully I get to that over the next few days, as I think we're really not far away in terms of what's missing.

   -hilmar

[1] Midford, Peter, Thomas Dececchi, James Balhoff, Wasila Dahdul, Nizar Ibrahim, Hilmar Lapp, John Lundberg, et al. 2013. "The Vertebrate Taxonomy Ontology: A Framework for Reasoning across Model Organism and Species Phenotypes." Journal of Biomedical Semantics 4 (1): 34.
http://dx.doi.org/10.1186/2041-1480-4-34

Here are links to the documents I mentioned before:

  Publishing Species Checklists, Best Practices
  http://www.gbif.org/resources/2548

  GBIF GNA Profile Reference Guide for Darwin Core Archive, Core Terms and Extensions:
  http://www.gbif.org/resources/2562

The GBIF documents are from 2011 and likely in need for some update in specific areas, but they still provide a good overview and lots of details.

In addition there is a Catalog of Life document that I cannot find online anymore so I have uploaded the last version I have here:
  i4Life Darwin Core Archive Profile
  https://dl.dropboxusercontent.com/u/457027/ChecklistExchangeFormat-v1.6.pdf

Markus
...
Markus and all - yes, I realized after I emailed how GBIF does this. I agree that this has advantages. However, this way of doing synonyms requires that there is an identifier for the synonym. For the core use-case I'm interested in synonyms are metadata of taxon records and do not have their own identifier. For example, synonyms in NCBI don't have identifiers, and they don't in Catalog of Fishes. (I'm not sure they do in PaleoDB.)
One could of course invent identifiers on behalf of the taxonomy providers in these cases, but that's a hack. I think if there is an extension for vernacularNames, there ought to be one as well for synonyms that are simply names.
-hilmar
On Tue, Mar 18, 2014 at 6:35 PM, Markus Döring <m.doering@mac.com<mailto:m.doering@mac.com>> wrote:
Hi Hilmar,
GBIF, Catalog of life and others have produced guidelines for how to express taxonomies with synonyms and these are in widespread use already since over a year. I will forward links tomorrow when Im back at my desk.
The common idea is to include synonyms together with accepted taxa in the core file. This allows one to also add extension data to synonyms, for example bibliographic references, types data, etc. The term acceptedNameUsageID is used to link to the accepted record in the core file (targeting taxonID), originalNameUsageiD for the basionym and taxonomicStatus to declare a specific type of synonym such as homo/heterotypic or later/junior synonym. The scientificName is used both for accepted and synonym records.
You should be able to find many dwca examples in the gbif dataset search when filtered for checklists: http://www.gbif.org/dataset/search?type=CHECKLIST
For example try these:
http://data.canadensys.net/ipt/archive.do?r=vascan
http://ipt.speciesfile.org:8080/archive.do?r=orthoptera
Cheers,
Markus
Am 18.03.2014 um 22:44 schrieb Chuck Miller <Chuck.Miller@mobot.org<mailto:Chuck.Miller@mobot.org>>:
...
Hilmar,
Sticking strictly to Darwin Core and not adding RDF, I think there are a couple of DwC terms that are attributes that can be used to identify a synonym:
taxonomicStatus - The status of the use of the scientificName as a label for a taxon. Requires taxonomic opinion to define the scope of a taxon. Rules of priority then are used to define the taxonomic status of the nomenclature contained in that scope, combined with the experts opinion. It must be linked to a specific taxonomic reference that defines the concept. Recommended best practice is to use a controlled vocabulary. Examples: "invalid", "misapplied", "homotypic synonym", "accepted".
relationshipofResource - The relationship of the resource identified by relatedResourceID to the subject (optionally identified by the resourceID). Recommended best practice is to use a controlled vocabulary. Examples: "duplicate of", "mother of", "endoparasite of", "host to", "sibling of", "valid synonym of", "located within".
There's also acceptedNameUsage and acceptedNameUsageID, which if used infer that the name the terms are associated with is a synonym of the AcceptedName.
But, so far there is no guideline for how to organize synonyms in a Darwin Core Archive.  They can be embedded in the core file using relationshiopofResource from a synonym name to an accepted name in the same file.  Or they can be in an extension file, where the extension file may be called Synonyms and thus define a one-to-many "synonym relationship" from the taxonID in the core file to synonym names in the extension file.  There are probably other ways.  RDF adds the ability to be more explicit about the relationships.
Rich Pyle has lectured prolifically on this so I'm sure he has good advice to offer.
Chuck
From: tdwg-content-bounces@lists.tdwg.org<mailto:tdwg-content-bounces@lists.tdwg.org> [mailto:tdwg-content-bounces@lists.tdwg.org<mailto:tdwg-content-bounces@lists.tdwg.org>] On Behalf Of Hilmar Lapp
Sent: Tuesday, March 18, 2014 2:55 PM
To: TDWG Content Mailing List
Cc: Dan Leehr
Subject: [tdwg-content] synonyms in DwC Archives
I'm looking for recommendations on how best to put synonyms for taxon records into DwC Archive format.
I'm assuming that these would go into an extension file. Do I have this right? What I'm having more trouble with is determining the right column term. there's dwc:vernacularName, which is also in the examples, but what about synonyms of different types that come with taxonomies (such as NCBI's) or that result from merging taxonomies. There isn't an obvious candidate in DwC, and the list at http://rs.gbif.org/core/dwc_taxon.xml doesn't have a suggestion either that would seem pertinent.
Any suggestions, pointers to documentation or examples?
-hilmar
--
Hilmar Lapp -:- informatics.nescent.org/wiki<http://informatics.nescent.org/wiki> -:- lappland.io<http://lappland.io>
_______________________________________________
tdwg-content mailing list
tdwg-content@lists.tdwg.org<mailto:tdwg-content@lists.tdwg.org>
http://lists.tdwg.org/mailman/listinfo/tdwg-content
--
Hilmar Lapp -:- informatics.nescent.org/wiki<http://informatics.nescent.org/wiki> -:- lappland.io<http://lappland.io>
--
Hilmar Lapp -:- informatics.nescent.org/wiki<http://informatics.nescent.org/wiki> -:- lappland.io<http://lappland.io>