[tdwg-content] [tdwg-tag] Inclusion of authorship in DwCscientificName: good or bad?

Wed Nov 24 01:05:05 CET 2010

> Thanks, Rich...
> 
> I'll expand your case (A) a bit:
> 
> > A) Representing the canonical (sans-authorship) form of a uninomial 
> > name at a rank not already represented by existing 
> rank-specific DwC 
> > terms (kingdom, phylum, class, order, family, genus)
> 
> *** in an efficient manner for bulk data transfer

Agreed!

> I.e. a single field canonicalName will then obviate the 
> requirements for multiple fields speciesEpithet, genus, 
> family, order, class, phylum, kingdom which otherwise have to 
> be supplied as "placeholders" for every record in a large set 
> even though only one or two will ever be populated at a given rank

I don't follow.  None of the rank-sepecific terms are required, so they
already can be empty. They are still useful to have, so that basic
classification information can be provided along with a lower-rank name.

The terms "genus", "subgenus", "specificEpithet" and "infraspecificEpithet"
are still needed to provide pre-parsed name bits.

So I don't understand how canonicalName obviates the need for the multiple
fields.

> And a comment on your case (B):
> 
> > B) ...If they concatenate the authorship string with the taxon name 
> > string when populating dwc:scientificName, then the consumer has no 
> > easy way of extracting the name bits from the authorship bits
> 
> Exactly! Wearing my data consumer hat, the first thing I need 
> to do with current dwc:ScientifiName content from multiple 
> sources is try to generate canonical names by stripping off 
> what appear to be authorities (hopefully successfully but not 
> guaranteed). If there was an extra field populated in all or 
> even a subset of cases, this task would not be required.

This only applies if the provider has not already parsed the authorship
details from the name bits.  If the provider has already parsed them, then
they can be provided separately via the appropriately parsed terms.

Let me ask this:

What value is canonicalName to:
1) providers who have name+authorship data in a single text blob, unparsed?
2) providers who have unparsed name bits, but separate name/authorship text
blobs?
3) providers who have fully parsed name bits, and separate authorship?
4) users of content from any of the above providers?

The only possible value I see is for #2 (I see no value for #1 or #3). But
there are two easy ways to work around this for providers in #2:

Concatnentate solution:
dwc:scientificName: "Aus bus Jones 1980"
dwc:scientificNameAuthorship: "Jones 1980"
[users can easily strip the dwc:scientificNameAuthorship from the end of
"Aus bus Jones 1980", to yield "Aus bus"]

Non-concatenate solution:
dwc:scientificName: "Aus bus"
dwc:scientificNameAuthorship: "Jones 1980"
[users can easily concatentate the two, if desired]

Both will work fine with the existing DwC.

Don't get me wrong -- I cerainly see some value in establishing
canonicalName.  I just don't see quite enough value to overcome the desire
for stability/consistency of DwC, compared what we can already do with DwC.
If we're going to make a change to the DwC terms, we should think more
carefully about what the actual problems are, and come up with a stable fix
that is substantial and stable.

I can't help but think that a better solution is to modify the definition of
scientificName to explicitly *exclude* authorship, and effectuvely play the
role of what you want for canonicalName. Then create a new term called
something like verbatimScientificName for people who have unparsed text
blobs that may include both name bits and authorship bits.

The problem with the existing dwc:scientificName is that it is both required
("core"), and loosely defined (with/without authorship, with/without various
qualifiers, etc.)

I think before we make any proposed changes to DwC, we need to identify more
precisely what we want as both providers and users, and figure out what
specific problems the existing DwC terms cause in terms of hassles for the
providers, hassles for the users, and inability to accurately share
information.

> So, I think the mnain driver for this has to be from the 
> large scale data consumers - GBIF, OBIS (with which I am 
> associated), EOL, ALA etc. - if they would find such a field 
> useful that is the real test. In my other incarnation as a 
> data supplier, I can concatenate everything into 
> scientificname as per the present DwC spec, no problem, it 
> just is a lossy export when it is received as far as I am concerned.

How is it lossy?  If you already have the content parsed, why can't you
provided it parsed?  Other than the specific example already discussed (no
easy way to provide a record for a uninomial of rank not already represented
among the terms), what other examples lead to information loss?

Aloha,
Rich