[tdwg-content] [tdwg-tag] Inclusion of authorship in DwCscientificName: good or bad?

"Markus Döring (GBIF)" mdoering at gbif.org
Wed Nov 24 13:56:51 CET 2010


Chuck,
we see all sorts of things you can imagine in scientificName. For occurrence records the vast majority is the canonical form though - with an empty scientificNameAuthorship. I'd think they mostly dont have the authorship information captured in their system.

Some recent statistics I did on the latest 269 million occurrence records for taxonomy can be seen here:
http://code.google.com/p/gbif-occurrencestore/wiki/TaxonomicIntegration#Statistics

We have roughly 3.5 million distinct scientific names. 
Parsing them into their canonical form leaves only 2.1 million, only few of them being monomials (95.000 names representing 14.3 million occurrence records).

Not surprisingly zoological names often contain the year while botanical ones often contain the authorship.
You will find 4 parted names and multiple authorships in the same name for different parts, eg a species authorship and a subspecies one.

Markus


On Nov 24, 2010, at 0:16, Chuck Miller wrote:

> Dave,
> The botanical folks often include the authors with their names. What do
> the data records coming into GBIF from herbarium collections look like?
> Do they mostly include or omit the authors in scientificName? 
> 
> Chuck
> 
> -----Original Message-----
> From: Tony.Rees at csiro.au [mailto:Tony.Rees at csiro.au] 
> Sent: Tuesday, November 23, 2010 5:09 PM
> To: Chuck Miller; deepreef at bishopmuseum.org
> Cc: dremsen at gbif.org; tdwg-content at lists.tdwg.org; dmozzherin at eol.org
> Subject: RE: [tdwg-content] [tdwg-tag] Inclusion of authorship in
> DwCscientificName: good or bad?
> 
> Hi all,
> 
> I just had a quick look at the first few thousand data records coming
> into OBIS for my region (Australia). Just about every supplier who
> includes authority as dwc:scientificNameAuthor has used
> dwc:scientificName "incorrectly" i.e., for the canonical name not the
> canonical name + author. This data then flows into GBIF, ALA, etc. and
> circulates in this form. So "users" are already ignoring the definition
> of dwc:scientificName in practice, it would seem, with no apparent ill
> effects (?) - not sure whether this is good or bad, hence the title of
> my original question which prompted this thread...
> 
> - Tony
> 
> 
>> -----Original Message-----
>> From: Chuck Miller [mailto:Chuck.Miller at mobot.org]
>> Sent: Wednesday, 24 November 2010 9:56 AM
>> To: Richard Pyle
>> Cc: Rees, Tony (CMAR, Hobart); dremsen at gbif.org; tdwg- 
>> content at lists.tdwg.org; dmozzherin at eol.org
>> Subject: Re: [tdwg-content] [tdwg-tag] Inclusion of authorship in
>> DwCscientificName: good or bad?
>> 
>> Rich,
>> I gather your reason would be because it's unclear if anyone would 
>> actually use a canonicalName element? That is, it's unneeded. So, 
>> following on, who says they need a dwc:canonicalName element?
>> 
>> You said you worry about feature creep. I suppose I worry about 
>> semantic creep. Extending the meaning of a term makes it more 
>> universal, but in a data world it increases the variability of the 
>> data that may be found attached to the term in some dataset. 
>> Imprecision in terms can create a lot of data quality headaches. Is
> that acceptable?
>> 
>> Chuck
>> 
>> 
>> 
>> On Nov 23, 2010, at 3:52 PM, "Richard Pyle" 
>> <deepreef at bishopmuseum.org>
>> wrote:
>> 
>>> 
>>>> What is the specific objection to adding canonicalName to DwC as an
> 
>>>> optional element, other than the fact it makes DwC one thing 
>>>> larger?
>>> 
>>> I don't have an objection to it per se, but I'd like to feel more
>> certain
>>> that I understand exactly what it is, and what it is intended to
>> achieve,
>>> that is not already achievable with existing terms and/or couldn't 
>>> be
>> more
>>> achievable with an alternative solution. I think there is value in
>> avoiding
>>> feature-creep with DwC, except when we can solve a real problem with
> 
>>> the existing terms. I agree there is a problem there, but I'm still
>> struggling
>>> to understand exactly what specific problem that something like 
>>> canonicalName will solve.
>>> 
>>>> There are databases which do not have their names parsed and 
>>>> provide whatever they have recorded as ScientificName.  But, there 
>>>> are also databases which do have parsed names and could provide 
>>>> this more narrowly defined element, in addition to the 
>>>> ScientificName.  Those databases could make use of a 
>>>> dwc:canonicalName element in their data exchange or query response.
>>> 
>>> Right -- but the point is this: if the data are already parsed, 
>>> where is
>> the
>>> failure of the existing DwC terms in providing the desired service?
>> We've
>>> already identified one of those: i.e., that "intermediate" uninomial
>> ranks
>>> not supported by existing DwC terms don't have a place to put the
>> canonical
>>> form of the name (other than scientificName, which isn't currently
>> intended
>>> or required to be canonical). So yes, that's a clear problem in need
> 
>>> of
>> a
>>> soultion. But is a generic canaonicalName term really going to solve
>> that
>>> efficiently/effectively? What other problems might canonicalName
> solve?
>>> 
>>>> What we don't have and I think never will have is perfectly 
>>>> consistent names data from every database in the world.  One reason
> 
>>>> is a mountain of inconsistently recorded legacy data from decades 
>>>> past that stands in the way of perfection.
>>>> Another is variation in convention or tradition for a variety of 
>>>> reasons that have been explored in these recent threads.
>>>> So, I think the pragmatic approach is to accept the inconsistencies
> 
>>>> and work around them.
>>> 
>>> Agreed!  And my questions are:
>>> 
>>> 1) What specific problems with existing DwC do we wish to solve?
>>> 2) How best to solve them?
>>> 
>>> I'll list two examples for #1:
>>> 
>>> A) Representing the canonical (sans-authorship) form of a uninomial 
>>> name
>> at
>>> a rank not already represented by existing rank-specific DwC terms
>> (kingdom,
>>> phylum, class, order, family, genus) Because the current definition 
>>> of dwc:scientificName allows (optionally)
>> the
>>> inclusion of authorship information, there is no clean way to 
>>> represent
>> a
>>> uninomial name in a way that expressly excludes authorship -- except
> 
>>> if
>> the
>>> uninomial name happens to be represented at the rank of kingdom, 
>>> phylum, class, order, family, or genus.
>>> 
>>> B) Content providers who have authorship data in a separate field 
>>> from
>> taxon
>>> name data, but who have not parsed the bits of a taxon name string 
>>> In this case, the provider cannot provide the parsed bits of the 
>>> name,
>> but
>>> can provide a (sort of) canonicalName string separately from an
>> authorship
>>> string.  If they concatenate the authorship string with the taxon 
>>> name string when populating dwc:scientificName, then the consumer 
>>> has no easy
>> way
>>> of extracting the name bits from the authorship bits (unless the
>> provider
>>> also provides dwc:scientificNameAuthorship, wich could be exactly
>> removed
>>> from the dwc:scientificName valu, yielding what the provider would 
>>> have otherwised provided as canonicalName. Or, as David suggested, 
>>> in this
>> case
>>> the Authorship text would not be concatenated with scientificName.
>>> 
>>> I would like to know some other problems that could be solved with 
>>> the addition of a canonicalName term before I start commenting on
> #2.
>>> 
>>> Aloha,
>>> Rich
>>> 
>>> 
> _______________________________________________
> tdwg-content mailing list
> tdwg-content at lists.tdwg.org
> http://lists.tdwg.org/mailman/listinfo/tdwg-content



More information about the tdwg-content mailing list