Re: [tdwg-content] [tdwg-tag] Inclusion of authorship in DwCscientificName: good or bad?

24 Nov 2010

      OK, understood.

But I guess my next question would be: is this really "bloat"?  Isn't the
cost of the bloat much less than the value of providing fully parsed
content?

I now understand what I think is a large part of the basis for our (perhaps
non-existent?) disagreement: I'm thinking of dwc terms in the abstract
sense, whereas you are thinking in terms of more practical issues such as
the MB size of your DwCA files.  This also clarifies for me why you keep
saying that it's really a question for the big aggregators (which I now
understand and agree with).

Sorry if I was misunderstanding where you are coming from on this!

Aloha,
Rich
...
-----Original Message-----
From: Tony.Rees@csiro.au [mailto:Tony.Rees@csiro.au] 
Sent: Tuesday, November 23, 2010 2:12 PM
To: Richard Pyle; Chuck.Miller@mobot.org; dremsen@gbif.org
Cc: tdwg-content@lists.tdwg.org; dmozzherin@eol.org
Subject: RE: [tdwg-content] [tdwg-tag] Inclusion of 
authorship in DwCscientificName: good or bad?
Rich Pyle wrote:
...
How is it lossy?  If you already have the content parsed, why can't 
you provided it parsed?  Other than the specific example already 
discussed (no easy way to provide a record for a uninomial 
of rank not 
already represented among the terms), what other examples lead to 
information loss?
It's lossy if I do not wish to add unnecessary "bloat" to my 
(already large) DwCA export file by including dedicated 
fields for the individual values of specificEpithet, genus, 
family, order, class, phylum and kingdom as previously 
mentioned. These (especially the higher taxa for any level) 
are specifically *not* required in the "normalized" example 
given on the TDWG wiki as they can be generated on receipt of 
the data by following the parentID value/s.
Cheers - Tony
...
-----Original Message-----
From: Richard Pyle [mailto:deepreef@bishopmuseum.org]
Sent: Wednesday, 24 November 2010 11:05 AM
To: Rees, Tony (CMAR, Hobart); Chuck.Miller@mobot.org; 
dremsen@gbif.org
Cc: tdwg-content@lists.tdwg.org; dmozzherin@eol.org
Subject: RE: [tdwg-content] [tdwg-tag] Inclusion of authorship in
DwCscientificName: good or bad?
...
Thanks, Rich...
I'll expand your case (A) a bit:
...
A) Representing the canonical (sans-authorship) form of a 
uninomial name at a rank not already represented by existing
rank-specific DwC
terms (kingdom, phylum, class, order, family, genus)
*** in an efficient manner for bulk data transfer
Agreed!
...
I.e. a single field canonicalName will then obviate the 
requirements 
for multiple fields speciesEpithet, genus, family, order, class, 
phylum, kingdom which otherwise have to be supplied as 
"placeholders" for every record in a large set even 
though only one 
or two will ever be populated at a given rank
I don't follow.  None of the rank-sepecific terms are required, so 
they already can be empty. They are still useful to have, so that 
basic classification information can be provided along with 
a lower-rank name.
The terms "genus", "subgenus", "specificEpithet" and 
"infraspecificEpithet"
are still needed to provide pre-parsed name bits.
So I don't understand how canonicalName obviates the need for the 
multiple fields.
...
And a comment on your case (B):
...
B) ...If they concatenate the authorship string with the taxon 
name string when populating dwc:scientificName, then 
the consumer 
has no easy way of extracting the name bits from the authorship 
bits
Exactly! Wearing my data consumer hat, the first thing I 
need to do 
with current dwc:ScientifiName content from multiple 
sources is try 
to generate canonical names by stripping off what appear to be 
authorities (hopefully successfully but not guaranteed). If there 
was an extra field populated in all or even a subset of 
cases, this 
task would not be required.
This only applies if the provider has not already parsed the 
authorship details from the name bits.  If the provider has already 
parsed them, then they can be provided separately via the 
appropriately parsed terms.
Let me ask this:
What value is canonicalName to:
1) providers who have name+authorship data in a single text blob, 
unparsed?
2) providers who have unparsed name bits, but separate 
name/authorship 
text blobs?
3) providers who have fully parsed name bits, and separate 
authorship?
4) users of content from any of the above providers?
The only possible value I see is for #2 (I see no value for 
#1 or #3). 
But there are two easy ways to work around this for providers in #2:
Concatnentate solution:
dwc:scientificName: "Aus bus Jones 1980"
dwc:scientificNameAuthorship: "Jones 1980"
[users can easily strip the dwc:scientificNameAuthorship 
from the end 
of "Aus bus Jones 1980", to yield "Aus bus"]
Non-concatenate solution:
dwc:scientificName: "Aus bus"
dwc:scientificNameAuthorship: "Jones 1980"
[users can easily concatentate the two, if desired]
Both will work fine with the existing DwC.
Don't get me wrong -- I cerainly see some value in establishing 
canonicalName.  I just don't see quite enough value to overcome the 
desire for stability/consistency of DwC, compared what we 
can already 
do with DwC.
If we're going to make a change to the DwC terms, we should 
think more 
carefully about what the actual problems are, and come up with a 
stable fix that is substantial and stable.
I can't help but think that a better solution is to modify the 
definition of scientificName to explicitly *exclude* 
authorship, and 
effectuvely play the role of what you want for canonicalName. Then 
create a new term called something like verbatimScientificName for 
people who have unparsed text blobs that may include both name bits 
and authorship bits.
The problem with the existing dwc:scientificName is that it is both 
required ("core"), and loosely defined (with/without authorship, 
with/without various qualifiers, etc.)
I think before we make any proposed changes to DwC, we need to 
identify more precisely what we want as both providers and 
users, and 
figure out what specific problems the existing DwC terms cause in 
terms of hassles for the providers, hassles for the users, and 
inability to accurately share information.
...
So, I think the mnain driver for this has to be from the 
large scale 
data consumers - GBIF, OBIS (with which I am associated), 
EOL, ALA 
etc. - if they would find such a field useful that is the 
real test. 
In my other incarnation as a data supplier, I can concatenate 
everything into scientificname as per the present DwC spec, no 
problem, it just is a lossy export when it is received as 
far as I 
am concerned.
How is it lossy?  If you already have the content parsed, why can't 
you provided it parsed?  Other than the specific example already 
discussed (no easy way to provide a record for a uninomial 
of rank not 
already represented among the terms), what other examples lead to 
information loss?
Aloha,
Rich