This all sounds like it's getting terribly complicated and the combined discussion on atomised parts vs canonical/full-name are confusing me.
You're not alone! :-)
For the first part, I still think available parsing tools make 99.8% of
all cases
tractable but if we want to be explicit and not run these services all the
time
then.
I can go with that (basically what I've been advocating all along: that is, we need to identify where it's broke before we try to fix it). At the moment, many data consumers don't have those tools available locally, but I think as more people become familiar with the tools that are out there, then this will not be so much a problem. In cases where scientificName is concatenated from parsed bits, then I suspect the client-side parsing will be closer to 100% effective.
For datasets that separate the name from the authorship we make sure it's clear this separation is retained. The definition for this term must change. It doesn't make sense to me to concatenate two elements that are already split. The parts go into:
- scientificName + scientificNameAuthorship
For datasets with only a scientificName. The name goes into: 2. scientificName For datasets with scientificName and authorship in a single field we have
two
choices: 3a. scientificName # in which case we must be able to detect and split authorship and we need to detect the canonical form in case 2 3b. scientificNameWithAuthorship # rather than a canonicalName term which is confusing we use a less ambiguous term like this.
It seems to me the intent of 3b is more explicit as to what we intend by adding canonicalName.
The thing that concerns me about 3a is the conditional definition of scientificName (i.e., it excludes authorship when that information is pre-parsed, but includes it when it's not pre-parsed).
The thing that concerns me about 3b is that scientificName is (supposedly?) required; so if someone with unparsed information provides scientificNameWithAuthorship, then are they supposed to leave scientificName blank?
Rich