Re: [tdwg-content] [tdwg-tag] Inclusion of authorship in DwC scientificName: good or bad?
What is the specific objection to adding canonicalName to DwC as an optional element, other than the fact it makes DwC one thing larger?
There are databases which do not have their names parsed and provide whatever they have recorded as ScientificName. But, there are also databases which do have parsed names and could provide this more narrowly defined element, in addition to the ScientificName. Those databases could make use of a dwc:canonicalName element in their data exchange or query response.
What we don't have and I think never will have is perfectly consistent names data from every database in the world. One reason is a mountain of inconsistently recorded legacy data from decades past that stands in the way of perfection. Another is variation in convention or tradition for a variety of reasons that have been explored in these recent threads. So, I think the pragmatic approach is to accept the inconsistencies and work around them.
Chuck
-----Original Message----- From: tdwg-content-bounces@lists.tdwg.org [mailto:tdwg-content-bounces@lists.tdwg.org] On Behalf Of Tony.Rees@csiro.au Sent: Tuesday, November 23, 2010 1:40 PM To: dremsen@gbif.org Cc: tdwg-content@lists.tdwg.org; dmozzherin@eol.org Subject: Re: [tdwg-content] [tdwg-tag] Inclusion of authorship in DwC scientificName: good or bad?
Hi David,
It seems to me that your suggestion is still not quite ideal, in that sometimes just the dwc:scientificName element will be picked up and passed around and the content will not be consistent between those suppliers who concatenate the available authority info and those who do not. That suggests to me that an extra field for known canonicalName if this can be supplied is still desirable - but I am not sure if I am alone in thinking this...
Regards - Tony
________________________________________ From: David Remsen (GBIF) [dremsen@gbif.org] Sent: Tuesday, 23 November 2010 11:15 PM To: Rees, Tony (CMAR, Hobart) Cc: David Remsen (GBIF); deepreef@bishopmuseum.org; m.doering@mac.com; tdwg-content@lists.tdwg.org; dmozzherin@eol.org Subject: Re: [tdwg-content] [tdwg-tag] Inclusion of authorship in DwC scientificName: good or bad?
Tony
I did indeed mean that scientificName and authorship could be used in the following way
1. "Agalinis purpurea" -> scientificName ("Agalinis purpurea") - where a canonical form of the name with no authorship in the source data
2. "Agalinis purpurea (L.) Pennell" -> scientificName ("Agalinis purpurea (L.) Pennell" ) - where a unparsed name+author is in the source data
3. "Agalinis purpurea" AND "(L.) Pennell" -> scientificName ("Agalinis purpurea") + scientificNameAuthorship ("(L.) Pennell") - where a semi-parsed name + author is in the source data
4. "Agalinis" AND purpurea" AND "(L.) Pennell" > scientificName ("Agalinis purpurea") + scientificNameAuthorship ("(L.) - where a fully atomised name is in the source data and the 'name' parts concatenated to make a proper canonical name.
Cases 3 and 4 require modification of the definition at http://rs.tdwg.org/dwc/terms/index.htm#scientificName to be something like
"The full scientific name, which may include authorship and date information if known..." with the implicit intention that it is not REQUIRED to parse or semi-parse an unparsed name in order to properly share it.
David
On Nov 23, 2010, at 12:35 PM, Tony.Rees@csiro.au wrote:
David Remsen wrote:
Maybe we shouldnt add canonical name but rather something more specific to the concatenated form like dwc:scientificNameWithAuthorshipAndOtherBits dwc:scientificName dwc:scientificNameAuthorship
If by "dwc:scientificName" you mean with authorship omitted, that is fine, however it would need the dwc definition to be altered...
Then at least folk would/should know which field to populate. However the mandatory yes/no issue would also have to be addressed - at present I think dwc:scientificName is the only taxonomy related element that is mandatory, all others are optional. Under your scenario it would then maybe be one of either of the first 2 fields, or both as available, I guess?
Regards - Tony
From: David Remsen (GBIF) [dremsen@gbif.org] Sent: Tuesday, 23 November 2010 7:47 PM To: Rees, Tony (CMAR, Hobart) Cc: David Remsen (GBIF); deepreef@bishopmuseum.org; m.doering@mac.com;
tdwg-content@lists.tdwg.org; dmozzherin@eol.org Subject: Re: [tdwg-content] [tdwg-tag] Inclusion of authorship in DwC scientificName: good or bad?
While I haven't seen them all, I have seen and had to understand a good number of biodiversity databases including many focused on managing species lists in one form or another. Names are represented in these three forms.
- Completely unparsed where the entire verbose name text is in a
single field corresponding to dwc:scientificName. In some databases this means just a scientific name as many databases don't hold authorship information.
- Semi-parsed where the canonical name is separated from the
authorship information corresponding to the proposed canonicalName and
dwc:scientificNameAuthorship
- Fully parsed into atoms (genus, specific epithet, infraspecific
rank, infraspecies, authorship) corresponding to the incomplete set of dwc atomic elements already in existence. This form is the most problematic because 1) it isn't always clear from the parts how the actual complete name is intended to be represented and 2) there are so
many structural exceptions and complexities that many more 'atoms' need to be described to effectively enable it to be used. 3) there is the problematic definition of the use of Genus as described by Markus that conflicts with atomising synonyms.
It makes sense to maintain the separation of name and authorship in data sources that already do this but Im not convinced a canonicalName element is required. It seems that it is suggested so that it makes it easier to consume the data but it also means its more confusing for a typical data manager or biologist to produce it. I have a database with binomials alone. How many data managers or biologists will map them to canonicalName before scientificName? I know we want to avoid testing different conditions when we use the data but we will have to in either case.
Maybe we shouldnt add canonical name but rather something more specific to the concatenated form like
dwc:scientificNameWithAuthorshipAndOtherBits dwc:scientificName dwc:scientificNameAuthorship
I'd know what to do then
DR
On Nov 22, 2010, at 11:18 PM, Tony.Rees@csiro.au Tony.Rees@csiro.au wrote:
Hi Rich, all,
You wrote: .
Otherwise, we could argue forever about which of the dozen possible forms we think DwC needs a term for.
No, I think that is muddying the waters (with respect of course...) I
simply made the case for "canonicalName" - aka scientific name without authorship - as a valuable adjunct to "scientificName", for users who can supply both, and consumers who would otherwise have to generate the former from the latter algorithmically. Markus, Dima probably represent the main "consumers" here and I if you like can represent a "provider" (although I wear other "consumer" hats on occasion as well). Basically if a "canonicalName" field does not exist, I will just omit to provide this information, which seems sub-
optimal since it all exists pre-parsed and manually verified in my system, and someone else will then have to do the job again...
Regards - Tony
-----Original Message----- From: Richard Pyle [mailto:deepreef@bishopmuseum.org] Sent: Tuesday, 23 November 2010 7:06 AM To: Rees, Tony (CMAR, Hobart); m.doering@mac.com Cc: tdwg-content@lists.tdwg.org; dmozzherin@eol.org Subject: RE: [tdwg-content] [tdwg-tag] Inclusion of authorship in DwC scientificName: good or bad?
"unininomial" would equal "canonicalName" for ranks subgenus and above, but not for species and below, while canonicalName (or scientificNameCanonical if you prefer) covers all cases, which is why I thik it is preferable, especially as the majority of names in
circulation are at species level and below I think...
Atomising further i.e. a binomial or poynomial into genus, species,
infaspecies is actually a separate activity with its own rationale,
I would say.
Just my personal view, of course...
The cleanest way to do it is to simply have Rank, NameElement and parentNameUsageID, and be done with it (maybe with the addition of verbatimNameString for purists). But that assumes that providers have parsed data, which they often do not. Maybe with services like
those associated with GNI, the time of databases with unparsed names
data are drawing to a close. Or, maybe if GNUB gets a foot-hold, we'll solve all the problems via a simply actionable persistent identifier.
But until that time, dwc needs to find a balance between users who want pre-parsed data, and providers who do not have pre-parsed data.
I think dwc *almost* accomodates both worlds, as long as scientificName is defined as "the complete set of textual elements useful for recognizing a unique scientific name"; which is either concatenated by the provider with parsed data, or simply "provided" by the provider with unparsed data.
What we seem to be arguing about now is how many different forms of a "formatted" name do we want?
With or without authorship?
With or without year?
With or without infraspecific prefixes ("var.", "f." etc.)?
With or without infrageneric name(s)?
With or without italics codes?
With or without qualifiers like "cf.", "aff.", etc.?
Etc.
Etc.
Etc.
There are potentially dozens of different terms we could define to accommodate every particular niche-need.
Personally, I think that the existing "scientificName" should be split into two different terms:
fullScientificNameStringWithAuthorship And verbatimNameString
The first would be a concatenated text string assembled from parsed bits, according to a community standard concatenation form.
The second would be the literal text string as it appeared in the original source.
Otherwise, we could argue forever about which of the dozen possible forms we think DwC needs a term for.
Rich
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
_______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
Aw, Chuck, you are such a kill-joy. We should never do anything until it is perfectly consistent. :-)
On Tue, Nov 23, 2010 at 3:00 PM, Chuck Miller Chuck.Miller@mobot.org wrote:
What is the specific objection to adding canonicalName to DwC as an optional element, other than the fact it makes DwC one thing larger?
There are databases which do not have their names parsed and provide whatever they have recorded as ScientificName. But, there are also databases which do have parsed names and could provide this more narrowly defined element, in addition to the ScientificName. Those databases could make use of a dwc:canonicalName element in their data exchange or query response.
What we don't have and I think never will have is perfectly consistent names data from every database in the world. One reason is a mountain of inconsistently recorded legacy data from decades past that stands in the way of perfection. Another is variation in convention or tradition for a variety of reasons that have been explored in these recent threads. So, I think the pragmatic approach is to accept the inconsistencies and work around them.
Chuck
-----Original Message----- From: tdwg-content-bounces@lists.tdwg.org [mailto:tdwg-content-bounces@lists.tdwg.org] On Behalf Of Tony.Rees@csiro.au Sent: Tuesday, November 23, 2010 1:40 PM To: dremsen@gbif.org Cc: tdwg-content@lists.tdwg.org; dmozzherin@eol.org Subject: Re: [tdwg-content] [tdwg-tag] Inclusion of authorship in DwC scientificName: good or bad?
Hi David,
It seems to me that your suggestion is still not quite ideal, in that sometimes just the dwc:scientificName element will be picked up and passed around and the content will not be consistent between those suppliers who concatenate the available authority info and those who do not. That suggests to me that an extra field for known canonicalName if this can be supplied is still desirable - but I am not sure if I am alone in thinking this...
Regards - Tony
From: David Remsen (GBIF) [dremsen@gbif.org] Sent: Tuesday, 23 November 2010 11:15 PM To: Rees, Tony (CMAR, Hobart) Cc: David Remsen (GBIF); deepreef@bishopmuseum.org; m.doering@mac.com; tdwg-content@lists.tdwg.org; dmozzherin@eol.org Subject: Re: [tdwg-content] [tdwg-tag] Inclusion of authorship in DwC scientificName: good or bad?
Tony
I did indeed mean that scientificName and authorship could be used in the following way
- "Agalinis purpurea" -> scientificName ("Agalinis purpurea")
- where a canonical form of the name with no authorship in the source data
- "Agalinis purpurea (L.) Pennell" -> scientificName ("Agalinis
purpurea (L.) Pennell" )
- where a unparsed name+author is in the source data
- "Agalinis purpurea" AND "(L.) Pennell" -> scientificName ("Agalinis
purpurea") + scientificNameAuthorship ("(L.) Pennell")
- where a semi-parsed name + author is in the source data
- "Agalinis" AND purpurea" AND "(L.) Pennell" > scientificName
("Agalinis purpurea") + scientificNameAuthorship ("(L.)
- where a fully atomised name is in the source data and the 'name'
parts concatenated to make a proper canonical name.
Cases 3 and 4 require modification of the definition at http://rs.tdwg.org/dwc/terms/index.htm#scientificName to be something like
"The full scientific name, which may include authorship and date information if known..." with the implicit intention that it is not REQUIRED to parse or semi-parse an unparsed name in order to properly share it.
David
On Nov 23, 2010, at 12:35 PM, Tony.Rees@csiro.au wrote:
David Remsen wrote:
Maybe we shouldnt add canonical name but rather something more specific to the concatenated form like dwc:scientificNameWithAuthorshipAndOtherBits dwc:scientificName dwc:scientificNameAuthorship
If by "dwc:scientificName" you mean with authorship omitted, that is fine, however it would need the dwc definition to be altered...
Then at least folk would/should know which field to populate. However the mandatory yes/no issue would also have to be addressed - at present I think dwc:scientificName is the only taxonomy related element that is mandatory, all others are optional. Under your scenario it would then maybe be one of either of the first 2 fields, or both as available, I guess?
Regards - Tony
From: David Remsen (GBIF) [dremsen@gbif.org] Sent: Tuesday, 23 November 2010 7:47 PM To: Rees, Tony (CMAR, Hobart) Cc: David Remsen (GBIF); deepreef@bishopmuseum.org; m.doering@mac.com;
tdwg-content@lists.tdwg.org; dmozzherin@eol.org Subject: Re: [tdwg-content] [tdwg-tag] Inclusion of authorship in DwC scientificName: good or bad?
While I haven't seen them all, I have seen and had to understand a good number of biodiversity databases including many focused on managing species lists in one form or another. Names are represented in these three forms.
- Completely unparsed where the entire verbose name text is in a
single field corresponding to dwc:scientificName. In some databases this means just a scientific name as many databases don't hold authorship information.
- Semi-parsed where the canonical name is separated from the
authorship information corresponding to the proposed canonicalName and
dwc:scientificNameAuthorship
- Fully parsed into atoms (genus, specific epithet, infraspecific
rank, infraspecies, authorship) corresponding to the incomplete set of dwc atomic elements already in existence. This form is the most problematic because 1) it isn't always clear from the parts how the actual complete name is intended to be represented and 2) there are so
many structural exceptions and complexities that many more 'atoms' need to be described to effectively enable it to be used. 3) there is the problematic definition of the use of Genus as described by Markus that conflicts with atomising synonyms.
It makes sense to maintain the separation of name and authorship in data sources that already do this but Im not convinced a canonicalName element is required. It seems that it is suggested so that it makes it easier to consume the data but it also means its more confusing for a typical data manager or biologist to produce it. I have a database with binomials alone. How many data managers or biologists will map them to canonicalName before scientificName? I know we want to avoid testing different conditions when we use the data but we will have to in either case.
Maybe we shouldnt add canonical name but rather something more specific to the concatenated form like
dwc:scientificNameWithAuthorshipAndOtherBits dwc:scientificName dwc:scientificNameAuthorship
I'd know what to do then
DR
On Nov 22, 2010, at 11:18 PM, Tony.Rees@csiro.au Tony.Rees@csiro.au wrote:
Hi Rich, all,
You wrote: .
Otherwise, we could argue forever about which of the dozen possible forms we think DwC needs a term for.
No, I think that is muddying the waters (with respect of course...) I
simply made the case for "canonicalName" - aka scientific name without authorship - as a valuable adjunct to "scientificName", for users who can supply both, and consumers who would otherwise have to generate the former from the latter algorithmically. Markus, Dima probably represent the main "consumers" here and I if you like can represent a "provider" (although I wear other "consumer" hats on occasion as well). Basically if a "canonicalName" field does not exist, I will just omit to provide this information, which seems sub-
optimal since it all exists pre-parsed and manually verified in my system, and someone else will then have to do the job again...
Regards - Tony
-----Original Message----- From: Richard Pyle [mailto:deepreef@bishopmuseum.org] Sent: Tuesday, 23 November 2010 7:06 AM To: Rees, Tony (CMAR, Hobart); m.doering@mac.com Cc: tdwg-content@lists.tdwg.org; dmozzherin@eol.org Subject: RE: [tdwg-content] [tdwg-tag] Inclusion of authorship in DwC scientificName: good or bad?
"unininomial" would equal "canonicalName" for ranks subgenus and above, but not for species and below, while canonicalName (or scientificNameCanonical if you prefer) covers all cases, which is why I thik it is preferable, especially as the majority of names in
circulation are at species level and below I think...
Atomising further i.e. a binomial or poynomial into genus, species,
infaspecies is actually a separate activity with its own rationale,
I would say.
Just my personal view, of course...
The cleanest way to do it is to simply have Rank, NameElement and parentNameUsageID, and be done with it (maybe with the addition of verbatimNameString for purists). But that assumes that providers have parsed data, which they often do not. Maybe with services like
those associated with GNI, the time of databases with unparsed names
data are drawing to a close. Or, maybe if GNUB gets a foot-hold, we'll solve all the problems via a simply actionable persistent identifier.
But until that time, dwc needs to find a balance between users who want pre-parsed data, and providers who do not have pre-parsed data.
I think dwc *almost* accomodates both worlds, as long as scientificName is defined as "the complete set of textual elements useful for recognizing a unique scientific name"; which is either concatenated by the provider with parsed data, or simply "provided" by the provider with unparsed data.
What we seem to be arguing about now is how many different forms of a "formatted" name do we want?
With or without authorship?
With or without year?
With or without infraspecific prefixes ("var.", "f." etc.)?
With or without infrageneric name(s)?
With or without italics codes?
With or without qualifiers like "cf.", "aff.", etc.?
Etc.
Etc.
Etc.
There are potentially dozens of different terms we could define to accommodate every particular niche-need.
Personally, I think that the existing "scientificName" should be split into two different terms:
fullScientificNameStringWithAuthorship And verbatimNameString
The first would be a concatenated text string assembled from parsed bits, according to a community standard concatenation form.
The second would be the literal text string as it appeared in the original source.
Otherwise, we could argue forever about which of the dozen possible forms we think DwC needs a term for.
Rich
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content _______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
Yeh. What's the binomial for a kill-joy?
Chuck
-----Original Message----- From: Bob Morris [mailto:morris.bob@gmail.com] Sent: Tuesday, November 23, 2010 2:21 PM To: Chuck Miller Cc: Tony.Rees@csiro.au; dremsen@gbif.org; tdwg-content@lists.tdwg.org; dmozzherin@eol.org Subject: Re: [tdwg-content] [tdwg-tag] Inclusion of authorship in DwC scientificName: good or bad?
Aw, Chuck, you are such a kill-joy. We should never do anything until it is perfectly consistent. :-)
On Tue, Nov 23, 2010 at 3:00 PM, Chuck Miller Chuck.Miller@mobot.org wrote:
What is the specific objection to adding canonicalName to DwC as an optional element, other than the fact it makes DwC one thing larger?
There are databases which do not have their names parsed and provide whatever they have recorded as ScientificName. But, there are also databases which do have parsed names and could provide this more narrowly defined element, in addition to the ScientificName. Those databases could make use of a dwc:canonicalName element in their data exchange or query response.
What we don't have and I think never will have is perfectly consistent names data from every database in the world. One reason is a mountain of inconsistently recorded legacy data from decades past that stands in the way of perfection. Another is variation in convention or tradition for a variety of reasons that have been explored in these recent threads. So, I think the pragmatic approach is to accept the inconsistencies and work around them.
Chuck
-----Original Message----- From: tdwg-content-bounces@lists.tdwg.org [mailto:tdwg-content-bounces@lists.tdwg.org] On Behalf Of Tony.Rees@csiro.au Sent: Tuesday, November 23, 2010 1:40 PM To: dremsen@gbif.org Cc: tdwg-content@lists.tdwg.org; dmozzherin@eol.org Subject: Re: [tdwg-content] [tdwg-tag] Inclusion of authorship in DwC scientificName: good or bad?
Hi David,
It seems to me that your suggestion is still not quite ideal, in that sometimes just the dwc:scientificName element will be picked up and passed around and the content will not be consistent between those suppliers who concatenate the available authority info and those who do not. That suggests to me that an extra field for known canonicalName if this can be supplied is still desirable - but I am not sure if I am alone in thinking this...
Regards - Tony
From: David Remsen (GBIF) [dremsen@gbif.org] Sent: Tuesday, 23 November 2010 11:15 PM To: Rees, Tony (CMAR, Hobart) Cc: David Remsen (GBIF); deepreef@bishopmuseum.org; m.doering@mac.com; tdwg-content@lists.tdwg.org; dmozzherin@eol.org Subject: Re: [tdwg-content] [tdwg-tag] Inclusion of authorship in DwC scientificName: good or bad?
Tony
I did indeed mean that scientificName and authorship could be used in the following way
- "Agalinis purpurea" -> scientificName ("Agalinis purpurea")
- where a canonical form of the name with no authorship in the source data
- "Agalinis purpurea (L.) Pennell" -> scientificName ("Agalinis
purpurea (L.) Pennell" )
- where a unparsed name+author is in the source data
- "Agalinis purpurea" AND "(L.) Pennell" -> scientificName
("Agalinis purpurea") + scientificNameAuthorship ("(L.) Pennell")
- where a semi-parsed name + author is in the source data
- "Agalinis" AND purpurea" AND "(L.) Pennell" > scientificName
("Agalinis purpurea") + scientificNameAuthorship ("(L.)
- where a fully atomised name is in the source data and the 'name'
parts concatenated to make a proper canonical name.
Cases 3 and 4 require modification of the definition at http://rs.tdwg.org/dwc/terms/index.htm#scientificName to be something like
"The full scientific name, which may include authorship and date information if known..." with the implicit intention that it is not REQUIRED to parse or semi-parse an unparsed name in order to properly share it.
David
On Nov 23, 2010, at 12:35 PM, Tony.Rees@csiro.au wrote:
David Remsen wrote:
Maybe we shouldnt add canonical name but rather something more specific to the concatenated form like dwc:scientificNameWithAuthorshipAndOtherBits dwc:scientificName dwc:scientificNameAuthorship
If by "dwc:scientificName" you mean with authorship omitted, that is fine, however it would need the dwc definition to be altered...
Then at least folk would/should know which field to populate. However the mandatory yes/no issue would also have to be addressed - at present I think dwc:scientificName is the only taxonomy related element that is mandatory, all others are optional. Under your scenario it would then maybe be one of either of the first 2 fields, or both as available, I guess?
Regards - Tony
From: David Remsen (GBIF) [dremsen@gbif.org] Sent: Tuesday, 23 November 2010 7:47 PM To: Rees, Tony (CMAR, Hobart) Cc: David Remsen (GBIF); deepreef@bishopmuseum.org; m.doering@mac.com;
tdwg-content@lists.tdwg.org; dmozzherin@eol.org Subject: Re: [tdwg-content] [tdwg-tag] Inclusion of authorship in DwC scientificName: good or bad?
While I haven't seen them all, I have seen and had to understand a good number of biodiversity databases including many focused on managing species lists in one form or another. Names are represented in these three forms.
- Completely unparsed where the entire verbose name text is in a
single field corresponding to dwc:scientificName. In some databases this means just a scientific name as many databases don't hold authorship information.
- Semi-parsed where the canonical name is separated from the
authorship information corresponding to the proposed canonicalName and
dwc:scientificNameAuthorship
- Fully parsed into atoms (genus, specific epithet, infraspecific
rank, infraspecies, authorship) corresponding to the incomplete set of dwc atomic elements already in existence. This form is the most problematic because 1) it isn't always clear from the parts how the actual complete name is intended to be represented and 2) there are so
many structural exceptions and complexities that many more 'atoms' need to be described to effectively enable it to be used. 3) there is the problematic definition of the use of Genus as described by Markus that conflicts with atomising synonyms.
It makes sense to maintain the separation of name and authorship in data sources that already do this but Im not convinced a canonicalName element is required. It seems that it is suggested so that it makes it easier to consume the data but it also means its more confusing for a typical data manager or biologist to produce it. I have a database with binomials alone. How many data managers or biologists will map them to canonicalName before scientificName? I know we want to avoid testing different conditions when we use the data but we will have to in either case.
Maybe we shouldnt add canonical name but rather something more specific to the concatenated form like
dwc:scientificNameWithAuthorshipAndOtherBits dwc:scientificName dwc:scientificNameAuthorship
I'd know what to do then
DR
On Nov 22, 2010, at 11:18 PM, Tony.Rees@csiro.au Tony.Rees@csiro.au wrote:
Hi Rich, all,
You wrote: .
Otherwise, we could argue forever about which of the dozen possible forms we think DwC needs a term for.
No, I think that is muddying the waters (with respect of course...) I
simply made the case for "canonicalName" - aka scientific name without authorship - as a valuable adjunct to "scientificName", for users who can supply both, and consumers who would otherwise have to generate the former from the latter algorithmically. Markus, Dima probably represent the main "consumers" here and I if you like can represent a "provider" (although I wear other "consumer" hats on occasion as well). Basically if a "canonicalName" field does not exist, I will just omit to provide this information, which seems sub-
optimal since it all exists pre-parsed and manually verified in my system, and someone else will then have to do the job again...
Regards - Tony
-----Original Message----- From: Richard Pyle [mailto:deepreef@bishopmuseum.org] Sent: Tuesday, 23 November 2010 7:06 AM To: Rees, Tony (CMAR, Hobart); m.doering@mac.com Cc: tdwg-content@lists.tdwg.org; dmozzherin@eol.org Subject: RE: [tdwg-content] [tdwg-tag] Inclusion of authorship in DwC scientificName: good or bad?
"unininomial" would equal "canonicalName" for ranks subgenus and above, but not for species and below, while canonicalName (or scientificNameCanonical if you prefer) covers all cases, which is why I thik it is preferable, especially as the majority of names in
circulation are at species level and below I think...
Atomising further i.e. a binomial or poynomial into genus, species,
infaspecies is actually a separate activity with its own rationale,
I would say.
Just my personal view, of course...
The cleanest way to do it is to simply have Rank, NameElement and parentNameUsageID, and be done with it (maybe with the addition of verbatimNameString for purists). But that assumes that providers have parsed data, which they often do not. Maybe with services like
those associated with GNI, the time of databases with unparsed names
data are drawing to a close. Or, maybe if GNUB gets a foot-hold, we'll solve all the problems via a simply actionable persistent identifier.
But until that time, dwc needs to find a balance between users who want pre-parsed data, and providers who do not have pre-parsed data.
I think dwc *almost* accomodates both worlds, as long as scientificName is defined as "the complete set of textual elements useful for recognizing a unique scientific name"; which is either concatenated by the provider with parsed data, or simply "provided" by the provider with unparsed data.
What we seem to be arguing about now is how many different forms of a "formatted" name do we want?
With or without authorship?
With or without year?
With or without infraspecific prefixes ("var.", "f." etc.)?
With or without infrageneric name(s)?
With or without italics codes?
With or without qualifiers like "cf.", "aff.", etc.?
Etc.
Etc.
Etc.
There are potentially dozens of different terms we could define to accommodate every particular niche-need.
Personally, I think that the existing "scientificName" should be split into two different terms:
fullScientificNameStringWithAuthorship And verbatimNameString
The first would be a concatenated text string assembled from parsed bits, according to a community standard concatenation form.
The second would be the literal text string as it appeared in the original source.
Otherwise, we could argue forever about which of the dozen possible forms we think DwC needs a term for.
Rich
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content _______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
-- Robert A. Morris Emeritus Professor of Computer Science UMASS-Boston 100 Morrissey Blvd Boston, MA 02125-3390 Associate, Harvard University Herbaria email: morris.bob@gmail.com web: http://bdei.cs.umb.edu/ web: http://etaxonomy.org/mw/FilteredPush http://www.cs.umb.edu/~ram phone (+1) 857 222 7992 (mobile)
What is the specific objection to adding canonicalName to DwC as an optional element, other than the fact it makes DwC one thing larger?
I don't have an objection to it per se, but I'd like to feel more certain that I understand exactly what it is, and what it is intended to achieve, that is not already achievable with existing terms and/or couldn't be more achievable with an alternative solution. I think there is value in avoiding feature-creep with DwC, except when we can solve a real problem with the existing terms. I agree there is a problem there, but I'm still struggling to understand exactly what specific problem that something like canonicalName will solve.
There are databases which do not have their names parsed and provide whatever they have recorded as ScientificName. But, there are also databases which do have parsed names and could provide this more narrowly defined element, in addition to the ScientificName. Those databases could make use of a dwc:canonicalName element in their data exchange or query response.
Right -- but the point is this: if the data are already parsed, where is the failure of the existing DwC terms in providing the desired service? We've already identified one of those: i.e., that "intermediate" uninomial ranks not supported by existing DwC terms don't have a place to put the canonical form of the name (other than scientificName, which isn't currently intended or required to be canonical). So yes, that's a clear problem in need of a soultion. But is a generic canaonicalName term really going to solve that efficiently/effectively? What other problems might canonicalName solve?
What we don't have and I think never will have is perfectly consistent names data from every database in the world. One reason is a mountain of inconsistently recorded legacy data from decades past that stands in the way of perfection. Another is variation in convention or tradition for a variety of reasons that have been explored in these recent threads. So, I think the pragmatic approach is to accept the inconsistencies and work around them.
Agreed! And my questions are:
1) What specific problems with existing DwC do we wish to solve? 2) How best to solve them?
I'll list two examples for #1:
A) Representing the canonical (sans-authorship) form of a uninomial name at a rank not already represented by existing rank-specific DwC terms (kingdom, phylum, class, order, family, genus) Because the current definition of dwc:scientificName allows (optionally) the inclusion of authorship information, there is no clean way to represent a uninomial name in a way that expressly excludes authorship -- except if the uninomial name happens to be represented at the rank of kingdom, phylum, class, order, family, or genus.
B) Content providers who have authorship data in a separate field from taxon name data, but who have not parsed the bits of a taxon name string In this case, the provider cannot provide the parsed bits of the name, but can provide a (sort of) canonicalName string separately from an authorship string. If they concatenate the authorship string with the taxon name string when populating dwc:scientificName, then the consumer has no easy way of extracting the name bits from the authorship bits (unless the provider also provides dwc:scientificNameAuthorship, wich could be exactly removed from the dwc:scientificName valu, yielding what the provider would have otherwised provided as canonicalName. Or, as David suggested, in this case the Authorship text would not be concatenated with scientificName.
I would like to know some other problems that could be solved with the addition of a canonicalName term before I start commenting on #2.
Aloha, Rich
Thanks, Rich...
I'll expand your case (A) a bit:
A) Representing the canonical (sans-authorship) form of a uninomial name at a rank not already represented by existing rank-specific DwC terms (kingdom, phylum, class, order, family, genus)
*** in an efficient manner for bulk data transfer
I.e. a single field canonicalName will then obviate the requirements for multiple fields speciesEpithet, genus, family, order, class, phylum, kingdom which otherwise have to be supplied as "placeholders" for every record in a large set even though only one or two will ever be populated at a given rank
And a comment on your case (B):
B) ...If they concatenate the authorship string with the taxon name string when populating dwc:scientificName, then the consumer has no easy way of extracting the name bits from the authorship bits
Exactly! Wearing my data consumer hat, the first thing I need to do with current dwc:ScientifiName content from multiple sources is try to generate canonical names by stripping off what appear to be authorities (hopefully successfully but not guaranteed). If there was an extra field populated in all or even a subset of cases, this task would not be required.
So, I think the mnain driver for this has to be from the large scale data consumers - GBIF, OBIS (with which I am associated), EOL, ALA etc. - if they would find such a field useful that is the real test. In my other incarnation as a data supplier, I can concatenate everything into scientificname as per the present DwC spec, no problem, it just is a lossy export when it is received as far as I am concerned.
Regards - Tony
-----Original Message----- From: Richard Pyle [mailto:deepreef@bishopmuseum.org] Sent: Wednesday, 24 November 2010 8:53 AM To: 'Chuck Miller'; Rees, Tony (CMAR, Hobart); dremsen@gbif.org Cc: tdwg-content@lists.tdwg.org; dmozzherin@eol.org Subject: RE: [tdwg-content] [tdwg-tag] Inclusion of authorship in DwCscientificName: good or bad?
What is the specific objection to adding canonicalName to DwC as an optional element, other than the fact it makes DwC one thing larger?
I don't have an objection to it per se, but I'd like to feel more certain that I understand exactly what it is, and what it is intended to achieve, that is not already achievable with existing terms and/or couldn't be more achievable with an alternative solution. I think there is value in avoiding feature-creep with DwC, except when we can solve a real problem with the existing terms. I agree there is a problem there, but I'm still struggling to understand exactly what specific problem that something like canonicalName will solve.
There are databases which do not have their names parsed and provide whatever they have recorded as ScientificName. But, there are also databases which do have parsed names and could provide this more narrowly defined element, in addition to the ScientificName. Those databases could make use of a dwc:canonicalName element in their data exchange or query response.
Right -- but the point is this: if the data are already parsed, where is the failure of the existing DwC terms in providing the desired service? We've already identified one of those: i.e., that "intermediate" uninomial ranks not supported by existing DwC terms don't have a place to put the canonical form of the name (other than scientificName, which isn't currently intended or required to be canonical). So yes, that's a clear problem in need of a soultion. But is a generic canaonicalName term really going to solve that efficiently/effectively? What other problems might canonicalName solve?
What we don't have and I think never will have is perfectly consistent names data from every database in the world. One reason is a mountain of inconsistently recorded legacy data from decades past that stands in the way of perfection. Another is variation in convention or tradition for a variety of reasons that have been explored in these recent threads. So, I think the pragmatic approach is to accept the inconsistencies and work around them.
Agreed! And my questions are:
- What specific problems with existing DwC do we wish to solve?
- How best to solve them?
I'll list two examples for #1:
A) Representing the canonical (sans-authorship) form of a uninomial name at a rank not already represented by existing rank-specific DwC terms (kingdom, phylum, class, order, family, genus) Because the current definition of dwc:scientificName allows (optionally) the inclusion of authorship information, there is no clean way to represent a uninomial name in a way that expressly excludes authorship -- except if the uninomial name happens to be represented at the rank of kingdom, phylum, class, order, family, or genus.
B) Content providers who have authorship data in a separate field from taxon name data, but who have not parsed the bits of a taxon name string In this case, the provider cannot provide the parsed bits of the name, but can provide a (sort of) canonicalName string separately from an authorship string. If they concatenate the authorship string with the taxon name string when populating dwc:scientificName, then the consumer has no easy way of extracting the name bits from the authorship bits (unless the provider also provides dwc:scientificNameAuthorship, wich could be exactly removed from the dwc:scientificName valu, yielding what the provider would have otherwised provided as canonicalName. Or, as David suggested, in this case the Authorship text would not be concatenated with scientificName.
I would like to know some other problems that could be solved with the addition of a canonicalName term before I start commenting on #2.
Aloha, Rich
Thanks, Rich...
I'll expand your case (A) a bit:
A) Representing the canonical (sans-authorship) form of a uninomial name at a rank not already represented by existing
rank-specific DwC
terms (kingdom, phylum, class, order, family, genus)
*** in an efficient manner for bulk data transfer
Agreed!
I.e. a single field canonicalName will then obviate the requirements for multiple fields speciesEpithet, genus, family, order, class, phylum, kingdom which otherwise have to be supplied as "placeholders" for every record in a large set even though only one or two will ever be populated at a given rank
I don't follow. None of the rank-sepecific terms are required, so they already can be empty. They are still useful to have, so that basic classification information can be provided along with a lower-rank name.
The terms "genus", "subgenus", "specificEpithet" and "infraspecificEpithet" are still needed to provide pre-parsed name bits.
So I don't understand how canonicalName obviates the need for the multiple fields.
And a comment on your case (B):
B) ...If they concatenate the authorship string with the taxon name string when populating dwc:scientificName, then the consumer has no easy way of extracting the name bits from the authorship bits
Exactly! Wearing my data consumer hat, the first thing I need to do with current dwc:ScientifiName content from multiple sources is try to generate canonical names by stripping off what appear to be authorities (hopefully successfully but not guaranteed). If there was an extra field populated in all or even a subset of cases, this task would not be required.
This only applies if the provider has not already parsed the authorship details from the name bits. If the provider has already parsed them, then they can be provided separately via the appropriately parsed terms.
Let me ask this:
What value is canonicalName to: 1) providers who have name+authorship data in a single text blob, unparsed? 2) providers who have unparsed name bits, but separate name/authorship text blobs? 3) providers who have fully parsed name bits, and separate authorship? 4) users of content from any of the above providers?
The only possible value I see is for #2 (I see no value for #1 or #3). But there are two easy ways to work around this for providers in #2:
Concatnentate solution: dwc:scientificName: "Aus bus Jones 1980" dwc:scientificNameAuthorship: "Jones 1980" [users can easily strip the dwc:scientificNameAuthorship from the end of "Aus bus Jones 1980", to yield "Aus bus"]
Non-concatenate solution: dwc:scientificName: "Aus bus" dwc:scientificNameAuthorship: "Jones 1980" [users can easily concatentate the two, if desired]
Both will work fine with the existing DwC.
Don't get me wrong -- I cerainly see some value in establishing canonicalName. I just don't see quite enough value to overcome the desire for stability/consistency of DwC, compared what we can already do with DwC. If we're going to make a change to the DwC terms, we should think more carefully about what the actual problems are, and come up with a stable fix that is substantial and stable.
I can't help but think that a better solution is to modify the definition of scientificName to explicitly *exclude* authorship, and effectuvely play the role of what you want for canonicalName. Then create a new term called something like verbatimScientificName for people who have unparsed text blobs that may include both name bits and authorship bits.
The problem with the existing dwc:scientificName is that it is both required ("core"), and loosely defined (with/without authorship, with/without various qualifiers, etc.)
I think before we make any proposed changes to DwC, we need to identify more precisely what we want as both providers and users, and figure out what specific problems the existing DwC terms cause in terms of hassles for the providers, hassles for the users, and inability to accurately share information.
So, I think the mnain driver for this has to be from the large scale data consumers - GBIF, OBIS (with which I am associated), EOL, ALA etc. - if they would find such a field useful that is the real test. In my other incarnation as a data supplier, I can concatenate everything into scientificname as per the present DwC spec, no problem, it just is a lossy export when it is received as far as I am concerned.
How is it lossy? If you already have the content parsed, why can't you provided it parsed? Other than the specific example already discussed (no easy way to provide a record for a uninomial of rank not already represented among the terms), what other examples lead to information loss?
Aloha, Rich
Rich Pyle wrote:
How is it lossy? If you already have the content parsed, why can't you provided it parsed? Other than the specific example already discussed (no easy way to provide a record for a uninomial of rank not already represented among the terms), what other examples lead to information loss?
It's lossy if I do not wish to add unnecessary "bloat" to my (already large) DwCA export file by including dedicated fields for the individual values of specificEpithet, genus, family, order, class, phylum and kingdom as previously mentioned. These (especially the higher taxa for any level) are specifically *not* required in the "normalized" example given on the TDWG wiki as they can be generated on receipt of the data by following the parentID value/s.
Cheers - Tony
-----Original Message----- From: Richard Pyle [mailto:deepreef@bishopmuseum.org] Sent: Wednesday, 24 November 2010 11:05 AM To: Rees, Tony (CMAR, Hobart); Chuck.Miller@mobot.org; dremsen@gbif.org Cc: tdwg-content@lists.tdwg.org; dmozzherin@eol.org Subject: RE: [tdwg-content] [tdwg-tag] Inclusion of authorship in DwCscientificName: good or bad?
Thanks, Rich...
I'll expand your case (A) a bit:
A) Representing the canonical (sans-authorship) form of a uninomial name at a rank not already represented by existing
rank-specific DwC
terms (kingdom, phylum, class, order, family, genus)
*** in an efficient manner for bulk data transfer
Agreed!
I.e. a single field canonicalName will then obviate the requirements for multiple fields speciesEpithet, genus, family, order, class, phylum, kingdom which otherwise have to be supplied as "placeholders" for every record in a large set even though only one or two will ever be populated at a given rank
I don't follow. None of the rank-sepecific terms are required, so they already can be empty. They are still useful to have, so that basic classification information can be provided along with a lower-rank name.
The terms "genus", "subgenus", "specificEpithet" and "infraspecificEpithet" are still needed to provide pre-parsed name bits.
So I don't understand how canonicalName obviates the need for the multiple fields.
And a comment on your case (B):
B) ...If they concatenate the authorship string with the taxon name string when populating dwc:scientificName, then the consumer has no easy way of extracting the name bits from the authorship bits
Exactly! Wearing my data consumer hat, the first thing I need to do with current dwc:ScientifiName content from multiple sources is try to generate canonical names by stripping off what appear to be authorities (hopefully successfully but not guaranteed). If there was an extra field populated in all or even a subset of cases, this task would not be required.
This only applies if the provider has not already parsed the authorship details from the name bits. If the provider has already parsed them, then they can be provided separately via the appropriately parsed terms.
Let me ask this:
What value is canonicalName to:
- providers who have name+authorship data in a single text blob,
unparsed? 2) providers who have unparsed name bits, but separate name/authorship text blobs? 3) providers who have fully parsed name bits, and separate authorship? 4) users of content from any of the above providers?
The only possible value I see is for #2 (I see no value for #1 or #3). But there are two easy ways to work around this for providers in #2:
Concatnentate solution: dwc:scientificName: "Aus bus Jones 1980" dwc:scientificNameAuthorship: "Jones 1980" [users can easily strip the dwc:scientificNameAuthorship from the end of "Aus bus Jones 1980", to yield "Aus bus"]
Non-concatenate solution: dwc:scientificName: "Aus bus" dwc:scientificNameAuthorship: "Jones 1980" [users can easily concatentate the two, if desired]
Both will work fine with the existing DwC.
Don't get me wrong -- I cerainly see some value in establishing canonicalName. I just don't see quite enough value to overcome the desire for stability/consistency of DwC, compared what we can already do with DwC. If we're going to make a change to the DwC terms, we should think more carefully about what the actual problems are, and come up with a stable fix that is substantial and stable.
I can't help but think that a better solution is to modify the definition of scientificName to explicitly *exclude* authorship, and effectuvely play the role of what you want for canonicalName. Then create a new term called something like verbatimScientificName for people who have unparsed text blobs that may include both name bits and authorship bits.
The problem with the existing dwc:scientificName is that it is both required ("core"), and loosely defined (with/without authorship, with/without various qualifiers, etc.)
I think before we make any proposed changes to DwC, we need to identify more precisely what we want as both providers and users, and figure out what specific problems the existing DwC terms cause in terms of hassles for the providers, hassles for the users, and inability to accurately share information.
So, I think the mnain driver for this has to be from the large scale data consumers - GBIF, OBIS (with which I am associated), EOL, ALA etc. - if they would find such a field useful that is the real test. In my other incarnation as a data supplier, I can concatenate everything into scientificname as per the present DwC spec, no problem, it just is a lossy export when it is received as far as I am concerned.
How is it lossy? If you already have the content parsed, why can't you provided it parsed? Other than the specific example already discussed (no easy way to provide a record for a uninomial of rank not already represented among the terms), what other examples lead to information loss?
Aloha, Rich
On 24/11/2010, at 11:11 AM, Tony.Rees@csiro.au Tony.Rees@csiro.au wrote:
It's lossy if I do not wish to add unnecessary "bloat" to my (already large) DwCA export file by including dedicated fields for the individual values of specificEpithet, genus, family, order, class, phylum and kingdom as previously mentioned. These (especially the higher taxa for any level) are specifically *not* required in the "normalized" example given on the TDWG wiki as they can be generated on receipt of the data by following the parentID value/s.
Not sure if I've mentioned it on the list, but one of my solutions was to include a format string indicating how to compose various renderings of the name from its fields. You'd provide a name-only format, a name with authority format, and so on.
Thus: "{genus} {species} {epithet}" for subspecific names, and "{genus} {species} {rank} {epithet}" for form, variant and so on. The difficulty is that it's nonstandard, it requires additional processing at the client end, and hybrids are still a bit of a mess. And advantage is that the process of building these format strings would tell us something about how names are composed.
_______________________________________________
If you have received this transmission in error please notify us immediately by return e-mail and delete all copies. If this e-mail or any attachments have been sent to you in error, that error does not constitute waiver of any confidentiality, privilege or copyright in respect of information in the e-mail or attachments.
Please consider the environment before printing this email.
OK, understood.
But I guess my next question would be: is this really "bloat"? Isn't the cost of the bloat much less than the value of providing fully parsed content?
I now understand what I think is a large part of the basis for our (perhaps non-existent?) disagreement: I'm thinking of dwc terms in the abstract sense, whereas you are thinking in terms of more practical issues such as the MB size of your DwCA files. This also clarifies for me why you keep saying that it's really a question for the big aggregators (which I now understand and agree with).
Sorry if I was misunderstanding where you are coming from on this!
Aloha, Rich
-----Original Message----- From: Tony.Rees@csiro.au [mailto:Tony.Rees@csiro.au] Sent: Tuesday, November 23, 2010 2:12 PM To: Richard Pyle; Chuck.Miller@mobot.org; dremsen@gbif.org Cc: tdwg-content@lists.tdwg.org; dmozzherin@eol.org Subject: RE: [tdwg-content] [tdwg-tag] Inclusion of authorship in DwCscientificName: good or bad?
Rich Pyle wrote:
How is it lossy? If you already have the content parsed, why can't you provided it parsed? Other than the specific example already discussed (no easy way to provide a record for a uninomial
of rank not
already represented among the terms), what other examples lead to information loss?
It's lossy if I do not wish to add unnecessary "bloat" to my (already large) DwCA export file by including dedicated fields for the individual values of specificEpithet, genus, family, order, class, phylum and kingdom as previously mentioned. These (especially the higher taxa for any level) are specifically *not* required in the "normalized" example given on the TDWG wiki as they can be generated on receipt of the data by following the parentID value/s.
Cheers - Tony
-----Original Message----- From: Richard Pyle [mailto:deepreef@bishopmuseum.org] Sent: Wednesday, 24 November 2010 11:05 AM To: Rees, Tony (CMAR, Hobart); Chuck.Miller@mobot.org; dremsen@gbif.org Cc: tdwg-content@lists.tdwg.org; dmozzherin@eol.org Subject: RE: [tdwg-content] [tdwg-tag] Inclusion of authorship in DwCscientificName: good or bad?
Thanks, Rich...
I'll expand your case (A) a bit:
A) Representing the canonical (sans-authorship) form of a uninomial name at a rank not already represented by existing
rank-specific DwC
terms (kingdom, phylum, class, order, family, genus)
*** in an efficient manner for bulk data transfer
Agreed!
I.e. a single field canonicalName will then obviate the
requirements
for multiple fields speciesEpithet, genus, family, order, class, phylum, kingdom which otherwise have to be supplied as "placeholders" for every record in a large set even
though only one
or two will ever be populated at a given rank
I don't follow. None of the rank-sepecific terms are required, so they already can be empty. They are still useful to have, so that basic classification information can be provided along with
a lower-rank name.
The terms "genus", "subgenus", "specificEpithet" and "infraspecificEpithet" are still needed to provide pre-parsed name bits.
So I don't understand how canonicalName obviates the need for the multiple fields.
And a comment on your case (B):
B) ...If they concatenate the authorship string with the taxon name string when populating dwc:scientificName, then
the consumer
has no easy way of extracting the name bits from the authorship bits
Exactly! Wearing my data consumer hat, the first thing I
need to do
with current dwc:ScientifiName content from multiple
sources is try
to generate canonical names by stripping off what appear to be authorities (hopefully successfully but not guaranteed). If there was an extra field populated in all or even a subset of
cases, this
task would not be required.
This only applies if the provider has not already parsed the authorship details from the name bits. If the provider has already parsed them, then they can be provided separately via the
appropriately parsed terms.
Let me ask this:
What value is canonicalName to:
- providers who have name+authorship data in a single text blob,
unparsed? 2) providers who have unparsed name bits, but separate
name/authorship
text blobs? 3) providers who have fully parsed name bits, and separate
authorship?
- users of content from any of the above providers?
The only possible value I see is for #2 (I see no value for
#1 or #3).
But there are two easy ways to work around this for providers in #2:
Concatnentate solution: dwc:scientificName: "Aus bus Jones 1980" dwc:scientificNameAuthorship: "Jones 1980" [users can easily strip the dwc:scientificNameAuthorship
from the end
of "Aus bus Jones 1980", to yield "Aus bus"]
Non-concatenate solution: dwc:scientificName: "Aus bus" dwc:scientificNameAuthorship: "Jones 1980" [users can easily concatentate the two, if desired]
Both will work fine with the existing DwC.
Don't get me wrong -- I cerainly see some value in establishing canonicalName. I just don't see quite enough value to overcome the desire for stability/consistency of DwC, compared what we
can already
do with DwC. If we're going to make a change to the DwC terms, we should
think more
carefully about what the actual problems are, and come up with a stable fix that is substantial and stable.
I can't help but think that a better solution is to modify the definition of scientificName to explicitly *exclude*
authorship, and
effectuvely play the role of what you want for canonicalName. Then create a new term called something like verbatimScientificName for people who have unparsed text blobs that may include both name bits and authorship bits.
The problem with the existing dwc:scientificName is that it is both required ("core"), and loosely defined (with/without authorship, with/without various qualifiers, etc.)
I think before we make any proposed changes to DwC, we need to identify more precisely what we want as both providers and
users, and
figure out what specific problems the existing DwC terms cause in terms of hassles for the providers, hassles for the users, and inability to accurately share information.
So, I think the mnain driver for this has to be from the
large scale
data consumers - GBIF, OBIS (with which I am associated),
EOL, ALA
etc. - if they would find such a field useful that is the
real test.
In my other incarnation as a data supplier, I can concatenate everything into scientificname as per the present DwC spec, no problem, it just is a lossy export when it is received as
far as I
am concerned.
How is it lossy? If you already have the content parsed, why can't you provided it parsed? Other than the specific example already discussed (no easy way to provide a record for a uninomial
of rank not
already represented among the terms), what other examples lead to information loss?
Aloha, Rich
Rich,
No need to apologise... Actually it affects the aggregators in two respects, one is the larger vs. more compact data representation, the other is the present inconsistency about what is actually expected/supplied in practice by real world data providers in the present "scientificName" element. If it was clearer that this was for sciname + author, and the sciname without author had its own dedicated element, the incoming data would (might) be potentially a lot more consistent.
Basically it is the present "scientificNameAuthor" element which is clouding the issue - people see this and then think they do not need to add the author in to "scientificName" as well, although as previously stated by Markus this is technically incorrect according to the DwC spec (and I can see the argument for keeping it that way, so as to capture as much info as possible in that field).
Cheers - Tony
-----Original Message----- From: Richard Pyle [mailto:deepreef@bishopmuseum.org] Sent: Wednesday, 24 November 2010 11:27 AM To: Rees, Tony (CMAR, Hobart); Chuck.Miller@mobot.org; dremsen@gbif.org Cc: tdwg-content@lists.tdwg.org; dmozzherin@eol.org Subject: RE: [tdwg-content] [tdwg-tag] Inclusion of authorship in DwCscientificName: good or bad?
OK, understood.
But I guess my next question would be: is this really "bloat"? Isn't the cost of the bloat much less than the value of providing fully parsed content?
I now understand what I think is a large part of the basis for our (perhaps non-existent?) disagreement: I'm thinking of dwc terms in the abstract sense, whereas you are thinking in terms of more practical issues such as the MB size of your DwCA files. This also clarifies for me why you keep saying that it's really a question for the big aggregators (which I now understand and agree with).
Sorry if I was misunderstanding where you are coming from on this!
Aloha, Rich
My vote would be to clarify the use of scientific name to not include authorship as Rich suggests.
Perhaps a partial solution would be for the GNI or GBIF to provide some web service that end users could use to clean and parse their names into their dwc:scientificname and authorship parts. (They probably have something close to this already)
For ease of use the system could output something like this
Puma concolor <tab> (Linnaeus 1771)
In the process they could flag potentially incorrect uses of parenthesis etc.
Puma concolor <tab> Linnaeus 1771 <tab> Note Potentially incorrect authorship - parenthesis missing
or
Felis concolor <tab> Linnaeus 1771 <tab> Note Do you mean "Puma concolor (Linnaeus 1771)"
A beneficial side effect would be that everyone has a more normalized and accurate species list.
Respectfully,
- Pete
On Tue, Nov 23, 2010 at 6:40 PM, Tony.Rees@csiro.au wrote:
Rich,
No need to apologise... Actually it affects the aggregators in two respects, one is the larger vs. more compact data representation, the other is the present inconsistency about what is actually expected/supplied in practice by real world data providers in the present "scientificName" element. If it was clearer that this was for sciname + author, and the sciname without author had its own dedicated element, the incoming data would (might) be potentially a lot more consistent.
Basically it is the present "scientificNameAuthor" element which is clouding the issue - people see this and then think they do not need to add the author in to "scientificName" as well, although as previously stated by Markus this is technically incorrect according to the DwC spec (and I can see the argument for keeping it that way, so as to capture as much info as possible in that field).
Cheers - Tony
-----Original Message----- From: Richard Pyle [mailto:deepreef@bishopmuseum.org] Sent: Wednesday, 24 November 2010 11:27 AM To: Rees, Tony (CMAR, Hobart); Chuck.Miller@mobot.org; dremsen@gbif.org Cc: tdwg-content@lists.tdwg.org; dmozzherin@eol.org Subject: RE: [tdwg-content] [tdwg-tag] Inclusion of authorship in DwCscientificName: good or bad?
OK, understood.
But I guess my next question would be: is this really "bloat"? Isn't the cost of the bloat much less than the value of providing fully parsed content?
I now understand what I think is a large part of the basis for our (perhaps non-existent?) disagreement: I'm thinking of dwc terms in the abstract sense, whereas you are thinking in terms of more practical issues such as the MB size of your DwCA files. This also clarifies for me why you keep saying that it's really a question for the big aggregators (which I now understand and agree with).
Sorry if I was misunderstanding where you are coming from on this!
Aloha, Rich
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
Several name parsing services exist to provide this functionality
http://tools.gbif.org/nameparser/ http://gni.globalnames.org/parsers/new
My personal philosophy regarding data sharing is that, whenever practical, the burden should be placed on the enabling infrastructure and not the person wanting to share their data. Put enough impediments in the way and the pipes will stay empty.
I prefer a loose definition for using scientificName, that recommends parsing authority information into the authorship element but does not require it. The services for parsing could be made more easily available and even incorporated into publishing tools. I think we will be dealing with this issue whether canonicalName is available or not.
For GBIF at least, if dealing with authorship issues like this were the biggest data processing issue we faced, we would take it. If you want to see what we are dealing with at the moment, for example, the values in specimen and observation data that are included in the dwc:Family element, its a bit of an eye opener and I have a file!
DR
On Nov 24, 2010, at 4:46 AM, Peter DeVries wrote:
My vote would be to clarify the use of scientific name to not include authorship as Rich suggests.
Perhaps a partial solution would be for the GNI or GBIF to provide some web service that end users could use to clean and parse their names into their dwc:scientificname and authorship parts. (They probably have something close to this already)
For ease of use the system could output something like this
Puma concolor <tab> (Linnaeus 1771)
In the process they could flag potentially incorrect uses of parenthesis etc.
Puma concolor <tab> Linnaeus 1771 <tab> Note Potentially incorrect authorship - parenthesis missing
or
Felis concolor <tab> Linnaeus 1771 <tab> Note Do you mean "Puma concolor (Linnaeus 1771)"
A beneficial side effect would be that everyone has a more normalized and accurate species list.
Respectfully,
- Pete
On Tue, Nov 23, 2010 at 6:40 PM, Tony.Rees@csiro.au wrote: Rich,
No need to apologise... Actually it affects the aggregators in two respects, one is the larger vs. more compact data representation, the other is the present inconsistency about what is actually expected/supplied in practice by real world data providers in the present "scientificName" element. If it was clearer that this was for sciname + author, and the sciname without author had its own dedicated element, the incoming data would (might) be potentially a lot more consistent.
Basically it is the present "scientificNameAuthor" element which is clouding the issue - people see this and then think they do not need to add the author in to "scientificName" as well, although as previously stated by Markus this is technically incorrect according to the DwC spec (and I can see the argument for keeping it that way, so as to capture as much info as possible in that field).
Cheers - Tony
-----Original Message----- From: Richard Pyle [mailto:deepreef@bishopmuseum.org] Sent: Wednesday, 24 November 2010 11:27 AM To: Rees, Tony (CMAR, Hobart); Chuck.Miller@mobot.org; dremsen@gbif.org Cc: tdwg-content@lists.tdwg.org; dmozzherin@eol.org Subject: RE: [tdwg-content] [tdwg-tag] Inclusion of authorship in DwCscientificName: good or bad?
OK, understood.
But I guess my next question would be: is this really "bloat"?
Isn't the
cost of the bloat much less than the value of providing fully parsed content?
I now understand what I think is a large part of the basis for our (perhaps non-existent?) disagreement: I'm thinking of dwc terms in the
abstract
sense, whereas you are thinking in terms of more practical issues
such as
the MB size of your DwCA files. This also clarifies for me why
you keep
saying that it's really a question for the big aggregators (which
I now
understand and agree with).
Sorry if I was misunderstanding where you are coming from on this!
Aloha, Rich
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
--
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 TaxonConcept Knowledge Base / GeoSpecies Knowledge Base About the GeoSpecies Knowledge Base
B) ...If they concatenate the authorship string with the taxon name string when populating dwc:scientificName, then the consumer has no easy way of extracting the name bits from the authorship bits
Exactly! Wearing my data consumer hat, the first thing I need to do with current dwc:ScientifiName content from multiple sources is try to generate canonical names by stripping off what appear to be authorities (hopefully successfully but not guaranteed). If there was an extra field populated in all or even a subset of cases, this task would not be required.
So, I think the mnain driver for this has to be from the large scale data consumers - GBIF, OBIS (with which I am associated), EOL, ALA etc. - if they would find such a field useful that is the real test. In my other incarnation as a data supplier, I can concatenate everything into scientificname as per the present DwC spec, no problem, it just is a lossy export when it is received as far as I am concerned.
From GBIFs point of view there is no problem at all with using the full scientific name as it is.
In fact my preferred solution would be to only have to look into scientificName and nowhere else! Less options are superior.
Also nearly all datasets have a mix of canonical and "qualified" scientific names, so I am sure they will find it hard to populate canonicalName only with canonicals and scientificName only with names with authorship. I bet finally we would still have to check for all options, dealing with canonicals in scientificName, potentially having inconsistencies between canonicalName + authorship and scientificName. It would also be harder to define a single required term. If I supply the canonicalName already, do I still have to populate the scientificName? Even if I only have the canonical? If I have a non parsed full name, how will I be able to fill the canonical? From my point of view its not getting any easier.
Markus
-----Original Message----- From: Richard Pyle [mailto:deepreef@bishopmuseum.org] Sent: Wednesday, 24 November 2010 8:53 AM To: 'Chuck Miller'; Rees, Tony (CMAR, Hobart); dremsen@gbif.org Cc: tdwg-content@lists.tdwg.org; dmozzherin@eol.org Subject: RE: [tdwg-content] [tdwg-tag] Inclusion of authorship in DwCscientificName: good or bad?
What is the specific objection to adding canonicalName to DwC as an optional element, other than the fact it makes DwC one thing larger?
I don't have an objection to it per se, but I'd like to feel more certain that I understand exactly what it is, and what it is intended to achieve, that is not already achievable with existing terms and/or couldn't be more achievable with an alternative solution. I think there is value in avoiding feature-creep with DwC, except when we can solve a real problem with the existing terms. I agree there is a problem there, but I'm still struggling to understand exactly what specific problem that something like canonicalName will solve.
There are databases which do not have their names parsed and provide whatever they have recorded as ScientificName. But, there are also databases which do have parsed names and could provide this more narrowly defined element, in addition to the ScientificName. Those databases could make use of a dwc:canonicalName element in their data exchange or query response.
Right -- but the point is this: if the data are already parsed, where is the failure of the existing DwC terms in providing the desired service? We've already identified one of those: i.e., that "intermediate" uninomial ranks not supported by existing DwC terms don't have a place to put the canonical form of the name (other than scientificName, which isn't currently intended or required to be canonical). So yes, that's a clear problem in need of a soultion. But is a generic canaonicalName term really going to solve that efficiently/effectively? What other problems might canonicalName solve?
What we don't have and I think never will have is perfectly consistent names data from every database in the world. One reason is a mountain of inconsistently recorded legacy data from decades past that stands in the way of perfection. Another is variation in convention or tradition for a variety of reasons that have been explored in these recent threads. So, I think the pragmatic approach is to accept the inconsistencies and work around them.
Agreed! And my questions are:
- What specific problems with existing DwC do we wish to solve?
- How best to solve them?
I'll list two examples for #1:
A) Representing the canonical (sans-authorship) form of a uninomial name at a rank not already represented by existing rank-specific DwC terms (kingdom, phylum, class, order, family, genus) Because the current definition of dwc:scientificName allows (optionally) the inclusion of authorship information, there is no clean way to represent a uninomial name in a way that expressly excludes authorship -- except if the uninomial name happens to be represented at the rank of kingdom, phylum, class, order, family, or genus.
B) Content providers who have authorship data in a separate field from taxon name data, but who have not parsed the bits of a taxon name string In this case, the provider cannot provide the parsed bits of the name, but can provide a (sort of) canonicalName string separately from an authorship string. If they concatenate the authorship string with the taxon name string when populating dwc:scientificName, then the consumer has no easy way of extracting the name bits from the authorship bits (unless the provider also provides dwc:scientificNameAuthorship, wich could be exactly removed from the dwc:scientificName valu, yielding what the provider would have otherwised provided as canonicalName. Or, as David suggested, in this case the Authorship text would not be concatenated with scientificName.
I would like to know some other problems that could be solved with the addition of a canonicalName term before I start commenting on #2.
Aloha, Rich
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
what you said. yeah, that's what I meant.
On Nov 24, 2010, at 1:20 PM, Markus Döring wrote:
B) ...If they concatenate the authorship string with the taxon name string when populating dwc:scientificName, then the consumer has no easy way of extracting the name bits from the authorship bits
Exactly! Wearing my data consumer hat, the first thing I need to do with current dwc:ScientifiName content from multiple sources is try to generate canonical names by stripping off what appear to be authorities (hopefully successfully but not guaranteed). If there was an extra field populated in all or even a subset of cases, this task would not be required.
So, I think the mnain driver for this has to be from the large scale data consumers - GBIF, OBIS (with which I am associated), EOL, ALA etc. - if they would find such a field useful that is the real test. In my other incarnation as a data supplier, I can concatenate everything into scientificname as per the present DwC spec, no problem, it just is a lossy export when it is received as far as I am concerned.
From GBIFs point of view there is no problem at all with using the full scientific name as it is. In fact my preferred solution would be to only have to look into scientificName and nowhere else! Less options are superior.
Also nearly all datasets have a mix of canonical and "qualified" scientific names, so I am sure they will find it hard to populate canonicalName only with canonicals and scientificName only with names with authorship. I bet finally we would still have to check for all options, dealing with canonicals in scientificName, potentially having inconsistencies between canonicalName + authorship and scientificName. It would also be harder to define a single required term. If I supply the canonicalName already, do I still have to populate the scientificName? Even if I only have the canonical? If I have a non parsed full name, how will I be able to fill the canonical? From my point of view its not getting any easier.
Markus
-----Original Message----- From: Richard Pyle [mailto:deepreef@bishopmuseum.org] Sent: Wednesday, 24 November 2010 8:53 AM To: 'Chuck Miller'; Rees, Tony (CMAR, Hobart); dremsen@gbif.org Cc: tdwg-content@lists.tdwg.org; dmozzherin@eol.org Subject: RE: [tdwg-content] [tdwg-tag] Inclusion of authorship in DwCscientificName: good or bad?
What is the specific objection to adding canonicalName to DwC as an optional element, other than the fact it makes DwC one thing larger?
I don't have an objection to it per se, but I'd like to feel more certain that I understand exactly what it is, and what it is intended to achieve, that is not already achievable with existing terms and/or couldn't be more achievable with an alternative solution. I think there is value in avoiding feature-creep with DwC, except when we can solve a real problem with the existing terms. I agree there is a problem there, but I'm still struggling to understand exactly what specific problem that something like canonicalName will solve.
There are databases which do not have their names parsed and provide whatever they have recorded as ScientificName. But, there are also databases which do have parsed names and could provide this more narrowly defined element, in addition to the ScientificName. Those databases could make use of a dwc:canonicalName element in their data exchange or query response.
Right -- but the point is this: if the data are already parsed, where is the failure of the existing DwC terms in providing the desired service? We've already identified one of those: i.e., that "intermediate" uninomial ranks not supported by existing DwC terms don't have a place to put the canonical form of the name (other than scientificName, which isn't currently intended or required to be canonical). So yes, that's a clear problem in need of a soultion. But is a generic canaonicalName term really going to solve that efficiently/effectively? What other problems might canonicalName solve?
What we don't have and I think never will have is perfectly consistent names data from every database in the world. One reason is a mountain of inconsistently recorded legacy data from decades past that stands in the way of perfection. Another is variation in convention or tradition for a variety of reasons that have been explored in these recent threads. So, I think the pragmatic approach is to accept the inconsistencies and work around them.
Agreed! And my questions are:
- What specific problems with existing DwC do we wish to solve?
- How best to solve them?
I'll list two examples for #1:
A) Representing the canonical (sans-authorship) form of a uninomial name at a rank not already represented by existing rank-specific DwC terms (kingdom, phylum, class, order, family, genus) Because the current definition of dwc:scientificName allows (optionally) the inclusion of authorship information, there is no clean way to represent a uninomial name in a way that expressly excludes authorship -- except if the uninomial name happens to be represented at the rank of kingdom, phylum, class, order, family, or genus.
B) Content providers who have authorship data in a separate field from taxon name data, but who have not parsed the bits of a taxon name string In this case, the provider cannot provide the parsed bits of the name, but can provide a (sort of) canonicalName string separately from an authorship string. If they concatenate the authorship string with the taxon name string when populating dwc:scientificName, then the consumer has no easy way of extracting the name bits from the authorship bits (unless the provider also provides dwc:scientificNameAuthorship, wich could be exactly removed from the dwc:scientificName valu, yielding what the provider would have otherwised provided as canonicalName. Or, as David suggested, in this case the Authorship text would not be concatenated with scientificName.
I would like to know some other problems that could be solved with the addition of a canonicalName term before I start commenting on #2.
Aloha, Rich
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
Rich, I gather your reason would be because it's unclear if anyone would actually use a canonicalName element? That is, it's unneeded. So, following on, who says they need a dwc:canonicalName element?
You said you worry about feature creep. I suppose I worry about semantic creep. Extending the meaning of a term makes it more universal, but in a data world it increases the variability of the data that may be found attached to the term in some dataset. Imprecision in terms can create a lot of data quality headaches. Is that acceptable?
Chuck
On Nov 23, 2010, at 3:52 PM, "Richard Pyle" deepreef@bishopmuseum.org wrote:
What is the specific objection to adding canonicalName to DwC as an optional element, other than the fact it makes DwC one thing larger?
I don't have an objection to it per se, but I'd like to feel more certain that I understand exactly what it is, and what it is intended to achieve, that is not already achievable with existing terms and/or couldn't be more achievable with an alternative solution. I think there is value in avoiding feature-creep with DwC, except when we can solve a real problem with the existing terms. I agree there is a problem there, but I'm still struggling to understand exactly what specific problem that something like canonicalName will solve.
There are databases which do not have their names parsed and provide whatever they have recorded as ScientificName. But, there are also databases which do have parsed names and could provide this more narrowly defined element, in addition to the ScientificName. Those databases could make use of a dwc:canonicalName element in their data exchange or query response.
Right -- but the point is this: if the data are already parsed, where is the failure of the existing DwC terms in providing the desired service? We've already identified one of those: i.e., that "intermediate" uninomial ranks not supported by existing DwC terms don't have a place to put the canonical form of the name (other than scientificName, which isn't currently intended or required to be canonical). So yes, that's a clear problem in need of a soultion. But is a generic canaonicalName term really going to solve that efficiently/effectively? What other problems might canonicalName solve?
What we don't have and I think never will have is perfectly consistent names data from every database in the world. One reason is a mountain of inconsistently recorded legacy data from decades past that stands in the way of perfection. Another is variation in convention or tradition for a variety of reasons that have been explored in these recent threads. So, I think the pragmatic approach is to accept the inconsistencies and work around them.
Agreed! And my questions are:
- What specific problems with existing DwC do we wish to solve?
- How best to solve them?
I'll list two examples for #1:
A) Representing the canonical (sans-authorship) form of a uninomial name at a rank not already represented by existing rank-specific DwC terms (kingdom, phylum, class, order, family, genus) Because the current definition of dwc:scientificName allows (optionally) the inclusion of authorship information, there is no clean way to represent a uninomial name in a way that expressly excludes authorship -- except if the uninomial name happens to be represented at the rank of kingdom, phylum, class, order, family, or genus.
B) Content providers who have authorship data in a separate field from taxon name data, but who have not parsed the bits of a taxon name string In this case, the provider cannot provide the parsed bits of the name, but can provide a (sort of) canonicalName string separately from an authorship string. If they concatenate the authorship string with the taxon name string when populating dwc:scientificName, then the consumer has no easy way of extracting the name bits from the authorship bits (unless the provider also provides dwc:scientificNameAuthorship, wich could be exactly removed from the dwc:scientificName valu, yielding what the provider would have otherwised provided as canonicalName. Or, as David suggested, in this case the Authorship text would not be concatenated with scientificName.
I would like to know some other problems that could be solved with the addition of a canonicalName term before I start commenting on #2.
Aloha, Rich
Hi all,
I just had a quick look at the first few thousand data records coming into OBIS for my region (Australia). Just about every supplier who includes authority as dwc:scientificNameAuthor has used dwc:scientificName "incorrectly" i.e., for the canonical name not the canonical name + author. This data then flows into GBIF, ALA, etc. and circulates in this form. So "users" are already ignoring the definition of dwc:scientificName in practice, it would seem, with no apparent ill effects (?) - not sure whether this is good or bad, hence the title of my original question which prompted this thread...
- Tony
-----Original Message----- From: Chuck Miller [mailto:Chuck.Miller@mobot.org] Sent: Wednesday, 24 November 2010 9:56 AM To: Richard Pyle Cc: Rees, Tony (CMAR, Hobart); dremsen@gbif.org; tdwg- content@lists.tdwg.org; dmozzherin@eol.org Subject: Re: [tdwg-content] [tdwg-tag] Inclusion of authorship in DwCscientificName: good or bad?
Rich, I gather your reason would be because it's unclear if anyone would actually use a canonicalName element? That is, it's unneeded. So, following on, who says they need a dwc:canonicalName element?
You said you worry about feature creep. I suppose I worry about semantic creep. Extending the meaning of a term makes it more universal, but in a data world it increases the variability of the data that may be found attached to the term in some dataset. Imprecision in terms can create a lot of data quality headaches. Is that acceptable?
Chuck
On Nov 23, 2010, at 3:52 PM, "Richard Pyle" deepreef@bishopmuseum.org wrote:
What is the specific objection to adding canonicalName to DwC as an optional element, other than the fact it makes DwC one thing larger?
I don't have an objection to it per se, but I'd like to feel more
certain
that I understand exactly what it is, and what it is intended to
achieve,
that is not already achievable with existing terms and/or couldn't be
more
achievable with an alternative solution. I think there is value in
avoiding
feature-creep with DwC, except when we can solve a real problem with the existing terms. I agree there is a problem there, but I'm still
struggling
to understand exactly what specific problem that something like canonicalName will solve.
There are databases which do not have their names parsed and provide whatever they have recorded as ScientificName. But, there are also databases which do have parsed names and could provide this more narrowly defined element, in addition to the ScientificName. Those databases could make use of a dwc:canonicalName element in their data exchange or query response.
Right -- but the point is this: if the data are already parsed, where is
the
failure of the existing DwC terms in providing the desired service?
We've
already identified one of those: i.e., that "intermediate" uninomial
ranks
not supported by existing DwC terms don't have a place to put the
canonical
form of the name (other than scientificName, which isn't currently
intended
or required to be canonical). So yes, that's a clear problem in need of
a
soultion. But is a generic canaonicalName term really going to solve
that
efficiently/effectively? What other problems might canonicalName solve?
What we don't have and I think never will have is perfectly consistent names data from every database in the world. One reason is a mountain of inconsistently recorded legacy data from decades past that stands in the way of perfection. Another is variation in convention or tradition for a variety of reasons that have been explored in these recent threads. So, I think the pragmatic approach is to accept the inconsistencies and work around them.
Agreed! And my questions are:
- What specific problems with existing DwC do we wish to solve?
- How best to solve them?
I'll list two examples for #1:
A) Representing the canonical (sans-authorship) form of a uninomial name
at
a rank not already represented by existing rank-specific DwC terms
(kingdom,
phylum, class, order, family, genus) Because the current definition of dwc:scientificName allows (optionally)
the
inclusion of authorship information, there is no clean way to represent
a
uninomial name in a way that expressly excludes authorship -- except if
the
uninomial name happens to be represented at the rank of kingdom, phylum, class, order, family, or genus.
B) Content providers who have authorship data in a separate field from
taxon
name data, but who have not parsed the bits of a taxon name string In this case, the provider cannot provide the parsed bits of the name,
but
can provide a (sort of) canonicalName string separately from an
authorship
string. If they concatenate the authorship string with the taxon name string when populating dwc:scientificName, then the consumer has no easy
way
of extracting the name bits from the authorship bits (unless the
provider
also provides dwc:scientificNameAuthorship, wich could be exactly
removed
from the dwc:scientificName valu, yielding what the provider would have otherwised provided as canonicalName. Or, as David suggested, in this
case
the Authorship text would not be concatenated with scientificName.
I would like to know some other problems that could be solved with the addition of a canonicalName term before I start commenting on #2.
Aloha, Rich
Dave, The botanical folks often include the authors with their names. What do the data records coming into GBIF from herbarium collections look like? Do they mostly include or omit the authors in scientificName?
Chuck
-----Original Message----- From: Tony.Rees@csiro.au [mailto:Tony.Rees@csiro.au] Sent: Tuesday, November 23, 2010 5:09 PM To: Chuck Miller; deepreef@bishopmuseum.org Cc: dremsen@gbif.org; tdwg-content@lists.tdwg.org; dmozzherin@eol.org Subject: RE: [tdwg-content] [tdwg-tag] Inclusion of authorship in DwCscientificName: good or bad?
Hi all,
I just had a quick look at the first few thousand data records coming into OBIS for my region (Australia). Just about every supplier who includes authority as dwc:scientificNameAuthor has used dwc:scientificName "incorrectly" i.e., for the canonical name not the canonical name + author. This data then flows into GBIF, ALA, etc. and circulates in this form. So "users" are already ignoring the definition of dwc:scientificName in practice, it would seem, with no apparent ill effects (?) - not sure whether this is good or bad, hence the title of my original question which prompted this thread...
- Tony
-----Original Message----- From: Chuck Miller [mailto:Chuck.Miller@mobot.org] Sent: Wednesday, 24 November 2010 9:56 AM To: Richard Pyle Cc: Rees, Tony (CMAR, Hobart); dremsen@gbif.org; tdwg- content@lists.tdwg.org; dmozzherin@eol.org Subject: Re: [tdwg-content] [tdwg-tag] Inclusion of authorship in DwCscientificName: good or bad?
Rich, I gather your reason would be because it's unclear if anyone would actually use a canonicalName element? That is, it's unneeded. So, following on, who says they need a dwc:canonicalName element?
You said you worry about feature creep. I suppose I worry about semantic creep. Extending the meaning of a term makes it more universal, but in a data world it increases the variability of the data that may be found attached to the term in some dataset. Imprecision in terms can create a lot of data quality headaches. Is
that acceptable?
Chuck
On Nov 23, 2010, at 3:52 PM, "Richard Pyle" deepreef@bishopmuseum.org wrote:
What is the specific objection to adding canonicalName to DwC as an
optional element, other than the fact it makes DwC one thing larger?
I don't have an objection to it per se, but I'd like to feel more
certain
that I understand exactly what it is, and what it is intended to
achieve,
that is not already achievable with existing terms and/or couldn't be
more
achievable with an alternative solution. I think there is value in
avoiding
feature-creep with DwC, except when we can solve a real problem with
the existing terms. I agree there is a problem there, but I'm still
struggling
to understand exactly what specific problem that something like canonicalName will solve.
There are databases which do not have their names parsed and provide whatever they have recorded as ScientificName. But, there are also databases which do have parsed names and could provide this more narrowly defined element, in addition to the ScientificName. Those databases could make use of a dwc:canonicalName element in their data exchange or query response.
Right -- but the point is this: if the data are already parsed, where is
the
failure of the existing DwC terms in providing the desired service?
We've
already identified one of those: i.e., that "intermediate" uninomial
ranks
not supported by existing DwC terms don't have a place to put the
canonical
form of the name (other than scientificName, which isn't currently
intended
or required to be canonical). So yes, that's a clear problem in need
of
a
soultion. But is a generic canaonicalName term really going to solve
that
efficiently/effectively? What other problems might canonicalName
solve?
What we don't have and I think never will have is perfectly consistent names data from every database in the world. One reason
is a mountain of inconsistently recorded legacy data from decades past that stands in the way of perfection. Another is variation in convention or tradition for a variety of reasons that have been explored in these recent threads. So, I think the pragmatic approach is to accept the inconsistencies
and work around them.
Agreed! And my questions are:
- What specific problems with existing DwC do we wish to solve?
- How best to solve them?
I'll list two examples for #1:
A) Representing the canonical (sans-authorship) form of a uninomial name
at
a rank not already represented by existing rank-specific DwC terms
(kingdom,
phylum, class, order, family, genus) Because the current definition of dwc:scientificName allows (optionally)
the
inclusion of authorship information, there is no clean way to represent
a
uninomial name in a way that expressly excludes authorship -- except
if
the
uninomial name happens to be represented at the rank of kingdom, phylum, class, order, family, or genus.
B) Content providers who have authorship data in a separate field from
taxon
name data, but who have not parsed the bits of a taxon name string In this case, the provider cannot provide the parsed bits of the name,
but
can provide a (sort of) canonicalName string separately from an
authorship
string. If they concatenate the authorship string with the taxon name string when populating dwc:scientificName, then the consumer has no easy
way
of extracting the name bits from the authorship bits (unless the
provider
also provides dwc:scientificNameAuthorship, wich could be exactly
removed
from the dwc:scientificName valu, yielding what the provider would have otherwised provided as canonicalName. Or, as David suggested, in this
case
the Authorship text would not be concatenated with scientificName.
I would like to know some other problems that could be solved with the addition of a canonicalName term before I start commenting on
#2.
Aloha, Rich
Chuck, we see all sorts of things you can imagine in scientificName. For occurrence records the vast majority is the canonical form though - with an empty scientificNameAuthorship. I'd think they mostly dont have the authorship information captured in their system.
Some recent statistics I did on the latest 269 million occurrence records for taxonomy can be seen here: http://code.google.com/p/gbif-occurrencestore/wiki/TaxonomicIntegration#Stat...
We have roughly 3.5 million distinct scientific names. Parsing them into their canonical form leaves only 2.1 million, only few of them being monomials (95.000 names representing 14.3 million occurrence records).
Not surprisingly zoological names often contain the year while botanical ones often contain the authorship. You will find 4 parted names and multiple authorships in the same name for different parts, eg a species authorship and a subspecies one.
Markus
On Nov 24, 2010, at 0:16, Chuck Miller wrote:
Dave, The botanical folks often include the authors with their names. What do the data records coming into GBIF from herbarium collections look like? Do they mostly include or omit the authors in scientificName?
Chuck
-----Original Message----- From: Tony.Rees@csiro.au [mailto:Tony.Rees@csiro.au] Sent: Tuesday, November 23, 2010 5:09 PM To: Chuck Miller; deepreef@bishopmuseum.org Cc: dremsen@gbif.org; tdwg-content@lists.tdwg.org; dmozzherin@eol.org Subject: RE: [tdwg-content] [tdwg-tag] Inclusion of authorship in DwCscientificName: good or bad?
Hi all,
I just had a quick look at the first few thousand data records coming into OBIS for my region (Australia). Just about every supplier who includes authority as dwc:scientificNameAuthor has used dwc:scientificName "incorrectly" i.e., for the canonical name not the canonical name + author. This data then flows into GBIF, ALA, etc. and circulates in this form. So "users" are already ignoring the definition of dwc:scientificName in practice, it would seem, with no apparent ill effects (?) - not sure whether this is good or bad, hence the title of my original question which prompted this thread...
- Tony
-----Original Message----- From: Chuck Miller [mailto:Chuck.Miller@mobot.org] Sent: Wednesday, 24 November 2010 9:56 AM To: Richard Pyle Cc: Rees, Tony (CMAR, Hobart); dremsen@gbif.org; tdwg- content@lists.tdwg.org; dmozzherin@eol.org Subject: Re: [tdwg-content] [tdwg-tag] Inclusion of authorship in DwCscientificName: good or bad?
Rich, I gather your reason would be because it's unclear if anyone would actually use a canonicalName element? That is, it's unneeded. So, following on, who says they need a dwc:canonicalName element?
You said you worry about feature creep. I suppose I worry about semantic creep. Extending the meaning of a term makes it more universal, but in a data world it increases the variability of the data that may be found attached to the term in some dataset. Imprecision in terms can create a lot of data quality headaches. Is
that acceptable?
Chuck
On Nov 23, 2010, at 3:52 PM, "Richard Pyle" deepreef@bishopmuseum.org wrote:
What is the specific objection to adding canonicalName to DwC as an
optional element, other than the fact it makes DwC one thing larger?
I don't have an objection to it per se, but I'd like to feel more
certain
that I understand exactly what it is, and what it is intended to
achieve,
that is not already achievable with existing terms and/or couldn't be
more
achievable with an alternative solution. I think there is value in
avoiding
feature-creep with DwC, except when we can solve a real problem with
the existing terms. I agree there is a problem there, but I'm still
struggling
to understand exactly what specific problem that something like canonicalName will solve.
There are databases which do not have their names parsed and provide whatever they have recorded as ScientificName. But, there are also databases which do have parsed names and could provide this more narrowly defined element, in addition to the ScientificName. Those databases could make use of a dwc:canonicalName element in their data exchange or query response.
Right -- but the point is this: if the data are already parsed, where is
the
failure of the existing DwC terms in providing the desired service?
We've
already identified one of those: i.e., that "intermediate" uninomial
ranks
not supported by existing DwC terms don't have a place to put the
canonical
form of the name (other than scientificName, which isn't currently
intended
or required to be canonical). So yes, that's a clear problem in need
of
a
soultion. But is a generic canaonicalName term really going to solve
that
efficiently/effectively? What other problems might canonicalName
solve?
What we don't have and I think never will have is perfectly consistent names data from every database in the world. One reason
is a mountain of inconsistently recorded legacy data from decades past that stands in the way of perfection. Another is variation in convention or tradition for a variety of reasons that have been explored in these recent threads. So, I think the pragmatic approach is to accept the inconsistencies
and work around them.
Agreed! And my questions are:
- What specific problems with existing DwC do we wish to solve?
- How best to solve them?
I'll list two examples for #1:
A) Representing the canonical (sans-authorship) form of a uninomial name
at
a rank not already represented by existing rank-specific DwC terms
(kingdom,
phylum, class, order, family, genus) Because the current definition of dwc:scientificName allows (optionally)
the
inclusion of authorship information, there is no clean way to represent
a
uninomial name in a way that expressly excludes authorship -- except
if
the
uninomial name happens to be represented at the rank of kingdom, phylum, class, order, family, or genus.
B) Content providers who have authorship data in a separate field from
taxon
name data, but who have not parsed the bits of a taxon name string In this case, the provider cannot provide the parsed bits of the name,
but
can provide a (sort of) canonicalName string separately from an
authorship
string. If they concatenate the authorship string with the taxon name string when populating dwc:scientificName, then the consumer has no easy
way
of extracting the name bits from the authorship bits (unless the
provider
also provides dwc:scientificNameAuthorship, wich could be exactly
removed
from the dwc:scientificName valu, yielding what the provider would have otherwised provided as canonicalName. Or, as David suggested, in this
case
the Authorship text would not be concatenated with scientificName.
I would like to know some other problems that could be solved with the addition of a canonicalName term before I start commenting on
#2.
Aloha, Rich
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
Markus, Very good. I was curious though about the difference in sciname formation between the plant and animal kingdoms. Do you have those counts?
Thanks, Chuck
On Nov 24, 2010, at 6:56 AM, "Markus Döring (GBIF)" mdoering@gbif.org wrote:
Chuck, we see all sorts of things you can imagine in scientificName. For occurrence records the vast majority is the canonical form though - with an empty scientificNameAuthorship. I'd think they mostly dont have the authorship information captured in their system.
Some recent statistics I did on the latest 269 million occurrence records for taxonomy can be seen here: http://code.google.com/p/gbif-occurrencestore/wiki/TaxonomicIntegration#Stat...
We have roughly 3.5 million distinct scientific names. Parsing them into their canonical form leaves only 2.1 million, only few of them being monomials (95.000 names representing 14.3 million occurrence records).
Not surprisingly zoological names often contain the year while botanical ones often contain the authorship. You will find 4 parted names and multiple authorships in the same name for different parts, eg a species authorship and a subspecies one.
Markus
On Nov 24, 2010, at 0:16, Chuck Miller wrote:
Dave, The botanical folks often include the authors with their names. What do the data records coming into GBIF from herbarium collections look like? Do they mostly include or omit the authors in scientificName?
Chuck
-----Original Message----- From: Tony.Rees@csiro.au [mailto:Tony.Rees@csiro.au] Sent: Tuesday, November 23, 2010 5:09 PM To: Chuck Miller; deepreef@bishopmuseum.org Cc: dremsen@gbif.org; tdwg-content@lists.tdwg.org; dmozzherin@eol.org Subject: RE: [tdwg-content] [tdwg-tag] Inclusion of authorship in DwCscientificName: good or bad?
Hi all,
I just had a quick look at the first few thousand data records coming into OBIS for my region (Australia). Just about every supplier who includes authority as dwc:scientificNameAuthor has used dwc:scientificName "incorrectly" i.e., for the canonical name not the canonical name + author. This data then flows into GBIF, ALA, etc. and circulates in this form. So "users" are already ignoring the definition of dwc:scientificName in practice, it would seem, with no apparent ill effects (?) - not sure whether this is good or bad, hence the title of my original question which prompted this thread...
- Tony
-----Original Message----- From: Chuck Miller [mailto:Chuck.Miller@mobot.org] Sent: Wednesday, 24 November 2010 9:56 AM To: Richard Pyle Cc: Rees, Tony (CMAR, Hobart); dremsen@gbif.org; tdwg- content@lists.tdwg.org; dmozzherin@eol.org Subject: Re: [tdwg-content] [tdwg-tag] Inclusion of authorship in DwCscientificName: good or bad?
Rich, I gather your reason would be because it's unclear if anyone would actually use a canonicalName element? That is, it's unneeded. So, following on, who says they need a dwc:canonicalName element?
You said you worry about feature creep. I suppose I worry about semantic creep. Extending the meaning of a term makes it more universal, but in a data world it increases the variability of the data that may be found attached to the term in some dataset. Imprecision in terms can create a lot of data quality headaches. Is
that acceptable?
Chuck
On Nov 23, 2010, at 3:52 PM, "Richard Pyle" deepreef@bishopmuseum.org wrote:
What is the specific objection to adding canonicalName to DwC as an
optional element, other than the fact it makes DwC one thing larger?
I don't have an objection to it per se, but I'd like to feel more
certain
that I understand exactly what it is, and what it is intended to
achieve,
that is not already achievable with existing terms and/or couldn't be
more
achievable with an alternative solution. I think there is value in
avoiding
feature-creep with DwC, except when we can solve a real problem with
the existing terms. I agree there is a problem there, but I'm still
struggling
to understand exactly what specific problem that something like canonicalName will solve.
There are databases which do not have their names parsed and provide whatever they have recorded as ScientificName. But, there are also databases which do have parsed names and could provide this more narrowly defined element, in addition to the ScientificName. Those databases could make use of a dwc:canonicalName element in their data exchange or query response.
Right -- but the point is this: if the data are already parsed, where is
the
failure of the existing DwC terms in providing the desired service?
We've
already identified one of those: i.e., that "intermediate" uninomial
ranks
not supported by existing DwC terms don't have a place to put the
canonical
form of the name (other than scientificName, which isn't currently
intended
or required to be canonical). So yes, that's a clear problem in need
of
a
soultion. But is a generic canaonicalName term really going to solve
that
efficiently/effectively? What other problems might canonicalName
solve?
What we don't have and I think never will have is perfectly consistent names data from every database in the world. One reason
is a mountain of inconsistently recorded legacy data from decades past that stands in the way of perfection. Another is variation in convention or tradition for a variety of reasons that have been explored in these recent threads. So, I think the pragmatic approach is to accept the inconsistencies
and work around them.
Agreed! And my questions are:
- What specific problems with existing DwC do we wish to solve?
- How best to solve them?
I'll list two examples for #1:
A) Representing the canonical (sans-authorship) form of a uninomial name
at
a rank not already represented by existing rank-specific DwC terms
(kingdom,
phylum, class, order, family, genus) Because the current definition of dwc:scientificName allows (optionally)
the
inclusion of authorship information, there is no clean way to represent
a
uninomial name in a way that expressly excludes authorship -- except
if
the
uninomial name happens to be represented at the rank of kingdom, phylum, class, order, family, or genus.
B) Content providers who have authorship data in a separate field from
taxon
name data, but who have not parsed the bits of a taxon name string In this case, the provider cannot provide the parsed bits of the name,
but
can provide a (sort of) canonicalName string separately from an
authorship
string. If they concatenate the authorship string with the taxon name string when populating dwc:scientificName, then the consumer has no easy
way
of extracting the name bits from the authorship bits (unless the
provider
also provides dwc:scientificNameAuthorship, wich could be exactly
removed
from the dwc:scientificName valu, yielding what the provider would have otherwised provided as canonicalName. Or, as David suggested, in this
case
the Authorship text would not be concatenated with scientificName.
I would like to know some other problems that could be solved with the addition of a canonicalName term before I start commenting on
#2.
Aloha, Rich
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
Not immediately at hand I am afraid. Ill see what I can do and post them later.
Markus
On Nov 24, 2010, at 14:38, Chuck Miller wrote:
Markus, Very good. I was curious though about the difference in sciname formation between the plant and animal kingdoms. Do you have those counts?
Thanks, Chuck
On Nov 24, 2010, at 6:56 AM, "Markus Döring (GBIF)" mdoering@gbif.org wrote:
Chuck, we see all sorts of things you can imagine in scientificName. For occurrence records the vast majority is the canonical form though - with an empty scientificNameAuthorship. I'd think they mostly dont have the authorship information captured in their system.
Some recent statistics I did on the latest 269 million occurrence records for taxonomy can be seen here: http://code.google.com/p/gbif-occurrencestore/wiki/TaxonomicIntegration#Stat...
We have roughly 3.5 million distinct scientific names. Parsing them into their canonical form leaves only 2.1 million, only few of them being monomials (95.000 names representing 14.3 million occurrence records).
Not surprisingly zoological names often contain the year while botanical ones often contain the authorship. You will find 4 parted names and multiple authorships in the same name for different parts, eg a species authorship and a subspecies one.
Markus
On Nov 24, 2010, at 0:16, Chuck Miller wrote:
Dave, The botanical folks often include the authors with their names. What do the data records coming into GBIF from herbarium collections look like? Do they mostly include or omit the authors in scientificName?
Chuck
-----Original Message----- From: Tony.Rees@csiro.au [mailto:Tony.Rees@csiro.au] Sent: Tuesday, November 23, 2010 5:09 PM To: Chuck Miller; deepreef@bishopmuseum.org Cc: dremsen@gbif.org; tdwg-content@lists.tdwg.org; dmozzherin@eol.org Subject: RE: [tdwg-content] [tdwg-tag] Inclusion of authorship in DwCscientificName: good or bad?
Hi all,
I just had a quick look at the first few thousand data records coming into OBIS for my region (Australia). Just about every supplier who includes authority as dwc:scientificNameAuthor has used dwc:scientificName "incorrectly" i.e., for the canonical name not the canonical name + author. This data then flows into GBIF, ALA, etc. and circulates in this form. So "users" are already ignoring the definition of dwc:scientificName in practice, it would seem, with no apparent ill effects (?) - not sure whether this is good or bad, hence the title of my original question which prompted this thread...
- Tony
-----Original Message----- From: Chuck Miller [mailto:Chuck.Miller@mobot.org] Sent: Wednesday, 24 November 2010 9:56 AM To: Richard Pyle Cc: Rees, Tony (CMAR, Hobart); dremsen@gbif.org; tdwg- content@lists.tdwg.org; dmozzherin@eol.org Subject: Re: [tdwg-content] [tdwg-tag] Inclusion of authorship in DwCscientificName: good or bad?
Rich, I gather your reason would be because it's unclear if anyone would actually use a canonicalName element? That is, it's unneeded. So, following on, who says they need a dwc:canonicalName element?
You said you worry about feature creep. I suppose I worry about semantic creep. Extending the meaning of a term makes it more universal, but in a data world it increases the variability of the data that may be found attached to the term in some dataset. Imprecision in terms can create a lot of data quality headaches. Is
that acceptable?
Chuck
On Nov 23, 2010, at 3:52 PM, "Richard Pyle" deepreef@bishopmuseum.org wrote:
What is the specific objection to adding canonicalName to DwC as an
optional element, other than the fact it makes DwC one thing larger?
I don't have an objection to it per se, but I'd like to feel more
certain
that I understand exactly what it is, and what it is intended to
achieve,
that is not already achievable with existing terms and/or couldn't be
more
achievable with an alternative solution. I think there is value in
avoiding
feature-creep with DwC, except when we can solve a real problem with
the existing terms. I agree there is a problem there, but I'm still
struggling
to understand exactly what specific problem that something like canonicalName will solve.
There are databases which do not have their names parsed and provide whatever they have recorded as ScientificName. But, there are also databases which do have parsed names and could provide this more narrowly defined element, in addition to the ScientificName. Those databases could make use of a dwc:canonicalName element in their data exchange or query response.
Right -- but the point is this: if the data are already parsed, where is
the
failure of the existing DwC terms in providing the desired service?
We've
already identified one of those: i.e., that "intermediate" uninomial
ranks
not supported by existing DwC terms don't have a place to put the
canonical
form of the name (other than scientificName, which isn't currently
intended
or required to be canonical). So yes, that's a clear problem in need
of
a
soultion. But is a generic canaonicalName term really going to solve
that
efficiently/effectively? What other problems might canonicalName
solve?
What we don't have and I think never will have is perfectly consistent names data from every database in the world. One reason
is a mountain of inconsistently recorded legacy data from decades past that stands in the way of perfection. Another is variation in convention or tradition for a variety of reasons that have been explored in these recent threads. So, I think the pragmatic approach is to accept the inconsistencies
and work around them.
Agreed! And my questions are:
- What specific problems with existing DwC do we wish to solve?
- How best to solve them?
I'll list two examples for #1:
A) Representing the canonical (sans-authorship) form of a uninomial name
at
a rank not already represented by existing rank-specific DwC terms
(kingdom,
phylum, class, order, family, genus) Because the current definition of dwc:scientificName allows (optionally)
the
inclusion of authorship information, there is no clean way to represent
a
uninomial name in a way that expressly excludes authorship -- except
if
the
uninomial name happens to be represented at the rank of kingdom, phylum, class, order, family, or genus.
B) Content providers who have authorship data in a separate field from
taxon
name data, but who have not parsed the bits of a taxon name string In this case, the provider cannot provide the parsed bits of the name,
but
can provide a (sort of) canonicalName string separately from an
authorship
string. If they concatenate the authorship string with the taxon name string when populating dwc:scientificName, then the consumer has no easy
way
of extracting the name bits from the authorship bits (unless the
provider
also provides dwc:scientificNameAuthorship, wich could be exactly
removed
from the dwc:scientificName valu, yielding what the provider would have otherwised provided as canonicalName. Or, as David suggested, in this
case
the Authorship text would not be concatenated with scientificName.
I would like to know some other problems that could be solved with the addition of a canonicalName term before I start commenting on
#2.
Aloha, Rich
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
Markus,
Thanks for the info. I'm slowly getting the impression that there is not a huge appetite for adding canonicalName although it appears folk are supplying this in many cases anyway, with or without separate authorship element populated.
One other thing did occur to me. I am thinking there are actually a couple of different use cases here. One is for occurrence data which then has to be matched against a reference taxonomy, including all sorts of rough stuff. The other is for formatting and transfer of the reference taxonomies themselves (I do both at different times, also you ingest both too). Does that have any impact on the choices / decisions to make here?
Other than that I'll probably just shut up now :)
Cheers - Tony
________________________________________ From: tdwg-content-bounces@lists.tdwg.org [tdwg-content-bounces@lists.tdwg.org] On Behalf Of "Markus Döring (GBIF)" [mdoering@gbif.org] Sent: Wednesday, 24 November 2010 11:56 PM To: Chuck Miller Cc: tdwg-content@lists.tdwg.org; dmozzherin@eol.org Subject: Re: [tdwg-content] [tdwg-tag] Inclusion of authorship in DwCscientificName: good or bad?
Chuck, we see all sorts of things you can imagine in scientificName. For occurrence records the vast majority is the canonical form though - with an empty scientificNameAuthorship. I'd think they mostly dont have the authorship information captured in their system.
Some recent statistics I did on the latest 269 million occurrence records for taxonomy can be seen here: http://code.google.com/p/gbif-occurrencestore/wiki/TaxonomicIntegration#Stat...
We have roughly 3.5 million distinct scientific names. Parsing them into their canonical form leaves only 2.1 million, only few of them being monomials (95.000 names representing 14.3 million occurrence records).
Not surprisingly zoological names often contain the year while botanical ones often contain the authorship. You will find 4 parted names and multiple authorships in the same name for different parts, eg a species authorship and a subspecies one.
Markus
On Nov 24, 2010, at 0:16, Chuck Miller wrote:
Dave, The botanical folks often include the authors with their names. What do the data records coming into GBIF from herbarium collections look like? Do they mostly include or omit the authors in scientificName?
Chuck
-----Original Message----- From: Tony.Rees@csiro.au [mailto:Tony.Rees@csiro.au] Sent: Tuesday, November 23, 2010 5:09 PM To: Chuck Miller; deepreef@bishopmuseum.org Cc: dremsen@gbif.org; tdwg-content@lists.tdwg.org; dmozzherin@eol.org Subject: RE: [tdwg-content] [tdwg-tag] Inclusion of authorship in DwCscientificName: good or bad?
Hi all,
I just had a quick look at the first few thousand data records coming into OBIS for my region (Australia). Just about every supplier who includes authority as dwc:scientificNameAuthor has used dwc:scientificName "incorrectly" i.e., for the canonical name not the canonical name + author. This data then flows into GBIF, ALA, etc. and circulates in this form. So "users" are already ignoring the definition of dwc:scientificName in practice, it would seem, with no apparent ill effects (?) - not sure whether this is good or bad, hence the title of my original question which prompted this thread...
- Tony
One other thing did occur to me. I am thinking there are actually a couple
of
different use cases here. One is for occurrence data which then has to be matched against a reference taxonomy, including all sorts of rough stuff.
The
other is for formatting and transfer of the reference taxonomies
themselves
(I do both at different times, also you ingest both too). Does that have
any
impact on the choices / decisions to make here?
I think this is a *key* point. DwC (and DwCA) are intended as a vehicle for transfer of names data both as *attributes* of the "real" objects (e.g., occurrences), and as "real" objects unto themselves (as reference taxonomies). Obviously, we'd expect much more noise from the former than the latter, and we can imagine services within GNA to join datasets of the former type to the latter type.
I suspect that the occurrence datasets would rely much more heavily on the "verbatimScientificName" term, whereas the reference taxonomies would lean more towards the parsed sets (although "verbatimScientificName" is still a very useful concept for the reference taxonomies, if those datasets are providing name-usage instances, and wish to provide both the verbatim representation of the name, and a parsed/canonical "code-corrected" representation of the name).
Other than that I'll probably just shut up now :)
Please don't! My mind is just now starting to get around the essence of this issue. I think it's been a very helpful conversation!
Aloha, Rich
I just had a quick look at the first few thousand data records coming into OBIS for my region (Australia). Just about every supplier who includes authority as dwc:scientificNameAuthor has used dwc:scientificName "incorrectly" i.e., for the canonical name not the canonical name + author. This data then flows into GBIF, ALA, etc. and circulates in this form. So "users" are already ignoring the definition of dwc:scientificName in practice, it would seem, with no apparent ill effects (?) - not sure whether this is good or bad, hence the title of my original question which prompted this thread...
OK, so here's the question:
Is it more disruptive to re-define dwc:scientificName to explicitly exclude authorship?
Or, is it more disruptive to leave the existing (loose) definition of scientificName intact, and create more term(s) with more precise meanings, which we feel can help facilitate sharing of infomration?
Rich
I just had a quick look at the first few thousand data records coming into OBIS for my region (Australia). Just about every supplier who includes authority as dwc:scientificNameAuthor has used dwc:scientificName "incorrectly" i.e., for the canonical name not the canonical name + author. This data then flows into GBIF, ALA, etc. and circulates in this form. So "users" are already ignoring the definition of dwc:scientificName in practice, it would seem, with no apparent ill effects (?) - not sure whether this is good or bad, hence the title of my original question which prompted this thread...
OK, so here's the question:
Is it more disruptive to re-define dwc:scientificName to explicitly exclude authorship?
Thats definitely something Id like to avoid! We really need one place to keep the most explicit form of the name.
From seeing real data coming in I would coin the definition for scientificName that it should *contain the most complete, verbatim name string*.
If you happen to have only a canonical, use the canonical. If you happen to have canonical + authorship parsed, join them if you can (its usually not a simple concatenation, beware).
Markus
Or, is it more disruptive to leave the existing (loose) definition of scientificName intact, and create more term(s) with more precise meanings, which we feel can help facilitate sharing of infomration?
Rich
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
The basic problem, I think, is that neither the Codes, nor the vast majority of taxonomists, regard the Authorship details as being part of a "scientific name". Hence, the inclusion of name-bits and authorship bits in the same text string does not resonate well with the term "scientificName". I think most of us in the taxonomy world would be willing to overlook this semantic dissonance, except for the fact that "scientificName" is so crucial to any DwC or DwCA record by virtue of it being required (Is this true, BTW? I don't see anywhere in the DwC documentation where this requirement is stated explicitly).
I think it's also true that it's much easier and more reliable to concatenate [scientificName]+' '+[scientificNameAuthorship] at the client side, than it is parse something like "Aus bus deBruen" into [genus] | [specificEpithet] | [scientificNameAuthorship]; especially when "deBruen" might be misinterpreted as an infraspecificEpithet.
So, forgetting the existing DwC terms for a moment, I think what we ultimately want is the ability to pass a complete/verbatim name-string, and also pass the parsed bits. As already stated, it's much easier to generate the former from the latter, than the other way around; thus, to be provider-friendly, if either is required, it should be the complete/verbatim version. So, what I think we need is something like:
verbatimScientificName As I suggested in an earlier post, this would be "the complete set of textual elements useful for recognizing a unique scientific name", exactly as they appear in the original source.
uninomialNameElement Used for all names at the rank of genus and above; would also replace "genus" in DwC.
infragenericNameElement Better term for "subgenus".
specificEpithet As in existing DwC.
infraspecificEpithet As in existing DwC.
scientificNameAuthorship As in existing DwC.
I don't really agree with Tony on the "clutter" argument for introducing a single "canonicalName" term to replace the parsed uninomialNameElement [aka "genus"], infragenericNameElement [aka "subgenus"], specificEpithet, and infraspecificEpithet. (Side question to Tony -- would canonicalName include "var.", "f." etc., hence obviating the need for TaxonRank as well?) After all, "Aus bus xus" requires exactly the same number of bytes as "Aus,bus,xus" in a DwCA file. Of course, if verbatimScientificName [aka scientificName] is required, we'd have redundancy and hence doubling of bytes. However, if defined as verbatimScientificName as above, it would not really be redundant information if the parsed bits were defined as representing the Code-corrected version of the name, and due to the fact that the verbatimScientificName will often be different from a canonical concatenation of the parsed bits according to some standard format/formula.
So, to me, the main questions to answer are:
1) How does the existing DwC/DwCA structure fail to meet the needs of providers and/or users, in terms of loss of information, potential for misrepresentation of information, or inefficient or ineffective transfer of information (i.e. overburdening either the provider or the client).
2) What are the most effective and least disruptive ways to correct the failures identified in #1 above, in terms of re-defining existing terms, vs. introducing new (and potentially redundant) terms, vs. a complete new set of terms that may be semantically less confusing to taxonomists (as above)?
Aloha, Rich
-----Original Message----- From: Markus Döring [mailto:m.doering@mac.com] Sent: Wednesday, November 24, 2010 2:30 AM To: Richard Pyle Cc: Tony.Rees@csiro.au; Chuck.Miller@mobot.org; tdwg- content@lists.tdwg.org; dmozzherin@eol.org Subject: Re: [tdwg-content] [tdwg-tag] Inclusion of authorship in DwCscientificName: good or bad?
I just had a quick look at the first few thousand data records coming into OBIS for my region (Australia). Just about every supplier who includes authority as dwc:scientificNameAuthor has used dwc:scientificName "incorrectly" i.e., for the canonical name not the canonical name + author. This data then flows into GBIF, ALA, etc. and circulates in this form. So "users" are already ignoring the definition of dwc:scientificName in practice, it would seem, with no apparent ill effects (?) - not sure whether this is good or bad, hence the title of my original question which prompted this thread...
OK, so here's the question:
Is it more disruptive to re-define dwc:scientificName to explicitly exclude authorship?
Thats definitely something Id like to avoid! We really need one place to keep the most explicit form of the name.
From seeing real data coming in I would coin the definition for
scientificName
that it should *contain the most complete, verbatim name string*. If you happen to have only a canonical, use the canonical. If you happen
to
have canonical + authorship parsed, join them if you can (its usually not
a
simple concatenation, beware).
Markus
Or, is it more disruptive to leave the existing (loose) definition of scientificName intact, and create more term(s) with more precise meanings, which we feel can help facilitate sharing of infomration?
Rich
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
I think we are hitting the problem now quite well. A quick response to your suggestion, Rich. Mostly agree with your conclusions:
verbatimScientificName As I suggested in an earlier post, this would be "the complete set of textual elements useful for recognizing a unique scientific name", exactly as they appear in the original source.
yes for the definition, but Im not sure if removing scientificName from the dwc terms is a true option though. Its the most known term of all...
uninomialNameElement Used for all names at the rank of genus and above; would also replace "genus" in DwC.
Genus will still be needed to represent the denormalised classification, but not for the parsed bits.
infragenericNameElement Better term for "subgenus".
Probably same is true for subgenus
specificEpithet As in existing DwC.
infraspecificEpithet As in existing DwC.
scientificNameAuthorship As in existing DwC.
I don't really agree with Tony on the "clutter" argument for introducing a single "canonicalName" term to replace the parsed uninomialNameElement [aka "genus"], infragenericNameElement [aka "subgenus"], specificEpithet, and infraspecificEpithet. (Side question to Tony -- would canonicalName include "var.", "f." etc., hence obviating the need for TaxonRank as well?) After all, "Aus bus xus" requires exactly the same number of bytes as "Aus,bus,xus" in a DwCA file. Of course, if verbatimScientificName [aka scientificName] is required, we'd have redundancy and hence doubling of bytes. However, if defined as verbatimScientificName as above, it would not really be redundant information if the parsed bits were defined as representing the Code-corrected version of the name, and due to the fact that the verbatimScientificName will often be different from a canonical concatenation of the parsed bits according to some standard format/formula.
So, to me, the main questions to answer are:
- How does the existing DwC/DwCA structure fail to meet the needs of
providers and/or users, in terms of loss of information, potential for misrepresentation of information, or inefficient or ineffective transfer of information (i.e. overburdening either the provider or the client).
- What are the most effective and least disruptive ways to correct the
failures identified in #1 above, in terms of re-defining existing terms, vs. introducing new (and potentially redundant) terms, vs. a complete new set of terms that may be semantically less confusing to taxonomists (as above)?
Aloha, Rich
-----Original Message----- From: Markus Döring [mailto:m.doering@mac.com] Sent: Wednesday, November 24, 2010 2:30 AM To: Richard Pyle Cc: Tony.Rees@csiro.au; Chuck.Miller@mobot.org; tdwg- content@lists.tdwg.org; dmozzherin@eol.org Subject: Re: [tdwg-content] [tdwg-tag] Inclusion of authorship in DwCscientificName: good or bad?
I just had a quick look at the first few thousand data records coming into OBIS for my region (Australia). Just about every supplier who includes authority as dwc:scientificNameAuthor has used dwc:scientificName "incorrectly" i.e., for the canonical name not the canonical name + author. This data then flows into GBIF, ALA, etc. and circulates in this form. So "users" are already ignoring the definition of dwc:scientificName in practice, it would seem, with no apparent ill effects (?) - not sure whether this is good or bad, hence the title of my original question which prompted this thread...
OK, so here's the question:
Is it more disruptive to re-define dwc:scientificName to explicitly exclude authorship?
Thats definitely something Id like to avoid! We really need one place to keep the most explicit form of the name.
From seeing real data coming in I would coin the definition for
scientificName
that it should *contain the most complete, verbatim name string*. If you happen to have only a canonical, use the canonical. If you happen
to
have canonical + authorship parsed, join them if you can (its usually not
a
simple concatenation, beware).
Markus
Or, is it more disruptive to leave the existing (loose) definition of scientificName intact, and create more term(s) with more precise meanings, which we feel can help facilitate sharing of infomration?
Rich
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
I guess I need to do some more DwC homework.
Is the Genus element in current DwC to be used for the generic epithet of a binomial/trinomial, or for the Genus classification/rank of the species? If the Genus classification/rank and the generic epithet are different, does current DwC have a way to distinguish them?
Isn't an infragenericNameElement another uninomialNameElement, just like Tribe, Subtribe, Subfamily, and all the rest? (ie. name at the rank of genus and above)
A generic way to handle this would be firstEpithet, secondEpithet, thirdEpithet, etc. By using a string like infraspecificEpithet, an attribute is added to the position of the epithet. Or by using the string uninomialNameElement, a different attribute is added to the firstEpithet (which is the only epithet in that case). I'm probably overthinking it.
Chuck
-----Original Message----- From: Markus Döring [mailto:m.doering@mac.com] Sent: Wednesday, November 24, 2010 3:33 PM To: Richard Pyle Cc: Tony.Rees@csiro.au; Chuck Miller; tdwg-content@lists.tdwg.org; dmozzherin@eol.org Subject: Re: [tdwg-content] [tdwg-tag] Inclusion of authorship in DwCscientificName: good or bad?
I think we are hitting the problem now quite well. A quick response to your suggestion, Rich. Mostly agree with your conclusions:
verbatimScientificName As I suggested in an earlier post, this would be "the complete set of textual elements useful for recognizing a unique scientific name", exactly as they appear in the original source.
yes for the definition, but Im not sure if removing scientificName from the dwc terms is a true option though. Its the most known term of all...
uninomialNameElement Used for all names at the rank of genus and above; would also replace "genus" in DwC.
Genus will still be needed to represent the denormalised classification, but not for the parsed bits.
infragenericNameElement Better term for "subgenus".
Probably same is true for subgenus
specificEpithet As in existing DwC.
infraspecificEpithet As in existing DwC.
scientificNameAuthorship As in existing DwC.
I don't really agree with Tony on the "clutter" argument for introducing a single "canonicalName" term to replace the parsed uninomialNameElement [aka "genus"], infragenericNameElement [aka "subgenus"], specificEpithet, and infraspecificEpithet. (Side question to Tony -- would canonicalName include "var.", "f." etc., hence obviating the need for TaxonRank as well?) After all, "Aus bus xus" requires exactly the same number of bytes as "Aus,bus,xus" in a DwCA file. Of course, if verbatimScientificName [aka scientificName] is required, we'd have redundancy and hence doubling of bytes. However, if defined as verbatimScientificName as above, it would not really be redundant information if the parsed bits were defined as representing the Code-corrected version of the name, and due to the fact that the verbatimScientificName will often be different from a canonical concatenation of the parsed bits according to some standard format/formula.
So, to me, the main questions to answer are:
- How does the existing DwC/DwCA structure fail to meet the needs of
providers and/or users, in terms of loss of information, potential for misrepresentation of information, or inefficient or ineffective transfer of information (i.e. overburdening either the provider or the client).
- What are the most effective and least disruptive ways to correct the
failures identified in #1 above, in terms of re-defining existing terms, vs. introducing new (and potentially redundant) terms, vs. a complete new set of terms that may be semantically less confusing to taxonomists (as above)?
Aloha, Rich
-----Original Message----- From: Markus Döring [mailto:m.doering@mac.com] Sent: Wednesday, November 24, 2010 2:30 AM To: Richard Pyle Cc: Tony.Rees@csiro.au; Chuck.Miller@mobot.org; tdwg- content@lists.tdwg.org; dmozzherin@eol.org Subject: Re: [tdwg-content] [tdwg-tag] Inclusion of authorship in DwCscientificName: good or bad?
I just had a quick look at the first few thousand data records coming into OBIS for my region (Australia). Just about every supplier who includes authority as dwc:scientificNameAuthor has used dwc:scientificName "incorrectly" i.e., for the canonical name not the canonical name + author. This data then flows into GBIF, ALA, etc. and circulates in this form. So "users" are already ignoring the definition of dwc:scientificName in practice, it would seem, with no apparent ill effects (?) - not sure whether this is good or bad, hence the title of my original question which prompted this thread...
OK, so here's the question:
Is it more disruptive to re-define dwc:scientificName to explicitly exclude authorship?
Thats definitely something Id like to avoid! We really need one place to keep the most explicit form of the name.
From seeing real data coming in I would coin the definition for
scientificName
that it should *contain the most complete, verbatim name string*. If you happen to have only a canonical, use the canonical. If you happen
to
have canonical + authorship parsed, join them if you can (its usually not
a
simple concatenation, beware).
Markus
Or, is it more disruptive to leave the existing (loose) definition of scientificName intact, and create more term(s) with more precise meanings, which we feel can help facilitate sharing of infomration?
Rich
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
Hi folks,
I'm trying to stay quiet now, obviously with no success...
Rich wrote:
I don't really agree with Tony on the "clutter" argument for introducing a single "canonicalName" term to replace the parsed uninomialNameElement [aka "genus"], infragenericNameElement [aka "subgenus"], specificEpithet, and infraspecificEpithet. (Side question to Tony -- would canonicalName include "var.", "f." etc., hence obviating the need for TaxonRank as well?)
Answers:
1. It's really a question for the data receivers. I.e which of these is more efficient to tranfer/ingest/parse - based on a consistent data structure across all ranks:
Either this (12 elements to ingest and parse):
dwc:taxonID=10400156 dwc:parentNameUsageID=10400152 dwc:scientificName=Philander opossum Linnaeus, 1758 dwc:canonicalName=Philander opossum dwc:scientificNameAuthorship=Linnaeus, 1758 dwc:taxonRank=species dwc:taxonomicStatus=valid dwc:nomenclaturalCode=ICZN dwc:namePublishedIn=Syst. Nat., 10th ed., 1: 55. dwc:taxonRemarks=Corbet and Hill (1980), Hall (1981), Husson (1978), and Pine (1973) used Metachirops opossum for this species. Reviewed by Castro-Arellano et al. (2000, Mammalian Species, 638). The name D. larvata Jentink, 1888, is a nomen nudum. Didelphis opossum Linnaeus, 1758, is the type species for Holothylax Cabrera, 1919. dwc:vernacularName=Gray Four-eyed Opossum dc:source=http://www.bucknell.edu/msw3/browse.asp?id=10400156
Or this (18 elements to ingest and parse):
dwc:taxonID=10400156 dwc:parentNameUsageID=10400152 dwc:scientificName=Philander opossum Linnaeus, 1758 dwc:genus=Philander opossum dwc:species=Philander dwc:scientificNameAuthorship=Linnaeus, 1758 dwc:taxonRank=species dwc:family= dwc:order= dwc:class= dwc:phylum= dwc:kingdom= dwc:taxonomicStatus=valid dwc:nomenclaturalCode=ICZN dwc:namePublishedIn=Syst. Nat., 10th ed., 1: 55. dwc:taxonRemarks=Corbet and Hill (1980), Hall (1981), Husson (1978), and Pine (1973) used Metachirops opossum for this species. Reviewed by Castro-Arellano et al. (2000, Mammalian Species, 638). The name D. larvata Jentink, 1888, is a nomen nudum. Didelphis opossum Linnaeus, 1758, is the type species for Holothylax Cabrera, 1919. dwc:vernacularName=Gray Four-eyed Opossum dc:source=http://www.bucknell.edu/msw3/browse.asp?id=10400156
(Now repeat for each of the remaining 2m or so rows)
As stated before, I can generate either format, but the first is more concise for the receiver. (maybe this is not a killer reason though). Of course the parser settings to either generate or just upload canonicalName would be different for the two cases.
Noting that the higher ranks are blank in this example, however they will be needed for at least some other records so they have to be there when passed as DwCA (though not as XML I guess). Also noting that the in-between ranks subfamily/infraorder/subphylum etc. do not have corresponding pre-named elements at this time.
To this:
(Side question to Tony -- would canonicalName include "var.", "f." etc., hence obviating the need for TaxonRank as well?)
I was hoping you would not ask that!!
I think that canonical names in Botany but not Zoo. (don't know about prokaryotes, probably these are like Botany??) would keep the infraspecies marker/s in there as they are required by the relevant Code (sorry to bring that up again), but would be happy either way - maybe this has been discussed and resolved elsewhere earlier e.g. in old Linnean Core/TCS discussions. Personally if there is a rank element there, I would like it to see it filled in all cases for consistency.
A question back: for "genus (subgenus) species" names as commonly found in some groups (molluscs, crustaceans come to mind), is the subgenus omitted to produce the canonical name? I imagine it would, since it is an indicator of taxonomic placement, not a part of the name, but would be happy to hear that confirmed.
Can I stop now?
Cheers - Tony
Sorry, typo in my second example just posted, for
dwc:genus=Philander opossum
Read:
dwc:genus=Philander
(Of course)
- Tony
-----Original Message----- From: tdwg-content-bounces@lists.tdwg.org [mailto:tdwg-content- bounces@lists.tdwg.org] On Behalf Of Tony.Rees@csiro.au Sent: Thursday, 25 November 2010 9:54 AM To: m.doering@mac.com; deepreef@bishopmuseum.org Cc: Chuck.Miller@mobot.org; tdwg-content@lists.tdwg.org; dmozzherin@eol.org Subject: [ExternalEmail] Re: [tdwg-content] [tdwg-tag] Inclusion of authorship in DwCscientificName: good or bad?
Hi folks,
I'm trying to stay quiet now, obviously with no success...
Rich wrote:
I don't really agree with Tony on the "clutter" argument for introducing
a
single "canonicalName" term to replace the parsed uninomialNameElement
[aka
"genus"], infragenericNameElement [aka "subgenus"], specificEpithet, and infraspecificEpithet. (Side question to Tony -- would canonicalName
include
"var.", "f." etc., hence obviating the need for TaxonRank as well?)
Answers:
- It's really a question for the data receivers. I.e which of these is
more efficient to tranfer/ingest/parse - based on a consistent data structure across all ranks:
Either this (12 elements to ingest and parse):
dwc:taxonID=10400156 dwc:parentNameUsageID=10400152 dwc:scientificName=Philander opossum Linnaeus, 1758 dwc:canonicalName=Philander opossum dwc:scientificNameAuthorship=Linnaeus, 1758 dwc:taxonRank=species dwc:taxonomicStatus=valid dwc:nomenclaturalCode=ICZN dwc:namePublishedIn=Syst. Nat., 10th ed., 1: 55. dwc:taxonRemarks=Corbet and Hill (1980), Hall (1981), Husson (1978), and Pine (1973) used Metachirops opossum for this species. Reviewed by Castro- Arellano et al. (2000, Mammalian Species, 638). The name D. larvata Jentink, 1888, is a nomen nudum. Didelphis opossum Linnaeus, 1758, is the type species for Holothylax Cabrera, 1919. dwc:vernacularName=Gray Four-eyed Opossum dc:source=http://www.bucknell.edu/msw3/browse.asp?id=10400156
Or this (18 elements to ingest and parse):
dwc:taxonID=10400156 dwc:parentNameUsageID=10400152 dwc:scientificName=Philander opossum Linnaeus, 1758 dwc:genus=Philander opossum dwc:species=Philander dwc:scientificNameAuthorship=Linnaeus, 1758 dwc:taxonRank=species dwc:family= dwc:order= dwc:class= dwc:phylum= dwc:kingdom= dwc:taxonomicStatus=valid dwc:nomenclaturalCode=ICZN dwc:namePublishedIn=Syst. Nat., 10th ed., 1: 55. dwc:taxonRemarks=Corbet and Hill (1980), Hall (1981), Husson (1978), and Pine (1973) used Metachirops opossum for this species. Reviewed by Castro- Arellano et al. (2000, Mammalian Species, 638). The name D. larvata Jentink, 1888, is a nomen nudum. Didelphis opossum Linnaeus, 1758, is the type species for Holothylax Cabrera, 1919. dwc:vernacularName=Gray Four-eyed Opossum dc:source=http://www.bucknell.edu/msw3/browse.asp?id=10400156
(Now repeat for each of the remaining 2m or so rows)
As stated before, I can generate either format, but the first is more concise for the receiver. (maybe this is not a killer reason though). Of course the parser settings to either generate or just upload canonicalName would be different for the two cases.
Noting that the higher ranks are blank in this example, however they will be needed for at least some other records so they have to be there when passed as DwCA (though not as XML I guess). Also noting that the in- between ranks subfamily/infraorder/subphylum etc. do not have corresponding pre-named elements at this time.
To this:
(Side question to Tony -- would canonicalName include "var.", "f." etc., hence obviating the need for TaxonRank as well?)
I was hoping you would not ask that!!
I think that canonical names in Botany but not Zoo. (don't know about prokaryotes, probably these are like Botany??) would keep the infraspecies marker/s in there as they are required by the relevant Code (sorry to bring that up again), but would be happy either way - maybe this has been discussed and resolved elsewhere earlier e.g. in old Linnean Core/TCS discussions. Personally if there is a rank element there, I would like it to see it filled in all cases for consistency.
A question back: for "genus (subgenus) species" names as commonly found in some groups (molluscs, crustaceans come to mind), is the subgenus omitted to produce the canonical name? I imagine it would, since it is an indicator of taxonomic placement, not a part of the name, but would be happy to hear that confirmed.
Can I stop now?
Cheers - Tony
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
Hi Tony,
- It's really a question for the data receivers. I.e which of these is
more
efficient to tranfer/ingest/parse - based on a consistent data structure
across
all ranks:
Either this (12 elements to ingest and parse):
dwc:taxonID=10400156 dwc:parentNameUsageID=10400152 dwc:scientificName=Philander opossum Linnaeus, 1758 dwc:canonicalName=Philander opossum dwc:scientificNameAuthorship=Linnaeus, 1758 dwc:taxonRank=species dwc:taxonomicStatus=valid dwc:nomenclaturalCode=ICZN dwc:namePublishedIn=Syst. Nat., 10th ed., 1: 55. dwc:taxonRemarks=Corbet and Hill (1980), Hall (1981), Husson (1978), and Pine (1973) used Metachirops opossum for this species. Reviewed by Castro- Arellano et al. (2000, Mammalian Species, 638). The name D. larvata
Jentink,
1888, is a nomen nudum. Didelphis opossum Linnaeus, 1758, is the type species for Holothylax Cabrera, 1919. dwc:vernacularName=Gray Four-eyed Opossum dc:source=http://www.bucknell.edu/msw3/browse.asp?id=10400156
Or this (18 elements to ingest and parse):
dwc:taxonID=10400156 dwc:parentNameUsageID=10400152 dwc:scientificName=Philander opossum Linnaeus, 1758 dwc:genus=Philander opossum dwc:species=Philander dwc:scientificNameAuthorship=Linnaeus, 1758 dwc:taxonRank=species dwc:family= dwc:order= dwc:class= dwc:phylum= dwc:kingdom= dwc:taxonomicStatus=valid dwc:nomenclaturalCode=ICZN dwc:namePublishedIn=Syst. Nat., 10th ed., 1: 55. dwc:taxonRemarks=Corbet and Hill (1980), Hall (1981), Husson (1978), and Pine (1973) used Metachirops opossum for this species. Reviewed by Castro- Arellano et al. (2000, Mammalian Species, 638). The name D. larvata
Jentink,
1888, is a nomen nudum. Didelphis opossum Linnaeus, 1758, is the type species for Holothylax Cabrera, 1919. dwc:vernacularName=Gray Four-eyed Opossum dc:source=http://www.bucknell.edu/msw3/browse.asp?id=10400156
(Now repeat for each of the remaining 2m or so rows)
Yes, but five of the additional fields are empty in most cases, so there is not really that much savings in terms of bytes. Also, there is an increased client-side cost in terms of reliably parsing canonicalName in the first example.
Incidentally, I assume you meant:
dwc:genus=Philander dwc:specificEpithet= opossum
In the second example above?
Also noting that the in- between ranks subfamily/infraorder/subphylum etc. do not have corresponding pre-named elements at this time.
Agreed -- that was my #1 reason why the existing set of terms is "broken", in that there is no way to provide canonical versions of records of intermediate ranks, unless scientificName is used and its definition violated by excluding scientificNameAuthorship.
To this:
(Side question to Tony -- would canonicalName include "var.", "f." etc., hence obviating the need for TaxonRank as well?)
I was hoping you would not ask that!!
:-)
I think that canonical names in Botany but not Zoo. (don't know about prokaryotes, probably these are like Botany??) would keep the infraspecies marker/s in there as they are required by the relevant Code (sorry to
bring
that up again), but would be happy either way - maybe this has been discussed and resolved elsewhere earlier e.g. in old Linnean Core/TCS discussions. Personally if there is a rank element there, I would like it
to see it
filled in all cases for consistency.
Agreed! Consistency trumps accuracy!
A question back: for "genus (subgenus) species" names as commonly found in some groups (molluscs, crustaceans come to mind), is the subgenus omitted to produce the canonical name? I imagine it would, since it is an indicator of taxonomic placement, not a part of the name, but would be happy to hear that confirmed.
Right -- that would need to be determined and explicitly indicated in a definition for canonicalName.
Can I stop now?
Sure, OK... I think I will too.
Aloha, Rich
verbatimScientificName As I suggested in an earlier post, this would be "the complete set of textual elements useful for recognizing a unique scientific name", exactly as they appear in the original source.
yes for the definition, but Im not sure if removing scientificName from
the
dwc terms is a true option though. Its the most known term of all...
Yes, I know and agree. I figured I'd take a stab at the "ideal" world first, then curb back to reality... :-)
The problem is, scientificName as it currently is defined is not exactly the same thing as verbatimScientificName. The problem with scientificName is both its curse and its blessing. The liberal definition makes it very easy to accommodate from the perspective of the provider; but this same liberal definition *can* make it difficult for many clients. Many people do use it as a "verbatim" representation of a string blob in their database. Others generate "clean" concatenated name-string values from their parsed databases. Many, as Tony pointed out, do not include Authorship, even though they have Authorship information (as represented in scientificNameAuthorship). One golden rule of data management that I often tell people is that it's often better to be consistent, then correct. That is, something that's consistently incorrect can be corrected easily. But something that is inconsistently correct is often harder to deal with. In the case of scientificName, different people have different ideas of what "should be", but I think the only "correct" answer is the one described in the term definition.
uninomialNameElement Used for all names at the rank of genus and above; would also replace "genus" in DwC.
Genus will still be needed to represent the denormalised classification,
but
not for the parsed bits.
Right -- you mean in the sense of Family, Order, Class, etc. Personally, I think it would be "ideal" to eliminate these individual fields and just use dwc: higherClassification for this purpose. People with normalised data can represent it properly via parentNameUsage[ID] -- with the understanding that all names with a rank lower than genus would include the genus name as uninomial.
There's no elegant solution to this, as far as I can tell.
infragenericNameElement Better term for "subgenus".
Probably same is true for subgenus
I was just suggesting a better label for subgenus, so in this case it would mean exactly the same thing as subgenus does, just spelled differently. The reason the more general term is better than the rank-specific "subgenus" is to accommodate infrageneric Sections as well. Of course, we're screwed in the case where both a subgenus *and* a section are provide; but in that case I would be inclined to rely on a verbatim string to capture that.
At this point, though, I really don't have a good sense for how best to proceed.
Aloha, Rich
Quoting Rich Pyle:
At this point, though, I really don't have a good sense for how best to proceed.
Aloha, Rich
Maybe an answer would be to use TCS not DwC for exchange of purely taxonomic data? How about creating a TCSA format for bulk transfer - or is this not a great thought... (not being that familiar with TCS)
One problem is that (e.g.) it is often desired to include some non-taxonomic information along with the names e.g. distribution/habitat codes, etc.
Just an idea, don't know if it solves the residual DwC issues anyway,
Cheers - Tony
-----Original Message----- From: Richard Pyle [mailto:deepreef@bishopmuseum.org] Sent: Thursday, 25 November 2010 11:56 AM To: 'Markus Döring' Cc: Rees, Tony (CMAR, Hobart); Chuck.Miller@mobot.org; tdwg- content@lists.tdwg.org; dmozzherin@eol.org Subject: RE: [tdwg-content] [tdwg-tag] Inclusion of authorship in DwCscientificName: good or bad?
verbatimScientificName As I suggested in an earlier post, this would be "the complete set of textual elements useful for recognizing a unique scientific name", exactly as they appear in the original source.
yes for the definition, but Im not sure if removing scientificName from
the
dwc terms is a true option though. Its the most known term of all...
Yes, I know and agree. I figured I'd take a stab at the "ideal" world first, then curb back to reality... :-)
The problem is, scientificName as it currently is defined is not exactly the same thing as verbatimScientificName. The problem with scientificName is both its curse and its blessing. The liberal definition makes it very easy to accommodate from the perspective of the provider; but this same liberal definition *can* make it difficult for many clients. Many people do use it as a "verbatim" representation of a string blob in their database. Others generate "clean" concatenated name-string values from their parsed databases. Many, as Tony pointed out, do not include Authorship, even though they have Authorship information (as represented in scientificNameAuthorship). One golden rule of data management that I often tell people is that it's often better to be consistent, then correct. That is, something that's consistently incorrect can be corrected easily. But something that is inconsistently correct is often harder to deal with. In the case of scientificName, different people have different ideas of what "should be", but I think the only "correct" answer is the one described in the term definition.
uninomialNameElement Used for all names at the rank of genus and above; would also replace "genus" in DwC.
Genus will still be needed to represent the denormalised classification,
but
not for the parsed bits.
Right -- you mean in the sense of Family, Order, Class, etc. Personally, I think it would be "ideal" to eliminate these individual fields and just use dwc: higherClassification for this purpose. People with normalised data can represent it properly via parentNameUsage[ID] -- with the understanding that all names with a rank lower than genus would include the genus name as uninomial.
There's no elegant solution to this, as far as I can tell.
infragenericNameElement Better term for "subgenus".
Probably same is true for subgenus
I was just suggesting a better label for subgenus, so in this case it would mean exactly the same thing as subgenus does, just spelled differently. The reason the more general term is better than the rank-specific "subgenus" is to accommodate infrageneric Sections as well. Of course, we're screwed in the case where both a subgenus *and* a section are provide; but in that case I would be inclined to rely on a verbatim string to capture that.
At this point, though, I really don't have a good sense for how best to proceed.
Aloha, Rich
On 25/11/2010, at 12:03 PM, Tony.Rees@csiro.au Tony.Rees@csiro.au wrote:
Maybe an answer would be to use TCS not DwC for exchange of purely taxonomic data? How about creating a TCSA format for bulk transfer - or is this not a great thought... (not being that familiar with TCS)
We had all sorts of hideous, intractable problems with TCS, and have abandoned it.
The main issue is that it is quite carefully closed. The only top-level element it defines is "DataSet" - you cannot validly include a single tcs:TaxonConcept element in some other xml - it has to be wrapped in a DataSet and a TaxonConcepts element. Most of its types are defined as in-line anonymous blocks, so you cannot say "my ibis Foo element is of type tcs:TaxonConceptType".
The reason for this, I think, is that one of the ways a TCS element can refer to another is by a "local" reference, which only makes sense inside a coherent dataset. The way TCS is built, you will never see a valid tcs xml element with a local reference but that that reference is inside a tcs DataSet. Which sounds like a good idea at first.
TCS has no extension mechanism other than the "provider specific data" sections. You cannot validly put attributes or elements from other namespaces inside TCS elements except in the ProviderSpecificData section. Now ... this would almost be ok if TCS was very complete. But we kept coming up with things that didn't fit.
TCS, for instance, insists that TaxonRelationshipAssertion blocks have ids. Ours don't. Ok, so we can put our relationship assertion objects inside the ProviderSpecificData section of the taxon. Obviously, we can't use the TCS xml type for these blocks. But that means that we can use any of the TCS attributes and sub-elements either. So we wound up building an XML schema that essentially paralleled TCS, using the same element names, but with the elements exposed as top-level elements where necessary.
We have many taxon relationship types that are not TCS types. Ok, so these extra relationships go in the provider-specifc data section, described with in-house xml. Now ... there are two design choices here, both bad.
These relationships can be not included in the TCS relationships block at all. But that means that something that only understands TCS will simply not be aware of them. It also meant that there were slabs of duplicated code to generate the TCS relationship blocks "where relationship type in ...", and almost identical code differing only in that it uses the ibis xml namespace and "where relationship type NOT in ...". Horrible.
Alternatively, the relationships can be included in the TCS block with our relationship type mapped as closely as possible to a TCS one (needless to say, there's no TCS "other/unspecified" relationship type value), and extra data included elsewhere. But ... TaxonRelationship elements don't have a provider specific data section: only the enclosing TaxonConcept does. We could include all of the extra data for each relationship in an element in this section, but there's no way to map them across to the particular TCS relationship element that they are each about, because relationships that appear inside a taxon concept (ie: not relationship assertions) don't have ids, and there's no way to attach one.
Our taxon ranks don't always match the TCS ranks. So we had to map them across to TCS equivalents. But we still want to keep the ones we have - it's data - and so the verbatim rank goes into an element in the ProviderSpecificData section. Which would be ok ... except that now a TaxonName has two ranks, and the reason that it has two is that the TCS XML schema forces us to include one which is, well, wrong.
Eventually the ProviderSpecificData sections became so large and complex that we had to admit that it was a document in its own right. At which stage ... why TCS at all? As a wrapper, into which we can't fit the data values that we want, around some in-house XML with the good stuff?
We abandoned it. Sorry to go on at such length and detail, but I have such vivid memories of trying to make it work.
_______________________________________________
If you have received this transmission in error please notify us immediately by return e-mail and delete all copies. If this e-mail or any attachments have been sent to you in error, that error does not constitute waiver of any confidentiality, privilege or copyright in respect of information in the e-mail or attachments.
Please consider the environment before printing this email.
How nice it would have been if http://www.tdwg.org/schemas/tcs/1.01 had all its data types, especially the ComplexTypes, at the top level, and in a separate file, and the rest would use mainly those types to define a preferred TCS data set structure. Then there would have been \the/ TCS vocabulary, and \a/ TCS exchange structure and you could reuse and extend the Types. That's what we did in SDD, though it leads to the need for lots of key, keyref use to put contextual restrictions on the particular exchange structure. People find it hard to understand, and complain that you have introduced needless complexity.(But Computer Science grad students admire it. :-) ). I've fallen out of love with XML-Schema except for the huge set of high quality tools around, and the useful JAXB databinding framework now fully embedded in Java. And its pretty good for specifying software configuration.
Bob Morris
Robert A. Morris Emeritus Professor of Computer Science UMASS-Boston 100 Morrissey Blvd Boston, MA 02125-3390 Associate, Harvard University Herbaria email: morris.bob@gmail.com web: http://bdei.cs.umb.edu/ web: http://etaxonomy.org/mw/FilteredPush http://www.cs.umb.edu/~ram phone (+1) 857 222 7992 (mobile)
On Wed, Nov 24, 2010 at 9:07 PM, Paul Murray pmurray@anbg.gov.au wrote:
On 25/11/2010, at 12:03 PM, Tony.Rees@csiro.au Tony.Rees@csiro.au wrote:
Maybe an answer would be to use TCS not DwC for exchange of purely taxonomic data? How about creating a TCSA format for bulk transfer - or is this not a great thought... (not being that familiar with TCS)
We had all sorts of hideous, intractable problems with TCS, and have abandoned it.
[snip]
About the problem of being unable to use fragments: SDD provides two root schemata, one that allows only valid "complete" datasets, another allowing fragments. ("object interchange..."). Under the assumptions of xml-schema, it is perfectly legal to have schema-validation variants for differnt purposes.
It was originally planned to merge the xml-standards (TCS, SDD, ABCD) into one unified one, thus introducing such mechanisms (and extension mechanisms like in SDD) into all. This was stopped by the TDWG executive in favor of redesigning everything as RDF vocabularies instead of xml-schema. So the xml-schema standards are somewhat poor orphans... :-)
Gregor
Gregor,
Should we consider a TCS version 2 with this capability?
greg
On 25 November 2010 23:54, Gregor Hagedorn g.m.hagedorn@gmail.com wrote:
About the problem of being unable to use fragments: SDD provides two root schemata, one that allows only valid "complete" datasets, another allowing fragments. ("object interchange..."). Under the assumptions of xml-schema, it is perfectly legal to have schema-validation variants for differnt purposes.
It was originally planned to merge the xml-standards (TCS, SDD, ABCD) into one unified one, thus introducing such mechanisms (and extension mechanisms like in SDD) into all. This was stopped by the TDWG executive in favor of redesigning everything as RDF vocabularies instead of xml-schema. So the xml-schema standards are somewhat poor orphans... :-)
Gregor _______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
If you have received this transmission in error please notify us immediately by return e-mail and delete all copies. If this e-mail or any attachments have been sent to you in error, that error does not constitute waiver of any confidentiality, privilege or copyright in respect of information in the e-mail or attachments.
Please consider the environment before printing this email.
Gregor, Should we consider a TCS version 2 with this capability? greg
I cannot say. In part I admit to bickering :-). But in part I think yes, because xml schema is closer to the information management capabilities of people like me than RDF and the open world assumption. But then the semantic web is very powerful and enticing and I am hoping one day to be able to use the potential as well.
The decision will be taken by those who invest their energy right now. I wanted to point out that a lot of the limitations of SDD and TCS etc are not a function of using xml-schema, but of the halted process.
Gregor
As people probably have seen in other mailing lists, the group commissioned by GBIF to advise it on its approaches to Knowledge Organization Systems today released its draft report for public comment. ( http://bit.ly/GBIFKOS_Comments on community.gbif.org and also has a link to the draft). Among the recommendations are some that GBIF get involved in some of these kinds of issues, sometimes specifically, sometimes generally. Several are recommendation that GBIF spur some joint TDWG/GBIF Task Groups for addressing specific issues. It would be great if people commented whether this particular question is addressed in the recommendations, and whether the problems giving rise to it are adequately explained. Alas, there is only two weeks time for comments to be considered for inclusion in the final report, but the comment site will remain open for ongoing contribution, and available for GBIF's consideration while they consider and hopefully act on the report.
You will see in the report faint echos of the last few months of the discussions here. That's not an accident.
The recommendations, especially the shorter ones, are written as though you've read the whole report, which is nominally 32 pages, but in fact has very wide margins and several pages of graphs and appendices.
--Bob
On Thu, Nov 25, 2010 at 4:50 PM, Gregor Hagedorn g.m.hagedorn@gmail.com wrote:
Gregor, Should we consider a TCS version 2 with this capability? greg
I cannot say. In part I admit to bickering :-). But in part I think yes, because xml schema is closer to the information management capabilities of people like me than RDF and the open world assumption. But then the semantic web is very powerful and enticing and I am hoping one day to be able to use the potential as well.
The decision will be taken by those who invest their energy right now. I wanted to point out that a lot of the limitations of SDD and TCS etc are not a function of using xml-schema, but of the halted process.
Gregor
This all sounds like it's getting terribly complicated and the combined discussion on atomised parts vs canonical/full-name are confusing me.
For the first part, I still think available parsing tools make 99.8% of all cases tractable but if we want to be explicit and not run these services all the time then.
For datasets that separate the name from the authorship we make sure it's clear this separation is retained. The definition for this term must change. It doesn't make sense to me to concatenate two elements that are already split. The parts go into: 1. scientificName + scientificNameAuthorship
For datasets with only a scientificName. The name goes into: 2. scientificName
For datasets with scientificName and authorship in a single field we have two choices: 3a. scientificName # in which case we must be able to detect and split authorship and we need to detect the canonical form in case 2 3b. scientificNameWithAuthorship # rather than a canonicalName term which is confusing we use a less ambiguous term like this.
It seems to me the intent of 3b is more explicit as to what we intend by adding canonicalName.
DR
On Nov 25, 2010, at 2:03 AM, Tony.Rees@csiro.au wrote:
Quoting Rich Pyle:
At this point, though, I really don't have a good sense for how best to proceed.
Aloha, Rich
Maybe an answer would be to use TCS not DwC for exchange of purely taxonomic data? How about creating a TCSA format for bulk transfer - or is this not a great thought... (not being that familiar with TCS)
One problem is that (e.g.) it is often desired to include some non- taxonomic information along with the names e.g. distribution/habitat codes, etc.
Just an idea, don't know if it solves the residual DwC issues anyway,
Cheers - Tony
This all sounds like it's getting terribly complicated and the combined discussion on atomised parts vs canonical/full-name are confusing me.
You're not alone! :-)
For the first part, I still think available parsing tools make 99.8% of
all cases
tractable but if we want to be explicit and not run these services all the
time
then.
I can go with that (basically what I've been advocating all along: that is, we need to identify where it's broke before we try to fix it). At the moment, many data consumers don't have those tools available locally, but I think as more people become familiar with the tools that are out there, then this will not be so much a problem. In cases where scientificName is concatenated from parsed bits, then I suspect the client-side parsing will be closer to 100% effective.
For datasets that separate the name from the authorship we make sure it's clear this separation is retained. The definition for this term must change. It doesn't make sense to me to concatenate two elements that are already split. The parts go into:
- scientificName + scientificNameAuthorship
For datasets with only a scientificName. The name goes into: 2. scientificName For datasets with scientificName and authorship in a single field we have
two
choices: 3a. scientificName # in which case we must be able to detect and split authorship and we need to detect the canonical form in case 2 3b. scientificNameWithAuthorship # rather than a canonicalName term which is confusing we use a less ambiguous term like this.
It seems to me the intent of 3b is more explicit as to what we intend by adding canonicalName.
The thing that concerns me about 3a is the conditional definition of scientificName (i.e., it excludes authorship when that information is pre-parsed, but includes it when it's not pre-parsed).
The thing that concerns me about 3b is that scientificName is (supposedly?) required; so if someone with unparsed information provides scientificNameWithAuthorship, then are they supposed to leave scientificName blank?
Rich
The thing that concerns me about 3a is the conditional definition of scientificName (i.e., it excludes authorship when that information is pre-parsed, but includes it when it's not pre-parsed).
right. That is where we are today. We need to test the contents every time. One thing we are developing requirements for is a DarwinCore Archive "Normaliser" which would be a web service/web app client that can accepted a DwC-A as input and would output the same data as a new DwC-A that conforms to a set of rules based on those requirements. So one thing would be to parse the names into the authorship field for simpler consumption.
The thing that concerns me about 3b is that scientificName is (supposedly?) required; so if someone with unparsed information provides scientificNameWithAuthorship, then are they supposed to leave scientificName blank?
The short answer is yes, they would leave it blank. if they could parse it then they would have put the parts into name and author elements in the first place. I also think the inverse is true. If it is already split and they put it into name and author fields - they won't concatenate and put a new, merged copy into a name+authorship field.
In regard to the requirement confusion - either we keep dwc:scientificName alone and have a simple requirement test or we add a new term and make it a conditional requirement. My point was, if we add a new term, then the set (scientificNameWithAuthorship, scientificName, scientificNameAuthorship) is more clear than (canonicalName, scientificName, scientificNameAuthorship)
DR
Rich
Hi David, all,
It is also important to follow the trail of how the name information then gets passed on to other systems e.g. harvested into GNI etc. My impression is that currently, if dwc:scientificName holds a sciname without authorship and the latter information is put into dwc:scientificNameAuthor, then the version that is harvested into GNI (presuming that happens) loses the authority information, which is definitely a bad thing...
Cheers - Tony
-----Original Message----- From: David Remsen (GBIF) [mailto:dremsen@gbif.org] Sent: Thursday, 25 November 2010 8:00 PM To: Richard Pyle Cc: David Remsen (GBIF); Rees, Tony (CMAR, Hobart); m.doering@mac.com; Chuck.Miller@mobot.org; tdwg-content@lists.tdwg.org; dmozzherin@eol.org Subject: Re: [tdwg-content] [tdwg-tag] Inclusion of authorship in DwCscientificName: good or bad?
The thing that concerns me about 3a is the conditional definition of scientificName (i.e., it excludes authorship when that information is pre-parsed, but includes it when it's not pre-parsed).
right. That is where we are today. We need to test the contents every time. One thing we are developing requirements for is a DarwinCore Archive "Normaliser" which would be a web service/web app client that can accepted a DwC-A as input and would output the same data as a new DwC-A that conforms to a set of rules based on those requirements. So one thing would be to parse the names into the authorship field for simpler consumption.
The thing that concerns me about 3b is that scientificName is (supposedly?) required; so if someone with unparsed information provides scientificNameWithAuthorship, then are they supposed to leave scientificName blank?
The short answer is yes, they would leave it blank. if they could parse it then they would have put the parts into name and author elements in the first place. I also think the inverse is true. If it is already split and they put it into name and author fields - they won't concatenate and put a new, merged copy into a name+authorship field.
In regard to the requirement confusion - either we keep dwc:scientificName alone and have a simple requirement test or we add a new term and make it a conditional requirement. My point was, if we add a new term, then the set (scientificNameWithAuthorship, scientificName, scientificNameAuthorship) is more clear than (canonicalName, scientificName, scientificNameAuthorship)
DR
Rich
Tony
Why isn't this something the GNI would address? I don't see the problem with DwC. What is scientificNameAuthor supposed to be used for if it isn't to store the authorship information from a scientific name?
If we really need to make strict distinctions and can't deal with just the two name parts, then the logical new term needs to be scientificNameVerbatim or scientificNameWithAuthorship, not canonical name.
David
On Nov 26, 2010, at 1:11 AM, Tony.Rees@csiro.au Tony.Rees@csiro.au wrote:
Hi David, all,
It is also important to follow the trail of how the name information then gets passed on to other systems e.g. harvested into GNI etc. My impression is that currently, if dwc:scientificName holds a sciname without authorship and the latter information is put into dwc:scientificNameAuthor, then the version that is harvested into GNI (presuming that happens) loses the authority information, which is definitely a bad thing...
Cheers - Tony
-----Original Message----- From: David Remsen (GBIF) [mailto:dremsen@gbif.org] Sent: Thursday, 25 November 2010 8:00 PM To: Richard Pyle Cc: David Remsen (GBIF); Rees, Tony (CMAR, Hobart); m.doering@mac.com; Chuck.Miller@mobot.org; tdwg-content@lists.tdwg.org; dmozzherin@eol.org Subject: Re: [tdwg-content] [tdwg-tag] Inclusion of authorship in DwCscientificName: good or bad?
The thing that concerns me about 3a is the conditional definition of scientificName (i.e., it excludes authorship when that information is pre-parsed, but includes it when it's not pre-parsed).
right. That is where we are today. We need to test the contents every time. One thing we are developing requirements for is a DarwinCore Archive "Normaliser" which would be a web service/web app client that can accepted a DwC-A as input and would output the same data as a new DwC-A that conforms to a set of rules based on those requirements. So one thing would be to parse the names into the authorship field for simpler consumption.
The thing that concerns me about 3b is that scientificName is (supposedly?) required; so if someone with unparsed information provides scientificNameWithAuthorship, then are they supposed to leave scientificName blank?
The short answer is yes, they would leave it blank. if they could parse it then they would have put the parts into name and author elements in the first place. I also think the inverse is true. If it is already split and they put it into name and author fields - they won't concatenate and put a new, merged copy into a name+authorship field.
In regard to the requirement confusion - either we keep dwc:scientificName alone and have a simple requirement test or we add a new term and make it a conditional requirement. My point was, if we add a new term, then the set (scientificNameWithAuthorship, scientificName, scientificNameAuthorship) is more clear than (canonicalName, scientificName, scientificNameAuthorship)
DR
Rich
Hi David,
The problem is with the present DwC definition of scientificName- it is expected to contain authorship if the latter is available. (I presume Dima did the last upload into GNI so you could ask him what actually happens).
Cheers - Tony
-----Original Message----- From: David Remsen (GBIF) [mailto:dremsen@gbif.org] Sent: Friday, 26 November 2010 7:39 PM To: Rees, Tony (CMAR, Hobart) Cc: David Remsen (GBIF); deepreef@bishopmuseum.org; m.doering@mac.com; Chuck.Miller@mobot.org; tdwg-content@lists.tdwg.org; dmozzherin@eol.org Subject: Re: [tdwg-content] [tdwg-tag] Inclusion of authorship in DwCscientificName: good or bad?
Tony
Why isn't this something the GNI would address? I don't see the problem with DwC. What is scientificNameAuthor supposed to be used for if it isn't to store the authorship information from a scientific name?
If we really need to make strict distinctions and can't deal with just the two name parts, then the logical new term needs to be scientificNameVerbatim or scientificNameWithAuthorship, not canonical name.
David
On Nov 26, 2010, at 1:11 AM, Tony.Rees@csiro.au Tony.Rees@csiro.au wrote:
Hi David, all,
It is also important to follow the trail of how the name information then gets passed on to other systems e.g. harvested into GNI etc. My impression is that currently, if dwc:scientificName holds a sciname without authorship and the latter information is put into dwc:scientificNameAuthor, then the version that is harvested into GNI (presuming that happens) loses the authority information, which is definitely a bad thing...
Cheers - Tony
-----Original Message----- From: David Remsen (GBIF) [mailto:dremsen@gbif.org] Sent: Thursday, 25 November 2010 8:00 PM To: Richard Pyle Cc: David Remsen (GBIF); Rees, Tony (CMAR, Hobart); m.doering@mac.com; Chuck.Miller@mobot.org; tdwg-content@lists.tdwg.org; dmozzherin@eol.org Subject: Re: [tdwg-content] [tdwg-tag] Inclusion of authorship in DwCscientificName: good or bad?
The thing that concerns me about 3a is the conditional definition of scientificName (i.e., it excludes authorship when that information is pre-parsed, but includes it when it's not pre-parsed).
right. That is where we are today. We need to test the contents every time. One thing we are developing requirements for is a DarwinCore Archive "Normaliser" which would be a web service/web app client that can accepted a DwC-A as input and would output the same data as a new DwC-A that conforms to a set of rules based on those requirements. So one thing would be to parse the names into the authorship field for simpler consumption.
The thing that concerns me about 3b is that scientificName is (supposedly?) required; so if someone with unparsed information provides scientificNameWithAuthorship, then are they supposed to leave scientificName blank?
The short answer is yes, they would leave it blank. if they could parse it then they would have put the parts into name and author elements in the first place. I also think the inverse is true. If it is already split and they put it into name and author fields - they won't concatenate and put a new, merged copy into a name+authorship field.
In regard to the requirement confusion - either we keep dwc:scientificName alone and have a simple requirement test or we add a new term and make it a conditional requirement. My point was, if we add a new term, then the set (scientificNameWithAuthorship, scientificName, scientificNameAuthorship) is more clear than (canonicalName, scientificName, scientificNameAuthorship)
DR
Rich
I stated that I believe this definition should be updated so that
1. It may hold the authorship information if the source date has this merged with the name but that the recommendation is that it be split into authorship.
OR
2. dwc:scientificName should explicitly NOT store authorship information and that a new term be created that carries the current definition for dwc:scientificName that does require a merged name +authorship string.
DR
On Nov 26, 2010, at 9:41 AM, Tony.Rees@csiro.au wrote:
Hi David,
The problem is with the present DwC definition of scientificName- it is expected to contain authorship if the latter is available. (I presume Dima did the last upload into GNI so you could ask him what actually happens).
Cheers - Tony
-----Original Message----- From: David Remsen (GBIF) [mailto:dremsen@gbif.org] Sent: Friday, 26 November 2010 7:39 PM To: Rees, Tony (CMAR, Hobart) Cc: David Remsen (GBIF); deepreef@bishopmuseum.org; m.doering@mac.com; Chuck.Miller@mobot.org; tdwg-content@lists.tdwg.org; dmozzherin@eol.org Subject: Re: [tdwg-content] [tdwg-tag] Inclusion of authorship in DwCscientificName: good or bad?
Tony
Why isn't this something the GNI would address? I don't see the problem with DwC. What is scientificNameAuthor supposed to be used for if it isn't to store the authorship information from a scientific name?
If we really need to make strict distinctions and can't deal with just the two name parts, then the logical new term needs to be scientificNameVerbatim or scientificNameWithAuthorship, not canonical name.
David
On Nov 26, 2010, at 1:11 AM, Tony.Rees@csiro.au Tony.Rees@csiro.au wrote:
Hi David, all,
It is also important to follow the trail of how the name information then gets passed on to other systems e.g. harvested into GNI etc. My impression is that currently, if dwc:scientificName holds a sciname without authorship and the latter information is put into dwc:scientificNameAuthor, then the version that is harvested into GNI (presuming that happens) loses the authority information, which is definitely a bad thing...
Cheers - Tony
-----Original Message----- From: David Remsen (GBIF) [mailto:dremsen@gbif.org] Sent: Thursday, 25 November 2010 8:00 PM To: Richard Pyle Cc: David Remsen (GBIF); Rees, Tony (CMAR, Hobart); m.doering@mac.com; Chuck.Miller@mobot.org; tdwg-content@lists.tdwg.org; dmozzherin@eol.org Subject: Re: [tdwg-content] [tdwg-tag] Inclusion of authorship in DwCscientificName: good or bad?
The thing that concerns me about 3a is the conditional definition of scientificName (i.e., it excludes authorship when that information is pre-parsed, but includes it when it's not pre-parsed).
right. That is where we are today. We need to test the contents every time. One thing we are developing requirements for is a DarwinCore Archive "Normaliser" which would be a web service/web app client that can accepted a DwC-A as input and would output the same data as a new DwC-A that conforms to a set of rules based on those requirements. So one thing would be to parse the names into the authorship field for simpler consumption.
The thing that concerns me about 3b is that scientificName is (supposedly?) required; so if someone with unparsed information provides scientificNameWithAuthorship, then are they supposed to leave scientificName blank?
The short answer is yes, they would leave it blank. if they could parse it then they would have put the parts into name and author elements in the first place. I also think the inverse is true. If it is already split and they put it into name and author fields - they won't concatenate and put a new, merged copy into a name+authorship field.
In regard to the requirement confusion - either we keep dwc:scientificName alone and have a simple requirement test or we add a new term and make it a conditional requirement. My point was, if we add a new term, then the set (scientificNameWithAuthorship, scientificName, scientificNameAuthorship) is more clear than (canonicalName, scientificName, scientificNameAuthorship)
DR
Rich
On 25/11/2010, at 11:56 AM, Richard Pyle wrote:
One golden rule of data management that I often tell people is that it's often better to be consistent, then correct. That is, something that's consistently incorrect can be corrected easily.
Well put.
_______________________________________________
If you have received this transmission in error please notify us immediately by return e-mail and delete all copies. If this e-mail or any attachments have been sent to you in error, that error does not constitute waiver of any confidentiality, privilege or copyright in respect of information in the e-mail or attachments.
Please consider the environment before printing this email.
Rich
Your two statements below don't jibe well in this case. Putting random concatenations of higher taxa into dwc:higherClassification would make for a real mess. Having only the basic named Linnaean ranks does ignore all of the intermediate ranks but it supports conformity at least for those in a way that higherClassification cannot as you lose the associated rank term. It also supports what I think is a fairly substantial bloc of data that exists in a denormal form with only (or nearly only) the basic Linnaean ranks in named rank columns. Concatenating these into dwc:higherClassification would be lossy in this case.
My real concern, however, would be in trying to subsequently line up multiple datasets where there were omissions in some higher ranks so that the concatenations were abbreviated. In other words
Bivalvia:Mytildae:Mytilus edulis Mollusca:Mytiloidea:Mytilus: Mytilus edulis Animalia: Mollusca:Mytiloidea:Mytildae:Mytilus: Mytilus edulis
See http://code.google.com/p/gbif-ecat/wiki/Nom5ExampleMytilusedulis for a real world example and, ignoring the other inherent messes, imagine trying to deal with this with no higher rank columns for context and all those nulls removed (no fair keeping the delimiters for them either).
DR
On 25/11/2010, at 11:56 AM, Richard Pyle wrote:
One golden rule of data management that I often tell people is that it's often better to be consistent, then correct. That is, something that's consistently incorrect can be corrected easily.
Right -- you mean in the sense of Family, Order, Class, etc. Personally, I think it would be "ideal" to eliminate these individual fields and just use dwc: higherClassification for this purpose. People with normalised data can represent it properly via parentNameUsage[ID] -- with the understanding that all names with a rank lower than genus would include the genus name as uninomial.
the denormalised single Linnean Rank terms are very, very helpful for sharing occurrence data. They are the primary means to distinguish between homonyms when only a canonical name is given. And they are found in many denormalised sources like spreadsheets. No doubt these are needed!
And yes, dwc:genus and dwc:subgenus according to the definition is for the *classification*, not the parsed name (even though this is mostly the same).
As far as I can tell the dwc changes we are discussing are still the same. Either:
A) add a canonicalName term or B) add an atomised term for genus/uninomial + infrageneric/uninomial
I think both options are a way to go. A single canonical name if given correctly is very straight forward to parse, so personally I think this is easier than having multiple terms. For the name part terms I think I would agree with Chuck that a single uninomial can be used for genus or infrageneric ranks. As a canonical binomial would *not* include a subgenus or section, there is not need to have that parsed information as a term. In case the scientificname actually IS the subgenus, the uninomial can be used.
Markus
On Nov 25, 2010, at 8:36, David Remsen (GBIF) wrote:
Rich
Your two statements below don't jibe well in this case. Putting random concatenations of higher taxa into dwc:higherClassification would make for a real mess. Having only the basic named Linnaean ranks does ignore all of the intermediate ranks but it supports conformity at least for those in a way that higherClassification cannot as you lose the associated rank term. It also supports what I think is a fairly substantial bloc of data that exists in a denormal form with only (or nearly only) the basic Linnaean ranks in named rank columns. Concatenating these into dwc:higherClassification would be lossy in this case.
My real concern, however, would be in trying to subsequently line up multiple datasets where there were omissions in some higher ranks so that the concatenations were abbreviated. In other words
Bivalvia:Mytildae:Mytilus edulis Mollusca:Mytiloidea:Mytilus: Mytilus edulis Animalia: Mollusca:Mytiloidea:Mytildae:Mytilus: Mytilus edulis
See http://code.google.com/p/gbif-ecat/wiki/Nom5ExampleMytilusedulis for a real world example and, ignoring the other inherent messes, imagine trying to deal with this with no higher rank columns for context and all those nulls removed (no fair keeping the delimiters for them either).
DR
On 25/11/2010, at 11:56 AM, Richard Pyle wrote:
One golden rule of data management that I often tell people is that it's often better to be consistent, then correct. That is, something that's consistently incorrect can be corrected easily.
Right -- you mean in the sense of Family, Order, Class, etc. Personally, I think it would be "ideal" to eliminate these individual fields and just use dwc: higherClassification for this purpose. People with normalised data can represent it properly via parentNameUsage[ID] -- with the understanding that all names with a rank lower than genus would include the genus name as uninomial.
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
... a much, much bigger use case for having linnean rank terms other than homonym disambiguation is actually fuzzy matching of misspelled canonicals!
On Nov 25, 2010, at 9:11, Markus Döring wrote:
the denormalised single Linnean Rank terms are very, very helpful for sharing occurrence data. They are the primary means to distinguish between homonyms when only a canonical name is given. And they are found in many denormalised sources like spreadsheets. No doubt these are needed!
And yes, dwc:genus and dwc:subgenus according to the definition is for the *classification*, not the parsed name (even though this is mostly the same).
As far as I can tell the dwc changes we are discussing are still the same. Either:
A) add a canonicalName term or B) add an atomised term for genus/uninomial + infrageneric/uninomial
I think both options are a way to go. A single canonical name if given correctly is very straight forward to parse, so personally I think this is easier than having multiple terms. For the name part terms I think I would agree with Chuck that a single uninomial can be used for genus or infrageneric ranks. As a canonical binomial would *not* include a subgenus or section, there is not need to have that parsed information as a term. In case the scientificname actually IS the subgenus, the uninomial can be used.
Markus
On Nov 25, 2010, at 8:36, David Remsen (GBIF) wrote:
Rich
Your two statements below don't jibe well in this case. Putting random concatenations of higher taxa into dwc:higherClassification would make for a real mess. Having only the basic named Linnaean ranks does ignore all of the intermediate ranks but it supports conformity at least for those in a way that higherClassification cannot as you lose the associated rank term. It also supports what I think is a fairly substantial bloc of data that exists in a denormal form with only (or nearly only) the basic Linnaean ranks in named rank columns. Concatenating these into dwc:higherClassification would be lossy in this case.
My real concern, however, would be in trying to subsequently line up multiple datasets where there were omissions in some higher ranks so that the concatenations were abbreviated. In other words
Bivalvia:Mytildae:Mytilus edulis Mollusca:Mytiloidea:Mytilus: Mytilus edulis Animalia: Mollusca:Mytiloidea:Mytildae:Mytilus: Mytilus edulis
See http://code.google.com/p/gbif-ecat/wiki/Nom5ExampleMytilusedulis for a real world example and, ignoring the other inherent messes, imagine trying to deal with this with no higher rank columns for context and all those nulls removed (no fair keeping the delimiters for them either).
DR
On 25/11/2010, at 11:56 AM, Richard Pyle wrote:
One golden rule of data management that I often tell people is that it's often better to be consistent, then correct. That is, something that's consistently incorrect can be corrected easily.
Right -- you mean in the sense of Family, Order, Class, etc. Personally, I think it would be "ideal" to eliminate these individual fields and just use dwc: higherClassification for this purpose. People with normalised data can represent it properly via parentNameUsage[ID] -- with the understanding that all names with a rank lower than genus would include the genus name as uninomial.
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
dwc:Genus
I think the definition "The full scientific name of the genus in which the taxon is classified." is incomplete and only makes sense for valid/ accepted taxon names. I think the definition should be changed so that the dwc:genus refers to the genus part of the name. For some synonyms, the genus part is different. In this case, why should the genus part refer to the genus of the accepted/valid taxon? It is already linked to that taxon via other methods and would inherit that information through the link. It's just an opportunity to create integrity conflicts as well as an opportunity to lose some valuable additional information.
Consider this record in the WoRMS database concerning my favorite fish:
http://www.marinespecies.org/aphia.php?p=taxdetails&id=301162
Note that the parent (genus) for this synonym is the literal, nominal parent genus, not the genus for the valid name. Given the degree of homonymy among the genera this could provide useful and explicit linking to a parent genus records, particularly if it were included, like in this case, in the source dataset. The value in either cases is limited in the larger aggregate world due to the recommendation that dwc:Genus, like all the named higher taxon elements, be canonical.
On a related note then, I would recommend that for synonyms, the more normal and enriched dwc:parentNameUsageID should be used to retain this information. In other words, normally, a synonym is linked to the accepted/valid taxon via acceptedNameUsageID and dwc:parentNameUsageID is null. In this case, however, it should be used to
DR
On Nov 25, 2010, at 9:11 AM, Markus Döring wrote:
the denormalised single Linnean Rank terms are very, very helpful for sharing occurrence data. They are the primary means to distinguish between homonyms when only a canonical name is given. And they are found in many denormalised sources like spreadsheets. No doubt these are needed!
And yes, dwc:genus and dwc:subgenus according to the definition is for the *classification*, not the parsed name (even though this is mostly the same).
As far as I can tell the dwc changes we are discussing are still the same. Either:
A) add a canonicalName term or B) add an atomised term for genus/uninomial + infrageneric/uninomial
I think both options are a way to go. A single canonical name if given correctly is very straight forward to parse, so personally I think this is easier than having multiple terms. For the name part terms I think I would agree with Chuck that a single uninomial can be used for genus or infrageneric ranks. As a canonical binomial would *not* include a subgenus or section, there is not need to have that parsed information as a term. In case the scientificname actually IS the subgenus, the uninomial can be used.
Markus
On Nov 25, 2010, at 8:36, David Remsen (GBIF) wrote:
Rich
Your two statements below don't jibe well in this case. Putting random concatenations of higher taxa into dwc:higherClassification would make for a real mess. Having only the basic named Linnaean ranks does ignore all of the intermediate ranks but it supports conformity at least for those in a way that higherClassification cannot as you lose the associated rank term. It also supports what I think is a fairly substantial bloc of data that exists in a denormal form with only (or nearly only) the basic Linnaean ranks in named rank columns. Concatenating these into dwc:higherClassification would be lossy in this case.
My real concern, however, would be in trying to subsequently line up multiple datasets where there were omissions in some higher ranks so that the concatenations were abbreviated. In other words
Bivalvia:Mytildae:Mytilus edulis Mollusca:Mytiloidea:Mytilus: Mytilus edulis Animalia: Mollusca:Mytiloidea:Mytildae:Mytilus: Mytilus edulis
See http://code.google.com/p/gbif-ecat/wiki/Nom5ExampleMytilusedulis for a real world example and, ignoring the other inherent messes, imagine trying to deal with this with no higher rank columns for context and all those nulls removed (no fair keeping the delimiters for them either).
DR
On 25/11/2010, at 11:56 AM, Richard Pyle wrote:
One golden rule of data management that I often tell people is that it's often better to be consistent, then correct. That is, something that's consistently incorrect can be corrected easily.
Right -- you mean in the sense of Family, Order, Class, etc. Personally, I think it would be "ideal" to eliminate these individual fields and just use dwc: higherClassification for this purpose. People with normalised data can represent it properly via parentNameUsage[ID] -- with the understanding that all names with a rank lower than genus would include the genus name as uninomial.
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
Wait now Im confused. How else would one represent the genus in the WoRMS example? Surely no one would put Pomatomus as the genus for a record representing the binomial "Gasterosteus saltatrix". Would they???
I agree with Dave -- the definition of the term "genus" is not right. It should follow the template for the definition of specificEpithet:
"The genus part of the scientificName."
Also, I don't like the inclusion of the genus within the definition of subgenus term. I would change it to:
"The subgenus part of the scientificName."
I think the problem is that these definitions do not account for the idea that records representing synonyms would be passed around using these terms. They were originally created as taxonomic attributes of occurrence records, rather than as primary taxon name records.
Obviously, terms like taxonomicStatus and acceptedNameUsage[ID] acknowledge that synonyms would be passed around, but I don't think the definitions of the rank-specific terms were updated accordingly.
It gets a bit fuzzy for names above the rank of genus. For example, if someone passes a family name represented as a synonym of a different family name; what is put for the "family" term? The synonym family name, or the valid family name? At first blush, I would think the synonym (literal) family name. However, that would require a change to the definition of the dwc:family term (and all the other higher-rank terms).
Rich
From: tdwg-content-bounces@lists.tdwg.org [mailto:tdwg-content-bounces@lists.tdwg.org] On Behalf Of David Remsen (GBIF) Sent: Wednesday, November 24, 2010 10:39 PM To: tdwg-content@lists.tdwg.org List Cc: Paul Murray Subject: Re: [tdwg-content] [tdwg-tag] Inclusion of authorship in DwCscientificName: good or bad? [SEC=UNCLASSIFIED]
dwc:Genus
I think the definition "The full scientific name of the genus in which the taxon is classified." is incomplete and only makes sense for valid/accepted taxon names. I think the definition should be changed so that the dwc:genus refers to the genus part of the name. For some synonyms, the genus part is different. In this case, why should the genus part refer to the genus of the accepted/valid taxon? It is already linked to that taxon via other methods and would inherit that information through the link. It's just an opportunity to create integrity conflicts as well as an opportunity to lose some valuable additional information.
Consider this record in the WoRMS database concerning my favorite fish:
http://www.marinespecies.org/aphia.php?p=taxdetails&id=301162
Note that the parent (genus) for this synonym is the literal, nominal parent genus, not the genus for the valid name. Given the degree of homonymy among the genera this could provide useful and explicit linking to a parent genus records, particularly if it were included, like in this case, in the source dataset. The value in either cases is limited in the larger aggregate world due to the recommendation that dwc:Genus, like all the named higher taxon elements, be canonical.
On a related note then, I would recommend that for synonyms, the more normal and enriched dwc:parentNameUsageID should be used to retain this information. In other words, normally, a synonym is linked to the accepted/valid taxon via acceptedNameUsageID and dwc:parentNameUsageID is null. In this case, however, it should be used to
DR
On Nov 25, 2010, at 9:11 AM, Markus Döring wrote:
the denormalised single Linnean Rank terms are very, very helpful for sharing occurrence data. They are the primary means to distinguish between homonyms when only a canonical name is given. And they are found in many denormalised sources like spreadsheets. No doubt these are needed!
And yes, dwc:genus and dwc:subgenus according to the definition is for the *classification*, not the parsed name (even though this is mostly the same).
As far as I can tell the dwc changes we are discussing are still the same. Either:
A) add a canonicalName term or B) add an atomised term for genus/uninomial + infrageneric/uninomial
I think both options are a way to go. A single canonical name if given correctly is very straight forward to parse, so personally I think this is easier than having multiple terms. For the name part terms I think I would agree with Chuck that a single uninomial can be used for genus or infrageneric ranks. As a canonical binomial would *not* include a subgenus or section, there is not need to have that parsed information as a term. In case the scientificname actually IS the subgenus, the uninomial can be used.
Markus
On Nov 25, 2010, at 8:36, David Remsen (GBIF) wrote:
Rich
Your two statements below don't jibe well in this case. Putting random concatenations of higher taxa into dwc:higherClassification would make for a real mess. Having only the basic named Linnaean ranks does ignore all of the intermediate ranks but it supports conformity at least for those in a way that higherClassification cannot as you lose the associated rank term. It also supports what I think is a fairly substantial bloc of data that exists in a denormal form with only (or nearly only) the basic Linnaean ranks in named rank columns. Concatenating these into dwc:higherClassification would be lossy in this case.
My real concern, however, would be in trying to subsequently line up multiple datasets where there were omissions in some higher ranks so that the concatenations were abbreviated. In other words
Bivalvia:Mytildae:Mytilus edulis Mollusca:Mytiloidea:Mytilus: Mytilus edulis Animalia: Mollusca:Mytiloidea:Mytildae:Mytilus: Mytilus edulis
See http://code.google.com/p/gbif-ecat/wiki/Nom5ExampleMytilusedulis for a real world example and, ignoring the other inherent messes, imagine trying to deal with this with no higher rank columns for context and all those nulls removed (no fair keeping the delimiters for them either).
DR
On 25/11/2010, at 11:56 AM, Richard Pyle wrote:
One golden rule of data management that I often tell people is that it's often better to be consistent, then correct. That is, something that's consistently incorrect can be corrected easily.
Right -- you mean in the sense of Family, Order, Class, etc. Personally, I think it would be "ideal" to eliminate these individual fields and just use dwc: higherClassification for this purpose. People with normalised data can represent it properly via parentNameUsage[ID] -- with the understanding that all names with a rank lower than genus would include the genus name as uninomial.
_______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
Now that we've been talking about the variations, perhaps we can move toward doing something about it.
I mentioned we have been talking about having a service built that can read and unpack a DarwinCore Archive, evaluate specific things like we have been discussing that may be expressed inconsistently, set them right, and spit back out a new and more consistent archive file. In order to put out a call for a developer to do this, we need to capture some of those things it should do. Markus and I have started this but we have a lot of end-of-year business and less time than money. We still have some funds available to at least start this process.
I'd like to know if anyone is interested in, and feels qualified to develop a more complete set of requirements for such a service, which we would then try to find a developer to build. We aren't trying to deal with everything at once, mind you. Just a some key things that might make ingesting a DarwinCore Archive for either Taxon data or Occurrence data a bit more consistent in regard to the taxonomic elements. I'd need a couple of days or three to do this as complete as I think is needed so it's that sort of time I'm anticipating you smart people can do in about the same.
For example,
• Checking the integrity of normal and denormal classifications. What are the steps, and conditions to check integrity in normal classifications and to transform a denormal to a normal. In the latter, for example, you have to make sure the same Family value doesn't have two different parents. If so, what then? * Creating IDs for IDless, normalised records (e.g parentNameUsage) * Map taxon ranks to our taxon rank vocabulary so that alternative forms (ssp, subspec, ss., are replaced, when possible, to the standard form). normalising taxon and nomenclatural status * Splitting a merged name into name and authorship parts. * Checking the split version is consistent with the complete one if both are given.
I put the working doc Markus and I have in Dropbox and put it here: https://docs.google.com/viewer?a=v&pid=explorer&chrome=true&srci...
Again, we not looking for someone to do the programming but to provide enough details so that person the steps needed to do that work.
So anyone want to help out a data standard over the holidays? We are eager to start because I'd like to get a call for a developer before the middle of December when the little bag of gold gets taken away? If multiple people are interested we will have to draw straws or pistols or some other means for making a good decision. But please contact me directly and we can followup offline.
Best, David Remsen
---------------------------------------------------------------------------- David Remsen, Senior Programme Officer Electronic Catalog of Names of Known Organisms Global Biodiversity Information Facility Secretariat Universitetsparken 15, DK-2100 Copenhagen, Denmark Tel: +45-35321472 Fax: +45-35321480 Mobile +45 28751472 Skype: dremsen ----------------------------------------------------------------------------
Dear David & all,
as a sidenote from me as a data-user who would like to assist in avoiding errors in interpretation of taxonomic names, - not sure if it helps in planning your future strategies: I'm just wondering, why not explore a few gentle steps towards ... whoknows ... something like elements of a future "BioCode" (being aware of the incompatible parts of current domain-specific IC*N Codes, - but making use of uniting principles).
The usefulness of automated name parsing is limited, - not enough or even misleading in too many cases, e.g. with binomina combined with different homonymous genera, e.g. the following:
Tylonotus bimaculatus Tylonotus rugicollis Tylonotus fryi
The first genus is Tylonotus HALDEMAN 1847 (Coleoptera Cerambycidae), second: Tylonotus FIEBER 1858 (Heteroptera Miridae), third: Tylonotus SCHAUM 1863 (Coleoptera Carabidae). Now imagine if we had an 'expanded parsing' strategy (with the assistance of taxon experts!) resulting in unique namestrings that contain all basic information on the nomenclatural status of names. In the above example, unique ID-strings for the nomenclatural content might look like this:
ZS-Tylonotus_bimaculatus ZS-Lygaeus_rugicollis/2Tylonotus_rugicollis/=Plesiocoris_rugicollis ZS-3Tylonotus_fryi/=Nototylus_fryi
(prefix 'ZS-' for zoological species-group names; the second string shows we have to do with a new generic combination, etc.) Such strings can be perfectly stable, unique, human and machine-readable. They could serve as a solid basis for interpretation of actual name usages [= the GNUB task?] ... in my imagination.
Best regards, Wolfgang
----------------------------------
Wolfgang Lorenz, Tutzing, Germany
2010/11/25 David Remsen (GBIF) dremsen@gbif.org
Now that we've been talking about the variations, perhaps we can move toward doing something about it.
I mentioned we have been talking about having a service built that can read and unpack a DarwinCore Archive, evaluate specific things like we have been discussing that may be expressed inconsistently, set them right, and spit back out a new and more consistent archive file. In order to put out a call for a developer to do this, we need to capture some of those things it should do. Markus and I have started this but we have a lot of end-of-year business and less time than money. We still have some funds available to at least start this process.
I'd like to know if anyone is interested in, and feels qualified to develop a more complete set of requirements for such a service, which we would then try to find a developer to build. We aren't trying to deal with everything at once, mind you. Just a some key things that might make ingesting a DarwinCore Archive for either Taxon data or Occurrence data a bit more consistent in regard to the taxonomic elements. I'd need a couple of days or three to do this as complete as I think is needed so it's that sort of time I'm anticipating you smart people can do in about the same.
For example,
• Checking the integrity of normal and denormal classifications. What are the steps, and conditions to check integrity in normal classifications and to transform a denormal to a normal. In the latter, for example, you have to make sure the same Family value doesn't have two different parents. If so, what then?
- Creating IDs for IDless, normalised records (e.g parentNameUsage)
- Map taxon ranks to our taxon rank vocabulary so that alternative
forms (ssp, subspec, ss., are replaced, when possible, to the standard form). normalising taxon and nomenclatural status
- Splitting a merged name into name and authorship parts.
- Checking the split version is consistent with the complete one if
both are given.
I put the working doc Markus and I have in Dropbox and put it here: https://docs.google.com/viewer?a=v&pid=explorer&chrome=true&srci...
Again, we not looking for someone to do the programming but to provide enough details so that person the steps needed to do that work.
So anyone want to help out a data standard over the holidays? We are eager to start because I'd like to get a call for a developer before the middle of December when the little bag of gold gets taken away? If multiple people are interested we will have to draw straws or pistols or some other means for making a good decision. But please contact me directly and we can followup offline.
Best, David Remsen
David Remsen, Senior Programme Officer Electronic Catalog of Names of Known Organisms Global Biodiversity Information Facility Secretariat Universitetsparken 15, DK-2100 Copenhagen, Denmark Tel: +45-35321472 Fax: +45-35321480 Mobile +45 28751472 Skype: dremsen
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
Rich, I gather your reason would be because it's unclear if anyone would actually use a canonicalName element? That is, it's unneeded.
Not exactly. I see a need, but I see other needs as well. I'd rather find a more stable and robust solution that allows people with both parsed and unparsed data to share what they have in a way that allows users to get what they need & want. My problem is not so much with canonicalName per se, but rather that it is unclear where the overlap and non-ovelap with scientificName is. It seems to me to be 80% overlap. As I said in my last post to Tony just now, I think the real problem is that scientificName is both core and loosely defined.
My feeling is that there should be *one* way to deal with completely unparsed data (e.g., a term that starts with "verbatim"), and *one* way for people to deal with parsed data. The problem is, there are different levels of parsed out there:
- fully unparsed with name bits and/or authorship in the same text string - authorship and name bits parsed - name bits parsed into "NameElement" pieces, with qualifiers stripped - name bits parsed into "NameElement" pieces, with qualifiers included - authorship parsed to names and/or years - authorship parsed to reference citation - etc.
You said you worry about feature creep. I suppose I worry about semantic creep. Extending the meaning of a term makes it more universal, but in a data world it increases the variability of the data that may be found attached to the term in some dataset. Imprecision in terms can create a lot of data quality headaches. Is that acceptable?
Exactly! This is getting at my point about scientificName being very liberal, but very essential (and overloaded). The addition of canonicalName would deal with a one or two problems, but I don't think it would solve the fundamental problem(s). It's more of a band-aid, than a cure, I guess.
Rich
participants (12)
-
"Markus Döring (GBIF)"
-
Bob Morris
-
Chuck Miller
-
David Remsen (GBIF)
-
greg whitbread
-
Gregor Hagedorn
-
Markus Döring
-
Paul Murray
-
Peter DeVries
-
Richard Pyle
-
Tony.Rees@csiro.au
-
Wolfgang Lorenz