I think it makes the most sense to model these based on biology and informatics and then be able to output a code compliant string.
One reason is that the code can change and so you don't want to have that a fixed part of the most fundamental units of your information systems.
Another advantage is you don't have to have different intermediate structures and forms for each of the nomenclatural codes.
Do we want to have one entity for species or four+ that have to be duplicated throughout the entire software stack?
Done this way if the code changes you just need to alter the output code.
You also don't always know what is the appropriate code for a given string until the end.
Respectfully,
- Pete
On Thu, Dec 9, 2010 at 7:52 AM, Bob Morris morris.bob@gmail.com wrote:
Thanks. To me what is interesting about this thread is that documents whose main(?) audience is authors and publishers, do not always address the needs of parser writers. It is a rare and happy circumstance for a programmer to have the document author to consult!
What I \think/ is implied by your answer is (something that requires biological knowledge that I don't have, namely) that there are hybrid names which are not necessarily a cross of two things, but rather only one is mentioned. The distinction then is that "formula" means at least two, but there are uses which do not appear in a formula, right? So a natural language name extractor should follow this rule:
- If the × adjoins text, the token to the left of any predecessor
white space is not part of a taxon name, but otherwise it is. Example: In the fragment "not unlike ×Agropogon littoralis" the token 'unlike' is not part of a name.
Believe it or not, I am not complaining about ICBN. No programmer interpreting a document not written for programmers should complain if understanding it assumes knowledge and insight of the intended audience. Nor should they complain if they are raising points that are addressed in other parts of the document that they haven't read--which in this case for me is everything but H.3A.
Robust context sensitive parsers are marginally more complicated to write than those that require no lookahead, but this is surely not the only name parsing issue that requires lookahead, so I can't even complain on that score. In a vaguely related setting, parser writers might see the rather nicely set forth
http://stackoverflow.com/questions/1952931/how-to-rewrite-this-nondeterminis...
Bob Morris p.s. Hey, I thought of something to complain about, albeit not about ICBN: I sure wish spec writers targeting software would banish "should" from their documents in favor of "must", even if multiple choices are accompanied by "... is preferred". Well, maybe it's a little complaint about the nomenclatural codes, because movement towards born-digital, semantically marked-up systematics literature will bump into it when people try to write semantically enhanced applications. It would be far better if publishers followed a set of rules with no "should" in them, for which compliance could be tested before publication.
Robert A. Morris Emeritus Professor of Computer Science UMASS-Boston 100 Morrissey Blvd Boston, MA 02125-3390 Associate, Harvard University Herbaria email: morris.bob@gmail.com web: http://bdei.cs.umb.edu/ web: http://etaxonomy.org/mw/FilteredPush http://www.cs.umb.edu/~ram phone (+1) 857 222 7992 (mobile)
On Thu, Dec 9, 2010 at 3:39 AM, dipteryx@freeler.nl wrote:
Having personally written Rec. H.3A.1, I do not see that it offers scope for being misread: the placement of the multiplication sign is a matter of style (and insight). As background information, the ICBN-preferred style is to put it directly in front of the name or epithet (no space whatsoever: ×Agropogon littoralis): just keep it nice together, so as to give computers no chance to mess it up (after all, at a line break, a computer is likely to separate these over more than one line).
Rec. H.3A Note 1 has been put in there (redundantly) for those who are careless readers, just to make sure the matter could not possibly be misunderstood by even the most whimsical. So, in a formula, the parents are separated by: space, multiplication sign, space; Agrostis stolonifera × Polypogon monspeliensis.
Paul van Rijckevorsel
-----Oorspronkelijk bericht----- Van: tdwg-content-bounces@lists.tdwg.org namens Bob Morris Verzonden: wo 8-12-2010 20:12 Aan: Markus Döring (GBIF) CC: tdwg-content@lists.tdwg.org List Onderwerp: Re: [tdwg-content] canonical name for named hybrid & infragenericnames
Your placement of the multiplication sign × does not seem code compliant. It looks too close. Maybe. Also there might be a question about whether a TDWG requirement to use the multiplication sign can be easily implemented by all providers.
On these subjects The Appendix on Hybrid Names of ICBN seems contradictory in that H.3A.1 (http://ibot.sav.sk/icbn/frameset/0071AppendixINoHa003.htm, quoted below) seems to allow your placement, but Note 1. there seems to require space. Note 1. would, with H.3A.1 imply that there must be more white space to the left than right of the multiplication sign or its surrogate. One spacing that seems to violate all interpretations of A.3A.1 is equal white space around the multiplication sign. My guess is that the overwhelming fraction of printed hybrid names are thereby noncompliant unless something elsewhere resolves the issue). Making the amount of white space significant in a parsed string is a horrifying thought.
--Bob Morris
"Recommendation H.3A
H.3A.1. The multiplication sign ×, indicating the hybrid nature of a taxon, should be placed so as to express that it belongs with the name or epithet but is not actually part of it. The exact amount of space, if any, between the multiplication sign and the initial letter of the name or epithet should depend on what best serves readability.
Note 1. The multiplication sign × in a hybrid formula is always placed between, and separate from, the names of the parents. H.3A.2. If the multiplication sign is not available it should be approximated by a lower case letter "x" (not italicized)." http://ibot.sav.sk/icbn/frameset/0071AppendixINoHa003.htm
======================
On Wed, Dec 8, 2010 at 1:14 PM, "Markus Döring (GBIF)" mdoering@gbif.org wrote:
talking about canonical names again I want to use the oppertunity and
get
rid of another question I have. What is the code compliant canonical version of named hybrids (not formulas) and infrageneric names?
Are these examples correct?
Botanical section: verbatim: Maxillaria sect. Multiflorae Christenson canonical: Maxillaria sect. Multiflorae
Botanical subgenus: verbatim: Anthemis subgen. Maruta (Cass.) Tzvelev canonical: Anthemis subgen. Maruta
Botanical series: verbatim: Artemisia ser. Codonocephalae (Pamp.) Y.R.Ling canonical: Artemisia ser. Codonocephalae
Zoological subgenus: verbatim: Murex (Promurex) Ponder & Vokes, 1988 canonical: Murex subgen. Promurex # if we use parenthesis to indicate the subgenus we can only guess if
its
an author or subgenus name
Zoological species verbatim: Leptochilus (Neoleptochilus) beaumonti Giordani Soika 1953 canonical: Leptochilus beaumonti
Botanical named genus hybrid: verbatim: ×Agropogon littoralis (Sm.) C. E. Hubb. canonical: ×Agropogon littoralis
Botanical named infrageneric hybrid: verbatim: Eryngium nothosect. Alpestria Burdet & Miège canonical: Eryngium nothosect. Alpestria
Botanical named species hybrid: verbatim: Salix ×capreola Andersson (1867) canonical: Salix ×capreola Andersson (1867)
Botanical variety, named species hybrid: verbatim: Populus ×canadensis var. serotina (R. Hartig) Rehder canonical: Populus ×canadensis var. serotina
Botanical named infraspecific hybrid: verbatim: Polypodium vulgare nothosubsp. mantoniae(Rothm.) Schidlay canonical: Polypodium vulgare nothosubsp. mantoniae
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
--
On Thu, Dec 9, 2010 at 3:39 AM, dipteryx@freeler.nl wrote:
Having personally written Rec. H.3A.1, I do not see that it offers scope for being misread: the placement of the multiplication sign is a matter of style (and insight). As background information, the ICBN-preferred style is to put it directly in front of the name or epithet (no space whatsoever: ×Agropogon littoralis): just keep it nice together, so as to give computers no chance to mess it up (after all, at a line break, a computer is likely to separate these over more than one line).
Rec. H.3A Note 1 has been put in there (redundantly) for those who are careless readers, just to make sure the matter could not possibly be misunderstood by even the most whimsical. So, in a formula, the parents are separated by: space, multiplication sign, space; Agrostis stolonifera × Polypogon monspeliensis.
Paul van Rijckevorsel
-----Oorspronkelijk bericht----- Van: tdwg-content-bounces@lists.tdwg.org namens Bob Morris Verzonden: wo 8-12-2010 20:12 Aan: Markus Döring (GBIF) CC: tdwg-content@lists.tdwg.org List Onderwerp: Re: [tdwg-content] canonical name for named hybrid & infragenericnames
Your placement of the multiplication sign × does not seem code compliant. It looks too close. Maybe. Also there might be a question about whether a TDWG requirement to use the multiplication sign can be easily implemented by all providers.
On these subjects The Appendix on Hybrid Names of ICBN seems contradictory in that H.3A.1 (http://ibot.sav.sk/icbn/frameset/0071AppendixINoHa003.htm, quoted below) seems to allow your placement, but Note 1. there seems to require space. Note 1. would, with H.3A.1 imply that there must be more white space to the left than right of the multiplication sign or its surrogate. One spacing that seems to violate all interpretations of A.3A.1 is equal white space around the multiplication sign. My guess is that the overwhelming fraction of printed hybrid names are thereby noncompliant unless something elsewhere resolves the issue). Making the amount of white space significant in a parsed string is a horrifying thought.
--Bob Morris
"Recommendation H.3A
H.3A.1. The multiplication sign ×, indicating the hybrid nature of a taxon, should be placed so as to express that it belongs with the name or epithet but is not actually part of it. The exact amount of space, if any, between the multiplication sign and the initial letter of the name or epithet should depend on what best serves readability.
Note 1. The multiplication sign × in a hybrid formula is always placed between, and separate from, the names of the parents. H.3A.2. If the multiplication sign is not available it should be approximated by a lower case letter "x" (not italicized)." http://ibot.sav.sk/icbn/frameset/0071AppendixINoHa003.htm
======================
On Wed, Dec 8, 2010 at 1:14 PM, "Markus Döring (GBIF)" mdoering@gbif.org wrote:
talking about canonical names again I want to use the oppertunity and
get
rid of another question I have. What is the code compliant canonical version of named hybrids (not formulas) and infrageneric names?
Are these examples correct?
Botanical section: verbatim: Maxillaria sect. Multiflorae Christenson canonical: Maxillaria sect. Multiflorae
Botanical subgenus: verbatim: Anthemis subgen. Maruta (Cass.) Tzvelev canonical: Anthemis subgen. Maruta
Botanical series: verbatim: Artemisia ser. Codonocephalae (Pamp.) Y.R.Ling canonical: Artemisia ser. Codonocephalae
Zoological subgenus: verbatim: Murex (Promurex) Ponder & Vokes, 1988 canonical: Murex subgen. Promurex # if we use parenthesis to indicate the subgenus we can only guess if
its
an author or subgenus name
Zoological species verbatim: Leptochilus (Neoleptochilus) beaumonti Giordani Soika 1953 canonical: Leptochilus beaumonti
Botanical named genus hybrid: verbatim: ×Agropogon littoralis (Sm.) C. E. Hubb. canonical: ×Agropogon littoralis
Botanical named infrageneric hybrid: verbatim: Eryngium nothosect. Alpestria Burdet & Miège canonical: Eryngium nothosect. Alpestria
Botanical named species hybrid: verbatim: Salix ×capreola Andersson (1867) canonical: Salix ×capreola Andersson (1867)
Botanical variety, named species hybrid: verbatim: Populus ×canadensis var. serotina (R. Hartig) Rehder canonical: Populus ×canadensis var. serotina
Botanical named infraspecific hybrid: verbatim: Polypodium vulgare nothosubsp. mantoniae(Rothm.) Schidlay canonical: Polypodium vulgare nothosubsp. mantoniae
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
-- Robert A. Morris Emeritus Professor of Computer Science UMASS-Boston 100 Morrissey Blvd Boston, MA 02125-3390 Associate, Harvard University Herbaria email: morris.bob@gmail.com web: http://bdei.cs.umb.edu/ web: http://etaxonomy.org/mw/FilteredPush http://www.cs.umb.edu/~ram phone (+1) 857 222 7992 (mobile) _______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content