Van: Bob Morris [mailto:morris.bob@gmail.com] Verzonden: do 9-12-2010 14:52
Thanks. To me what is interesting about this thread is that documents whose main(?) audience is authors and publishers, do not always address the needs of parser writers.
*** That depends on how you look at it. The ICBN is mostly written so that nobody who just browses through will make sense of it. It requires any user to read it in some depth, if he is to apply it. Perhaps the parser writer should realize that he is no exception?
But, actually, parsers are not going to be the answer to any question in biodiversity informatics. This is impossible, as the natural laws in a nomenclatural universe are subject to change (almost without notice). What was true ten years ago is not necessarily true now: it may have been retroactively changed. Anybody doing anything in biodiversity informatics should have at least some basic awareness of the natural laws that govern nomenclatural universes. * * *
It is a rare and happy circumstance for a programmer to have the document author to consult!
*** Not the document, just the recommendation (excluding the Note). The ICBN more or less is a wiki (has been for a hundred years). * * *
What I \think/ is implied by your answer is (something that requires biological knowledge that I don't have, namely) that there are hybrid names which are not necessarily a cross of two things, but rather only one is mentioned.
*** No, numbers are irrelevant, provided there are at least two parents involved. * * *
The distinction then is that "formula" means at least two, but there are uses which do not appear in a formula, right?
*** No, the distinction is that a name is a name, while a formula is a summation of (at least two) names.
×Agropogon littoralis is a name, and it is the same as Agropogon littoralis, for most purposes.
Agrostis stolonifera × Polypogon monspeliensis are two names, and the formula indicates their relation, which may be more complex than here: see Rec. H.2A.1; so just lifting a formula in isolation from the literature is out (Mentha longifolia > × rotundifolia is an obsolete form).
* * *
So a natural language name extractor should follow this rule: - If the × adjoins text, the token to the left of any predecessor white space is not part of a taxon name, but otherwise it is. Example: In the fragment "not unlike ×Agropogon littoralis" the token 'unlike' is not part of a name.
Believe it or not, I am not complaining about ICBN. No programmer interpreting a document not written for programmers should complain if understanding it assumes knowledge and insight of the intended audience. Nor should they complain if they are raising points that are addressed in other parts of the document that they haven't read--which in this case for me is everything but H.3A.
Robust context sensitive parsers are marginally more complicated to write than those that require no lookahead, but this is surely not the only name parsing issue that requires lookahead, so I can't even complain on that score. In a vaguely related setting, parser writers might see the rather nicely set forth http://stackoverflow.com/questions/1952931/how-to-rewrite-this-nondeterminis...
Bob Morris p.s. Hey, I thought of something to complain about, albeit not about ICBN: I sure wish spec writers targeting software would banish "should" from their documents in favor of "must", even if multiple choices are accompanied by "... is preferred". Well, maybe it's a little complaint about the nomenclatural codes, because movement towards born-digital, semantically marked-up systematics literature will bump into it when people try to write semantically enhanced applications. It would be far better if publishers followed a set of rules with no "should" in them, for which compliance could be tested before publication.
*** There are more distinctions than just "must" and "should" in the ICBN. Eliminating the "should" is not going to happen, but sometimes a "should" will grow up to become a "must".
Paul van Rijckevorsel