SDD standard - purpose

Thu Jul 27 14:34:49 CEST 2000

> From: Kevin Thiele <kevin.thiele at PI.CSIRO.AU>
> To: TDWG-SDD

> Initially, I think we should aim at descriptions and interactive ID. The
> idea of massaging one data file into two or more different products (e.g.
> natural-language and keys) is very attractive, but surprisingly
> problematical, since the structure of data needed for the two purposes is
> often subtly different. Of course, doing just this is the basis of the
> DELTA system, but we may need to do it in a more sophisticated way. ...
> The problems inherent in the multiple-product model become even more
> alarming when you try to maintain one data file for both
> description/identification and phylogenetic analysis. My personal view is
> that we should leave cladistics out of the scope at least for the time
> being.

INTRODUCTION

I think that it's not particularly difficult to accommodate description,
identification, and phenetic or phylogenetic analysis in a single database.
Perhaps some of the people who have done or taught it would care to comment.

By 'not particularly difficult' I don't mean that it's easy, particularly
without the help of an experienced teacher. It's of comparable difficulty to
many other aspects of professional work, for which we usually prepare by
undertaking a degree course. Nor should we expect that advances in software
will ever make it easy, in the sense that it could be done well without
aptitude, training, thought, and experience. (In fact, software advances
often make tasks _more_ difficult, as greater capabilities lead to higher
expectations.)

In addition to the obvious benefits of making as much use as possible of
laboriously acquired data, there can be valuable synergies between the
different kinds of application. For example, even if the data are primarily
for phylogenetic analysis, using them for description and identification can
help detect errors. It is not unheard of for published work to contain gross
errors (such as frame shifts caused by the accidental deletion of matrix
elements) which could easily have been detected in this way. Also, the
information-retrieval functions of Intkey can help in exploring patterns and
relationships in the data.

DETAILED DISCUSSION - permission granted to stop reading here :-)

Within a given project, it's possible to define a 'universal' set of
characters which are suitable for all applications. To these can be added
characters designed for particular purposes, which are to be omitted for
other purposes. For example, classification characters (e.g. the family to
which a taxon belongs) and geographical distribution characters (what
countries, states, etc. a taxon occurs in) are useful in description and
identification, but would not normally be used in classification (for want
of a better word, I will use this as an abbreviation for 'phenetic and
phylogenetic analysis').

Sometimes it will be necessary to define alternative characters to represent
similar concepts for different purposes. Obviously, efforts should be made
to keep such alternative characters to a minimum. Software can help by
combining character states, converting numeric characters to multistate, and
checking the scoring of characters against relationships defined between
them (not done by any current software as far as I know, except for the
special case of character dependencies).

(An aside. While some 'identification' characters are unsuitable for
classification, the converse is not true. To claim that a 'classification'
character is not suitable for identification is tantamount to an admission
that the author's scoring of the character is not reproducible by others. Of
course, I am referring to interactive identification, using a program with a
'best characters' calculation fast enough to be used routinely, and
supporting character weights.)

With a given set of characters, it may be necessary to record attributes
(i.e. the cells of the taxa X characters 'matrix') in different ways for
different purposes. For example, in LucID it is possible to flag state
values as 'present by misinterpretation'. Values so flagged would normally
be used for identification but not for description.

Our proposed enhancements for the Delta format (see, for example,
http://biodiversity.uno.edu/delta/www/descdata.htm) contain more general
methods of flagging values for use in any number of user-defined
applications. For example, consider the coding

16,2/1<@only keys> 17,7<@only keys>-8.5-9<@for classification>-10-12<@only
keys> 18,2<@for classification>/3 20,1/2<@not Australia>

For the application 'keys', this would be interpreted as

   16,2/1 17,7-12 18,2/3 20,1/2

and for the application 'classification Australia' as

   16,2 17,9 18,2 20,1

It is often necessary to use alternative _wordings_ of characters for
different purposes. (This is different from the alternative character
_concepts_ discussed above.) This arises: (1) because of the different
contexts in which the words are used; (2) because of the different audiences
for whom the words are intended (e.g. different native language, different
knowledge of terminology).

The contexts in which the words appear range from full natural-language
descriptions, which may contain all the characters in their natural order,
supplemented by headings, to applications such as conventional keys in which
the characters appear in random order, completely out of the context of
their related characters. Intermediate cases are descriptions in which parts
are omitted because of: missing data; inapplicable characters; inclusion of
only a subset of the characters; or inclusion of only diagnostic attributes.
Other example of random order are: lists of 'best' characters in interactive
identification; displaying the attributes of a specimen in the order in
which they were entered; displaying diagnostic descriptions in the order in
which the characters were added.

Another requirement is an abbreviated form of the character for displaying
in applications, such as interactive identification, where characters must
be selected from a list. In DELTA, this is achieved by means of comments in
the 'feature' line of a character. For example, with the character
...#10. leaves <presence>/
        1. present/
        2. absent/
Intkey would display 'leaves (presence)' in character-selection lists, but a
natural-language description would read (say) 'leaves absent'.

It is often possible to meet the requirements of various contexts from a
single character list, though doing so may require some compromise - the
results for some purposes may not be optimal. For the best results, it may
be necessary to have alternative wordings. In the past, we have accommodated
this in DELTA simply by having separate character lists, and invoking the
appropriate one for different applications. This is inefficient, because a
large proportion of the words can usually be used in all applications. We
therefore want to move towards a single list, with groups of words flagged
for use in different applications or contexts. The same mechanism can be
used for different languages, and also in other text such as character
notes, and text in 'item' descriptions (text characters, and comments
associated with attributes). The syntax suggested for this in our Web
publications does not reflect our current preferences - I have not had time
to update these documents.

--

Mike Dallwitz

CSIRO Entomology, GPO Box 1700, Canberra ACT 2601, Australia
Phone: +61 2 6246 4075   Fax: +61 2 6246 4000
Email: md at ento.csiro.au  Internet: biodiversity.uno.edu/delta/