SDD Specifications Document

Thu Feb 24 15:30:25 CET 2000

At 16:02 23/02/2000 +1100, Kevin Thiele wrote:
>Dear list_eners
>
>attached find SDDspecs.rtf. This is my attempt to formalise the discussion
>so far into a set of specifications for a new descriptive data standard.
>

and requested comments his proposal. I've not had time to think this
through in detail, but I'd nonetheless like to offer a few scattered
observations.

1) One "pattern" that recurs in the document in the use of "attachments" to
entities. These consist of a name, type, path, public notes, and private
notes. Presumably one could define some sort of generic "attachment" object
and avoid the multiple (4) redefinitions currently given in the outline.
This would shorten the outline appreciably.

2) ID numbers crop up in a number of places. But it's not obvious how
"unique" these various IDs need to be. For example, must character IDs be
unique only with the context of the set of all character IDs, or of all IDs
used anywhere within the treatment? Or perhaps they are intended be used
across treatments (to facilitate merging, etc.), and must be unique (but
consistent) in an even broader sense? (But perhaps this is an area this is
best left deliberately ambiguous for now.)

3) Character (and taxon) sets - these should probably be defined
hierarchically, so that sets would be able contain other sets, as well as
the base elements (characters or taxa). Note that there seems to be another
"up vs. down" problem here - is it better to define a set by listing the
members within it, or for the each of the members to list the sets to which
it belongs?

4) I'm rather confused by footnote 3, regarding the nesting of character
names, and the restriction of "properties" to only the lowest level. What
it the reason for this restriction? But certainly there seems to be merit
in separating the "properties" of a character from it's textual
representation. This is almost essential when attempting to generate
natural-language descriptions in multiple languages. Similarly, different
wordings may be appropriate in different application contexts
(natural-language vs. interactive keys vs. conventional keys, or keys of
the layman vs. the specialist).

5) This draft allows for a "score" only within the context of a "state
name". It is not obvious how characters with non-discrete values (e.g.
numeric values) would be handled.

6) I'm intrigued by the notion of a "Progressive Revelation model"
(footnote 5). It sounds terribly theological - or perhaps that's
Thiele-logical? (my apologies to Kevin, but I really can't resist bad puns).

7) For purposes of natural-language generation (and perhaps other
applications), it is desirable to have some sort of "connection operator"
between states within a character (e.g., "flowers blue or violet" vs.
"flowers blue and violet" vs. "flowers blue to violet" all carry slightly
different meanings). This and other requirements of generating
natural-language descriptions might be an argument for generally preferring
a "characters within taxa" representation to "taxa within characters".

Cheers,

Eric Zurcher
CSIRO Division of Entomology
Canberra, Australia
E-mail: ericz at ento.csiro.au