SDD Specifications Document

Mon Feb 28 09:47:06 CET 2000

At 15:30 24/02/00 +1100, Eric Zurcher wrote:

>1) One "pattern" that recurs in the document in the use of "attachments" to
>entities. These consist of a name, type, path, public notes, and private
>notes. Presumably one could define some sort of generic "attachment" object
>and avoid the multiple (4) redefinitions currently given in the outline.
>This would shorten the outline appreciably.

This sounds sensible. Should someone edit the document to reflect this, or
should we just leave efficiency dividends like this to Leigh or whoever
creates the final thing?

Two comments though. Firstly, most of these attachment objects will be
unique - presumably it won't often happen that one attachment will be used
for two taxa, or for a taxon and a character state. So do you gain much?

Secondly, ease of reading the text treatment should be a consideration. If
all the information re the attachment stays with the taxon/state etc entry,
this may make the treatment more navigable.

>2) ID numbers crop up in a number of places. But it's not obvious how
>"unique" these various IDs need to be. For example, must character IDs be
>unique only with the context of the set of all character IDs, or of all IDs
>used anywhere within the treatment? Or perhaps they are intended be used
>across treatments (to facilitate merging, etc.), and must be unique (but
>consistent) in an even broader sense? (But perhaps this is an area this is
>best left deliberately ambiguous for now.)

I'd say that IDs need to be unique within their context eg at the character
level. Super-uniqueness beyond the treatment is the lexicon issue and
doesn't need to be addressed here, I think. We should construct the standard
in such a way that lexica can grow, but without a requirement for them. For
instance, there should be no requirement that character IDs form a
contiguous integer set.

>3) Character (and taxon) sets - these should probably be defined
>hierarchically, so that sets would be able contain other sets, as well as
>the base elements (characters or taxa). Note that there seems to be another
>"up vs. down" problem here - is it better to define a set by listing the
>members within it, or for the each of the members to list the sets to which
>it belongs?

What do people think?

Does the idea of allowing nested sets overlap and merge the boundary with
the nested character structure, and perhaps become redundant. For instance,
I envisage the characters being nested (as discussed earlier in the list):

Plant
 leaves
  orientation
  venation
   prominence
   reticulation
 flowers
  petals
   colour
   number
.........etc

Sets are perhaps a shortcut across all this. If they were heirarchically
structured, would you just be repeating much of the above without the lowest
level?

>4) I'm rather confused by footnote 3, regarding the nesting of character
>names, and the restriction of "properties" to only the lowest level. What
>it the reason for this restriction? But certainly there seems to be merit
>in separating the "properties" of a character from it's textual
>representation. This is almost essential when attempting to generate
>natural-language descriptions in multiple languages. Similarly, different
>wordings may be appropriate in different application contexts
>(natural-language vs. interactive keys vs. conventional keys, or keys of
>the layman vs. the specialist).

There are many things in the document that I haven't thought through
properly, but I decided to get it out for comment before labouring further.

I think what I meant was that the higher-level structures in the character
names list do not have any representation in the data "matrix", only the
lowest level.
Consider some data represented in the LucID way (as a taxon-state matrix)
(you can do the same thing for a DELTA representation in which cells of the
matrix hold taxon-character scores):

       123456789
Taxon1 010101000
Taxon2 111010101
Taxon3 000101001

Columns 1-9 represent character states. Columns 1-3 may be the states of
character 1, 4-5 the states of character 2, 6-9 the states of character 3.
Now characters 1&2 may both belong to a higher-level structure, but the
properties of this higher-level thing are not equivalent to the properties
of the state (for instance, it can't have a score).

But perhaps there are properties in common?

>5) This draft allows for a "score" only within the context of a "state
>name". It is not obvious how characters with non-discrete values (e.g.
>numeric values) would be handled.

Just as for Leigh's original XDELTA.

>7) For purposes of natural-language generation (and perhaps other
>applications), it is desirable to have some sort of "connection operator"
>between states within a character (e.g., "flowers blue or violet" vs.
>"flowers blue and violet" vs. "flowers blue to violet" all carry slightly
>different meanings). This and other requirements of generating
>natural-language descriptions might be an argument for generally preferring
>a "characters within taxa" representation to "taxa within characters".

Again, I think Leigh's XDELTA covers this, using nested scores.