At 15:30 24/02/00 +1100, Eric Zurcher wrote:
- One "pattern" that recurs in the document in the use of "attachments" to
entities. These consist of a name, type, path, public notes, and private notes. Presumably one could define some sort of generic "attachment" object and avoid the multiple (4) redefinitions currently given in the outline. This would shorten the outline appreciably.
This sounds sensible. Should someone edit the document to reflect this, or should we just leave efficiency dividends like this to Leigh or whoever creates the final thing?
Two comments though. Firstly, most of these attachment objects will be unique - presumably it won't often happen that one attachment will be used for two taxa, or for a taxon and a character state. So do you gain much?
Secondly, ease of reading the text treatment should be a consideration. If all the information re the attachment stays with the taxon/state etc entry, this may make the treatment more navigable.
- ID numbers crop up in a number of places. But it's not obvious how
"unique" these various IDs need to be. For example, must character IDs be unique only with the context of the set of all character IDs, or of all IDs used anywhere within the treatment? Or perhaps they are intended be used across treatments (to facilitate merging, etc.), and must be unique (but consistent) in an even broader sense? (But perhaps this is an area this is best left deliberately ambiguous for now.)
I'd say that IDs need to be unique within their context eg at the character level. Super-uniqueness beyond the treatment is the lexicon issue and doesn't need to be addressed here, I think. We should construct the standard in such a way that lexica can grow, but without a requirement for them. For instance, there should be no requirement that character IDs form a contiguous integer set.
- Character (and taxon) sets - these should probably be defined
hierarchically, so that sets would be able contain other sets, as well as the base elements (characters or taxa). Note that there seems to be another "up vs. down" problem here - is it better to define a set by listing the members within it, or for the each of the members to list the sets to which it belongs?
What do people think?
Does the idea of allowing nested sets overlap and merge the boundary with the nested character structure, and perhaps become redundant. For instance, I envisage the characters being nested (as discussed earlier in the list):
Plant leaves orientation venation prominence reticulation flowers petals colour number .........etc
Sets are perhaps a shortcut across all this. If they were heirarchically structured, would you just be repeating much of the above without the lowest level?
- I'm rather confused by footnote 3, regarding the nesting of character
names, and the restriction of "properties" to only the lowest level. What it the reason for this restriction? But certainly there seems to be merit in separating the "properties" of a character from it's textual representation. This is almost essential when attempting to generate natural-language descriptions in multiple languages. Similarly, different wordings may be appropriate in different application contexts (natural-language vs. interactive keys vs. conventional keys, or keys of the layman vs. the specialist).
There are many things in the document that I haven't thought through properly, but I decided to get it out for comment before labouring further.
I think what I meant was that the higher-level structures in the character names list do not have any representation in the data "matrix", only the lowest level. Consider some data represented in the LucID way (as a taxon-state matrix) (you can do the same thing for a DELTA representation in which cells of the matrix hold taxon-character scores):
123456789 Taxon1 010101000 Taxon2 111010101 Taxon3 000101001
Columns 1-9 represent character states. Columns 1-3 may be the states of character 1, 4-5 the states of character 2, 6-9 the states of character 3. Now characters 1&2 may both belong to a higher-level structure, but the properties of this higher-level thing are not equivalent to the properties of the state (for instance, it can't have a score).
But perhaps there are properties in common?
- This draft allows for a "score" only within the context of a "state
name". It is not obvious how characters with non-discrete values (e.g. numeric values) would be handled.
Just as for Leigh's original XDELTA.
- For purposes of natural-language generation (and perhaps other
applications), it is desirable to have some sort of "connection operator" between states within a character (e.g., "flowers blue or violet" vs. "flowers blue and violet" vs. "flowers blue to violet" all carry slightly different meanings). This and other requirements of generating natural-language descriptions might be an argument for generally preferring a "characters within taxa" representation to "taxa within characters".
Again, I think Leigh's XDELTA covers this, using nested scores.