Special states: stepping back to the generalized character definition and coding model
Steve Shattuck wrote:
For "Special States", another way to look at it is that this doesn't really relate to a character but to the coding of a given character for a given taxon. It is really the status of the coding of that character for a taxon. Instead of storing it as a part of the character (by creating a new state for the character) why not create a new element for this information and attach it to the taxon by character intersection (or "cell", or "taxon description"). Call it "Coding Status" with an enumerated list of "Coded" (meaning the taxon has been coded for this character), "Not yet coded" (meaning it will be coded when I get around to it), "Not to be coded" (meaning it can be coded but I not going to) and "Can't be coded". Then add a text attribute to this element to allow an explanation of what's going on for the coding of this character for this taxon ("I haven't yet coded this character because suitable material of this taxon is currently unavailable"). This would seem to meet the needs of computer processing (through a defined, machine-readable enumerated list), extensibility (by supporting additions to this list such as "Coded but unreliable") and a full, human-readable explanation of what's going on (in the text comment).
I agree on the need to add an option for annotation text and to keep the "coding status/missing data indicator" list relatively short.
I also can basically accept your list, except that the last example usually applies to a coded state, not the character, i.e. one state may be coded unreliably, the other ok. Currently we propose to use uncertainty modifiers (probably, perhaps, etc.) for this case. This is similar to the "by-misinterpretation" mechanism originally proposed by Kevin, now proposed in SDD to be handled through a coding modifier on an existing state.
General question to all: Any examples where having uncertainty/unreliability should apply to the entire character rather than to the states, i.e. where if implemented as a modifier this would always have to be added to all states?
I wonder whether the "Coded" option is redundant. This should be evident if no other CodingStatus is present and data are coded. Except for redundancy (and therefore the need to enforce a rule that "Coded" requires categorical or numerical data to be indeed coded; else it would be a "lie"), I have no problem with it.
We need to take a step back before we can move forward.
I am quite willing to take that step back. Let my try to explain the reasoning why I think it is a good idea to generalize "coding status/missing data indicators" to be handled akin to categorical states:
The current SDD model for categorical characters is:
1. It is possible to defined shared state sets globally. Example: color-states: red, green, blue, ...
2. The character definition list allows to define a) local states b) refer to 0 to several shared state sets (= inherit them) alpha) as a whole set beta) restricting the inheritance to states specified in an enumeration (Note: if the shared definition is expanded, alpha automatically inherits the new character, beta not)
3. The item descriptions always refers to states defined in the character definition list, never to global/shared definitions directly.
----------
I view this as a structural generalization, and tried to model statistical parameters like min/mean/max/s.d./s.e./sample size etc. accordingly:
1. Global measures define the semantics of for the first version of SDD perhaps 30 parameters. In the first SDD version to keep it simply, the list would not yet be exandable, but it is modeled such that it is easily expandable later on. Statistical parameters already carry descriptive attributes that allows for most purposes to interpret them without referring the state code.
2. Within a numeric character in Terminology: character definition it is possible to select only specific parameters.
3. in the item description only parameters selected by the designer of the terminology can be used.
----------
Now in that admittedly confusing "special state" paper from March I tried to figure out whether we should really (as implemented in Brazil) use this same generalization for "special states" or "coding status/missing data indicators"
1. A shared state set defining special states with fixed codes (i.e. not user expandable in the first SDD version, but since all information is already in the data, easily expandable in future versions.
2. Within a character in Terminology: define which of these status/indicator states are to be enabled for a given character.
3. The item descriptions always refers to states defined in the character definition list, never to global/shared definitions directly.
----------
You propose to handle them separately. What would be the advantages and disadvantages? Please add to the list if I overlook something:
Advantages: a) avoid confusion in the definition of the terminology between normal categorical states and "coding status/missing data indicators". These things are structurally related in that both are specific to a given character when the character is coded, but yield information from different (although non-overlapping!) knowledge domains, viz. biology for character states and knowledge meta data for "coding status/missing data indicators".
Disadvantages: a) the list is fixed and changes to the software are required if the next SDD version extends another coding status. b) reports can be generated only in the language of the program. If the data are in German, but the software is English, the software must add a special feature to translate the reporting of this information into German. c) when collating/compiling data from children, "coding status/missing data indicators" must be handled identical with normal data. That means especially, that the "coding status/missing data indicators" must support multiple concurrent states:
Item 1, character 1: state 1 Item 2, character 1: coding status "excluded/scoped out" Item 3, character 1: coding status: "Not applicable"
if these 3 items are species within a genus, the compiled genus description would be:
genus item, character 1: state 1, or "excluded/scoped out", or "Not applicable"
In which type of report this information would be made visible is another question, i.e. a difference between the types of states remains. However, it would be clearly inappropriate to summarize this into just "state 1".
Note that I was especially delighted to see that probably the entire question of character scoping automatically falls into place, without any additional structural complexity, if "scoped out" can be inherited as a default value from the taxonomic parents! This is my only excuse that I included preliminary notes on the issue of default states in the special state paper, which made the paper a lot more confusing!
---------------
This is my list, and on the basis of this list I would prefer to keep the structural generalization, rather than implementing a completely different mechanism. However, it is definitely good to take that step back and rethink this.
To all: please respond if I currently overlook issues here!
However, even if "coding status/missing data indicators" are handled analogous to shared categorical states and statistical parameters, this does not mean, they have to be in one xml element. Whether a database would store them in one table (would make sense to me) is quite independent of this.
So, we could have (as an example, not necessarily the acutal schema):
Terminology SharedStates StatisticalParameters MissingDataIndicators / or CodingStatus CategoricalStates CharacterDefinition StatisticalParameters MissingDataIndicators / or CodingStatus CategoricalStates
Description Character keyref=... StatisticalParameters MissingDataIndicators / or CodingStatus CategoricalStates
----
Or:
Terminology SharedStates StatisticalParameters MissingDataIndicators / or CodingStatus CategoricalStates CharacterDefinition <StateDefinition xs:type="StatisticalParametersDef" key=...> <StateDefinition xs:type="MissingDataIndicatorsDef" key=...> <StateDefinition xs:type="CategoricalStatesDef" key=...> Description Character keyref=... <StateReference xs:type="StatisticalParameterRef" keyref=...> <StateReference xs:type="MissingDataIndicatorRef" keyref=...> <StateReference xs:type="CategoricalStateRef" keyref=...>
------
Which is preferable?
------
Technical aside regarding the second version: I still don't fully understand the implications of using the latter subtyping ("substitution group"?) version in regard to key/keyref constraints. One thing is: is it possible to force the schema that xs:type is required? It seems that this is only optional, and that if it is not present it is up to the consuming application to figure out which subtype is acutally used. However, whereas the CharacterDefininition subtypes (StatisticalParametersDef) differ in their complex content and can be recognized, the StatisticalParameterRef do not differ EXCEPT in the keyref constraint. My guess is that therefore the above does not work in xml schema, but please comment if you have insight on this!
Gregor ---------------------------------------------------------- Gregor Hagedorn (G.Hagedorn@bba.de) Institute for Plant Virology, Microbiology, and Biosafety Federal Research Center for Agriculture and Forestry (BBA) Koenigin-Luise-Str. 19 Tel: +49-30-8304-2220 14195 Berlin, Germany Fax: +49-30-8304-2203
Often wrong but never in doubt!
participants (1)
-
Gregor Hagedorn