Special states: stepping back to the generalized character definition and coding model

Wed Sep 3 13:05:41 CEST 2003

I'm afraid this doesn't make much sense to me so I'll have to let the rest of you sort it out.  When you come up with a solution we'll see if we can implement it in the morphology section of BioLink.

Sorry about that, Steve

Steve Shattuck
CSIRO Entomology
steve.shattuck at csiro.au

-----Original Message-----
From: Gregor Hagedorn
To: TDWG-SDD at USOBI.ORG
Sent: 9/2/2003 9:36 PM
Subject: Special states: stepping back to the generalized character definition and coding model

Steve Shattuck wrote:

> For "Special States", another way to look at it is that this doesn't
> really relate to a character but to the coding of a given character
> for a given taxon.  It is really the status of the coding of that
> character for a taxon.  Instead of storing it as a part of the
> character (by creating a new state for the character) why not create a
> new element for this information and attach it to the taxon by
> character intersection (or "cell", or "taxon description").  Call it
> "Coding Status" with an enumerated list of "Coded" (meaning the taxon
> has been coded for this character), "Not yet coded" (meaning it will
> be coded when I get around to it), "Not to be coded" (meaning it can
> be coded but I not going to) and "Can't be coded".  Then add a text
> attribute to this element to allow an explanation of what's going on
> for the coding of this character for this taxon ("I haven't yet coded
> this character because suitable material of this taxon is currently
> unavailable").  This would seem to meet the needs of computer
> processing (through a defined, machine-readable enumerated list),
> extensibility (by supporting additions to this list such as "Coded but
> unreliable") and a full, human-readable explanation of what's going on
> (in the text comment).

I agree on the need to add an option for annotation text and to keep
the "coding status/missing data indicator" list relatively short.

I also can basically accept your list, except that the last example
usually applies to a coded state, not the character, i.e. one state
may be coded unreliably, the other ok. Currently we propose to use
uncertainty modifiers (probably, perhaps, etc.) for this case. This
is similar to the "by-misinterpretation" mechanism originally
proposed by Kevin, now proposed in SDD to be handled through a coding
modifier on an existing state.

General question to all: Any examples where having
uncertainty/unreliability should apply to the entire character rather
than to the states, i.e. where if implemented as a modifier this
would always have to be added to all states?

I wonder whether the "Coded" option is redundant. This should be
evident if no other CodingStatus is present and data are coded.
Except for redundancy (and therefore the need to enforce a rule that
"Coded" requires categorical or numerical data to be indeed coded;
else it would be a "lie"), I have no problem with it.

> We need to take a step back before we can move forward.

I am quite willing to take that step back. Let my try to explain the
reasoning why I think it is a good idea to generalize "coding
status/missing data indicators" to be handled akin to categorical
states:

The current SDD model for categorical characters is:

1. It is possible to defined shared state sets globally. Example:
color-states: red, green, blue, ...

2. The character definition list allows to define
 a) local states
 b) refer to 0 to several shared state sets (= inherit them)
    alpha) as a whole set
    beta) restricting the inheritance to states
          specified in an enumeration
    (Note: if the shared definition is expanded, alpha
      automatically inherits the new character, beta not)

3. The item descriptions always refers to states defined in
   the character definition list, never to global/shared
   definitions directly.

----------

I view this as a structural generalization, and tried to model
statistical parameters like min/mean/max/s.d./s.e./sample size etc.
accordingly:

1. Global measures define the semantics of for the first version of
SDD perhaps 30 parameters. In the first SDD version to keep it
simply, the list would not yet be exandable, but it is modeled such
that it is easily expandable later on. Statistical parameters already
carry descriptive attributes that allows for most purposes to
interpret them without referring the state code.

2. Within a numeric character in Terminology: character definition it
is possible to select only specific parameters.

3. in the item description only parameters selected by the designer
of the terminology can be used.

----------

Now in that admittedly confusing "special state" paper from March I
tried to figure out whether we should really (as implemented in
Brazil) use this same generalization for "special states" or "coding
status/missing data indicators"

1. A shared state set defining special states with fixed codes (i.e.
not user expandable in the first SDD version, but since all
information is already in the data, easily expandable in future
versions.

2. Within a character in Terminology: define which of these
   status/indicator states are to be enabled for a given character.

3. The item descriptions always refers to states defined in
   the character definition list, never to global/shared
   definitions directly.

----------

You propose to handle them separately. What would be the advantages
and disadvantages? Please add to the list if I overlook something:

Advantages:
a) avoid confusion in the definition of the terminology between
normal categorical states and "coding status/missing data
indicators". These things are structurally related in that both are
specific to a given character when the character is coded, but yield
information from different (although non-overlapping!) knowledge
domains, viz. biology for character states and knowledge meta data
for "coding status/missing data indicators".

Disadvantages:
a) the list is fixed and changes to the software are required if the
next SDD version extends another coding status.
b) reports can be generated only in the language of the program. If
the data are in German, but the software is English, the software
must add a special feature to translate the reporting of this
information into German.
c) when collating/compiling data from children, "coding
status/missing data indicators" must be handled identical with normal
data. That means especially, that the "coding status/missing data
indicators" must support multiple concurrent states:

 Item 1, character 1:
   state 1
 Item 2, character 1:
   coding status "excluded/scoped out"
 Item 3, character 1:
   coding status: "Not applicable"

if these 3 items are species within a genus, the compiled genus
description would be:

genus item, character 1:
  state 1, or "excluded/scoped out", or "Not applicable"

In which type of report this information would be made visible is
another question, i.e. a difference between the types of states
remains. However, it would be clearly inappropriate to summarize this
into just "state 1".

Note that I was especially delighted to see that probably the entire
question of character scoping automatically falls into place, without
any additional structural complexity, if "scoped out" can be
inherited as a default value from the taxonomic parents! This is my
only excuse that I included preliminary notes on the  issue of
default states in the special state paper, which made the paper a lot
more confusing!

---------------

This is my list, and on the basis of this list I would prefer to keep
the structural generalization, rather than implementing a completely
different mechanism. However, it is definitely good to take that step
back and rethink this.

To all: please respond if I currently overlook issues here!

However, even if "coding status/missing data indicators" are handled
analogous to shared categorical states and statistical parameters,
this does not mean, they have to be in one xml element. Whether a
database would store them in one table (would make sense to me) is
quite independent of this.

So, we could have (as an example, not necessarily the acutal schema):

Terminology
 SharedStates
   StatisticalParameters
   MissingDataIndicators / or CodingStatus
   CategoricalStates
 CharacterDefinition
   StatisticalParameters
   MissingDataIndicators / or CodingStatus
   CategoricalStates

Description
  Character keyref=...
    StatisticalParameters
    MissingDataIndicators / or CodingStatus
    CategoricalStates

----

Or:

Terminology
 SharedStates
   StatisticalParameters
   MissingDataIndicators / or CodingStatus
   CategoricalStates
 CharacterDefinition
   <StateDefinition xs:type="StatisticalParametersDef" key=...>
   <StateDefinition xs:type="MissingDataIndicatorsDef" key=...>
   <StateDefinition xs:type="CategoricalStatesDef" key=...>
Description
  Character keyref=...
    <StateReference xs:type="StatisticalParameterRef" keyref=...>
    <StateReference xs:type="MissingDataIndicatorRef" keyref=...>
    <StateReference xs:type="CategoricalStateRef" keyref=...>

------

Which is preferable?

------

Technical aside regarding the second version: I still don't fully
understand the implications of using the latter subtyping
("substitution group"?) version in regard to key/keyref constraints.
One thing is: is it possible to force the schema that xs:type is
required? It seems that this is only optional, and that if it is not
present it is up to the consuming application to figure out which
subtype is acutally used. However, whereas the CharacterDefininition
subtypes (StatisticalParametersDef) differ in their complex content
and can be recognized, the StatisticalParameterRef do not differ
EXCEPT in the keyref constraint. My guess is that therefore the above
does not work in xml schema, but please comment if you have insight
on this!

Gregor
----------------------------------------------------------
Gregor Hagedorn (G.Hagedorn at bba.de)
Institute for Plant Virology, Microbiology, and Biosafety
Federal Research Center for Agriculture and Forestry (BBA)
Koenigin-Luise-Str. 19          Tel: +49-30-8304-2220
14195 Berlin, Germany           Fax: +49-30-8304-2203

Often wrong but never in doubt!