tdwg-content
Threads by month
- ----- 2024 -----
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2023 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2022 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2021 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2020 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2019 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2018 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2017 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2016 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2015 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2014 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2013 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2012 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2011 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2010 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2009 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2008 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2007 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2006 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2005 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2004 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2003 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2002 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2001 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2000 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 1999 -----
- December
- November
- October
- September
- August
- 1557 discussions
Steve Shattuck's response to Gregor's Special States document raises some
important points. There is a clear divergence of opinion as to the role and
scope of SDD. It seems to me that the difference is that Steve wants SDD to
record pure descriptive facts, while Gregor wants it to capture a scientific
work-in-progress, including the judgements and process decisions of the
scientist. Hence, Gregor wants SDD to capture the types of statements that a
taxonomist may wish to make with respect to an evolving description (in this
case, with respect to uncertainty or missing data in the description), while
Steve is sticking with the pure known, that the data are missing so lets
leave it at that. Am I right, you two?
Steve regards that there is a potentially infinite universe of possibly
useful special states. In this view, any attempt to specify a few special
states is restrictive, so we need to generalize, Generalize, GENERALIZE.
Steve's suggestion is:
| A much better way to implement this functionality would be to store an
| "uncoded" flag with the description along with an (encoded or
| text-based) explanation ("unknown", "not interpretable", "too lazy to
| code this", "don't have proper specimens", "To Do" or what ever). This
| is both direct and allows the explanation to change in a simple and
| flexible way.
It seems to me that if the "explanation" is encoded, then this suggestion is
not a long way from Gregor's. If the "explanation" is text-based, then it
will be a mere comment that may be useful to the original author but will be
impossible to process by any other application. I agree that processing
issues need to be kept firmly under control in SDD but I don't think they
have no role - after all, we're capturing these data in order to process
them, not just archive them.
Under Gregor's model and Steve's "encoded explanation" model, we would need
to be quite sure that it is possible to capture the entire universe of
statements describing the uncertainty. This seems to me to be possible. For
instance, the following list of possibilities for an unencoded datum seem to
be exhaustive:
It's logically possible to code and I intend to code it but haven't gotten
around to it yet (unfinished business)
It's logically possible to code and I intend not to do it (character scoped
out)
It's logically impossible to code (inapplicable)
Surely there's no space between the logically possible and logically
impossible, or between the intend to do it and intend not to do it. (There
may of course be subcategories of these that we may choose to capture. And
there will of course be a role for free-form text)
I agree with Steve (and Gregor) that Gregor's document is messy and we need
to clean up and tease out the concepts.
Cheers - k
1
0
Thanks everyone - got it - another example of the power of email discussion
lists!
----- Original Message -----
From: "Steve Shattuck" <Steve.Shattuck(a)CSIRO.AU>
To: <TDWG-SDD(a)USOBI.ORG>
Sent: Tuesday, March 18, 2003 3:43 PM
Subject: Re: Special states
Hi All,
I've had a look at the excellent "Special States" document
(http://160.45.63.11/Projects/TDWG-SDD/Docs/SDD_P_Data_SpecialStates.htm
l) put together by Gregor and a few comments are attached below.
These comments are in several sections (with a few notes at the bottom).
Have a look at the first and last sections and the long second section
if you are expecially interested. This second section is pretty long
and rambles a bit, but covers a wide range of issues centred on Gregor's
document. Hopefully it will help move the SDD forward.
Thanks, Steve
******************************************************************
SDD Special States
At some level the details of the SDD structure are less important than
its content. This is because few (if any) consuming applications will
use SDD other than as a source of information, this information being
translated into application-specific formats for processing and
analysis. This requires mapping SDD items to local items and this can
be done for any given SDD structure.
Taking this approach, I would make only one suggestion: generalize,
Generalize, GENERALIZE! What I mean is that the current Special States
document attempts to define all of the possible Special States that are
to be part of SDD and even goes so far as to exclude some potential
Special States. I would strongly suggest that SDD have a general method
for handling these, with at most a list of suggested Special States
rather than trying to be too restrictive. I firmly believe that it will
be extremely difficult (and short sighted) to try to define these as
part of the SDD Standard. The list of useful Special States is much
longer than the few suggested in the current document, and an extension
is even suggested at the end of the document, a "ToDo" Special State.
Many more can be thought of with little effort.
This same difficulty (trying to be too specific) has arisen in related
activities within TDWG and has resulted in an unfortunate split among
participants. There are two camps within the specimen-based standards
group, one arguing for a complete description of specimens, the other
looking for a set of common items across most groups. The first group
has developed a "standard" with some 400 items while the second has
selected in the neighbourhood of 20 or so items. My impression is that
SDD is taking the first approach: study the problem long enough and the
entire universe of data items and their relationships can be documented.
I'm not confident this is possible. Biology is too diverse and needs
change too quickly to document all of life in any stable, meaningful and
useful way. We can provide a framework and make suggestions, but we
won't be successful if we try to dictate too narrowly.
So my suggestion is that we focus on general needs, such as support for
Special States, but that we don't try to define other than at most a few
basic data items (such as general Special States). What we provide is a
mechanism to support this functionality with the ability to extend it in
a controlled and flexible way.
****************************************************
OK, this is where it gets ugly. I have a few specific comments on the
current Special States document, some relating to general issues, some
to specific comments. These are less important than the message above
and largely represent a slightly different perspective on SDD from the
one given in the current document.
First, the scope of SDD. Currently SDD is trying to address three
separate and non-overlapping issues: (1) data, (2) business practices
and (3) application development. Data is the specific information items
under consideration, business practices are how we think of this items
and how we handle them, and applications are specific pieces of software
used by manage data items. The current Special States document mixes
these three areas, talking about data items, how people interpreter them
and how they might be displayed by a computer program. The first area
is clearly within the scope of SDD, the second possibly but only in a
very general, non-specific way, and the third is clearly outside SDD
(and well it should be). Bob made essentially this same comment when he
said that he thought "project management problems were not part of SDD."
We need to focus on DATA and nothing more.
The second serious problem I have with Special States is that they
aren't "states" in the normal, taxonomic sense. Taxonomic "states" are
observations made about biological objects (be they taxa or specimens)
that are grouped together as "characters." We use "states" to describe
items. However, Special States aren't observable and aren't used to
describe items.
What are "Special States"? They are a mix of two separate sets of
information. They document the status of data coding (for example
"unknown", "not interpretable" and "unfinished work") and they document
application-specific business rules (for example "use character default
state" - get the data based on a flag set for a specific state of this
character). These are fundamentally different things and show the
danger of mixing data, business practices and application development.
Lack of appreciation for these differences as resulted in "use default"
being treated as a case of "data have not yet been entered" when in fact
a state has been clearly indicated (through a pointer): it's the default
state (however the data or application defines this) and is certainly
known. It's a user-interface shortcut used by an application to make
data entry quicker (in the opinion of an application developer!). The
end result is exactly the same as if the state had been entered directly
in the description (the taxon by character intersection, or what DELTA
calls the Attribute).
This thinking results in some very complex processing being required by
consumers of SDD documents. For example, it is suggested that if a
description is null, then the "special state" called "use default"
should be inserted and that this "state" be set to "not yet evaluated."
This means that to interpret an SDD document you not only have to
understand the individual data items but also know the rules for
processing them (to know that if data is missing you need to insert a
special value and then look this value up in another list to translate
it into something meaningful). There must be a cleaner way to do this!
The "unknown", "not interpretable" and similar "special states" are
another problem altogether. These "states" have nothing to do with
characters and everything to do with descriptions. This is why Mike
Dallwitz didn't include them as part of the Character List. I can
understand why you might want to treat these as "states" of characters
as this is the basic DELTA paradigm. However, there is no need to do
this and doing so only complicates things. For example, to tell if a
description has not been coded you'll need to get its state(s) and then
check their translations to see if any are "unknown", "not
interpretable" or any one of who knows how many other special
conditions.
A much better way to implement this functionality would be to store an
"uncoded" flag with the description along with an (encoded or
text-based) explanation ("unknown", "not interpretable", "too lazy to
code this", "don't have proper specimens", "To Do" or what ever). This
is both direct and allows the explanation to change in a simple and
flexible way.
These are the most serious and fundamental problems I have with the
current document. I would also add the following minor comments (some
of these expand on the above comments as well).
The meaning of the DELTA "U" in attributes: It's stated in the document
that this means "attempts to research the information were made but
failed" and that "the state U in DELTA is also used for cases where data
are present, but the author is unable to interpret them in current
terminology." The DELTA documentation states simply that "A missing
attribute is equivalent to an attribute with pseudo-value U" and that
"If the state of the character is unknown, then the character is
omitted, or the state value coded as U." I can't find anything in the
DELTA documentation that gives a reason for the use of U or suggests
that "attempts to research the information were made but failed" or
anything similar. It would seem that DELTA treats nulls and U as the
same, nothing more.
Remove all sections dealing with "performance", "user-interfaces" and
referring to "applications." These deal with specific software
developments and should not influence the SDD standard.
Great care should be taken when developing "business rules" for
interpreting an SDD document. For example see the discussion starting
with "One frequent situation is that characters are added to the
terminology. Possible solutions to achieve a synchronization between
descriptions and a separate terminology in this case are:". It may be
important for SDD to document this situation but HOW it is dealt with is
up to individual applications and has little to do with the data itself
(which is the focus of SDD). The same problem occurs with the statement
"The omission of character coding in a description should be used to
express the "use default" state." This mixes data with processing. The
lack of data is simply that - it implies nothing more and NO assumptions
should be made about it. If the default state should be used then the
author needs to state this.
What does "terminology" mean? It seems to be the character/character
state list but this is unclear. It might be good to develop a glossary
of terms so we all know what we're talking about.
We also need to define "schema." My understanding is that a "schema" is
a model of the structure of the data and not the data itself. If you
add data you don't change the schema (assuming that this addition is
permitted by the schema - it's like adding rows to a database table -
the schema doesn't change, only the data). But at several points the
document says "... the case that the terminology is changing ("schema
evolution") is ..." or similar. This sounds like changes to the data
change the schema. Is this the general view of schema or am I on the
wrong track?
The section discussing "Data have not yet been entered", "Data cannot be
entered" and "Data could have been entered but a deliberate decision was
made not to enter them" seems to mix unrelated cases and is a mess. The
options listed are (followed by my interpretation of their meaning):
1) "Use character default state": Data is coded and is not missing (see
above).
2) "Unfinished work": Data exists but has yet to be captured.
3) "Not applicable": Data does not logically exist.
4) "Unknown": Data exists but has yet to be captured.
5) "Not interpretable": Data exists but has yet to be captured.
6) "Out of scope": Data exists but has yet to be captured.
7) "Do not need to score": Data is coded and is not missing.
The mix involves data that is being pointed to through a process (No. 1
and 7), data that is impossible to collect (No. 3) and data that can be
collected but hasn't (for a variety of reasons) (No. 2, 4, 5 and 6).
Keep it simple by recording if data is absent and if it is, the reason.
Don't confuse it with application-specific short cuts ("Use default",
"Inherit/Compile from Parent/Children") or personal decisions made by an
author ("it's too hard to code", "this character is unimportant here",
"appropriate specimens are unavailable", etc). These are distinct
cases.
There are a number of cases where attempts are made to develop lists of
conditions or situations. For example, it is asked "Should a general
'cannot score' (for any kind of reason) be differentiated into" and then
three situations are listed. Again, generalise this into "uncoded" with
a text-based explanation ("observation method failed", "incomplete
specimen" or any one of a thousand other reasons). All the information
is captured and the system is rich enough to handle unforseen
situations. This approach also fits perfectly with the previous
example.
There is a note that "a general method is planned (@but not yet
formalized!@) in SDD" and that "SDD is considering introducing computed
characters." We've been at this for at least 2 years now and there are
things that are "planned" or are "being considered" for SDD that we
don't know about?!?!? Or is there an overall planning document that I'm
not aware of? WE are SDD, SDD can't be "considering" things in
isolation.
The entire discussion of inheritance and compilation (or what ever we
call it) needs to be though of in a fresh light. Again, the document
confuses data with process. For example, it's stated that "it is
desirable that the assumption [of inheritance/compilation] rather than
the inherited data are recorded". The author of the dataset MUST
present real data, not give a process to get to that data. Yes, the
author can tell us how they collected the data ("inherited from direct
parent", "compiled using Gouldes Statistic from all coded children", "by
reference to the default state for this character") but they also must
give us the data, not make us go look for it. The process of
inheritance/compilation is too complex to leave to chance and trying to
define exactly what it means is too error prone. I would suggest that
inheritance/compilation is an application-based activity, not a
fundamental aspect of the data. Again, give us the data and separately
tell us how it was derived, don't make us derive it ourselves.
This exact same problem exists with "computed data." I don't see how or
why an SDD document can contain the value "CouldNotCompute" because the
"generator couldn't compute this". The "generator" is a piece of
software, not a data representation (which is what SDD is).
"CouldNotCompute" is an error message returned from an application and
it should never end up in an SDD document. Yet again, if you want to
explain how the data was generated, find, do it, but don't make us
calculate it, we don't know how.
The discussion about "Not supported as special states, but supported
through modifiers" is good but, as above, describes descriptions and not
characters. That's why inheritance/compilation is included here.
The statement "Special states express knowledge about why data for a
given character are missing in a description and thus make a statement
about the entire character" is interesting. It says that special states
are about a character in a description. The point I would make is that
they are primarily about the description, not the character. They say
nothing about the character's use for other items, and only relate to
this character in this taxon, and this is through the "description." So
let's focus this document on the description and only discuss the
character when it's appropriate. This is supported by the statement
that "DELTA does not define the special states in the "character list"
directive: they are implicitly present in each character." They are not
in the character list because the have little to do with characters.
They are in the DELTA attributes because they have everything to do with
descriptions. I'm sorry to say that the next statement shows the danger
of letting past developments influence current work (and I'm as guilty
as any): "When a new character is created, DeltaAccess automatically
creates the full set of special states." In BioLink we attach this
information to the description: you do it when you're coding the data,
not developing the character list. I don't know which way is better and
as noted above, it probably doesn't matter as each application will
translate the information into a local format anyway. What we DO need
to do is make sure the SDD format is expressive, flexible and as
application-neutral as possible (and ideally simple).
Most of the discussion under "Relations between declarative character
dependency and special states" seems to deal with developing
applications rather than data representation and may well be outside
SDD. Sure, document that these dependencies exist and use them to
produce high-quality and internally consistent data, but do this at
application-level, not SDD-level.
The same applies to "Responsibility for validation." SDD should be
about representing data in a standardized way, not about enforcing
taxonomic business rules or data quality standards. As SDD will be
represented in XML then this XML document must be well-formed and pass a
check with a DTD or XML-Schema checker. But SDD shouldn't be
responsible for checking the content of individual data items, that's
the job of authors and their applications.
********************************************************
So, to summarize:
SDD is about data and only data. These data include characters and
descriptions and (optionally) how these descriptions were developed and
their current status.
SDD is not about process. Process information can be included but it is
optional and can be safely ignored without data loss. [I shouldn't have
to process a "use default data" statement to access a complete
description, the appropriate state(s) should be inserted into the
description when the SDD document is prepared with a note on how it was
derived.]
SDD is not about applications. Discussions concerning user-interfaces
or processing methods should not be included. [How an application
manages and represents these data is completely independent of how SDD
represents the data.]
********************************
Finally, I want to reconfirm that Gregor has done a fantastic job with
getting us this far and I support his efforts. My comments are more
about a different perspective rather than any serious flaw in logic.
Most of what is in the document is well presented and relevant to
managing taxonomic descriptions, I'm just not sure it is directly
related to the goals of SDD.
And sorry for going on for so long. I'll sit down now.
Steve
Steve Shattuck
CSIRO Entomology
Steve.shattuck(a)csiro.au
1
0
The problem is the last 'l' of the URL is wrapping to the next line and
is being dropped from the URL. Try this one?
http://160.45.63.11/Projects/TDWG-SDD/Docs/SDD_P_Data_SpecialStates.html
Or maybe cut and paste this one into your browser:
160.45.63.11/Projects/TDWG-SDD/Docs/SDD_P_Data_SpecialStates.html
Hope this helps.
Steve Shattuck
CSIRO Entomology
Steve.shattuck(a)csiro.au
1
0
Kevin,
It's not just you. The URL Steve provided was not quite correct - the
extension should have been ".html", not ".htm".
So try this:
http://160.45.63.11/Projects/TDWG-SDD/Docs/SDD_P_Data_SpecialStates.html
Eric Zurcher
CSIRO Livestock Industries
Canberra, Australia
Eric.Zurcher(a)csiro.au
> -----Original Message-----
> From: Kevin Thiele [mailto:kevin.thiele@BIGPOND.COM]
> Sent: Tuesday, 18 March 2003 5:53 PM
> To: TDWG-SDD(a)USOBI.ORG
> Subject: Re: Special states
>
>
> Can anybody else not access the Special States page as below,
> or is it just
> me?
>
> ----- Original Message -----
> From: "Steve Shattuck" <Steve.Shattuck(a)CSIRO.AU>
> To: <TDWG-SDD(a)USOBI.ORG>
> Sent: Tuesday, March 18, 2003 3:43 PM
> Subject: Re: Special states
>
>
> Hi All,
>
> I've had a look at the excellent "Special States" document
> (http://160.45.63.11/Projects/TDWG-SDD/Docs/SDD_P_Data_Special
> States.htm
> l) put together by Gregor and a few comments are attached below.
1
0
Can anybody else not access the Special States page as below, or is it just
me?
----- Original Message -----
From: "Steve Shattuck" <Steve.Shattuck(a)CSIRO.AU>
To: <TDWG-SDD(a)USOBI.ORG>
Sent: Tuesday, March 18, 2003 3:43 PM
Subject: Re: Special states
Hi All,
I've had a look at the excellent "Special States" document
(http://160.45.63.11/Projects/TDWG-SDD/Docs/SDD_P_Data_SpecialStates.htm
l) put together by Gregor and a few comments are attached below.
These comments are in several sections (with a few notes at the bottom).
Have a look at the first and last sections and the long second section
if you are expecially interested. This second section is pretty long
and rambles a bit, but covers a wide range of issues centred on Gregor's
document. Hopefully it will help move the SDD forward.
Thanks, Steve
******************************************************************
SDD Special States
At some level the details of the SDD structure are less important than
its content. This is because few (if any) consuming applications will
use SDD other than as a source of information, this information being
translated into application-specific formats for processing and
analysis. This requires mapping SDD items to local items and this can
be done for any given SDD structure.
Taking this approach, I would make only one suggestion: generalize,
Generalize, GENERALIZE! What I mean is that the current Special States
document attempts to define all of the possible Special States that are
to be part of SDD and even goes so far as to exclude some potential
Special States. I would strongly suggest that SDD have a general method
for handling these, with at most a list of suggested Special States
rather than trying to be too restrictive. I firmly believe that it will
be extremely difficult (and short sighted) to try to define these as
part of the SDD Standard. The list of useful Special States is much
longer than the few suggested in the current document, and an extension
is even suggested at the end of the document, a "ToDo" Special State.
Many more can be thought of with little effort.
This same difficulty (trying to be too specific) has arisen in related
activities within TDWG and has resulted in an unfortunate split among
participants. There are two camps within the specimen-based standards
group, one arguing for a complete description of specimens, the other
looking for a set of common items across most groups. The first group
has developed a "standard" with some 400 items while the second has
selected in the neighbourhood of 20 or so items. My impression is that
SDD is taking the first approach: study the problem long enough and the
entire universe of data items and their relationships can be documented.
I'm not confident this is possible. Biology is too diverse and needs
change too quickly to document all of life in any stable, meaningful and
useful way. We can provide a framework and make suggestions, but we
won't be successful if we try to dictate too narrowly.
So my suggestion is that we focus on general needs, such as support for
Special States, but that we don't try to define other than at most a few
basic data items (such as general Special States). What we provide is a
mechanism to support this functionality with the ability to extend it in
a controlled and flexible way.
****************************************************
OK, this is where it gets ugly. I have a few specific comments on the
current Special States document, some relating to general issues, some
to specific comments. These are less important than the message above
and largely represent a slightly different perspective on SDD from the
one given in the current document.
First, the scope of SDD. Currently SDD is trying to address three
separate and non-overlapping issues: (1) data, (2) business practices
and (3) application development. Data is the specific information items
under consideration, business practices are how we think of this items
and how we handle them, and applications are specific pieces of software
used by manage data items. The current Special States document mixes
these three areas, talking about data items, how people interpreter them
and how they might be displayed by a computer program. The first area
is clearly within the scope of SDD, the second possibly but only in a
very general, non-specific way, and the third is clearly outside SDD
(and well it should be). Bob made essentially this same comment when he
said that he thought "project management problems were not part of SDD."
We need to focus on DATA and nothing more.
The second serious problem I have with Special States is that they
aren't "states" in the normal, taxonomic sense. Taxonomic "states" are
observations made about biological objects (be they taxa or specimens)
that are grouped together as "characters." We use "states" to describe
items. However, Special States aren't observable and aren't used to
describe items.
What are "Special States"? They are a mix of two separate sets of
information. They document the status of data coding (for example
"unknown", "not interpretable" and "unfinished work") and they document
application-specific business rules (for example "use character default
state" - get the data based on a flag set for a specific state of this
character). These are fundamentally different things and show the
danger of mixing data, business practices and application development.
Lack of appreciation for these differences as resulted in "use default"
being treated as a case of "data have not yet been entered" when in fact
a state has been clearly indicated (through a pointer): it's the default
state (however the data or application defines this) and is certainly
known. It's a user-interface shortcut used by an application to make
data entry quicker (in the opinion of an application developer!). The
end result is exactly the same as if the state had been entered directly
in the description (the taxon by character intersection, or what DELTA
calls the Attribute).
This thinking results in some very complex processing being required by
consumers of SDD documents. For example, it is suggested that if a
description is null, then the "special state" called "use default"
should be inserted and that this "state" be set to "not yet evaluated."
This means that to interpret an SDD document you not only have to
understand the individual data items but also know the rules for
processing them (to know that if data is missing you need to insert a
special value and then look this value up in another list to translate
it into something meaningful). There must be a cleaner way to do this!
The "unknown", "not interpretable" and similar "special states" are
another problem altogether. These "states" have nothing to do with
characters and everything to do with descriptions. This is why Mike
Dallwitz didn't include them as part of the Character List. I can
understand why you might want to treat these as "states" of characters
as this is the basic DELTA paradigm. However, there is no need to do
this and doing so only complicates things. For example, to tell if a
description has not been coded you'll need to get its state(s) and then
check their translations to see if any are "unknown", "not
interpretable" or any one of who knows how many other special
conditions.
A much better way to implement this functionality would be to store an
"uncoded" flag with the description along with an (encoded or
text-based) explanation ("unknown", "not interpretable", "too lazy to
code this", "don't have proper specimens", "To Do" or what ever). This
is both direct and allows the explanation to change in a simple and
flexible way.
These are the most serious and fundamental problems I have with the
current document. I would also add the following minor comments (some
of these expand on the above comments as well).
The meaning of the DELTA "U" in attributes: It's stated in the document
that this means "attempts to research the information were made but
failed" and that "the state U in DELTA is also used for cases where data
are present, but the author is unable to interpret them in current
terminology." The DELTA documentation states simply that "A missing
attribute is equivalent to an attribute with pseudo-value U" and that
"If the state of the character is unknown, then the character is
omitted, or the state value coded as U." I can't find anything in the
DELTA documentation that gives a reason for the use of U or suggests
that "attempts to research the information were made but failed" or
anything similar. It would seem that DELTA treats nulls and U as the
same, nothing more.
Remove all sections dealing with "performance", "user-interfaces" and
referring to "applications." These deal with specific software
developments and should not influence the SDD standard.
Great care should be taken when developing "business rules" for
interpreting an SDD document. For example see the discussion starting
with "One frequent situation is that characters are added to the
terminology. Possible solutions to achieve a synchronization between
descriptions and a separate terminology in this case are:". It may be
important for SDD to document this situation but HOW it is dealt with is
up to individual applications and has little to do with the data itself
(which is the focus of SDD). The same problem occurs with the statement
"The omission of character coding in a description should be used to
express the "use default" state." This mixes data with processing. The
lack of data is simply that - it implies nothing more and NO assumptions
should be made about it. If the default state should be used then the
author needs to state this.
What does "terminology" mean? It seems to be the character/character
state list but this is unclear. It might be good to develop a glossary
of terms so we all know what we're talking about.
We also need to define "schema." My understanding is that a "schema" is
a model of the structure of the data and not the data itself. If you
add data you don't change the schema (assuming that this addition is
permitted by the schema - it's like adding rows to a database table -
the schema doesn't change, only the data). But at several points the
document says "... the case that the terminology is changing ("schema
evolution") is ..." or similar. This sounds like changes to the data
change the schema. Is this the general view of schema or am I on the
wrong track?
The section discussing "Data have not yet been entered", "Data cannot be
entered" and "Data could have been entered but a deliberate decision was
made not to enter them" seems to mix unrelated cases and is a mess. The
options listed are (followed by my interpretation of their meaning):
1) "Use character default state": Data is coded and is not missing (see
above).
2) "Unfinished work": Data exists but has yet to be captured.
3) "Not applicable": Data does not logically exist.
4) "Unknown": Data exists but has yet to be captured.
5) "Not interpretable": Data exists but has yet to be captured.
6) "Out of scope": Data exists but has yet to be captured.
7) "Do not need to score": Data is coded and is not missing.
The mix involves data that is being pointed to through a process (No. 1
and 7), data that is impossible to collect (No. 3) and data that can be
collected but hasn't (for a variety of reasons) (No. 2, 4, 5 and 6).
Keep it simple by recording if data is absent and if it is, the reason.
Don't confuse it with application-specific short cuts ("Use default",
"Inherit/Compile from Parent/Children") or personal decisions made by an
author ("it's too hard to code", "this character is unimportant here",
"appropriate specimens are unavailable", etc). These are distinct
cases.
There are a number of cases where attempts are made to develop lists of
conditions or situations. For example, it is asked "Should a general
'cannot score' (for any kind of reason) be differentiated into" and then
three situations are listed. Again, generalise this into "uncoded" with
a text-based explanation ("observation method failed", "incomplete
specimen" or any one of a thousand other reasons). All the information
is captured and the system is rich enough to handle unforseen
situations. This approach also fits perfectly with the previous
example.
There is a note that "a general method is planned (@but not yet
formalized!@) in SDD" and that "SDD is considering introducing computed
characters." We've been at this for at least 2 years now and there are
things that are "planned" or are "being considered" for SDD that we
don't know about?!?!? Or is there an overall planning document that I'm
not aware of? WE are SDD, SDD can't be "considering" things in
isolation.
The entire discussion of inheritance and compilation (or what ever we
call it) needs to be though of in a fresh light. Again, the document
confuses data with process. For example, it's stated that "it is
desirable that the assumption [of inheritance/compilation] rather than
the inherited data are recorded". The author of the dataset MUST
present real data, not give a process to get to that data. Yes, the
author can tell us how they collected the data ("inherited from direct
parent", "compiled using Gouldes Statistic from all coded children", "by
reference to the default state for this character") but they also must
give us the data, not make us go look for it. The process of
inheritance/compilation is too complex to leave to chance and trying to
define exactly what it means is too error prone. I would suggest that
inheritance/compilation is an application-based activity, not a
fundamental aspect of the data. Again, give us the data and separately
tell us how it was derived, don't make us derive it ourselves.
This exact same problem exists with "computed data." I don't see how or
why an SDD document can contain the value "CouldNotCompute" because the
"generator couldn't compute this". The "generator" is a piece of
software, not a data representation (which is what SDD is).
"CouldNotCompute" is an error message returned from an application and
it should never end up in an SDD document. Yet again, if you want to
explain how the data was generated, find, do it, but don't make us
calculate it, we don't know how.
The discussion about "Not supported as special states, but supported
through modifiers" is good but, as above, describes descriptions and not
characters. That's why inheritance/compilation is included here.
The statement "Special states express knowledge about why data for a
given character are missing in a description and thus make a statement
about the entire character" is interesting. It says that special states
are about a character in a description. The point I would make is that
they are primarily about the description, not the character. They say
nothing about the character's use for other items, and only relate to
this character in this taxon, and this is through the "description." So
let's focus this document on the description and only discuss the
character when it's appropriate. This is supported by the statement
that "DELTA does not define the special states in the "character list"
directive: they are implicitly present in each character." They are not
in the character list because the have little to do with characters.
They are in the DELTA attributes because they have everything to do with
descriptions. I'm sorry to say that the next statement shows the danger
of letting past developments influence current work (and I'm as guilty
as any): "When a new character is created, DeltaAccess automatically
creates the full set of special states." In BioLink we attach this
information to the description: you do it when you're coding the data,
not developing the character list. I don't know which way is better and
as noted above, it probably doesn't matter as each application will
translate the information into a local format anyway. What we DO need
to do is make sure the SDD format is expressive, flexible and as
application-neutral as possible (and ideally simple).
Most of the discussion under "Relations between declarative character
dependency and special states" seems to deal with developing
applications rather than data representation and may well be outside
SDD. Sure, document that these dependencies exist and use them to
produce high-quality and internally consistent data, but do this at
application-level, not SDD-level.
The same applies to "Responsibility for validation." SDD should be
about representing data in a standardized way, not about enforcing
taxonomic business rules or data quality standards. As SDD will be
represented in XML then this XML document must be well-formed and pass a
check with a DTD or XML-Schema checker. But SDD shouldn't be
responsible for checking the content of individual data items, that's
the job of authors and their applications.
********************************************************
So, to summarize:
SDD is about data and only data. These data include characters and
descriptions and (optionally) how these descriptions were developed and
their current status.
SDD is not about process. Process information can be included but it is
optional and can be safely ignored without data loss. [I shouldn't have
to process a "use default data" statement to access a complete
description, the appropriate state(s) should be inserted into the
description when the SDD document is prepared with a note on how it was
derived.]
SDD is not about applications. Discussions concerning user-interfaces
or processing methods should not be included. [How an application
manages and represents these data is completely independent of how SDD
represents the data.]
********************************
Finally, I want to reconfirm that Gregor has done a fantastic job with
getting us this far and I support his efforts. My comments are more
about a different perspective rather than any serious flaw in logic.
Most of what is in the document is well presented and relevant to
managing taxonomic descriptions, I'm just not sure it is directly
related to the goals of SDD.
And sorry for going on for so long. I'll sit down now.
Steve
Steve Shattuck
CSIRO Entomology
Steve.shattuck(a)csiro.au
1
0
Hi All,
I've had a look at the excellent "Special States" document
(http://160.45.63.11/Projects/TDWG-SDD/Docs/SDD_P_Data_SpecialStates.htm
l) put together by Gregor and a few comments are attached below.
These comments are in several sections (with a few notes at the bottom).
Have a look at the first and last sections and the long second section
if you are expecially interested. This second section is pretty long
and rambles a bit, but covers a wide range of issues centred on Gregor's
document. Hopefully it will help move the SDD forward.
Thanks, Steve
******************************************************************
SDD Special States
At some level the details of the SDD structure are less important than
its content. This is because few (if any) consuming applications will
use SDD other than as a source of information, this information being
translated into application-specific formats for processing and
analysis. This requires mapping SDD items to local items and this can
be done for any given SDD structure.
Taking this approach, I would make only one suggestion: generalize,
Generalize, GENERALIZE! What I mean is that the current Special States
document attempts to define all of the possible Special States that are
to be part of SDD and even goes so far as to exclude some potential
Special States. I would strongly suggest that SDD have a general method
for handling these, with at most a list of suggested Special States
rather than trying to be too restrictive. I firmly believe that it will
be extremely difficult (and short sighted) to try to define these as
part of the SDD Standard. The list of useful Special States is much
longer than the few suggested in the current document, and an extension
is even suggested at the end of the document, a "ToDo" Special State.
Many more can be thought of with little effort.
This same difficulty (trying to be too specific) has arisen in related
activities within TDWG and has resulted in an unfortunate split among
participants. There are two camps within the specimen-based standards
group, one arguing for a complete description of specimens, the other
looking for a set of common items across most groups. The first group
has developed a "standard" with some 400 items while the second has
selected in the neighbourhood of 20 or so items. My impression is that
SDD is taking the first approach: study the problem long enough and the
entire universe of data items and their relationships can be documented.
I'm not confident this is possible. Biology is too diverse and needs
change too quickly to document all of life in any stable, meaningful and
useful way. We can provide a framework and make suggestions, but we
won't be successful if we try to dictate too narrowly.
So my suggestion is that we focus on general needs, such as support for
Special States, but that we don't try to define other than at most a few
basic data items (such as general Special States). What we provide is a
mechanism to support this functionality with the ability to extend it in
a controlled and flexible way.
****************************************************
OK, this is where it gets ugly. I have a few specific comments on the
current Special States document, some relating to general issues, some
to specific comments. These are less important than the message above
and largely represent a slightly different perspective on SDD from the
one given in the current document.
First, the scope of SDD. Currently SDD is trying to address three
separate and non-overlapping issues: (1) data, (2) business practices
and (3) application development. Data is the specific information items
under consideration, business practices are how we think of this items
and how we handle them, and applications are specific pieces of software
used by manage data items. The current Special States document mixes
these three areas, talking about data items, how people interpreter them
and how they might be displayed by a computer program. The first area
is clearly within the scope of SDD, the second possibly but only in a
very general, non-specific way, and the third is clearly outside SDD
(and well it should be). Bob made essentially this same comment when he
said that he thought "project management problems were not part of SDD."
We need to focus on DATA and nothing more.
The second serious problem I have with Special States is that they
aren't "states" in the normal, taxonomic sense. Taxonomic "states" are
observations made about biological objects (be they taxa or specimens)
that are grouped together as "characters." We use "states" to describe
items. However, Special States aren't observable and aren't used to
describe items.
What are "Special States"? They are a mix of two separate sets of
information. They document the status of data coding (for example
"unknown", "not interpretable" and "unfinished work") and they document
application-specific business rules (for example "use character default
state" - get the data based on a flag set for a specific state of this
character). These are fundamentally different things and show the
danger of mixing data, business practices and application development.
Lack of appreciation for these differences as resulted in "use default"
being treated as a case of "data have not yet been entered" when in fact
a state has been clearly indicated (through a pointer): it's the default
state (however the data or application defines this) and is certainly
known. It's a user-interface shortcut used by an application to make
data entry quicker (in the opinion of an application developer!). The
end result is exactly the same as if the state had been entered directly
in the description (the taxon by character intersection, or what DELTA
calls the Attribute).
This thinking results in some very complex processing being required by
consumers of SDD documents. For example, it is suggested that if a
description is null, then the "special state" called "use default"
should be inserted and that this "state" be set to "not yet evaluated."
This means that to interpret an SDD document you not only have to
understand the individual data items but also know the rules for
processing them (to know that if data is missing you need to insert a
special value and then look this value up in another list to translate
it into something meaningful). There must be a cleaner way to do this!
The "unknown", "not interpretable" and similar "special states" are
another problem altogether. These "states" have nothing to do with
characters and everything to do with descriptions. This is why Mike
Dallwitz didn't include them as part of the Character List. I can
understand why you might want to treat these as "states" of characters
as this is the basic DELTA paradigm. However, there is no need to do
this and doing so only complicates things. For example, to tell if a
description has not been coded you'll need to get its state(s) and then
check their translations to see if any are "unknown", "not
interpretable" or any one of who knows how many other special
conditions.
A much better way to implement this functionality would be to store an
"uncoded" flag with the description along with an (encoded or
text-based) explanation ("unknown", "not interpretable", "too lazy to
code this", "don't have proper specimens", "To Do" or what ever). This
is both direct and allows the explanation to change in a simple and
flexible way.
These are the most serious and fundamental problems I have with the
current document. I would also add the following minor comments (some
of these expand on the above comments as well).
The meaning of the DELTA "U" in attributes: It's stated in the document
that this means "attempts to research the information were made but
failed" and that "the state U in DELTA is also used for cases where data
are present, but the author is unable to interpret them in current
terminology." The DELTA documentation states simply that "A missing
attribute is equivalent to an attribute with pseudo-value U" and that
"If the state of the character is unknown, then the character is
omitted, or the state value coded as U." I can't find anything in the
DELTA documentation that gives a reason for the use of U or suggests
that "attempts to research the information were made but failed" or
anything similar. It would seem that DELTA treats nulls and U as the
same, nothing more.
Remove all sections dealing with "performance", "user-interfaces" and
referring to "applications." These deal with specific software
developments and should not influence the SDD standard.
Great care should be taken when developing "business rules" for
interpreting an SDD document. For example see the discussion starting
with "One frequent situation is that characters are added to the
terminology. Possible solutions to achieve a synchronization between
descriptions and a separate terminology in this case are:". It may be
important for SDD to document this situation but HOW it is dealt with is
up to individual applications and has little to do with the data itself
(which is the focus of SDD). The same problem occurs with the statement
"The omission of character coding in a description should be used to
express the "use default" state." This mixes data with processing. The
lack of data is simply that - it implies nothing more and NO assumptions
should be made about it. If the default state should be used then the
author needs to state this.
What does "terminology" mean? It seems to be the character/character
state list but this is unclear. It might be good to develop a glossary
of terms so we all know what we're talking about.
We also need to define "schema." My understanding is that a "schema" is
a model of the structure of the data and not the data itself. If you
add data you don't change the schema (assuming that this addition is
permitted by the schema - it's like adding rows to a database table -
the schema doesn't change, only the data). But at several points the
document says "... the case that the terminology is changing ("schema
evolution") is ..." or similar. This sounds like changes to the data
change the schema. Is this the general view of schema or am I on the
wrong track?
The section discussing "Data have not yet been entered", "Data cannot be
entered" and "Data could have been entered but a deliberate decision was
made not to enter them" seems to mix unrelated cases and is a mess. The
options listed are (followed by my interpretation of their meaning):
1) "Use character default state": Data is coded and is not missing (see
above).
2) "Unfinished work": Data exists but has yet to be captured.
3) "Not applicable": Data does not logically exist.
4) "Unknown": Data exists but has yet to be captured.
5) "Not interpretable": Data exists but has yet to be captured.
6) "Out of scope": Data exists but has yet to be captured.
7) "Do not need to score": Data is coded and is not missing.
The mix involves data that is being pointed to through a process (No. 1
and 7), data that is impossible to collect (No. 3) and data that can be
collected but hasn't (for a variety of reasons) (No. 2, 4, 5 and 6).
Keep it simple by recording if data is absent and if it is, the reason.
Don't confuse it with application-specific short cuts ("Use default",
"Inherit/Compile from Parent/Children") or personal decisions made by an
author ("it's too hard to code", "this character is unimportant here",
"appropriate specimens are unavailable", etc). These are distinct
cases.
There are a number of cases where attempts are made to develop lists of
conditions or situations. For example, it is asked "Should a general
'cannot score' (for any kind of reason) be differentiated into" and then
three situations are listed. Again, generalise this into "uncoded" with
a text-based explanation ("observation method failed", "incomplete
specimen" or any one of a thousand other reasons). All the information
is captured and the system is rich enough to handle unforseen
situations. This approach also fits perfectly with the previous
example.
There is a note that "a general method is planned (@but not yet
formalized!@) in SDD" and that "SDD is considering introducing computed
characters." We've been at this for at least 2 years now and there are
things that are "planned" or are "being considered" for SDD that we
don't know about?!?!? Or is there an overall planning document that I'm
not aware of? WE are SDD, SDD can't be "considering" things in
isolation.
The entire discussion of inheritance and compilation (or what ever we
call it) needs to be though of in a fresh light. Again, the document
confuses data with process. For example, it's stated that "it is
desirable that the assumption [of inheritance/compilation] rather than
the inherited data are recorded". The author of the dataset MUST
present real data, not give a process to get to that data. Yes, the
author can tell us how they collected the data ("inherited from direct
parent", "compiled using Gouldes Statistic from all coded children", "by
reference to the default state for this character") but they also must
give us the data, not make us go look for it. The process of
inheritance/compilation is too complex to leave to chance and trying to
define exactly what it means is too error prone. I would suggest that
inheritance/compilation is an application-based activity, not a
fundamental aspect of the data. Again, give us the data and separately
tell us how it was derived, don't make us derive it ourselves.
This exact same problem exists with "computed data." I don't see how or
why an SDD document can contain the value "CouldNotCompute" because the
"generator couldn't compute this". The "generator" is a piece of
software, not a data representation (which is what SDD is).
"CouldNotCompute" is an error message returned from an application and
it should never end up in an SDD document. Yet again, if you want to
explain how the data was generated, find, do it, but don't make us
calculate it, we don't know how.
The discussion about "Not supported as special states, but supported
through modifiers" is good but, as above, describes descriptions and not
characters. That's why inheritance/compilation is included here.
The statement "Special states express knowledge about why data for a
given character are missing in a description and thus make a statement
about the entire character" is interesting. It says that special states
are about a character in a description. The point I would make is that
they are primarily about the description, not the character. They say
nothing about the character's use for other items, and only relate to
this character in this taxon, and this is through the "description." So
let's focus this document on the description and only discuss the
character when it's appropriate. This is supported by the statement
that "DELTA does not define the special states in the "character list"
directive: they are implicitly present in each character." They are not
in the character list because the have little to do with characters.
They are in the DELTA attributes because they have everything to do with
descriptions. I'm sorry to say that the next statement shows the danger
of letting past developments influence current work (and I'm as guilty
as any): "When a new character is created, DeltaAccess automatically
creates the full set of special states." In BioLink we attach this
information to the description: you do it when you're coding the data,
not developing the character list. I don't know which way is better and
as noted above, it probably doesn't matter as each application will
translate the information into a local format anyway. What we DO need
to do is make sure the SDD format is expressive, flexible and as
application-neutral as possible (and ideally simple).
Most of the discussion under "Relations between declarative character
dependency and special states" seems to deal with developing
applications rather than data representation and may well be outside
SDD. Sure, document that these dependencies exist and use them to
produce high-quality and internally consistent data, but do this at
application-level, not SDD-level.
The same applies to "Responsibility for validation." SDD should be
about representing data in a standardized way, not about enforcing
taxonomic business rules or data quality standards. As SDD will be
represented in XML then this XML document must be well-formed and pass a
check with a DTD or XML-Schema checker. But SDD shouldn't be
responsible for checking the content of individual data items, that's
the job of authors and their applications.
********************************************************
So, to summarize:
SDD is about data and only data. These data include characters and
descriptions and (optionally) how these descriptions were developed and
their current status.
SDD is not about process. Process information can be included but it is
optional and can be safely ignored without data loss. [I shouldn't have
to process a "use default data" statement to access a complete
description, the appropriate state(s) should be inserted into the
description when the SDD document is prepared with a note on how it was
derived.]
SDD is not about applications. Discussions concerning user-interfaces
or processing methods should not be included. [How an application
manages and represents these data is completely independent of how SDD
represents the data.]
********************************
Finally, I want to reconfirm that Gregor has done a fantastic job with
getting us this far and I support his efforts. My comments are more
about a different perspective rather than any serious flaw in logic.
Most of what is in the document is well presented and relevant to
managing taxonomic descriptions, I'm just not sure it is directly
related to the goals of SDD.
And sorry for going on for so long. I'll sit down now.
Steve
Steve Shattuck
CSIRO Entomology
Steve.shattuck(a)csiro.au
1
0
just add a 'l' at the end of URL
Kevin Thiele a écrit:
>Can anybody else not access the Special States page as below, or is it just
>me?
>
>----- Original Message -----
>From: "Steve Shattuck" <Steve.Shattuck(a)CSIRO.AU>
>To: <TDWG-SDD(a)USOBI.ORG>
>Sent: Tuesday, March 18, 2003 3:43 PM
>Subject: Re: Special states
>
>
1
0
You need to manually append the "l" (so that the URL ends in ".html",
instead of ".htm").
The "l" was forced onto the next line in Steve's original message.
Aloha,
Rich
Richard L. Pyle
Natural Sciences Database Coordinator, Bishop Museum
1525 Bernice St., Honolulu, HI 96817
Ph: (808)848-4115, Fax: (808)847-8252
email: deepreef(a)bishopmuseum.org
http://www.bishopmuseum.org/bishop/HBS/pylerichard.html
"The opinions expressed are those of the sender, and not necessarily those
of Bishop Museum."
> -----Original Message-----
> From: TDWG - Structure of Descriptive Data [mailto:TDWG-SDD@USOBI.ORG]On
> Behalf Of Kevin Thiele
> Sent: Monday, March 17, 2003 8:53 PM
> To: TDWG-SDD(a)USOBI.ORG
> Subject: Re: Special states
>
>
> Can anybody else not access the Special States page as below, or
> is it just
> me?
>
> ----- Original Message -----
> From: "Steve Shattuck" <Steve.Shattuck(a)CSIRO.AU>
> To: <TDWG-SDD(a)USOBI.ORG>
> Sent: Tuesday, March 18, 2003 3:43 PM
> Subject: Re: Special states
>
>
> Hi All,
>
> I've had a look at the excellent "Special States" document
> (http://160.45.63.11/Projects/TDWG-SDD/Docs/SDD_P_Data_SpecialStates.htm
> l) put together by Gregor and a few comments are attached below.
>
> These comments are in several sections (with a few notes at the bottom).
> Have a look at the first and last sections and the long second section
> if you are expecially interested. This second section is pretty long
> and rambles a bit, but covers a wide range of issues centred on Gregor's
> document. Hopefully it will help move the SDD forward.
>
> Thanks, Steve
>
> ******************************************************************
>
> SDD Special States
>
> At some level the details of the SDD structure are less important than
> its content. This is because few (if any) consuming applications will
> use SDD other than as a source of information, this information being
> translated into application-specific formats for processing and
> analysis. This requires mapping SDD items to local items and this can
> be done for any given SDD structure.
>
> Taking this approach, I would make only one suggestion: generalize,
> Generalize, GENERALIZE! What I mean is that the current Special States
> document attempts to define all of the possible Special States that are
> to be part of SDD and even goes so far as to exclude some potential
> Special States. I would strongly suggest that SDD have a general method
> for handling these, with at most a list of suggested Special States
> rather than trying to be too restrictive. I firmly believe that it will
> be extremely difficult (and short sighted) to try to define these as
> part of the SDD Standard. The list of useful Special States is much
> longer than the few suggested in the current document, and an extension
> is even suggested at the end of the document, a "ToDo" Special State.
> Many more can be thought of with little effort.
>
> This same difficulty (trying to be too specific) has arisen in related
> activities within TDWG and has resulted in an unfortunate split among
> participants. There are two camps within the specimen-based standards
> group, one arguing for a complete description of specimens, the other
> looking for a set of common items across most groups. The first group
> has developed a "standard" with some 400 items while the second has
> selected in the neighbourhood of 20 or so items. My impression is that
> SDD is taking the first approach: study the problem long enough and the
> entire universe of data items and their relationships can be documented.
> I'm not confident this is possible. Biology is too diverse and needs
> change too quickly to document all of life in any stable, meaningful and
> useful way. We can provide a framework and make suggestions, but we
> won't be successful if we try to dictate too narrowly.
>
> So my suggestion is that we focus on general needs, such as support for
> Special States, but that we don't try to define other than at most a few
> basic data items (such as general Special States). What we provide is a
> mechanism to support this functionality with the ability to extend it in
> a controlled and flexible way.
>
>
> ****************************************************
>
> OK, this is where it gets ugly. I have a few specific comments on the
> current Special States document, some relating to general issues, some
> to specific comments. These are less important than the message above
> and largely represent a slightly different perspective on SDD from the
> one given in the current document.
>
> First, the scope of SDD. Currently SDD is trying to address three
> separate and non-overlapping issues: (1) data, (2) business practices
> and (3) application development. Data is the specific information items
> under consideration, business practices are how we think of this items
> and how we handle them, and applications are specific pieces of software
> used by manage data items. The current Special States document mixes
> these three areas, talking about data items, how people interpreter them
> and how they might be displayed by a computer program. The first area
> is clearly within the scope of SDD, the second possibly but only in a
> very general, non-specific way, and the third is clearly outside SDD
> (and well it should be). Bob made essentially this same comment when he
> said that he thought "project management problems were not part of SDD."
> We need to focus on DATA and nothing more.
>
> The second serious problem I have with Special States is that they
> aren't "states" in the normal, taxonomic sense. Taxonomic "states" are
> observations made about biological objects (be they taxa or specimens)
> that are grouped together as "characters." We use "states" to describe
> items. However, Special States aren't observable and aren't used to
> describe items.
>
> What are "Special States"? They are a mix of two separate sets of
> information. They document the status of data coding (for example
> "unknown", "not interpretable" and "unfinished work") and they document
> application-specific business rules (for example "use character default
> state" - get the data based on a flag set for a specific state of this
> character). These are fundamentally different things and show the
> danger of mixing data, business practices and application development.
>
> Lack of appreciation for these differences as resulted in "use default"
> being treated as a case of "data have not yet been entered" when in fact
> a state has been clearly indicated (through a pointer): it's the default
> state (however the data or application defines this) and is certainly
> known. It's a user-interface shortcut used by an application to make
> data entry quicker (in the opinion of an application developer!). The
> end result is exactly the same as if the state had been entered directly
> in the description (the taxon by character intersection, or what DELTA
> calls the Attribute).
>
> This thinking results in some very complex processing being required by
> consumers of SDD documents. For example, it is suggested that if a
> description is null, then the "special state" called "use default"
> should be inserted and that this "state" be set to "not yet evaluated."
> This means that to interpret an SDD document you not only have to
> understand the individual data items but also know the rules for
> processing them (to know that if data is missing you need to insert a
> special value and then look this value up in another list to translate
> it into something meaningful). There must be a cleaner way to do this!
>
> The "unknown", "not interpretable" and similar "special states" are
> another problem altogether. These "states" have nothing to do with
> characters and everything to do with descriptions. This is why Mike
> Dallwitz didn't include them as part of the Character List. I can
> understand why you might want to treat these as "states" of characters
> as this is the basic DELTA paradigm. However, there is no need to do
> this and doing so only complicates things. For example, to tell if a
> description has not been coded you'll need to get its state(s) and then
> check their translations to see if any are "unknown", "not
> interpretable" or any one of who knows how many other special
> conditions.
>
> A much better way to implement this functionality would be to store an
> "uncoded" flag with the description along with an (encoded or
> text-based) explanation ("unknown", "not interpretable", "too lazy to
> code this", "don't have proper specimens", "To Do" or what ever). This
> is both direct and allows the explanation to change in a simple and
> flexible way.
>
> These are the most serious and fundamental problems I have with the
> current document. I would also add the following minor comments (some
> of these expand on the above comments as well).
>
> The meaning of the DELTA "U" in attributes: It's stated in the document
> that this means "attempts to research the information were made but
> failed" and that "the state U in DELTA is also used for cases where data
> are present, but the author is unable to interpret them in current
> terminology." The DELTA documentation states simply that "A missing
> attribute is equivalent to an attribute with pseudo-value U" and that
> "If the state of the character is unknown, then the character is
> omitted, or the state value coded as U." I can't find anything in the
> DELTA documentation that gives a reason for the use of U or suggests
> that "attempts to research the information were made but failed" or
> anything similar. It would seem that DELTA treats nulls and U as the
> same, nothing more.
>
> Remove all sections dealing with "performance", "user-interfaces" and
> referring to "applications." These deal with specific software
> developments and should not influence the SDD standard.
>
> Great care should be taken when developing "business rules" for
> interpreting an SDD document. For example see the discussion starting
> with "One frequent situation is that characters are added to the
> terminology. Possible solutions to achieve a synchronization between
> descriptions and a separate terminology in this case are:". It may be
> important for SDD to document this situation but HOW it is dealt with is
> up to individual applications and has little to do with the data itself
> (which is the focus of SDD). The same problem occurs with the statement
> "The omission of character coding in a description should be used to
> express the "use default" state." This mixes data with processing. The
> lack of data is simply that - it implies nothing more and NO assumptions
> should be made about it. If the default state should be used then the
> author needs to state this.
>
> What does "terminology" mean? It seems to be the character/character
> state list but this is unclear. It might be good to develop a glossary
> of terms so we all know what we're talking about.
>
> We also need to define "schema." My understanding is that a "schema" is
> a model of the structure of the data and not the data itself. If you
> add data you don't change the schema (assuming that this addition is
> permitted by the schema - it's like adding rows to a database table -
> the schema doesn't change, only the data). But at several points the
> document says "... the case that the terminology is changing ("schema
> evolution") is ..." or similar. This sounds like changes to the data
> change the schema. Is this the general view of schema or am I on the
> wrong track?
>
> The section discussing "Data have not yet been entered", "Data cannot be
> entered" and "Data could have been entered but a deliberate decision was
> made not to enter them" seems to mix unrelated cases and is a mess. The
> options listed are (followed by my interpretation of their meaning):
> 1) "Use character default state": Data is coded and is not missing (see
> above).
> 2) "Unfinished work": Data exists but has yet to be captured.
> 3) "Not applicable": Data does not logically exist.
> 4) "Unknown": Data exists but has yet to be captured.
> 5) "Not interpretable": Data exists but has yet to be captured.
> 6) "Out of scope": Data exists but has yet to be captured.
> 7) "Do not need to score": Data is coded and is not missing.
> The mix involves data that is being pointed to through a process (No. 1
> and 7), data that is impossible to collect (No. 3) and data that can be
> collected but hasn't (for a variety of reasons) (No. 2, 4, 5 and 6).
> Keep it simple by recording if data is absent and if it is, the reason.
> Don't confuse it with application-specific short cuts ("Use default",
> "Inherit/Compile from Parent/Children") or personal decisions made by an
> author ("it's too hard to code", "this character is unimportant here",
> "appropriate specimens are unavailable", etc). These are distinct
> cases.
>
> There are a number of cases where attempts are made to develop lists of
> conditions or situations. For example, it is asked "Should a general
> 'cannot score' (for any kind of reason) be differentiated into" and then
> three situations are listed. Again, generalise this into "uncoded" with
> a text-based explanation ("observation method failed", "incomplete
> specimen" or any one of a thousand other reasons). All the information
> is captured and the system is rich enough to handle unforseen
> situations. This approach also fits perfectly with the previous
> example.
>
> There is a note that "a general method is planned (@but not yet
> formalized!@) in SDD" and that "SDD is considering introducing computed
> characters." We've been at this for at least 2 years now and there are
> things that are "planned" or are "being considered" for SDD that we
> don't know about?!?!? Or is there an overall planning document that I'm
> not aware of? WE are SDD, SDD can't be "considering" things in
> isolation.
>
> The entire discussion of inheritance and compilation (or what ever we
> call it) needs to be though of in a fresh light. Again, the document
> confuses data with process. For example, it's stated that "it is
> desirable that the assumption [of inheritance/compilation] rather than
> the inherited data are recorded". The author of the dataset MUST
> present real data, not give a process to get to that data. Yes, the
> author can tell us how they collected the data ("inherited from direct
> parent", "compiled using Gouldes Statistic from all coded children", "by
> reference to the default state for this character") but they also must
> give us the data, not make us go look for it. The process of
> inheritance/compilation is too complex to leave to chance and trying to
> define exactly what it means is too error prone. I would suggest that
> inheritance/compilation is an application-based activity, not a
> fundamental aspect of the data. Again, give us the data and separately
> tell us how it was derived, don't make us derive it ourselves.
>
> This exact same problem exists with "computed data." I don't see how or
> why an SDD document can contain the value "CouldNotCompute" because the
> "generator couldn't compute this". The "generator" is a piece of
> software, not a data representation (which is what SDD is).
> "CouldNotCompute" is an error message returned from an application and
> it should never end up in an SDD document. Yet again, if you want to
> explain how the data was generated, find, do it, but don't make us
> calculate it, we don't know how.
>
> The discussion about "Not supported as special states, but supported
> through modifiers" is good but, as above, describes descriptions and not
> characters. That's why inheritance/compilation is included here.
>
> The statement "Special states express knowledge about why data for a
> given character are missing in a description and thus make a statement
> about the entire character" is interesting. It says that special states
> are about a character in a description. The point I would make is that
> they are primarily about the description, not the character. They say
> nothing about the character's use for other items, and only relate to
> this character in this taxon, and this is through the "description." So
> let's focus this document on the description and only discuss the
> character when it's appropriate. This is supported by the statement
> that "DELTA does not define the special states in the "character list"
> directive: they are implicitly present in each character." They are not
> in the character list because the have little to do with characters.
> They are in the DELTA attributes because they have everything to do with
> descriptions. I'm sorry to say that the next statement shows the danger
> of letting past developments influence current work (and I'm as guilty
> as any): "When a new character is created, DeltaAccess automatically
> creates the full set of special states." In BioLink we attach this
> information to the description: you do it when you're coding the data,
> not developing the character list. I don't know which way is better and
> as noted above, it probably doesn't matter as each application will
> translate the information into a local format anyway. What we DO need
> to do is make sure the SDD format is expressive, flexible and as
> application-neutral as possible (and ideally simple).
>
> Most of the discussion under "Relations between declarative character
> dependency and special states" seems to deal with developing
> applications rather than data representation and may well be outside
> SDD. Sure, document that these dependencies exist and use them to
> produce high-quality and internally consistent data, but do this at
> application-level, not SDD-level.
>
> The same applies to "Responsibility for validation." SDD should be
> about representing data in a standardized way, not about enforcing
> taxonomic business rules or data quality standards. As SDD will be
> represented in XML then this XML document must be well-formed and pass a
> check with a DTD or XML-Schema checker. But SDD shouldn't be
> responsible for checking the content of individual data items, that's
> the job of authors and their applications.
>
>
> ********************************************************
>
> So, to summarize:
>
> SDD is about data and only data. These data include characters and
> descriptions and (optionally) how these descriptions were developed and
> their current status.
>
> SDD is not about process. Process information can be included but it is
> optional and can be safely ignored without data loss. [I shouldn't have
> to process a "use default data" statement to access a complete
> description, the appropriate state(s) should be inserted into the
> description when the SDD document is prepared with a note on how it was
> derived.]
>
> SDD is not about applications. Discussions concerning user-interfaces
> or processing methods should not be included. [How an application
> manages and represents these data is completely independent of how SDD
> represents the data.]
>
>
> ********************************
>
> Finally, I want to reconfirm that Gregor has done a fantastic job with
> getting us this far and I support his efforts. My comments are more
> about a different perspective rather than any serious flaw in logic.
> Most of what is in the document is well presented and relevant to
> managing taxonomic descriptions, I'm just not sure it is directly
> related to the goals of SDD.
>
> And sorry for going on for so long. I'll sit down now.
>
> Steve
>
> Steve Shattuck
> CSIRO Entomology
> Steve.shattuck(a)csiro.au
1
0
Before finishing the Paris minutes, I have just finished the last
remaining bit from Brazil. On the last day we discussed special
states, which was not really in the minutes released before Paris. I
have written it as a separate document and have tried to work through
some of the issues and scenarios we touched upon in Brazil. These
include character scoping, default or implicit states, inheritance
from taxonomic hierarchy, and a bit on character dependency. The
result is a bit lengthy, but it did help _me_ a lot. I hope that the
document can be pruned down when some parts can be moved elsewhere.
I would love get your comments and feedback. Either sent them to the
list, or if you have small corrections to help me you can write them
into the document (Bob already did a first review helping to remove
the worst logic and language errors!) and send it back to me.
It is available at:
http://160.45.63.11/Projects/TDWG-
SDD/Docs/SDD_P_Data_SpecialStates.html
Thanks
Gregor
----------------------------------------------------------
Gregor Hagedorn (G.Hagedorn(a)bba.de)
Institute for Plant Virology, Microbiology, and Biosafety
Federal Research Center for Agriculture and Forestry (BBA)
Koenigin-Luise-Str. 19 Tel: +49-30-8304-2220
14195 Berlin, Germany Fax: +49-30-8304-2203
Often wrong but never in doubt!
1
0
Re: Correct term for collating/summarizing/consolidating/inferring from below
by unknown@example.com 13 Mar '03
by unknown@example.com 13 Mar '03
13 Mar '03
{Sorry for the delay in contributing this, I've been away the past few
days.)
In BioLink, we have the following options/commands:
Compile
with the options to:
Compile Once
Always Compile From Children
Refresh Compile
Inherit
with the options to:
Inherit Once
Always Inherit From Parent
Refresh Inherit
We do this independently for each cell or attribute (that is, for each
character for each taxon/item, or for each character by taxon/item
intersection). This way you can build a full description by compiling
some characters from children while inheriting other characters from the
parent.
We allow you to either regenerate these data automatically (by selecting
the "Always Compile/Inherit from Children/Parent" command) or to create
a static copy of the data so you can fine-tune it by manually inserting
comments and clarifications (with the danger that changes to
children/parents will not be reflected).
Our hope is that by including the phrases "from children" and "from
parent" in the commands that we can make the actions of these commands
clear without having to resort to looking these terms up in the help
files or, worse, a dictionary.
Steve Shattuck
CSIRO Entomology
steve.shattuck(a)csiro.au
1
0