Types of data

Wed Nov 24 09:50:55 CET 1999

I almost entirely agree with Peter.

Most existing formats deal not with data per se, but rather with
representations of data.  0's and 1's in phylogenetic analysis are an obvious
example.  While either value could represent a particular molecular base pair
substitution, it need not and will, almost certainly, rarely reflect actual
measurement of a particular specimen or series of specimens, even though such
information would be required in some measure or estimation to actually  reach
the "higher-order inference".

I too would argue that what we seek is a system that distinguishes between
"observations", measured in some sense (either directly using some device), or
perhaps "mapped" using known or widely used conventions and higher level
abstractions that often pass as qualitative "data", (ie leaf "ovate" or state
"relatively derived" or "1").

Because science requires assumptions be made, it is clear that in many areas,
such representations are routinely assumed, with much or relatively little
justification, as the issues require.  Consequently, our "matrix" would be
extremely sparse should one insist that only "measured variables" be specified
as "character data" to reach the abstractions needed to conduct some kinds of
investigations (ie. pooling museum records for a given species, whose taxonomy
is established on the basis of morphology).  Delta and Nexus serve well in
variety of contexts.  However, one would also like to be able to request the
"intersection" of a variety of "weakly associated" or
"semi-structured" information that might facilitate determining what the
distribution and taxonomic conclusions might be for select sub-populations
measured for a number of potentially widely different
"features/properties/potential synonyms".

To use the XML analogy, perhaps we need a tag language that
"wrap" "representations of data", as well as other tags that indicate the
circumstances under which "measured variables" were actually taken (eg.
measurements taken from landmarks that may subtly differ among investigators;
device used; preparation methods, etc.).  In any case we need a means to
distinguish between these two [maybe more?] "fundamental [?] types" of "data",
while at formulate a searching/description language at the same time rich
enough to characterize reasonably precisely the context in which the
representations were made, as well as the objects themselves.

If I disagree with Peter, it is that for at least a set of morphological
features that can be imaged, we are rapidly approaching a day when it will be
possible to measure thousands of objects each for thousands of measures
quickly.  For these, automated, well-defined definitions of shape will no
longer be an issue (except perhaps which method/measure is most useful for a
particular or general purpose).  Rather, the issue is how can we set up a
system of describing such measured data that allows us to evaluate it against
taxonomic and morphological conclusions reached in the past using other
methods, not to mention compare it against other molecular, developmental,
physiological, and ecological data collected using a wide variety of techniques
collected in the past and to be collected in the future.  Also, comparison of
methods for taking such data will become increasingly important as the devices
(hardware and software) become ever more sophisticated and our language or
interpreting their output becomes increasingly precise.

Perhaps a useful approach might be to evaluate various "data types" separately
to establish the appropriate set of "context" tags (the essence of DELTA,
LUCID, NEXUS and related approaches), while also seeking to better understand
the nature of the "conceptual wrappers" that will be needed to associate (tag)
different contexts.  At least such an approach might permit comparison of the
controlled vocabularies of alternative tagging methods where there is content
and concept overlap.  Such an approach might also permit us to assess just what
kinds of associations we need to be able to make, and hence what kind of
language do we need to construct to permit "multi-dimensional" extensions.

"P. F. Stevens" wrote:

Of course, the IAPT shapes are not "real", they are conventions (the actual

> circumscription of the various shapes was decided at a meeting back in the
> '50s, I think).  ...