(RQT) Structure, Matrix, and Text

Fri Dec 17 11:40:57 CET 1999

Although some time has passed, I would like to take up something up
that was posed by Kevin under "(XML) XML?".

> The problem to my mind is that in current formats, e.g. DELTA and
> LucID, much information is implied by context. Thus, in
> 1010
> 0101
> The taxa and character state numbers (identities) are implied by the
> position of the data bit in the matrix. In XML this information is
> verbosely explicit.

> The following question: what could we do with such data as XML that we couldn't
> do with the data as a simple structured file as above?

> Is direct processing of XML data any easier than direct processing of the
> data in a simpler format? Perhaps there will be off-the-shelf parsing tools,
> but how much of a benefit will this be?

I see among other things already mentioned, 2 benefits:

1. A matrix is inherently ordering dependent. Reordering characters
or taxa is necessarily tied with the contents of data. In a more
verbose presentation (be it XML or not), these actions can be
decoupled. This is not currently an advantage as long as we mainly
use isolated programs with import/export interaction alone. However,
I believe in the future we will use programs working on networked
data. This could mean that I locally modify only the character
definition, perhaps translate it to German, define a character and
taxon subset for my use, and change character hierarchy or ordering,
but still cooperate with other groups using the same description data
on the net. That requires of course, that the descriptions use, at
least partially, standardized character definitions and that Globally
unique identifiers have been assigned to characters.

The advantage is not clear when thinking of DELTA, where the
character ID is local, and also identical with the character ordering
definition. These are understandable design limitation, which have
the advantage to produce simple and compact files. In the future, I
believe, we will want to be more flexible than this. In DeltaAccess,
I followed the DELTA example and coupled ID and ordering, which does
brings me quite some trouble currently...

2. Only with explicit and verbose tagging can be achieve forward AND
backward compatibility. If LucID would want to support additional
modifiers, perhaps change the 2 frequency states into a more specific
and complex system, or if it would want to allow state specific
reportable text and internal annotation (which are extremely
difficult to support in matrix notation anyways)
- the parser of a new application version would have to be changed
quite a bit, perhaps rewritten
- any older software would no longer be able to read these data.

With a well designed verbose, fully tagged format, both these
restrictions need not apply, I believe.

> Is the following true?: once upon a time, computers could represent but not
> efficiently analyse or process textual data, hence documents were stored as
> text but "data" were stored as matrices etc. Now, XML has blurred the
> boundary between these types of information ("textual" and "data") and we're
> exploring the implications of that blurring. But are there now no
> differences, and no further need for a matrix?

I believe we need structure, and we need support for quality control.
The distinction between text and data is blurred somewhat indeed, but
text must be structured in ways that can be processed analytically.
Free comments anywhere, where it is not clear to which data item a
comment belongs, will not help. Also, in free text the same concept
will often be expressed with different words. This is a common
problem known by anybody who tried to capture free conventional
descriptions in a database. For example, even in DELTA, frequency
statements in free text "comments" may express the same frequency by
multiple more or less synonymous wordings (is "mostly" and "usually"
the same or not?). All frequency wordings are not really defined, and
are therefore not accessible to analysis.

DeltaAccess tries to use textual frequency modifiers, where the
number of possible modifiers is restricted in the character
definition, and where each frequency modifier is defined as to the
exact upper and lower frequency range it presents. This is
analytically accessible.

Gregor
----------------------------------------------------------
Gregor Hagedorn                 G.Hagedorn at bba.de
Institute for Plant Virology, Microbiology, and Biosafety
Federal Research Center for Agriculture and Forestry (BBA)
Koenigin-Luise-Str. 19          Tel: +49-30-8304-2220
14195 Berlin, Germany           Fax: +49-30-8304-2203

Often wrong but never in doubt!