Although some time has passed, I would like to take up something up that was posed by Kevin under "(XML) XML?".
The problem to my mind is that in current formats, e.g. DELTA and LucID, much information is implied by context. Thus, in 1010 0101 The taxa and character state numbers (identities) are implied by the position of the data bit in the matrix. In XML this information is verbosely explicit.
The following question: what could we do with such data as XML that we couldn't do with the data as a simple structured file as above?
Is direct processing of XML data any easier than direct processing of the data in a simpler format? Perhaps there will be off-the-shelf parsing tools, but how much of a benefit will this be?
I see among other things already mentioned, 2 benefits:
1. A matrix is inherently ordering dependent. Reordering characters or taxa is necessarily tied with the contents of data. In a more verbose presentation (be it XML or not), these actions can be decoupled. This is not currently an advantage as long as we mainly use isolated programs with import/export interaction alone. However, I believe in the future we will use programs working on networked data. This could mean that I locally modify only the character definition, perhaps translate it to German, define a character and taxon subset for my use, and change character hierarchy or ordering, but still cooperate with other groups using the same description data on the net. That requires of course, that the descriptions use, at least partially, standardized character definitions and that Globally unique identifiers have been assigned to characters.
The advantage is not clear when thinking of DELTA, where the character ID is local, and also identical with the character ordering definition. These are understandable design limitation, which have the advantage to produce simple and compact files. In the future, I believe, we will want to be more flexible than this. In DeltaAccess, I followed the DELTA example and coupled ID and ordering, which does brings me quite some trouble currently...
2. Only with explicit and verbose tagging can be achieve forward AND backward compatibility. If LucID would want to support additional modifiers, perhaps change the 2 frequency states into a more specific and complex system, or if it would want to allow state specific reportable text and internal annotation (which are extremely difficult to support in matrix notation anyways) - the parser of a new application version would have to be changed quite a bit, perhaps rewritten - any older software would no longer be able to read these data.
With a well designed verbose, fully tagged format, both these restrictions need not apply, I believe.
Is the following true?: once upon a time, computers could represent but not efficiently analyse or process textual data, hence documents were stored as text but "data" were stored as matrices etc. Now, XML has blurred the boundary between these types of information ("textual" and "data") and we're exploring the implications of that blurring. But are there now no differences, and no further need for a matrix?
I believe we need structure, and we need support for quality control. The distinction between text and data is blurred somewhat indeed, but text must be structured in ways that can be processed analytically. Free comments anywhere, where it is not clear to which data item a comment belongs, will not help. Also, in free text the same concept will often be expressed with different words. This is a common problem known by anybody who tried to capture free conventional descriptions in a database. For example, even in DELTA, frequency statements in free text "comments" may express the same frequency by multiple more or less synonymous wordings (is "mostly" and "usually" the same or not?). All frequency wordings are not really defined, and are therefore not accessible to analysis.
DeltaAccess tries to use textual frequency modifiers, where the number of possible modifiers is restricted in the character definition, and where each frequency modifier is defined as to the exact upper and lower frequency range it presents. This is analytically accessible.
Gregor ---------------------------------------------------------- Gregor Hagedorn G.Hagedorn@bba.de Institute for Plant Virology, Microbiology, and Biosafety Federal Research Center for Agriculture and Forestry (BBA) Koenigin-Luise-Str. 19 Tel: +49-30-8304-2220 14195 Berlin, Germany Fax: +49-30-8304-2203
Often wrong but never in doubt!