(XML) XML?

Thu Dec 2 09:45:24 CET 1999

Will someone list for us exactly the benefits and disbenefits that will flow
from using XML for a data interchange standard?

Gregor wrote:

>>These issues should be seen quite separately, and that we need both
>>
>>1. A markup language for text documents. This includes:
>>  - existing descriptions, captured by OCR or other means, where the
>>    markup would be manually (or automated/data mining?)
>>  - computer generated natural language descriptions that are
>>    published as electronic documents. A new CSIRO package could have
>>    a ToNat command that automatically adds the necessary, hidden
>>    markup code.
>>2. A data language for new observations, including repeated
>>   measurements or repeated observations of categorical data, e.g.
>>   shape of multiple leaves in a single specimen. Further, many
>>   structures for knowledge managements (data revision, annotation,
>>   quality control and assessment) need to be implemented here.
>>
>>The issues overlap, and it would be beneficial to use as much common
>>syntax as possible, but fundamentally I believe them to be quite
>>different.

To which Jim Croft replied:

>Can you explain in more detail why they should be fundamentally different?
>Aren't the differences just a matter a matter of degree, different points
>on a continuum of data as it were?  And aren't the basic principles
>applicable across the whole?

We need to explore this a lot more.

I'm talking here about the data language, not the markup language, as it
seems to me there are practical, even if not theoretical, differences.

I can well understand the benefit from marking up a document (a text
description) in XML. But when it comes to straight data, I'm less sure.

By straight data, I mean data that can be represented most concisely as a
matrix, for instance:

Characters:
Leaves
 ovate
 elliptic
Flowers
 blue
 yellow
Taxa:
 taxon1
 taxon2
Data (taxa x states):
0101
1010

(...or a DELTA type disaggregated matrix)

With this type of data, the XML markup becomes considerably more verbose
than the data. Is this a problem?

Over the past couple of days I've partially implemented an export function
to produce Leigh's XDELTA documents (as a simple example of a possible XML
format) for the data in Lucid keys. I have a key to families of flowering
plants of Australia (240 taxa, 166 characters, 600 states). The data I'm
using are simple - basically a score matrix, a list of taxa and a list of
characters. The file sizes in three formats for these data are:

LucID    166 kb
DELTA    240 kb
XML      c3  Mb

And this is only the most basic XML!.

Now, perhaps in a few years we'll be carrying terabytes of data in our back
pockets, but I would have thought that file size is important, particularly
if I'm using this as an interchange format and want to email the data to
someone (the xml file compresses very well into a zipfile, it's true).

The following question:

what could we do with such data as XML that we couldn't do with the data as
a simple structured file as above?

Leigh wrote:
>Secondly I might not be generating anything visible at all. I can
>well imagine an application that will take an XML document and
>from the data within produce (say) a taxonomic tree or tress of that
>data. Here I wouldn't use a stylesheet, I'd simply process the
>data directly.

Is direct processing of XML data any easier than direct processing of the
data in a simpler format? Perhaps there will be off-the-shelf parsing tools,
but how much of a benefit will this be?

The problem to my mind is that in current formats, e.g. DELTA and LucID,
much information is implied by context. Thus, in

1010
0101

The taxa and character state numbers (identities) are implied by the
position of the data bit in the matrix. In XML this information is verbosely
explicit.

Is the following true?: once upon a time, computers could represent but not
efficiently analyse or process textual data, hence documents were stored as
text but "data" were stored as matrices etc. Now, XML has blurred the
boundary between these types of information ("textual" and "data") and we're
exploring the implications of that blurring. But are there now no
differences, and no further need for a matrix?

A final point. It seems to me that the discussion so far has been dominated
by computer nuts (no offence intended, we need you!), with relatively little
input from the community of taxonomists whose needs this standard is
supposed to serve. Most of these will be happy to use a well-crafted tool,
but won't want to know the intricacies of XML, and will be put off if things
are not quite straightforward. This is a danger. We may end up with a
wonderful, sophisticated, state-of-the-art, cutting-edge standard that
nobody uses.

Cheers - k