Will someone list for us exactly the benefits and disbenefits that will flow from using XML for a data interchange standard?
Gregor wrote:
These issues should be seen quite separately, and that we need both
- A markup language for text documents. This includes:
- existing descriptions, captured by OCR or other means, where the markup would be manually (or automated/data mining?)
- computer generated natural language descriptions that are published as electronic documents. A new CSIRO package could have a ToNat command that automatically adds the necessary, hidden markup code.
- A data language for new observations, including repeated
measurements or repeated observations of categorical data, e.g. shape of multiple leaves in a single specimen. Further, many structures for knowledge managements (data revision, annotation, quality control and assessment) need to be implemented here.
The issues overlap, and it would be beneficial to use as much common syntax as possible, but fundamentally I believe them to be quite different.
To which Jim Croft replied:
Can you explain in more detail why they should be fundamentally different? Aren't the differences just a matter a matter of degree, different points on a continuum of data as it were? And aren't the basic principles applicable across the whole?
We need to explore this a lot more.
I'm talking here about the data language, not the markup language, as it seems to me there are practical, even if not theoretical, differences.
I can well understand the benefit from marking up a document (a text description) in XML. But when it comes to straight data, I'm less sure.
By straight data, I mean data that can be represented most concisely as a matrix, for instance:
Characters: Leaves ovate elliptic Flowers blue yellow Taxa: taxon1 taxon2 Data (taxa x states): 0101 1010
(...or a DELTA type disaggregated matrix)
With this type of data, the XML markup becomes considerably more verbose than the data. Is this a problem?
Over the past couple of days I've partially implemented an export function to produce Leigh's XDELTA documents (as a simple example of a possible XML format) for the data in Lucid keys. I have a key to families of flowering plants of Australia (240 taxa, 166 characters, 600 states). The data I'm using are simple - basically a score matrix, a list of taxa and a list of characters. The file sizes in three formats for these data are:
LucID 166 kb DELTA 240 kb XML c3 Mb
And this is only the most basic XML!.
Now, perhaps in a few years we'll be carrying terabytes of data in our back pockets, but I would have thought that file size is important, particularly if I'm using this as an interchange format and want to email the data to someone (the xml file compresses very well into a zipfile, it's true).
The following question:
what could we do with such data as XML that we couldn't do with the data as a simple structured file as above?
Leigh wrote:
Secondly I might not be generating anything visible at all. I can well imagine an application that will take an XML document and from the data within produce (say) a taxonomic tree or tress of that data. Here I wouldn't use a stylesheet, I'd simply process the data directly.
Is direct processing of XML data any easier than direct processing of the data in a simpler format? Perhaps there will be off-the-shelf parsing tools, but how much of a benefit will this be?
The problem to my mind is that in current formats, e.g. DELTA and LucID, much information is implied by context. Thus, in
1010 0101
The taxa and character state numbers (identities) are implied by the position of the data bit in the matrix. In XML this information is verbosely explicit.
Is the following true?: once upon a time, computers could represent but not efficiently analyse or process textual data, hence documents were stored as text but "data" were stored as matrices etc. Now, XML has blurred the boundary between these types of information ("textual" and "data") and we're exploring the implications of that blurring. But are there now no differences, and no further need for a matrix?
A final point. It seems to me that the discussion so far has been dominated by computer nuts (no offence intended, we need you!), with relatively little input from the community of taxonomists whose needs this standard is supposed to serve. Most of these will be happy to use a well-crafted tool, but won't want to know the intricacies of XML, and will be put off if things are not quite straightforward. This is a danger. We may end up with a wonderful, sophisticated, state-of-the-art, cutting-edge standard that nobody uses.
Cheers - k
participants (1)
-
Kevin Thiele