It's How the Data will be Used that Counts

Jim Croft jrc at ANBG.GOV.AU
Wed Dec 5 07:47:19 CET 2001

Kevin wrote:
> I seem to be in a minority of one here again, but I'll continue to argue my
> case for a bit longer.

why not... whe have be doing this for several years now, a few more
email iterations won't hurt... :)

> 1. Exactly, it's easier to go from data to +/- natural language, which is
> precisely why we need to try hard to facilitate the reverse.

Which is not all that difficult... to mark up a single description
manually is a relatively trivial task... but to process a lot
automatically is another matter... and to do it in a way that the whole
community accepts is another matter entirely...  :)

> 2. If we can effectively embed fully parsable data in a natural-language
> paragraph, why not?

becasue it creates mixed content (legal, but evil) where the data is
parsable, but not structured. If we can not represent our data universe
in a database in an elegant and easily understood way we may not
necessarily have failed, but we will have fallen short of the target by
a large margin.

In fact, DELTA already does this  sort of thing by allowing liberal
appending and prepending <freeform comments> all over the place. While
this makes for quasireadable descriptions, authors often embed
interesting character data in the comments making it unusable by other
applications, even within the DELTA suite.  The 'rarely' and
'misinterpreted' scoring options in Lucid might be seen as an attempt
to capture some of this information.

> 3. If a structured data document based on our standard is a subset of a
> marked-up description based on the standard, then creating a standard that
> can support the latter gives us the best of both worlds. If it can be done,
> why not?

I agree...  this is a laudable aim... but I must admit to never thinking
of a structured document as a subset of a marked up description -
oftentimes the reverse may be the case... Isn't it better to think of
both structured documents and marked up descriptions as being subsets
of the standard we are trying to create?

> Personally I think that creating an XML representation of structured data
> would be a doddle.

But as we have seen, getting everyone to agree that one person's way
of doing it is the one true and proper path to descriptive enlightenment
is no easy task...

> Creating a fully parsable but lossless XML representation
> of a natural language description (which hence can also handle the degraded
> case of structured data) - now that would really be something to write home
> about!

Well, dreaming about it at least... I think we are dealing with two
different, and I fear irreconsilable things here...  descriptions by
their very nature are lossy - they are abstract representations of the
gestalt of a sample of a taxon, often with an arbitrary word limit,
attempting to portray in a familiar format what an author thinks a taxon
looks like.  Structured documents such as DELTA and Lucid at least have
to potential to store everything that is remotely interesting about
every taxon in the set and often come close to achieving it in reality.
So what if the resulting descriptions do not have the poetic beauty of
a Shakespearian sonnet; at least the information will be there and
retrievable... In the case of biological description, beauty is not
necessarily truth, or at least the whloe truth...

> Anyone else out there +/- agree with me, or should I give up now?

Don't do that... if you do, you will never have anything to write home


