It's How the Data will be Used that Counts

Kevin Thiele kevin.thiele at BIGPOND.COM
Tue Dec 4 17:50:22 CET 2001

----- Original Message -----
From: "Steve Shattuck" <Steve.Shattuck at CSIRO.AU>
Sent: Tuesday, December 04, 2001 11:31 AM
Subject: It's How the Data will be Used that Counts

| Kevin's representation is too focused on text descriptions.  A more
| representation might be:
| <character name="leaf">
|   <state>present</state>
|   <character name="leaf margin">
|     <state>serrate</state>
|     <character name = "tooth orientation">
|       <state>forward-pointing</state>
|     </character>
|   </character>
| </character>

I would argue that this is a less complete representation, because you've
abstracted the original data further than I have.

I won't use the "markup" from my last email for "Leaf margins serrate with
forward-pointing teeth", as this was designed to exemplify one limited
problem (how to expand a description if characters are nested) and wasn't
actually marked up in the way I proposed for challenge case 1. Two possible
ways I'd do this example are:

"Leaf margins serrate with forward-pointing teeth"

{Using the rule "<states>s cannot have <characters>s as siblings"}

 <Feature name = "Presence" value = "Present">
 <Feature Name="marginal toothing">margins
 </Feature> with
 <Feature Name = "tooth orientation">
  <State>forward-pointing</State> teeth

{Relaxing the rule so that <states>s can have <characters>s as siblings}

 <Feature name = "Presence" value = "Present">
 <Feature Name="marginal toothing">margins
  <Value>serrate</Value> with
  <Feature Name = "tooth orientation">
   <State>forward-pointing</State> teeth

Your proposal is:

<character name="leaf">
  <character name="leaf margin">
    <character name = "tooth orientation">

The score:

1. Can we parse from these the data atoms "leaf = present "leaf margin =
serrate" and "tooth orientation = forward-pointing"?

Kevin's = Yes  Steve's = Yes

Can we easily retrieve from these the original natural language string?

Kevin's = Yes  Steve's = No

On this scoring I'm one up. Then again, yours would be slightly easier to
parse than mine, so we're probably equal. What's most important here? Dunno.

Further, it seems to me that yours is a subset of mine: a Schema that
allowed mine would also allow yours, but not vice versa.

| In my original, DELTA-centric model I used a <description> tag to try and
| capture the text description information separate from the <state>
| information.  My thinking was that these two
| requirements/approaches/viewpoints are too distinct to cram together
| falling into the same trap as the current DELTA Standard (which is a
| least-common denominator approach).

Yes we could tag the bits of free-form text. But is there any need? They
will (by definition) be ignored by all processors except for
natural-language - since the NL is fully retrievable from my model, why not
leave them untagged? In your model, they would need to be tagged since the
model does not represent a natural description - it represents abstracted
data from which a description can be +/- created.

| The problem here is that the phrase "Leaf margins serrate with
| forward-pointing teeth" concerns 3 characters (leaf, margin and teeth) and
| implied and 2 expressed states (present, serrate and forward pointing)
| the characters being dependent (and therefore the context containing
| significant information - we know that 'teeth' have something to do with
| 'serrate' which has something to do with 'leaf margins' - the leaves being
| present because we're describing them).  There's a lot of logic involved
| parsing this.  I can't think of a simple way of representing all this
| complex information without separating it at some level.  Kevin's
| represents the text description and mine the underlying data, but neither
| works well for the other.

I agree - there's still too much complex logic even in the very simple types
of examples we're using so far. We need somehow to step back further to even
more basic examples to tease these issues out.

Cheers - k

More information about the tdwg-content mailing list