Morphological Data Representation

Mon Nov 26 12:31:40 CET 2001

First, I see Kevin has proposed a process slightly different to the one I
started on Friday.  I would suggest that we focus on one of these rather
than running both at the same time.  I don't think it matter which one we
follow as they both have pros and cons.  My idea was to start simple and
build as we go while Kevin's is to scope the project first then fill in the
details.  Any strong views on the process.

Guillaume commented that he's "always surprised seeing people recommend
using proprietary stuff" in response to my suggestion of using Microsoft's
XML Notepad (which, by the way, is actually free).  The point I was trying
to make was that XML can get complicated very quickly and using an XML view
(of any sort) is better than using a text editor.  Nothing more.

He also commented that my example "was closely related to delta (as xdelta),
whereas everyone seems to agree on building something from scratch".  I
fully agree, but I would suggest that DELTA has it basically right and that
whatever we develop will look suspiciously like DELTA.  For years
taxonomists have talked about things called "characters" with "states" and
have used these to build "descriptions" (= DELTA attributes).  DELTA is
firmly founded in taxonomic practices and since this is driving this process
I would be surprised if they diverge too far.  In other words there's a
reason DELTA has been as broadly accepted as it has and we shouldn't ignore
this.

Guillaume's final comment, about the use of XML elements and attributes, is
important but I still think it can wait.  There often isn't a clear
distinction between information that "has real-world meaning" and that which
is "modelling artefacts."  ["One person's data is another person's
metadata."]  The fact that "default XSLT transformation enforces this by
outputting elements contents and ignoring attributes" is too application
specific.  Many people process XML using DOM tools and they shouldn't be
constrained just because XSLT does it another way.

Leigh's comments are good and worth a detailed look.  His
model/representation/syntax (or whatever you want to call it) of the same
data I used (see http://www.bath.ac.uk/~ccslrd/delta/lep.xml) is exactly the
kind of thing I had in mind.  Does his representation make more sense than
the one I proposed?  What are the strengths/weaknesses of our approaches.
Does one allow us to get to where we want to be?  Again, I don't think the
exact syntax is important at this stage.  For example, both models describe
the meaning of the numbers present in item descriptions for numeric
characters, Leigh as "<value start="0.5" end="0.55" />" and myself as
"<CodedState StateID=C3S1 Value=0.5 /> <CodedState StateID=C3S2 Value=0.55
/>" with the meaning stored with the state rather than the item description.
At the syntax-level these are very different but at the modelling-level they
are the same - the same information is being managed (and both differ from
the current DELTA-standard in this regard).  We will need to work on the
syntax but let's get the model agreed to first, then worry about specific
syntax.

>However, we must agree on model extend: will it concerns only concept
>description (aka: characters) or also case description (aka: items) ? IMHO,
>only the first one can be generalized, or we'll have to validate the case
>description twice: against a generic model and against its concept.
>
>For using an example, if i have a description of the characters of
>Pociloporidae familiy, and a description of the items of Pociloporidae
>family, i'll have to make sure characters are really characters (validating
>against generic character model), to make sure items are really items
>(validating against generic items model), and make sure Pociloporidae items
>are really Pociloporida (validating items against characters). I would
prefer
>to have only to validate characters against a generic character model, and
>validate items against characters, meaning using a character description as
a
>suitable model for items description.

I believe this misses the goals of this project in a number of important
ways and we should avoid going down this path.  It seems to mix the (i)
processing of the data with (ii) the data representation with (iii)
taxonomic work practices.  I'm very uncomfortable going there.

Finally, Peter's concerns are important for the next step, expanding the
proposed representation to include information not currently managed.  One
of the strong recommendations from SDD Round 1 was to manage raw data.  This
needs to be housed under the summarized data (in this case, the actual
measurements under the ratios).  We WILL need to do this eventually.

Peter also pointed out possibly our largest challenge. He noted that "having
clear spots in wings is not very
precise if the data is to be comparable beyond the group in question - which
I suppose is part of the goal."  At every TDWG meeting I've been to we
decide that we can't build standards for specific character values and yet
at every TDWG meeting I've been to we try to build standards for specific
character values.  We need to build mechanisms to allow sharing of character
lists across projects IF THOSE PROJECTS WANT TO USE THIS FEATURE.  If
projects don't want to share character lists, for what ever reason, then
they won't not matter how important we think it is to do so.

<soapbox on>
I think this focus on "standard character lists" is very much a "plant
thing."  In animals it would never occur to me to use "clear spots in wings"
for anything other than the local context for which it was established.  No
one would suggest that "clear spots in wings" in bees has anything to do
with "clear spots in wings" in butterflies and trying to use a single
character coding for this would receive limited support at best.  I think
the problem is that it's common to talk about "identifying a plant" but very
rare to talk about "identifying an animal."  There are no "faunas" that are
equivalent to "floras."  This very fundamental difference between plants and
animals and the way people view them has a huge impact on this very
development - it's one of the reason's that the botanical community has
accepted DELTA much more strongly than the zoological community.  While a
"global flora" is a completely reasonable goal, a "global fauna" isn't even
a faint blip on distant radar.  Zoologists work in relative isolation
compared to botanists and have very different work practices and needs.
Meeting the needs of both of these communities will be a significant
challenge, one I'm not sure we can meet in a single set of tools.
<soapbox off>

Thanks, Steve

Steve Shattuck
CSIRO Entomology
steve.shattuck at csiro.au