Morphological Data Representation

Sun Nov 25 21:28:58 CET 2001

Steve Shattuck writes:
 > Date:         Mon, 26 Nov 2001 12:31:40 +1100
 > From: Steve Shattuck <Steve.Shattuck at csiro.au>
 > To: TDWG-SDD at usobi.org
 > Subject:      Re: Morphological Data Representation
 >
 > First, I see Kevin has proposed a process slightly different to the one I
 > started on Friday.  I would suggest that we focus on one of these rather
 > than running both at the same time.  I don't think it matter which one we
 > follow as they both have pros and cons.  My idea was to start simple and
 > build as we go while Kevin's is to scope the project first then fill in the
 > details.  Any strong views on the process.

The "data challenges" process was what was agreed in Sydney. Also, it
corresponds closely to processes generally regarded as successful in
software design, namely Use Case analysis, though it focuses on the
use and not the user (in the sense of both human and software).

 >
 > Guillaume commented that he's "always surprised seeing people recommend
 > using proprietary stuff" in response to my suggestion of using Microsoft's
 > XML Notepad (which, by the way, is actually free).  The point I was trying
 > to make was that XML can get complicated very quickly and using an XML view
 > (of any sort) is better than using a text editor.  Nothing more.

Also, MSIE is a reasonable XML viewer, albeit not an editor.

 >
 > He also commented that my example "was closely related to delta (as xdelta),
 > whereas everyone seems to agree on building something from scratch".  I
 > fully agree, but I would suggest that DELTA has it basically right and that
 > whatever we develop will look suspiciously like DELTA.  For years
 > taxonomists have talked about things called "characters" with "states" and
 > have used these to build "descriptions" (= DELTA attributes).  DELTA is
 > firmly founded in taxonomic practices and since this is driving this process
 > I would be surprised if they diverge too far.  In other words there's a
 > reason DELTA has been as broadly accepted as it has and we shouldn't ignore
 > this.

In Sydney it was agreed that DELTA experience and functionality should
not be ignored, i.e. we should not start from scratch. [I don't happen
to agree with that because of the large community outside TDWG that is
already heavily involved in descriptive data if only informally. But
it is what was decided, and it does leverage the experience of most of
SDD. The trick will be to not arrive at something that only models DELTA]

 >
 > Guillaume's final comment, about the use of XML elements and attributes, is
 > important but I still think it can wait.  There often isn't a clear
 > distinction between information that "has real-world meaning" and that which
 > is "modelling artefacts."  ["One person's data is another person's
 > metadata."]  The fact that "default XSLT transformation enforces this by
 > outputting elements contents and ignoring attributes" is too application
 > specific.  Many people process XML using DOM tools and they shouldn't be
 > constrained just because XSLT does it another way.

Not only can it wait, but these things are possibly technical enough
that they shouldn't be initially in this mailing list. We are nearly
finished an installation of vBulletin forum software and will offer
to operate a forum off this list for discussion of the technical bits.

 >
 > Leigh's comments are good and worth a detailed look.  His
 > model/representation/syntax (or whatever you want to call it) of the same
 > data I used (see http://www.bath.ac.uk/~ccslrd/delta/lep.xml) is exactly the
 > kind of thing I had in mind.  Does his representation make more sense than
 > the one I proposed?  What are the strengths/weaknesses of our approaches.
 > Does one allow us to get to where we want to be?  Again, I don't think the
 > exact syntax is important at this stage.  For example, both models describe
 > the meaning of the numbers present in item descriptions for numeric
 > characters, Leigh as "<value start="0.5" end="0.55" />" and myself as
 > "<CodedState StateID=C3S1 Value=0.5 /> <CodedState StateID=C3S2 Value=0.55
 > />" with the meaning stored with the state rather than the item description.
 > At the syntax-level these are very different but at the modelling-level they
 > are the same - the same information is being managed (and both differ from
 > the current DELTA-standard in this regard).  We will need to work on the
 > syntax but let's get the model agreed to first, then worry about specific
 > syntax.
 >
 >
 > >However, we must agree on model extend: will it concerns only concept
 > >description (aka: characters) or also case description (aka: items) ? IMHO,
 > >only the first one can be generalized, or we'll have to validate the case
 > >description twice: against a generic model and against its concept.
 > >

To the extent these are separable, there would probably be less
dispute about character representation. Probably this means that lots
of case description models have to fit on the same character model.

 > >For using an example, if i have a description of the characters of
 > >Pociloporidae familiy, and a description of the items of Pociloporidae
 > >family, i'll have to make sure characters are really characters (validating
 > >against generic character model), to make sure items are really items
 > >(validating against generic items model), and make sure Pociloporidae items
 > >are really Pociloporida (validating items against characters). I would
 > prefer
 > >to have only to validate characters against a generic character model, and
 > >validate items against characters, meaning using a character description as
 > a
 > >suitable model for items description.

As a not biologist, this sounds right to me.

 >
 > I believe this misses the goals of this project in a number of important
 > ways and we should avoid going down this path.  It seems to mix the (i)
 > processing of the data with (ii) the data representation with (iii)
 > taxonomic work practices.  I'm very uncomfortable going there.

 >
 >
 > Finally, Peter's concerns are important for the next step, expanding the
 > proposed representation to include information not currently managed.  One
 > of the strong recommendations from SDD Round 1 was to manage raw data.  This
 > needs to be housed under the summarized data (in this case, the actual
 > measurements under the ratios).  We WILL need to do this eventually.
 >
 > Peter also pointed out possibly our largest challenge. He noted that "having
 > clear spots in wings is not very
 > precise if the data is to be comparable beyond the group in question - which
 > I suppose is part of the goal."  At every TDWG meeting I've been to we
 > decide that we can't build standards for specific character values and yet
 > at every TDWG meeting I've been to we try to build standards for specific
 > character values.  We need to build mechanisms to allow sharing of character
 > lists across projects IF THOSE PROJECTS WANT TO USE THIS FEATURE.  If
 > projects don't want to share character lists, for what ever reason, then
 > they won't not matter how important we think it is to do so.

I agree with this. Whether sharing character lists matters may be a
function of the purpose of the list. For example, paper field guides
to a given group of taxa often have a far greater commonality of
description of characters than they do of characters. And they often
come equipped with a character metadata section that explains how to
use the characters.

 >
 > <soapbox on>
 > I think this focus on "standard character lists" is very much a "plant
 > thing."  In animals it would never occur to me to use "clear spots in wings"
 > for anything other than the local context for which it was established.  No
 > one would suggest that "clear spots in wings" in bees has anything to do
 > with "clear spots in wings" in butterflies and trying to use a single
 > character coding for this would receive limited support at best.  I think
 > the problem is that it's common to talk about "identifying a plant" but very
 > rare to talk about "identifying an animal."  There are no "faunas" that are
 > equivalent to "floras."  This very fundamental difference between plants and
 > animals and the way people view them has a huge impact on this very
 > development - it's one of the reason's that the botanical community has
 > accepted DELTA much more strongly than the zoological community.  While a
 > "global flora" is a completely reasonable goal, a "global fauna" isn't even
 > a faint blip on distant radar.  Zoologists work in relative isolation
 > compared to botanists and have very different work practices and needs.
 > Meeting the needs of both of these communities will be a significant
 > challenge, one I'm not sure we can meet in a single set of tools.
 > <soapbox off>
 >
 > Thanks, Steve
 >
 > Steve Shattuck
 > CSIRO Entomology
 > steve.shattuck at csiro.au
 >