First, I see Kevin has proposed a process slightly different to the one I started on Friday. I would suggest that we focus on one of these rather than running both at the same time. I don't think it matter which one we follow as they both have pros and cons. My idea was to start simple and build as we go while Kevin's is to scope the project first then fill in the details. Any strong views on the process.
Guillaume commented that he's "always surprised seeing people recommend using proprietary stuff" in response to my suggestion of using Microsoft's XML Notepad (which, by the way, is actually free). The point I was trying to make was that XML can get complicated very quickly and using an XML view (of any sort) is better than using a text editor. Nothing more.
He also commented that my example "was closely related to delta (as xdelta), whereas everyone seems to agree on building something from scratch". I fully agree, but I would suggest that DELTA has it basically right and that whatever we develop will look suspiciously like DELTA. For years taxonomists have talked about things called "characters" with "states" and have used these to build "descriptions" (= DELTA attributes). DELTA is firmly founded in taxonomic practices and since this is driving this process I would be surprised if they diverge too far. In other words there's a reason DELTA has been as broadly accepted as it has and we shouldn't ignore this.
Guillaume's final comment, about the use of XML elements and attributes, is important but I still think it can wait. There often isn't a clear distinction between information that "has real-world meaning" and that which is "modelling artefacts." ["One person's data is another person's metadata."] The fact that "default XSLT transformation enforces this by outputting elements contents and ignoring attributes" is too application specific. Many people process XML using DOM tools and they shouldn't be constrained just because XSLT does it another way.
Leigh's comments are good and worth a detailed look. His model/representation/syntax (or whatever you want to call it) of the same data I used (see http://www.bath.ac.uk/~ccslrd/delta/lep.xml) is exactly the kind of thing I had in mind. Does his representation make more sense than the one I proposed? What are the strengths/weaknesses of our approaches. Does one allow us to get to where we want to be? Again, I don't think the exact syntax is important at this stage. For example, both models describe the meaning of the numbers present in item descriptions for numeric characters, Leigh as "<value start="0.5" end="0.55" />" and myself as "<CodedState StateID=C3S1 Value=0.5 /> <CodedState StateID=C3S2 Value=0.55 />" with the meaning stored with the state rather than the item description. At the syntax-level these are very different but at the modelling-level they are the same - the same information is being managed (and both differ from the current DELTA-standard in this regard). We will need to work on the syntax but let's get the model agreed to first, then worry about specific syntax.
However, we must agree on model extend: will it concerns only concept description (aka: characters) or also case description (aka: items) ? IMHO, only the first one can be generalized, or we'll have to validate the case description twice: against a generic model and against its concept.
For using an example, if i have a description of the characters of Pociloporidae familiy, and a description of the items of Pociloporidae family, i'll have to make sure characters are really characters (validating against generic character model), to make sure items are really items (validating against generic items model), and make sure Pociloporidae items are really Pociloporida (validating items against characters). I would
prefer
to have only to validate characters against a generic character model, and validate items against characters, meaning using a character description as
a
suitable model for items description.
I believe this misses the goals of this project in a number of important ways and we should avoid going down this path. It seems to mix the (i) processing of the data with (ii) the data representation with (iii) taxonomic work practices. I'm very uncomfortable going there.
Finally, Peter's concerns are important for the next step, expanding the proposed representation to include information not currently managed. One of the strong recommendations from SDD Round 1 was to manage raw data. This needs to be housed under the summarized data (in this case, the actual measurements under the ratios). We WILL need to do this eventually.
Peter also pointed out possibly our largest challenge. He noted that "having clear spots in wings is not very precise if the data is to be comparable beyond the group in question - which I suppose is part of the goal." At every TDWG meeting I've been to we decide that we can't build standards for specific character values and yet at every TDWG meeting I've been to we try to build standards for specific character values. We need to build mechanisms to allow sharing of character lists across projects IF THOSE PROJECTS WANT TO USE THIS FEATURE. If projects don't want to share character lists, for what ever reason, then they won't not matter how important we think it is to do so.
<soapbox on> I think this focus on "standard character lists" is very much a "plant thing." In animals it would never occur to me to use "clear spots in wings" for anything other than the local context for which it was established. No one would suggest that "clear spots in wings" in bees has anything to do with "clear spots in wings" in butterflies and trying to use a single character coding for this would receive limited support at best. I think the problem is that it's common to talk about "identifying a plant" but very rare to talk about "identifying an animal." There are no "faunas" that are equivalent to "floras." This very fundamental difference between plants and animals and the way people view them has a huge impact on this very development - it's one of the reason's that the botanical community has accepted DELTA much more strongly than the zoological community. While a "global flora" is a completely reasonable goal, a "global fauna" isn't even a faint blip on distant radar. Zoologists work in relative isolation compared to botanists and have very different work practices and needs. Meeting the needs of both of these communities will be a significant challenge, one I'm not sure we can meet in a single set of tools. <soapbox off>
Thanks, Steve
Steve Shattuck CSIRO Entomology steve.shattuck@csiro.au