Morphological Data Representation
Below and attached are a first attempt at representing simple, common DELTA-type data in an XML-based structure. I've used a selected set of characters and items from the Butterfly sample data on the DELTA web site. The DELTA-formatted data looks like this:
==== DELTA-Standard CHARS File ====
*SHOW: Lepidoptera demonstration characters. Revised 28-AUG-91.
*CHARACTER LIST
#1. main colour of inner part of front wing/ 1. white/ 2. cream/ 3. grey/ 4. brown/ 5. black/ 6. yellow/ 7. orange/ 8. blue/ 9. green/
#2. wings <transparency>/ 1. with transparent areas/ 2. without transparent areas/
#3. length of front wing/ mm/
#4. antennae <length>/ times length of front wing/
==== DELTA-Standard ITEMS File ====
*SHOW: Lepidoptera demonstration items. Revised 18-OCT-94.
*ITEM DESCRIPTIONS
# Antheraea/ 1,4 2,2 3,43-50 4,0.15-0.2
# Ethmia/ 1,2-4 2,2 3,11-14 4,0.6-0.65
# Graphium/ 1,1-2/9 2,2 3,29-33 4,0.45-0.5
# Hecatesia/ 1,4 2,1<small, translucent window>/2 3,11-14 4,0.8-0.9
==== DELTA-Standard SPECS File ====
*SHOW: Lepidoptera demonstration specifications. Revised 28-AUG-91.
*NUMBER OF CHARACTERS 4 *MAXIMUM NUMBER OF STATES 9 *MAXIMUM NUMBER OF ITEMS 4
*CHARACTER TYPES 3,RN 4,RN
*NUMBERS OF STATES 1,9
==== For these files, the DELTA-generated natural language would look something like this:
Antheraea Main colour of inner part of front wing brown. Wings without transparent areas. Length of front wing 43-50 mm. Antennae 0.15-0.2 times length of front wing.
Ethmia Main colour of inner part of front wing cream to brown. Wings without transparent areas. Length of front wing 11-14 mm. Antennae 0.6-0.65 times length of front wing.
Graphium Main colour of inner part of front wing white to cream, or green. Wings without transparent areas. Length of front wing 29-33 mm. Antennae 0.45-0.5 times length of front wing.
Hecatesia Main colour of inner part of front wing brown. Wings with transparent areas (small, translucent window), or without transparent areas. Length of front wing 11-14 mm. Antennae 0.8-0.9 times length of front wing.
==== Hand-generated natural language would be essentially the same except for the last item, where it might look more like this:
Hecatesia Main colour of inner part of front wing brown. Wings with or without transparent areas (when present, forming a small window). Length of front wing 11-14 mm. Antennae 0.8-0.9 times length of front wing.
============================================
I've translated this into the XML file that is attached. (Even this fairly simple example is moderately large and I would recommend using an XML viewer such as Microsoft's XML Notepad when working with it - see http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnxml/html/ xmlpaddownload.asp.)
The basic structure of the attached file is:
<Morphology> <Characters> <Character> - ANY NUMBER <Type> <ID> <Short_Descriptor> <Long_Descriptor> <State_Descriptor> <Order> <State> - ANY NUMBER <Value> <ID> <Order> <Items> <Item> - ANY NUMBER <Type> <ID> <Descriptor> <CodedCharacter> - ANY NUMBER <CharacterID> <Description> <CodedState> - ANY NUMBER <StateID> <Value>
Note that I've treated everything as elements and haven't used attributes. Simplicity is the only reason for this and some elements would be better as attributes; these can be converted when the dust settles.
I've tried to generalise as much as possible and use only two main elements: <character> and <item>. Each character is assigned a <type> that tells what it is (ordered multistate, unordered multistate, real, integer, etc). Similarly the item <type> can be specified as taxon or specimen (or potentially something else).
A couple of points are probably worth making:
The <Short_Descriptor> and <Long_Descriptor> are used to support DELTA comments. This probably needs to be generalised further to support any number of alternate phrasings.
<State_Descriptor> is used for the units of numeric characters and isn't needed (?) for other character types - it's an attempt at keeping <character> general.
The <Description> element in <CodedCharacter> is used to house natural language representations. Codes for the states (when needed) are placed in square brackets, these being translated during generation. As noted above, this may been to be generalised to support any number of phrasings.
In <CodedState>, <Value> is used to hold numeric values, the <character> <state> being used to define what the number means (minimum, maximum, etc., rather than using placement in the attribute string as in the DELTA standard). This element won't (?) be needed for multistate characters.
I think/hope the remainder is fairly clear.
** The Next Step **
I would suggest the following path from here:
1) Make sure the above representation makes sense for the data given.
2) Expand the above data to support LucID-specific requirements (without adding additional complexity).
Once this is finished we can:
Add additional DELTA features (dependencies, default values, etc.)
Add more complex data sets and examples
Add new features on our assorted "wish lists"
I look forward to comments and forward progress!
Thanks, Steve
Steve Shattuck CSIRO Entomology biolink@ento.csiro.au
participants (1)
-
unknown@example.com