Re: Morphological Data Representation
----- Original Message ----- From: "Steve Shattuck" Steve.Shattuck@CSIRO.AU To: TDWG-SDD@USOBI.ORG Sent: Friday, November 23, 2001 2:33 PM Subject: Morphological Data Representation
| Below and attached are a first attempt at representing simple, common | DELTA-type data in an XML-based structure. I've used a selected set of | characters and items from the Butterfly sample data on the DELTA web site. | The DELTA-formatted data looks like this: | | | ==== DELTA-Standard CHARS File ==== | | *SHOW: Lepidoptera demonstration characters. Revised 28-AUG-91. | | *CHARACTER LIST | | #1. main colour of inner part of front wing/ | 1. white/ | 2. cream/ | 3. grey/ | 4. brown/ | 5. black/ | 6. yellow/ | 7. orange/ | 8. blue/ | 9. green/ | | #2. wings <transparency>/ | 1. with transparent areas/ | 2. without transparent areas/ | | #3. length of front wing/ | mm/ | | #4. antennae <length>/ | times length of front wing/ | | | ==== DELTA-Standard ITEMS File ==== | | *SHOW: Lepidoptera demonstration items. Revised 18-OCT-94. | | *ITEM DESCRIPTIONS | | # Antheraea/ | 1,4 2,2 3,43-50 4,0.15-0.2 | | # Ethmia/ | 1,2-4 2,2 3,11-14 4,0.6-0.65 | | # Graphium/ | 1,1-2/9 2,2 3,29-33 4,0.45-0.5 | | # Hecatesia/ | 1,4 2,1<small, translucent window>/2 3,11-14 4,0.8-0.9 | | | ==== DELTA-Standard SPECS File ==== | | *SHOW: Lepidoptera demonstration specifications. Revised 28-AUG-91. | | *NUMBER OF CHARACTERS 4 | *MAXIMUM NUMBER OF STATES 9 | *MAXIMUM NUMBER OF ITEMS 4 | | *CHARACTER TYPES 3,RN 4,RN | | *NUMBERS OF STATES 1,9 | | | | | ==== For these files, the DELTA-generated natural language would look | something like this: | | Antheraea | Main colour of inner part of front wing brown. Wings without transparent | areas. Length of front wing 43-50 mm. Antennae 0.15-0.2 times length of | front wing. | | Ethmia | Main colour of inner part of front wing cream to brown. Wings without | transparent areas. Length of front wing 11-14 mm. Antennae 0.6-0.65 times | length of front wing. | | Graphium | Main colour of inner part of front wing white to cream, or green. Wings | without transparent areas. Length of front wing 29-33 mm. Antennae 0.45-0.5 | times length of front wing. | | Hecatesia | Main colour of inner part of front wing brown. Wings with transparent areas | (small, translucent window), or without | transparent areas. Length of front wing 11-14 mm. Antennae 0.8-0.9 times | length of front wing. | | | | | ==== Hand-generated natural language would be essentially the same except | for the last item, where it might look more like this: | | Hecatesia | Main colour of inner part of front wing brown. Wings with or without | transparent areas (when present, forming a small window). Length of front | wing 11-14 mm. Antennae 0.8-0.9 times length of front wing. | | | ============================================ | | I've translated this into the XML file that is attached. (Even this fairly | simple example is moderately large and I would recommend using an XML viewer | such as Microsoft's XML Notepad when working with it - see | http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnxml/html/ | xmlpaddownload.asp.) | | The basic structure of the attached file is: | | <Morphology> | <Characters> | <Character> - ANY NUMBER | <Type> | <ID> | <Short_Descriptor> | <Long_Descriptor> | <State_Descriptor> | <Order> | <State> - ANY NUMBER | <Value> | <ID> | <Order> | <Items> | <Item> - ANY NUMBER | <Type> | <ID> | <Descriptor> | <CodedCharacter> - ANY NUMBER | <CharacterID> | <Description> | <CodedState> - ANY NUMBER | <StateID> | <Value> | | | Note that I've treated everything as elements and haven't used attributes. | Simplicity is the only reason for this and some elements would be better as | attributes; these can be converted when the dust settles. | | I've tried to generalise as much as possible and use only two main elements: | <character> and <item>. Each character is assigned a <type> that tells what | it is (ordered multistate, unordered multistate, real, integer, etc). | Similarly the item <type> can be specified as taxon or specimen (or | potentially something else). | | A couple of points are probably worth making: | | The <Short_Descriptor> and <Long_Descriptor> are used to support DELTA | comments. This probably needs to be generalised further to support any | number of alternate phrasings. | | <State_Descriptor> is used for the units of numeric characters and isn't | needed (?) for other character types - it's an attempt at keeping | <character> general. | | The <Description> element in <CodedCharacter> is used to house natural | language representations. Codes for the states (when needed) are placed in | square brackets, these being translated during generation. As noted above, | this may been to be generalised to support any number of phrasings. | | In <CodedState>, <Value> is used to hold numeric values, the <character> | <state> being used to define what the number means (minimum, maximum, etc., | rather than using placement in the attribute string as in the DELTA | standard). This element won't (?) be needed for multistate characters. | | I think/hope the remainder is fairly clear. | | | ** The Next Step ** | | I would suggest the following path from here: | | 1) Make sure the above representation makes sense for the data given. | | 2) Expand the above data to support LucID-specific requirements (without | adding additional complexity). | | Once this is finished we can: | | Add additional DELTA features (dependencies, default values, etc.) | | Add more complex data sets and examples | | Add new features on our assorted "wish lists" | | | I look forward to comments and forward progress! | | | Thanks, Steve | | Steve Shattuck | CSIRO Entomology | biolink@ento.csiro.au | |
participants (1)
-
Kevin Thiele