TDWG subgroup: Structure of Descriptive Data, subgroup session at the TDWG meeting in Frankfurt, 12. Nov. 2000 ## Version 1 ##
== Participants == Stan Blum, Jim Croft, Gregor Hagedorn, Nicholas Lander, Bob Morris, Jörg Ochsmann, Richard Pankhurst, Jean-Marc Vanel, Mark Watson, Greg Whitbread, and others
== Discussion topic: What is this group about? == Bob Morris: Group is not interested in resource development. Nicholas Lander: Not on Standard Characters lists. There was a TDWG subgroup on Standard Characters for plants (Richard Pankhurst was convener), which did not succeed to draw up a standard character list. Richard Pankhurst thinks we should put it off for further 5 years until Delta-like technology is widely used and interest is sufficient.
== Discussion topic: SDD future and presentation == Convener: Gregor Hagedorn apologized for his lack of initiatives during the year, being overwhelmed with work in the starting phase of the GLOPP project. He is willing to continue as convener, but also willing to step aside if somebody else wants to fill the position and bring more initiatives into the group.
It was agreed that there should be a SDD website to point to information resources concerning the activities of the group. Nicholas Lander will ask Ben Richardson at Perth, alternatively if Gregor Hagedorn continues as convener he can host the site on the www.DiversityCampus.net server of the GLOPP project.
Most participants agreed that a true SDD workshop should be organized, lasting at least full 2 days, to clarify the issues brought up during the email discussions and reach real understanding. It was agreed that the email discussions are extremely good and useful, but that it is difficult to fully assess where a consensus was reached, and where the discussion simply tapered off. Also, certain concepts are easier discussed with physical presence, using graphical presentations.
Such a workshop was considered to be desirable in the near future, latest immediately before or after the next TDWG meeting. It should be organized in conjunction with some other event, to minimize traveling expenses in terms of time and money. Another time could be before or after a workshop which is planned by the TDWG subgroup on Accession Data in the spring of 2001.
The more technical discussions regarding implementation choices in XML should be clearly separated on the discussion list by the string XML. Perhaps a sub-subgroup might be formed to resolve questions in this area?
== Discussion topic: Resource discovery == Bob Morris: We need to distinguish between questions that hinge on resource cost, and questions that hinge on biological problems. Resource cost: does it pay to try a given information source whether it holds information. It should be easy to know in advance whether a server may hold the information I am looking for. New important development reported by Bob Morris: Uniform data description interface (UDDI). Nothing about resource cost has been discussed so far, but group may address this issue. Some information about this should be in the header of a new XML files standard. It was agreed that this question is secondary to the interest
Information that should be present in the header of the files: Taxonomic scope: What are the taxa I hold data on, either list, or query definition (query -> sorry I hold no data) Do I have descriptions? (e.g. unparsed natural language) Do I have structured descriptions? Which kind? e.g. DELTA/NEXUS/XML? or: If so, point me to where the structure of description is defined Do I have media (images)? Do I have keys or other means of identification? Do I have interactive keys?
== Discussion topic: Who is interested in a new standard for descriptive data? == We distinguished between 3 user/provider interests: 1) The pure user of electronic field guides, applying the descriptions e.g. to identification work, 2) The pure data provider, who holds large amounts of descriptive information that is historical and has the status of an unchanging publication 3) The data developing scientist, who creates and analyses/uses his or her data. In contrast to case 2, any data here must be referenced to individual sources and any information is under contention of being false.
The case of 2 occurs e.g. in projects where legacy information like huge conventional flora or fauna works are digitized and shall be accessed in a structured way. In cases 1 and 2 the data can be transformed in ways that may lead to some loss of information (e.g. concerning the source of individual assertions: "leaves are glandulous when young: observed by Author1, 1980"). However, the case 3 is interested in rigid structure to allow knowledge management and data validation. The case 3 is the case assumed in programs like the CSIRO Delta programs, Pankey/Pandora, or DeltaAccess which see themselves as working tools for scientists.
== Discussion topic: XML markup as proposed by Kevin Thiele == Bob Morris raised the question whether Kevin Thieles proposal would be enough to start with, to which additional structural information can be added. Greg Whitbread thinks yes, but character types should be added.
Gregor Hagedorn remarked that, without having objections against Kevins principal argument about the advantages of providing a very simple system that could be used to markup existing information, it was difficult to say whether the system would be compatible with a more structured approach, as long as the structured approach is yet very vague. However, it may be wise to take Kevins approach ahead, risking however to redefine it later.
It was agreed that it is desirable to have a common standard which defines levels of structure. Gregor Hagedorn remarked that a software may require at least a certain structural level, e.g. a structured database may require coded markup of character and feature, referring to a full character schema. Levels should be clearly labeled, so that an application can easily detect whether there is sufficient structure for its needs. However, levels should be known not only to the software, but also to users so that scientists can communicate I can give you this, is this ok for you? Problems like those with versions of the DELTA standard (format changes, but version not recognizable for importing software), or in the graphics area the TIF-problem (the standard defines an envelope, which may contain any kind of information, including proprietary formats unreadable for other software) should be avoided. Possible levels could be:
Level 0: The description is marked up as a block referring to a certain taxon. No markup of
structures, methodology, or features.
Level 1: Level 0 plus markup (not necessarily complete) of structures (leaf, flower, ) or
method (naked eye, hand lens, light microscope, scanning electron microscope)
Level 2: Level 1 plus markup of characters (i.e. structure/methodology/feature), but not
character states
Level 3: Level 2 plus markup of states Level 4: Level 3, fully coded markup referring to separate character definition/schema
Are more levels needed? More orthogonal scheme, with complete/incomplete markup noted separately?
Gregor Hagedorn brought up the question whether the simpler, character schema-free forms of XML markup are able to cope with queries and reports in multiple languages. It seems to him that the words of the language stand for only the English understanding, without any definition being available elsewhere. This is a contrast to the DELTA method of defining characters and states, and using codes that can easily be output in multiple languages concurrently.
Nicholas Lander: There is a file format standard, the "Star file" format (20 yrs old but dynamic CODATA standard, now also in xml) in chemistry supplies means to define core character lists with supplements
== Discussion topic: Discussion of future of DELTA format == Nicholas Lander: We need more rigorous system. It was agreed to put efforts into the original idea of developing a system that goes beyond DELTA and Nexus, but encompassed the functionality of NEXUS, DELTA/New DELTA, and adds the additional requirements identified by LucID or DeltaAccess. Richard Pankhurst warned about the generalized system fallacy: Any system that tries to fulfill too many requirements will become very complex and inherently difficult to analyze and maintain. There is a danger of creating a monster-structure nobody will actually use.
@@ Question to participants: something was discussed about the Free Delta system, but I missed that in my note. Anybody can fill in here?
== Discussion topic: Standard character lists == Several members stressed the need for Standard character lists in the future. It was (who said that??) proposed to start with core schemata that can be expanded as time goes on. Gregor Hagedorn proposed that standards should not necessarily be seen as concentric rings (e.g. a single standard with successive versions or levels), but perhaps rather modular blocks that can stand side-by-side. For example, it may be wise to develop and maintain standards for different methodologies (field observation, light microscope, SEM characters, chemical compounds) by different standard bodies. A given description could choose from the standard modules as necessary for the observations or studies made.
Gregor Hagedorn: Standard character schema should be developed like scientific publications, so that they can be developed and improved by scientists in the course of several years. Only after the contour of competing schemata have become clear, standardization efforts should begin.
== Discussion topic: Character vs. Structure / feature == The following discussion happened after a break necessary due to the collision of Bob Morris presentation with the discussion. Many participants were absent during this part. Jean-Marc Vanel presented his views about structural analysis of character data. It was first that it is preferable to use the general term structure rather than the term organ proposed by Jean- Marc Vanel. Vanel proposed that structures can be hierarchical or primary/secondary. We found, however, that structures have not necessarily a clear hierarchy. If the same type of hairs exist on both the stem and the leaf, it is not sufficient to place hairs outside the group containing both stem and leaf. It is possible that characters and structures have overlapping hierarchies, that can not be resolved into a simple tree. For example, the same hairs may occur on many different structures, which can not be grouped hierarchically. A Ref-ID mechanism is necessary to document the relations between structures, substructures, and properties. Vanel: XML is like a Christmas tree: basic tree with decoration connecting branches.
Richard Pankhurst uses relational adjectives (= linguistic term): part-of relationship and kind-of relationship. Examples: Leaflet is a part of a leaf. Basal leaf is a kind of leaf; glume is a kind of leaf, but in more special sense.
Features have qualifiers. Richard Pankhurst further discussed restrictions of context: plant is young/old, ivy leaves: when young lobed, when old almost round. Certain conditions are called epitopic child birth is possible only in female, and only in female that is pregnant and of appropriate age.
Gregor Hagedorn stressed the importance of methods or methodology in addition to the structure/basic property analysis performed by Diederich et al. The same character may have different states (or values/results) for different methodologies (e.g. surface rough in SEM, but smooth with hand lens). Methodology can further be split into observation method (type of apparatus used) and condition environmental or experimental method (soil, climate, culture media, substrate, etc.). In some cases a character is implicitly only possible to observe using a certain methodology. ---------------------------------------------------------- Gregor Hagedorn (G.Hagedorn@bba.de) Institute for Plant Virology, Microbiology, and Biosafety Federal Research Center for Agriculture and Forestry (BBA) Koenigin-Luise-Str. 19 Tel: +49-30-8304-2220 14195 Berlin, Germany Fax: +49-30-8304-2203
Often wrong but never in doubt!