TDWG Frankfurt: SDD subgroup meeting report

27 Nov 2000

      TDWG subgroup: Structure of Descriptive Data, subgroup session at the TDWG
meeting in Frankfurt, 12. Nov. 2000
## Version 1 ##

== Participants ==
Stan Blum, Jim Croft, Gregor Hagedorn, Nicholas Lander, Bob Morris, Jörg Ochsmann,
Richard Pankhurst, Jean-Marc Vanel, Mark Watson, Greg Whitbread, and others

== Discussion topic: What is this group about? ==
Bob Morris: Group is not interested in resource development. Nicholas Lander: Not on
Standard Characters lists. There was a TDWG subgroup on Standard Characters for plants
(Richard Pankhurst was convener), which did not succeed to draw up a standard character
list. Richard Pankhurst thinks we should put it off for further 5 years until Delta-like technology
is widely used and interest is sufficient.

== Discussion topic: SDD future and presentation ==
Convener: Gregor Hagedorn apologized for his lack of initiatives during the year, being
overwhelmed with work in the starting phase of the GLOPP project. He is willing to continue
as convener, but also willing to step aside if somebody else wants to fill the position and bring
more initiatives into the group.

It was agreed that there should be a SDD website to point to information resources
concerning the activities of the group. Nicholas Lander will ask Ben Richardson at Perth,
alternatively if Gregor Hagedorn continues as convener he can host the site on the
www.DiversityCampus.net server of the GLOPP project.

Most participants agreed that a true SDD workshop should be organized, lasting at least full 2
days, to clarify the issues brought up during the email discussions and reach real understanding.
It was agreed that the email discussions are extremely good and useful, but that it is difficult to
fully assess where a consensus was reached, and where the discussion simply tapered off.
Also, certain concepts are easier discussed with physical presence, using graphical
presentations.

Such a workshop was considered to be desirable in the near future, latest immediately before
or after the next TDWG meeting. It should be organized in conjunction with some other event,
to minimize traveling expenses in terms of time and money. Another time could be before or
after a workshop which is planned by the TDWG subgroup on Accession Data in the spring of
2001.

The more technical discussions regarding implementation choices in XML should be clearly
separated on the discussion list by the string XML. Perhaps a sub-subgroup might be formed
to resolve questions in this area?

== Discussion topic: Resource discovery ==
Bob Morris: We need to distinguish between questions that hinge on resource cost, and
questions that hinge on biological problems. Resource cost: does it pay to try a given
information source whether it holds information. It should be easy to know in advance whether
a server may hold the information I am looking for. New important development reported by
Bob Morris: Uniform data description interface (UDDI). Nothing about resource cost has been
discussed so far, but group may address this issue. Some information about this should be in
the header of a new XML files standard. It was agreed that this question is secondary to the
interest

Information that should be present in the header of the files:
Taxonomic scope: What are the taxa I hold data on, either list, or query definition (query ->
sorry I hold no data)
Do I have descriptions? (e.g. unparsed natural language)
  Do I have structured descriptions? Which kind? e.g. DELTA/NEXUS/XML? or: If so, point
me to where the structure of description is defined
Do I have media (images)?
Do I have keys or other means of identification?
  Do I have interactive keys?

== Discussion topic: Who is interested in a new standard for descriptive data? ==
We distinguished between 3 user/provider interests:
1) The pure user of electronic field guides, applying the descriptions e.g. to identification work,
2) The pure data provider, who holds large amounts of descriptive information that is historical
and has the status of an unchanging publication
3) The data developing scientist, who creates and analyses/uses his or her data. In contrast to
case 2, any data here must be referenced to individual sources and any information is under
contention of being false.

The case of 2 occurs e.g. in projects where legacy information like huge conventional flora or
fauna works are digitized and shall be accessed in a structured way. In cases 1 and 2 the data
can be transformed in ways that may lead to some loss of information (e.g. concerning the
source of individual assertions: "leaves are glandulous when young: observed by Author1,
1980"). However, the case 3 is interested in rigid structure to allow knowledge management
and data validation. The case 3 is the case assumed in programs like the CSIRO Delta
programs, Pankey/Pandora, or DeltaAccess which see themselves as working tools for
scientists.

== Discussion topic: XML markup as proposed by Kevin Thiele ==
Bob Morris raised the question whether Kevin Thieles proposal would be enough to start with,
to which additional structural information can be added. Greg Whitbread thinks yes, but
character types should be added.

Gregor Hagedorn remarked that, without having objections against Kevins principal argument
about the advantages of providing a very simple system that could be used to markup existing
information, it was difficult to say whether the system would be compatible with a more
structured approach, as long as the structured approach is yet very vague. However, it may be
wise to take Kevins approach ahead, risking however to redefine it later.

It was agreed that it is desirable to have a common standard which defines levels of structure.
Gregor Hagedorn remarked that a software may require at least a certain structural level, e.g.
a structured database may require coded markup of character and feature, referring to a full
character schema. Levels should be clearly labeled, so that an application can easily detect
whether there is sufficient structure for its needs. However, levels should be known not only to
the software, but also to users so that scientists can communicate I can give you this, is this
ok for you? Problems like those with versions of the DELTA standard (format changes, but
version not recognizable for importing software), or in the graphics area the TIF-problem
(the standard defines an envelope, which may contain any kind of information, including
proprietary formats unreadable for other software) should be avoided. Possible levels could be:
...
...
Level 0: The description is marked up as a block referring to a certain taxon. No markup of
structures, methodology, or features.
Level 1: Level 0 plus markup (not necessarily complete) of structures (leaf, flower, ) or
method (naked eye, hand lens, light microscope, scanning electron microscope)
Level 2: Level 1 plus markup of characters (i.e. structure/methodology/feature), but not
character states
Level 3: Level 2 plus markup of states
Level 4: Level 3, fully coded markup referring to separate character definition/schema
Are more levels needed? More orthogonal scheme, with complete/incomplete markup noted
separately?
Gregor Hagedorn brought up the question whether the simpler, character schema-free forms
of  XML markup are able to cope with queries and reports in multiple languages. It seems to
him that the words of the language stand for only the English understanding, without any
definition being available elsewhere. This is a contrast to the DELTA method of defining
characters and states, and using codes that can easily be output in multiple languages
concurrently.

Nicholas Lander: There is a file format standard, the "Star file" format (20 yrs old but dynamic
CODATA standard, now also in xml) in chemistry supplies means to define core character
lists with supplements

== Discussion topic: Discussion of future of DELTA format ==
Nicholas Lander: We need more rigorous system. It was agreed to put efforts into the
original idea of developing a system that goes beyond DELTA and Nexus, but encompassed
the functionality of NEXUS, DELTA/New DELTA, and adds the additional requirements
identified by LucID or DeltaAccess. Richard Pankhurst warned about the generalized system
fallacy: Any system that tries to fulfill too many requirements will become very complex and
inherently difficult to analyze and maintain. There is a danger of creating a monster-structure
nobody will actually use.

@@ Question to participants: something was discussed about the Free Delta system, but I
missed that in my note. Anybody can fill in here?

== Discussion topic: Standard character lists ==
Several members stressed the need for Standard character lists in the future. It was (who said
that??) proposed to start with core schemata that can be expanded as time goes on. Gregor
Hagedorn proposed that standards should not necessarily be seen as concentric rings (e.g. a
single standard with successive versions or levels), but perhaps rather modular blocks that can
stand side-by-side. For example, it may be wise to develop and maintain standards for different
methodologies (field observation, light microscope, SEM characters, chemical compounds) by
different standard bodies. A given description could choose from the standard modules as
necessary for the observations or studies made.

Gregor Hagedorn: Standard character schema should be developed like scientific publications,
so that they can be developed and improved by scientists in the course of several years. Only
after the contour of competing schemata have become clear, standardization efforts should
begin.

== Discussion topic: Character vs. Structure / feature  ==
The following discussion happened after a break necessary due to the collision of Bob Morris
presentation with the discussion. Many participants were absent during this part. Jean-Marc
Vanel presented his views about structural analysis of character data. It was first that it is
preferable to use the general term structure rather than the term organ proposed by Jean-
Marc Vanel. Vanel proposed that structures can be hierarchical or primary/secondary. We
found, however, that structures have not necessarily a clear hierarchy. If the same type of
hairs exist on both the stem and the leaf, it is not sufficient to place hairs outside the group
containing both stem and leaf. It is possible that characters and structures have overlapping
hierarchies, that can not be resolved into a simple tree. For example, the same hairs may occur
on many different structures, which can not be grouped hierarchically. A Ref-ID mechanism is
necessary to document the relations between structures, substructures, and properties. Vanel:
XML is like a Christmas tree: basic tree with decoration connecting branches.

Richard Pankhurst uses relational adjectives (= linguistic term): part-of relationship and kind-of
relationship. Examples: Leaflet is a part of a leaf. Basal leaf is a kind of leaf; glume is a
kind of leaf, but in more special sense.

Features have qualifiers. Richard Pankhurst further discussed restrictions of context: plant is
young/old, ivy leaves: when young lobed, when old almost round. Certain conditions are called
epitopic child birth is possible only in female, and only in female that is pregnant and of
appropriate age.

Gregor Hagedorn stressed the importance of methods or methodology in addition to the
structure/basic property analysis performed by Diederich et al. The same character may have
different states (or values/results) for different methodologies (e.g. surface rough in SEM, but
smooth with hand lens). Methodology can further be split into observation method (type of
apparatus used) and condition environmental or experimental method (soil, climate, culture
media, substrate, etc.). In some cases a character is implicitly only possible to observe using a
certain methodology.
----------------------------------------------------------
Gregor Hagedorn (G.Hagedorn@bba.de)
Institute for Plant Virology, Microbiology, and Biosafety
Federal Research Center for Agriculture and Forestry (BBA)
Koenigin-Luise-Str. 19          Tel: +49-30-8304-2220
14195 Berlin, Germany           Fax: +49-30-8304-2203

Often wrong but never in doubt!