- tdwg-content - lists.tdwg.org

TDWG Frankfurt: SDD subgroup meeting report
by Gregor Hagedorn 27 Nov '00

27 Nov '00

TDWG subgroup: Structure of Descriptive Data, subgroup session at the TDWG meeting in Frankfurt, 12. Nov. 2000 ## Version 1 ## == Participants == Stan Blum, Jim Croft, Gregor Hagedorn, Nicholas Lander, Bob Morris, Jörg Ochsmann, Richard Pankhurst, Jean-Marc Vanel, Mark Watson, Greg Whitbread, and others == Discussion topic: What is this group about? == Bob Morris: Group is not interested in resource development. Nicholas Lander: Not on Standard Characters lists. There was a TDWG subgroup on Standard Characters for plants (Richard Pankhurst was convener), which did not succeed to draw up a standard character list. Richard Pankhurst thinks we should put it off for further 5 years until Delta-like technology is widely used and interest is sufficient. == Discussion topic: SDD future and presentation == Convener: Gregor Hagedorn apologized for his lack of initiatives during the year, being overwhelmed with work in the starting phase of the GLOPP project. He is willing to continue as convener, but also willing to step aside if somebody else wants to fill the position and bring more initiatives into the group. It was agreed that there should be a SDD website to point to information resources concerning the activities of the group. Nicholas Lander will ask Ben Richardson at Perth, alternatively if Gregor Hagedorn continues as convener he can host the site on the www.DiversityCampus.net server of the GLOPP project. Most participants agreed that a true SDD workshop should be organized, lasting at least full 2 days, to clarify the issues brought up during the email discussions and reach real understanding. It was agreed that the email discussions are extremely good and useful, but that it is difficult to fully assess where a consensus was reached, and where the discussion simply tapered off. Also, certain concepts are easier discussed with physical presence, using graphical presentations. Such a workshop was considered to be desirable in the near future, latest immediately before or after the next TDWG meeting. It should be organized in conjunction with some other event, to minimize traveling expenses in terms of time and money. Another time could be before or after a workshop which is planned by the TDWG subgroup on Accession Data in the spring of 2001. The more technical discussions regarding implementation choices in XML should be clearly separated on the discussion list by the string XML. Perhaps a sub-subgroup might be formed to resolve questions in this area? == Discussion topic: Resource discovery == Bob Morris: We need to distinguish between questions that hinge on resource cost, and questions that hinge on biological problems. Resource cost: does it pay to try a given information source whether it holds information. It should be easy to know in advance whether a server may hold the information I am looking for. New important development reported by Bob Morris: Uniform data description interface (UDDI). Nothing about resource cost has been discussed so far, but group may address this issue. Some information about this should be in the header of a new XML files standard. It was agreed that this question is secondary to the interest Information that should be present in the header of the files: Taxonomic scope: What are the taxa I hold data on, either list, or query definition (query -> sorry I hold no data) Do I have descriptions? (e.g. unparsed natural language) Do I have structured descriptions? Which kind? e.g. DELTA/NEXUS/XML? or: If so, point me to where the structure of description is defined Do I have media (images)? Do I have keys or other means of identification? Do I have interactive keys? == Discussion topic: Who is interested in a new standard for descriptive data? == We distinguished between 3 user/provider interests: 1) The pure user of electronic field guides, applying the descriptions e.g. to identification work, 2) The pure data provider, who holds large amounts of descriptive information that is historical and has the status of an unchanging publication 3) The data developing scientist, who creates and analyses/uses his or her data. In contrast to case 2, any data here must be referenced to individual sources and any information is under contention of being false. The case of 2 occurs e.g. in projects where legacy information like huge conventional flora or fauna works are digitized and shall be accessed in a structured way. In cases 1 and 2 the data can be transformed in ways that may lead to some loss of information (e.g. concerning the source of individual assertions: "leaves are glandulous when young: observed by Author1, 1980"). However, the case 3 is interested in rigid structure to allow knowledge management and data validation. The case 3 is the case assumed in programs like the CSIRO Delta programs, Pankey/Pandora, or DeltaAccess which see themselves as working tools for scientists. == Discussion topic: XML markup as proposed by Kevin Thiele == Bob Morris raised the question whether Kevin Thieles proposal would be enough to start with, to which additional structural information can be added. Greg Whitbread thinks yes, but character types should be added. Gregor Hagedorn remarked that, without having objections against Kevins principal argument about the advantages of providing a very simple system that could be used to markup existing information, it was difficult to say whether the system would be compatible with a more structured approach, as long as the structured approach is yet very vague. However, it may be wise to take Kevins approach ahead, risking however to redefine it later. It was agreed that it is desirable to have a common standard which defines levels of structure. Gregor Hagedorn remarked that a software may require at least a certain structural level, e.g. a structured database may require coded markup of character and feature, referring to a full character schema. Levels should be clearly labeled, so that an application can easily detect whether there is sufficient structure for its needs. However, levels should be known not only to the software, but also to users so that scientists can communicate I can give you this, is this ok for you? Problems like those with versions of the DELTA standard (format changes, but version not recognizable for importing software), or in the graphics area the TIF-problem (the standard defines an envelope, which may contain any kind of information, including proprietary formats unreadable for other software) should be avoided. Possible levels could be: >> Level 0: The description is marked up as a block referring to a certain taxon. No markup of structures, methodology, or features. >> Level 1: Level 0 plus markup (not necessarily complete) of structures (leaf, flower, ) or method (naked eye, hand lens, light microscope, scanning electron microscope) >> Level 2: Level 1 plus markup of characters (i.e. structure/methodology/feature), but not character states >> Level 3: Level 2 plus markup of states >> Level 4: Level 3, fully coded markup referring to separate character definition/schema Are more levels needed? More orthogonal scheme, with complete/incomplete markup noted separately? Gregor Hagedorn brought up the question whether the simpler, character schema-free forms of XML markup are able to cope with queries and reports in multiple languages. It seems to him that the words of the language stand for only the English understanding, without any definition being available elsewhere. This is a contrast to the DELTA method of defining characters and states, and using codes that can easily be output in multiple languages concurrently. Nicholas Lander: There is a file format standard, the "Star file" format (20 yrs old but dynamic CODATA standard, now also in xml) in chemistry supplies means to define core character lists with supplements == Discussion topic: Discussion of future of DELTA format == Nicholas Lander: We need more rigorous system. It was agreed to put efforts into the original idea of developing a system that goes beyond DELTA and Nexus, but encompassed the functionality of NEXUS, DELTA/New DELTA, and adds the additional requirements identified by LucID or DeltaAccess. Richard Pankhurst warned about the generalized system fallacy: Any system that tries to fulfill too many requirements will become very complex and inherently difficult to analyze and maintain. There is a danger of creating a monster-structure nobody will actually use. @@ Question to participants: something was discussed about the Free Delta system, but I missed that in my note. Anybody can fill in here? == Discussion topic: Standard character lists == Several members stressed the need for Standard character lists in the future. It was (who said that??) proposed to start with core schemata that can be expanded as time goes on. Gregor Hagedorn proposed that standards should not necessarily be seen as concentric rings (e.g. a single standard with successive versions or levels), but perhaps rather modular blocks that can stand side-by-side. For example, it may be wise to develop and maintain standards for different methodologies (field observation, light microscope, SEM characters, chemical compounds) by different standard bodies. A given description could choose from the standard modules as necessary for the observations or studies made. Gregor Hagedorn: Standard character schema should be developed like scientific publications, so that they can be developed and improved by scientists in the course of several years. Only after the contour of competing schemata have become clear, standardization efforts should begin. == Discussion topic: Character vs. Structure / feature == The following discussion happened after a break necessary due to the collision of Bob Morris presentation with the discussion. Many participants were absent during this part. Jean-Marc Vanel presented his views about structural analysis of character data. It was first that it is preferable to use the general term structure rather than the term organ proposed by Jean- Marc Vanel. Vanel proposed that structures can be hierarchical or primary/secondary. We found, however, that structures have not necessarily a clear hierarchy. If the same type of hairs exist on both the stem and the leaf, it is not sufficient to place hairs outside the group containing both stem and leaf. It is possible that characters and structures have overlapping hierarchies, that can not be resolved into a simple tree. For example, the same hairs may occur on many different structures, which can not be grouped hierarchically. A Ref-ID mechanism is necessary to document the relations between structures, substructures, and properties. Vanel: XML is like a Christmas tree: basic tree with decoration connecting branches. Richard Pankhurst uses relational adjectives (= linguistic term): part-of relationship and kind-of relationship. Examples: Leaflet is a part of a leaf. Basal leaf is a kind of leaf; glume is a kind of leaf, but in more special sense. Features have qualifiers. Richard Pankhurst further discussed restrictions of context: plant is young/old, ivy leaves: when young lobed, when old almost round. Certain conditions are called epitopic child birth is possible only in female, and only in female that is pregnant and of appropriate age. Gregor Hagedorn stressed the importance of methods or methodology in addition to the structure/basic property analysis performed by Diederich et al. The same character may have different states (or values/results) for different methodologies (e.g. surface rough in SEM, but smooth with hand lens). Methodology can further be split into observation method (type of apparatus used) and condition environmental or experimental method (soil, climate, culture media, substrate, etc.). In some cases a character is implicitly only possible to observe using a certain methodology. ---------------------------------------------------------- Gregor Hagedorn (G.Hagedorn(a)bba.de) Institute for Plant Virology, Microbiology, and Biosafety Federal Research Center for Agriculture and Forestry (BBA) Koenigin-Luise-Str. 19 Tel: +49-30-8304-2220 14195 Berlin, Germany Fax: +49-30-8304-2203 Often wrong but never in doubt!

1 0

Re: Report; correction
by Robert A. (Bob) Morris 27 Nov '00

27 Nov '00

Although I am the perpetrator of the mistake, UDDI does not stand for Universial Data Description Interface, but rather Universial Description, Discovery and Integration. See www.uddi.org. UDDI is not about describing data, but about how you can find out where and how to invoke services, e.g. databases, on the web. Designed originally for Business-to-Business service discovery, we are looking at it as a registry facility for biodiversity data sources. Bob

1 0

Report, please
by Kevin Thiele 21 Nov '00

21 Nov '00

Will someone who went to Frankfurt please report on the meeting and discussions re the SDD - Bob or Jim? - at your leisure

1 0

Re: Report, please
by Gregor Hagedorn 21 Nov '00

21 Nov '00

Hi Kevin, sorry, I am preparing a report, I will try to post it soon! Gregor > Will someone who went to Frankfurt please report on the meeting and > discussions re the SDD - Bob or Jim? - at your leisure > ---------------------------------------------------------- Gregor Hagedorn (G.Hagedorn(a)bba.de) Institute for Plant Virology, Microbiology, and Biosafety Federal Research Center for Agriculture and Forestry (BBA) Koenigin-Luise-Str. 19 Tel: +49-30-8304-2220 14195 Berlin, Germany Fax: +49-30-8304-2203 Often wrong but never in doubt!

1 0

Re: Delta-like descriptions for Thiele 0.3 draft
by Kevin Thiele 07 Nov '00

07 Nov '00

----- Original Message ----- From: Robert A. (Bob) Morris <ram(a)CS.UMB.EDU> To: <TDWG-SDD(a)USOBI.ORG> Sent: Tuesday, November 07, 2000 5:15 AM Subject: Delta-like descriptions for Thiele 0.3 draft | Ah, Jun Wan reminds me that uniqueness of ID attribute values is | imposed by XML itself. So in the Delta-like descriptions, the choices | I see are to generate unique ID's that encode the context or to use a | different attribute than ID, perhaps something like TDWGID. In the | former case one might have something like: | ID="F.1" to mean that this is an id for feature 1 | ID="F.V.1.2" to mean that this is value 2 for feature 1 | In XSLT, these are parseable with the XPath string functions, so that | nothing more than XSLT is needed to associate an IDREF with the object | having the associated ID. This is a nuisance. TDWGID I think is not a likely candidate. Using your first option, one could use simply: ID="1" (id for feature 1) ID="1.2" (id for value 2 of feature 1) This may then perhaps even be handy. Then again, by constraining like this it would quickly become a nightmare to add or re-arrange features or values. Is this the first case where we're being imposed upon by XML and we start to wear the cost of its advantages? Are there any other ways out? Cheers - k

1 0

Re: Delta-like descriptions for Thiele 0.3 draft
by Kevin Thiele 07 Nov '00

07 Nov '00

Bob, | We think that Kevin's example 9 the <DESCRIPTION> is really meant to be | | <DESCRIPTION ID = "1"> | <FEATURE IDREF="1"> <VALUE IDREF="3"> | <FEATURE IDREF="2"> <VALUE IDREF="1"> | </DESCRIPTION> | | rather than | | <DESCRIPTION ID = "1"> | <FEATURE ID="1"> <VALUE ID="3"> | <FEATURE ID="2"> <VALUE ID="1"> | </DESCRIPTION> It should probably be: <DESCRIPTION IDREF = "1"> <FEATURE IDREF="1"> <VALUE IDREF="3"> <FEATURE IDREF="2"> <VALUE IDREF="1"> </DESCRIPTION> Since the description is also referring to a defined taxon ID. | II. Global feature values (This may be a Delta question). | | In 0.3's Examples 8 and 9 there do not seem to be any global feature | values. Instead, all possible values are completely local to | features. For example, can there be a list of feature values | containing both "present" and "absent" to which any feature can refer, | whether or not those values are in that feature's local list of | values? Or rather, must every feature that wishes to allow those | values name them in its list of possible values> I think all values should be specifically named. It would be silly to have a global "present", then have a feature "Flower colour" that allows this feature value. Cheers - k

1 0

Re: TDWG-SDD XML proposals of Kevin Thiele
by unknown＠example.com 07 Nov '00

07 Nov '00

Thanks for the gentle reminder of where we are ultimately headed. I guess I'm expressing some disappointment relating to the "bicycle vs. space shuttle" problem discussed earlier. The BioLink team have put together quite a nice little bike and are now zoning around town having a great time. The SDD discussion seems to still be lying in a grassy field (or on the railroad tracks?) gazing at the stars and dreaming of space ships. We're not even close to putting together a single data model that works with even a single application, forget about sharing multiple data models across multiple applications. Yes, that will come and I understand the power of it, but let's get a system that can represent DELTA/Intkey and LucID data now, the data we currently hold and the applications we currently use. By all means we should keep dreaming and planning and exploring. And when new applications appear and want to share or extend our data then we share and extend it. It's the classic problem of deciding when to put down the pencil and paper and pick up the hammer and saw. We've been at this for a year now, it's time to get dirty. Thanks, Steve Shattuck P.S. - I'll give the DTD generator a try as soon as the current BioLink XML import finishes.

1 0

Re: TDWG-SDD XML proposals of Kevin Thiele
by unknown＠example.com 07 Nov '00

07 Nov '00

> > <taxonomy> > > <rank="species" value="alfari"/> > > <rank="genus" value="Azteca"/> > > <author value="Emery"/> >What is referred to as "species" value="alfari"/ is actually the specific >epithet. The "species" value in this instance would actually be "Azteca >alfari" You're probably familiar with this protocol of referring to the >scientific name of an organism which implies citing the genus name & the >specific epithet which together comprise the species names - as per the >Linnean system of binomial nomenclature. Should I have the 'wrong end of the >stick' of the discourse at this stage then please advise me. Well, the above use to be true but not any more. The latest version of the Zoological Code drops all reference to "epithet" and now calls it the "specific name" (Article 5 and Glossary under "Name" - epithet doesn't even appear in the Index). More to the point, to be absolutely proper (under the current ICZN) the above would look like this: <taxonomy> <rank="species" specific_name="alfari"/> <rank="genus" generic_name="Azteca"/> <author value="Emery"/> But this would be a particularly unclever way of doing it. If you wanted to change it I would suggest de-generalising it a bit to: <taxonomy> <specific_name="alfari"/> <generic_name="Azteca"/> <author value="Emery"/> The rank element doesn't really contribute much since you'll need to parse the "value" attribute to figure out what's going. Might as well just get the text of the "specific_name" element directly rather than searching for the "rank" element that equals "species" and then getting the text of its "value" attribute. Yes, this change makes it more specific and less general (and is a bad thing in strict IT terms), but why make life harder than it has to be? This way of doing it might make it harder to develop a Schema definition, but I would argue strongly that the data model should come after we know what we want to do, not before (<soap_box_on> and that's partly why this discussion has drawn out for sooooo long - the BioLink team built a complete XML representation of the BioLink database, some 400 fields in 50 tables, in 3 weeks and are now using it to move data between BioLink databases and between Platypus and BioLink - and, surprise surprise, we don't have a DTD or Schema because you don't actually need one to make this stuff work <soap_box_off/>). If you need to add a subgenus to the name just add a new element: <taxonomy> <specific_name="alfari"/> <subgeneric_name="Alfaridris"/> <generic_name="Azteca"/> <author value="Emery"/> I don't think this would cause any more confusion during processing or parsing than: <taxonomy> <rank="species" value="alfari"/> <rank="subgenus" value="Alfaridris"/> <rank="genus" value="Azteca"/> <author value="Emery"/> If the software can't handle the first then it's unlikely to be able to do much with the second either. Yes, the Schema definition would be broken by the first method and not the second, but this is a secondary consideration to someone who wants to actually process this bit of pseudo-XML. If I don't know how to handle the element called "subgeneric_name" then I won't know how to handle a rank with a value of "subgenus" either. The point is that the application and the data are tightly integrated (more than we would like them to be) and if the two get out of synch things won't go smoothly. Yours in confusion, Steve Shattuck

1 0

Re: Delta-like descriptions for Thiele 0.3 draft
by don kirkup 07 Nov '00

07 Nov '00

from Kevin Thiele; > | Ah, Jun Wan reminds me that uniqueness of ID attribute values is > | imposed by XML itself. So in the Delta-like descriptions, the choices > | I see are to generate unique ID's that encode the context or to use a > | different attribute than ID, perhaps something like TDWGID. In the > | former case one might have something like: > | ID="F.1" to mean that this is an id for feature 1 > | ID="F.V.1.2" to mean that this is value 2 for feature 1 > | In XSLT, these are parseable with the XPath string functions, so that > | nothing more than XSLT is needed to associate an IDREF with the object > | having the associated ID. > > This is a nuisance. TDWGID I think is not a likely candidate. > > Using your first option, one could use simply: > ID="1" (id for feature 1) > ID="1.2" (id for value 2 of feature 1) > > This may then perhaps even be handy. Then again, by constraining like this > it would quickly become a nightmare to add or re-arrange features > or values. > > Is this the first case where we're being imposed upon by XML and > we start to > wear the cost of its advantages? Are there any other ways out? Rather than inserting 'ID' type elements within the XML, you could instead define <xsl:key> elements in a xsl stylesheet and then use the XPATH key() function to access the values. The key() function works in pretty much identical way as the XPATH id() function, but a key (unlike an id) doesn't need to be unique. don

1 0

Re: Delta-like descriptions for Thiele 0.3 draft
by Robert A. (Bob) Morris 07 Nov '00

07 Nov '00

Kevin Thiele writes: > Date: Tue, 7 Nov 2000 22:14:33 +1100 > Bob Morris wrote: > > | II. Global feature values (This may be a Delta question). > | > | In 0.3's Examples 8 and 9 there do not seem to be any global feature > | values. Instead, all possible values are completely local to > | features. For example, can there be a list of feature values > | containing both "present" and "absent" to which any feature can refer, > | whether or not those values are in that feature's local list of > | values? Or rather, must every feature that wishes to allow those > | values name them in its list of possible values> > Kevin replied > I think all values should be specifically named. It would be silly to have a > global "present", then have a feature "Flower colour" that allows this > feature value. > Bob writes: Having global named values doesn't imply that an application *must* use them, only that it *may* use them. Also, this is one of those places where having a schema (= DTD or X Schema) can help because it is possible to define the schema in such a way that 'Flower colour present' would be prohibited. In this case a validating XML parser would signal exactly where the "error" lies, in case you care. But even better, an application that was built to the schema but did not enforce the schema would just ignore the silly statement.

1 0