- tdwg-content - lists.tdwg.org

Re: (RQT) Character Dependency rules
by Noel Cross 21 Dec '99

21 Dec '99

On Mon, 20 Dec 1999, Gregor Hagedorn wrote: > > 3. (and finally): Does anybody have good examples where character > _state_ dependencies would be necessary? This has been proposed for > New Delta. I have no real objections, except that I have not found a > good example where I would want to have it. You've probably considered these already, but for the sake of discussion: example 1. Say you want to allow a more or less accurate state for a "time of year" character. In that case the month would be dependent on a year, and the day would be dependent on the month. example 2. You want your user to be able to choose more or less accurate geographical locations. Whether these examples would ever really occur is a question, but I'm not sure we need to proscribe the use of state dependencies. -Noel

1 0

Re: (RQT) character/state/comment
by Noel Cross 21 Dec '99

21 Dec '99

On Mon, 20 Dec 1999, Gregor Hagedorn wrote: > > If I understand this correctly, it assumes that there would be a > general "property" entity. An instance of this entity would then be > the property "shape", which has a list of generally applicable > values. Some of these values would apply to leaves as well as to > butterfly wings, other may be more specific. > Actually, what I'd like to see as a requirement is more modest than this. I simply propose that there should be the possibility of explicitly recording a property name, which need not have any global significance. The property could be expressed in many ways, but for the sake of an example I suggest that we might possibly have a DELTA character that looked like this: #2. body <@property degree of convexity>/ 1. strongly flattened/ 2. slightly flattened to moderately convex/ 3. strongly convex/ -Noel

1 0

Re: (RQT) character/state/comment
by Jim Croft 21 Dec '99

21 Dec '99

>Who thinks global state lists for something like shape (already >separated into 2D-shape, 3D-shape), texture, smell, color, are >possible and useful? Long answer: The key word here is 'global'. Comprehensive state lists are obviously possible and useful as people create and use them all the time within their discipline. Many of the terms are shared across disciplines, reflecting common usage or common greek/latin derivation. But a mathematician's or engineers concept of what is acceptable as round, cylindical, elliptical, linear, pentagonal, etc. is very different to a biologists (although they would probably understand what we mean). But I think the pressure against achieving global state-lists are historical, political and sociological more than technical. Biologists tend not to work across taxonomic disciplines and carry the lexicons backwards and forwards, and major and minor differences in descriptive practice and terminology evolve. Botanists, entomologists, icthyologists all describe shapes of flat structures (scales, leaves, petals, wings, fins, etc.) and I suspect there are differences in how they do this (but not working across disciplines will never be sure), and neither discipline is going to want give up their traditional practices - (why does the word 'biocode' keep coming into my mind here?). In any case, does it really matter that different disciplines use words for the same thing? It may be possible to to achieve consistency within a discipline but even there I have doubts - who is going to adjudicate on what is the correct terminology for the circular/round/orbicular thing, and if it is not one thing, how do all the things that are sort of like it differ. And having adjudicated, who is going to enforce the decision, and interfere with the freedom of of expression, the right to vote for the multinational company of your choice, the right to bear arms and so on. As I said, political, and I do not think we want to go there. A gut feeling is if we go down the universal lexicon/terminolgy route a lot of biologists will go round in ever decreasing concentric circles, eventually vanishing up their own centroid never to be seen again. Each discipline and discipline with discipline can and will develop and use terminology/lexicons/state-lists that biologists and information managers have to work with. That is the reality that we are given - I periodically try to change reality to make information management easier, but inevitably get beaten to pulp rather than thanked. What we can aim for is a data structure that enables disciplines and biologists to use terminology they are comfortable with, storing and documenting it in an unambiguous form that does not inhibit flexibility. DELTA and like programs allow users to do this now: by changing one word, I can change 'frond' to 'leaf', which it is, everywhere (but I like the word 'frond', so I won't)... Thus data can be converted and integrated later when the public realizes the true value of my lifetimes's work on the monograph and combined interactive key to the marine pteridophytes and benthic butterflies of the continental Antarctica. Short answer: no. Most important is a sound informatiom model and corresponding data structure. If we get it right, the data will clean itself up. :) jim __________________________________________________________________________ Jim Croft ~ jrc(a)anbg.gov.au ~ http://www.anbg.gov.au/people/croft.jim.html ph 02-6246-5500 ~ fx 02-6246-5248 ~ GPO Box 1600 Canberra ACT 2601

1 0

(RQT) Character Dependency rules
by Gregor Hagedorn 20 Dec '99

20 Dec '99

> DELTA already allows dependencies between characters - such > that if a specific state has been selected for a character, other > characters are ignored (i.e. no wings, ignore anything to do > with wings). Is there any other dependency relationships that > might be required, or additional information about such a > relationship? Leigh brought this up; I would like to start a new thread with this. What problems do we have with dependency relationships (or integrity rules, to put it another way)? Most of us are familiar with the dependency as implemented in DELTA compatible applications. here one character is defined as controlling, and that depending on which states is scored in a given item, other characters become applicable or inapplicable in that item. Questions: 1. Do we need both applicable and inapplicable definitions? In DeltaAccess I assumed that applicable is the complement of inapplicable. If the complement of applicable states are defined as inapplicable, the application behaves the same way. This is only partly true. a) If new states are added in the definition, some care is necessary. I realized that in DeltaAccess, but thought it the lesser problem. However, for the same reason for which we may need actually a "not" statement in the item description (a statement "color of X is NOT pure white" remains true, regardless of how many shades of beige are later added later on) b) If multiple state are scored, the behavior needs to be defined. In the CSIRO programs, it seems that if two states are scored in a given item, a single one is sufficient to make another characters applicable or inapplicable (which makes sense, but is lost if the complement is choosen). I have not found a good generalization for this, but loath to implement the rules in duplicate. Any ideas? This is certainly a question I am thinking about on the logical rather than conceptual level, but answering it may help to define the requirements, i.e. whether the DELTA applicable/inapplicable model should be followed, or whether alternative expressions are possible. 2. I recently had a good discussion with Wouter Addink and Flip Boer in Holland. Among other things we considered whether a structural hierarchy (leaf stalk is part of leaf is part of plant) could replace dependency rules. Clearly, it can replace some rules: if there is no leaf, all leaf characters will be inapplicable. However, it can not replace depencies based on multiple states that are not present/absence. Further, it may not always be true because we tend to use confusing terminology: Something at the leaf stalk base would belong to leaf stalk, but absence of leaf stalk (i.e. absence of measurable lenght of leaf stalk) does not imply absence of stalk based characters. Any further ideas on this? 3. (and finally): Does anybody have good examples where character _state_ dependencies would be necessary? This has been proposed for New Delta. I have no real objections, except that I have not found a good example where I would want to have it. ---------------------------------------------------------- Gregor Hagedorn G.Hagedorn(a)bba.de Institute for Plant Virology, Microbiology, and Biosafety Federal Research Center for Agriculture and Forestry (BBA) Koenigin-Luise-Str. 19 Tel: +49-30-8304-2220 14195 Berlin, Germany Fax: +49-30-8304-2203 Often wrong but never in doubt!

1 0

Re: (RQT) character/state/comment
by Gregor Hagedorn 20 Dec '99

20 Dec '99

Leigh wrote: > As a first cut I just transferred everything from DELTA comments into > XDELTA comments, although it was quite obvious that the 'comment' > component of DELTA was overloaded depending on context (a legacy > problem I guess, and one which raises the prospect of 'extensibility' > as a requirement of the new format). The concept of "overloading" is very good, I think it helps to understand the issue. I believe in the new standard, we will want to have as little overloading (which means context specific contents) as possible. In general, all data elements should be defined much more exactly than in Delta. Many things in Delta are really defined by result, not by purpose, which means that comments may have a different function in different applications (e.g. identification or nat. language reporting). > DELTA already allows dependencies between characters - such > that if a specific state has been selected for a character, other > characters are ignored (i.e. no wings, ignore anything to do > with wings). Is there any other dependency relationships that > might be required, or additional information about such a > relationship? > > Restricting states to particular options depending on the property > in question (e.g. leaf and/or wing shape) leads back to the > prior discussion on accepted standards for character description. If I understand this correctly, it assumes that there would be a general "property" entity. An instance of this entity would then be the property "shape", which has a list of generally applicable values. Some of these values would apply to leaves as well as to butterfly wings, other may be more specific. I wonder, whether terminology is actually well developed enough to really support this notion. The discussion is somewhat similar to the discussion on lexicons/globally applicable character definitions. My feeling is, that nowhere ever has "shape" been defined in a way that would be applicable to all areas. Defining dependency rules to create overlapping, or partly matching sets of character state seems to me a troublesome process. Further, one should realize that we need more than a name for such concepts. Two shapes in plants and butterfly wings that may be more or less identical in concept may be named identical in one language, but not in another. I believe that properties are useful (and from the viewpoint of information modeling are preferable to use), but dangerous to use in reality, simply because we as biologists never did any globally applicable terminology. Also, we have to make the concept clear to any user, e.g. that if she or he changes the name of a state, this change is actually applicable not only to his butterfly, but also to the plant the collegue is working on in the same database (a bit gross example, but more likely somewhere in a large institution for two people working on separate orders, or perhaps on ferns and conifers). Who thinks global state lists for something like shape (already separated into 2D-shape, 3D-shape), texture, smell, color, are possible and useful? Gregor ---------------------------------------------------------- Gregor Hagedorn G.Hagedorn(a)bba.de Institute for Plant Virology, Microbiology, and Biosafety Federal Research Center for Agriculture and Forestry (BBA) Koenigin-Luise-Str. 19 Tel: +49-30-8304-2220 14195 Berlin, Germany Fax: +49-30-8304-2203 Often wrong but never in doubt!

1 0

Re: (GEN) Lexicons
by Gregor Hagedorn 20 Dec '99

20 Dec '99

Originally, the definition of applied schema ("character definitions", lexica, terminology) is outside the scope of this discussion group. While judging from my own experience I think it unlikely to make fast progress here, I completely agree that it is most important to work on it. I would therefore endorse Alex's proposal to create a new top level subject: -> Please do use a subject line (LEX) if you want to talk about applied terminology for a certain group (all plants, all insects, etc.) If these discussion threads become too big, we should then open a new list for those interested in it. Gregor ---------------------------------------------------------- Gregor Hagedorn G.Hagedorn(a)bba.de Institute for Plant Virology, Microbiology, and Biosafety Federal Research Center for Agriculture and Forestry (BBA) Koenigin-Luise-Str. 19 Tel: +49-30-8304-2220 14195 Berlin, Germany Fax: +49-30-8304-2203 Often wrong but never in doubt!

1 0

Compression of XML files
by Robert A. (Bob) Morris 18 Dec '99

18 Dec '99

People interested in XML file size will be interested in a new free program XMill from AT&T Research Labs and UPenn. XMill is said to compress files up to twice as much as gzip on proprietary formats when data in those formats is first rendered in XML and then compressed with XMill. This is essentially because XMill can use the tag information to reorganize the data for greater redundancy. http://www.research.att.com/sw/tools/xmill/ points to the software and technical material about it. One of the authors, Dan Suciu, is also the a co-author of "Data on the Web : From Relations to Semistructured Data and XML", Morgan Kaufman. This is great book whose title is accurate. To read it, you have to be comfortable with graph theory, regular expressions, and finite state automata. If you are not, you can accept this a summary: XML is more general than relational databases and a little less general than pure object oriented databases.

1 0

Re: (RQT) Structure, Matrix, and Text
by Jim Croft 18 Dec '99

18 Dec '99

>Free comments anywhere, where it is not clear to which data item a >comment belongs, will not help. Also, in free text the same concept >will often be expressed with different words. This is a common >problem known by anybody who tried to capture free conventional >descriptions in a database. ... and is probably intractable. One person can often see data in another comments, while another may see others data as hardly worthy of comment. In many descriptive works the pearls of information are subtle and implicit rather than explicit and in your face (but we will not open the biological description as an art versus a science debate). Discussions about what is and is not data will go on forever and the critera will change fron genus to genus, gamily to family, order to order, and biologist to biologist. What is needed is a common standard format to store all descriptive data and information in an understandable and unambiguous way for exchange between applications so that nothing gets lost or made unavailable in transactions between applications and up and down the taxonomic hierachy. To ignore data and information is probably excusable, but to actually lose it is sinful and unpardonable. It is a shame that verbosity is the price we have to pay for this flexibility, but the tecnology can handle it. I did not think we were considering the 'direct processing' of these tagged XML (or whatever) files; they were to be just the lingua franca of applictions using descritpive data. A contemporary example would be the DELTA format: after Intkey, the DELTA Editor, MEKA or whatever have grabbbed the data they neither know nor care what a DELTA file looks like. What I think we are after is a format like this that all applications know about and can use without any extrenal hacking and preprocessing. There is a major differece between an effective archival and exchange format and an efficient arrangement of data for processing and analysis; I doubt if it will be posible to have both in the same package. My understanding was that this list was going to focus on the former, while application developers and closet hackers were going to whatever they needed or wanted with the latter. A recurring nightmare is the thought of pouring data backwards and forwards between applicatons, losing a little at each export/import like the fraying ends of an aging chromosome at each cell division, until eventually there is nothing left... jim __________________________________________________________________________ Jim Croft ~ jrc(a)anbg.gov.au ~ http://www.anbg.gov.au/people/croft.jim.html ph 02-6246-5500 ~ fx 02-6246-5248 ~ GPO Box 1600 Canberra ACT 2601

1 0

(RQT) Structure, Matrix, and Text
by Gregor Hagedorn 17 Dec '99

17 Dec '99

Although some time has passed, I would like to take up something up that was posed by Kevin under "(XML) XML?". > The problem to my mind is that in current formats, e.g. DELTA and > LucID, much information is implied by context. Thus, in > 1010 > 0101 > The taxa and character state numbers (identities) are implied by the > position of the data bit in the matrix. In XML this information is > verbosely explicit. > The following question: what could we do with such data as XML that we couldn't > do with the data as a simple structured file as above? > Is direct processing of XML data any easier than direct processing of the > data in a simpler format? Perhaps there will be off-the-shelf parsing tools, > but how much of a benefit will this be? I see among other things already mentioned, 2 benefits: 1. A matrix is inherently ordering dependent. Reordering characters or taxa is necessarily tied with the contents of data. In a more verbose presentation (be it XML or not), these actions can be decoupled. This is not currently an advantage as long as we mainly use isolated programs with import/export interaction alone. However, I believe in the future we will use programs working on networked data. This could mean that I locally modify only the character definition, perhaps translate it to German, define a character and taxon subset for my use, and change character hierarchy or ordering, but still cooperate with other groups using the same description data on the net. That requires of course, that the descriptions use, at least partially, standardized character definitions and that Globally unique identifiers have been assigned to characters. The advantage is not clear when thinking of DELTA, where the character ID is local, and also identical with the character ordering definition. These are understandable design limitation, which have the advantage to produce simple and compact files. In the future, I believe, we will want to be more flexible than this. In DeltaAccess, I followed the DELTA example and coupled ID and ordering, which does brings me quite some trouble currently... 2. Only with explicit and verbose tagging can be achieve forward AND backward compatibility. If LucID would want to support additional modifiers, perhaps change the 2 frequency states into a more specific and complex system, or if it would want to allow state specific reportable text and internal annotation (which are extremely difficult to support in matrix notation anyways) - the parser of a new application version would have to be changed quite a bit, perhaps rewritten - any older software would no longer be able to read these data. With a well designed verbose, fully tagged format, both these restrictions need not apply, I believe. > Is the following true?: once upon a time, computers could represent but not > efficiently analyse or process textual data, hence documents were stored as > text but "data" were stored as matrices etc. Now, XML has blurred the > boundary between these types of information ("textual" and "data") and we're > exploring the implications of that blurring. But are there now no > differences, and no further need for a matrix? I believe we need structure, and we need support for quality control. The distinction between text and data is blurred somewhat indeed, but text must be structured in ways that can be processed analytically. Free comments anywhere, where it is not clear to which data item a comment belongs, will not help. Also, in free text the same concept will often be expressed with different words. This is a common problem known by anybody who tried to capture free conventional descriptions in a database. For example, even in DELTA, frequency statements in free text "comments" may express the same frequency by multiple more or less synonymous wordings (is "mostly" and "usually" the same or not?). All frequency wordings are not really defined, and are therefore not accessible to analysis. DeltaAccess tries to use textual frequency modifiers, where the number of possible modifiers is restricted in the character definition, and where each frequency modifier is defined as to the exact upper and lower frequency range it presents. This is analytically accessible. Gregor ---------------------------------------------------------- Gregor Hagedorn G.Hagedorn(a)bba.de Institute for Plant Virology, Microbiology, and Biosafety Federal Research Center for Agriculture and Forestry (BBA) Koenigin-Luise-Str. 19 Tel: +49-30-8304-2220 14195 Berlin, Germany Fax: +49-30-8304-2203 Often wrong but never in doubt!

1 0

Re: (GEN) Lexicons
by Jim Croft 14 Dec '99

14 Dec '99

>I would like to suggest that if part of our group think it is worthwhile >to attempt to construct a standard character list for one or more (or all) >kingdoms (to which I am not opposed) that we have an additional first level >topic identifier for subject lines in our postings (currently we use GEN, >RQT and XML). Whatever happened to Richard Pankhurst's TDWG-based attempt to come up with a minimum character set (or something) for just the (vascular? seed? flowering?) plants? I seem to remember there was a fair bit of wailing, gnashing of teeth and slitting of wrists over this exercise... jim __________________________________________________________________________ Jim Croft ~ jrc(a)anbg.gov.au ~ http://www.anbg.gov.au/people/croft.jim.html ph 02-6246-5500 ~ fx 02-6246-5248 ~ GPO Box 1600 Canberra ACT 2601

1 0