- tdwg-content - lists.tdwg.org

Re: SDD Specifications Document
by Jean-Marc Vanel 02 Mar '00

02 Mar '00

Kevin Thiele a écrit : > Jean-Marc's work does indeed need appraisal by the group in relation to our > goals. My understanding from a brief perusal of his site is that he's trying > to establish a franework for creating a set (one day he hopes for a global > set) of descriptions of plants on the web, Yes, by merging Flora descriptions with character-based data à la Delta. > the data structured in such a way > as to be queryable as well as readable. Yes, it will make biological knowledge really accessible. For the query aspect, I try to structure data in a way suitable to export to several kind of programs: - Artificial Intelligence - Delta and LucID - databases, relational and OO > Two likely differences between the present attempt and his may be > > 1. that we need to create something more flexible so that any botanist can > capture any aspect of systematic data for any taxon - i.e. a Stipa can be > described just as fully as a Taraxacum and a Nepenthes, with data ranging > from morphology to anatomy, phytochemisty and gene sequences (the latter is > the only easy one). Jean-Marc's thing seems to me to be of the fairly > inflexible variety This assertion is "present by misinterpretation". The set of properties and organs is of course not limited, although for convenience a standard set will be provided. > 2. Jean-Marc's model is directed towards capturing descriptions. We need to > also allow for identification data for the range of key programs available > and in the future. Thus, things like character sets, dependencies, allowing > for misinterpretations etc may perhaps have little meaning in Jean-Marc's > domain, but are critical in ours. You must not doubt about my commitment to integrate those things in a general model. I'm currently searching funding/sponsorship because now I have to work on this after normal work hours. Anyway next monday I'll present a more detailed model and example, showing how data and meta-data published on different sites can be used together. Cheers JMV -- <person> <first_name>Jean-Marc</first_name> <name>Vanel</name> <project>Worlwide Botanical Knowledge Base - making botany available on Internet <a href="http://wwbota.free.fr/" >site</a> </project> <homePage>http://jmvanel.free.fr/</homePage> <a href="mailto:jmvanel@free.fr">mail (eventually put "wwbota" in subject to route your mail in relevant folder)</a> </person>

1 0

Re: SDD Specifications Document
by Leigh Dodds 02 Mar '00

02 Mar '00

Hi, I got some time last night to start giving this some thought. I've got some questions: 1. Collation rules. These are currently unspecified. Any objections if I leave them out of an attempt to express these in XML? We can revisit and revise this portion as time goes on. 2. Collated Character source. I see these as essentially a drill down mechanism that further identifies a 'bottom-level' taxon. Is it reasonable to include character dependencies 'upwards'? Am I right in believing that multiple character sets could drill down into the same source (perhaps produced by different organisations, researchers, techniques). In this case, is there a principal source for going back upwards? It seems unfair to expect a treatment to keep track of all other treatments which point to it (I may be infering too much here). 3. Why should characters only have properties at the lowest level? What is the 'lowest level' given that drill-down can occur? Could you outline the reasoning here? 4. Its possible to provide a Character list internally to the treatment or reference an external one. Will it be a requirement that both could be used (i.e. combine the internal and external lists.) Cheers, L.

1 0

Re: SDD Specifications Document
by Leigh Dodds 02 Mar '00

02 Mar '00

> At the very least, we should make sure that they're compatible, perhaps > seeing Jean-Marc's as a special-purpose subset of the general. Thats my current interpretation. Jean-Marc's model seems to presuppose that the identification of the organism has already taken place. With that identification involving the allocation of properties. Or have I mis-interpreted it? Jean-Marc, perhaps you could provide some additional examples? L.

1 0

Re: SDD Specifications Document
by Robert A. (Bob) Morris 02 Mar '00

02 Mar '00

Leigh Dodds writes: > Date: Thu, 2 Mar 2000 17:54:10 -0000 > From: Leigh Dodds <ldodds(a)INGENTA.COM> > To: TDWG-SDD(a)usobi.org > Subject: Re: SDD Specifications Document > > [...] > > 2. Collated Character source. I see these as essentially > a drill down mechanism that further identifies a 'bottom-level' > taxon. Is it reasonable to include character dependencies 'upwards'? > Am I right in believing that multiple character sets could drill > down into the same source (perhaps produced by different > organisations, researchers, techniques). In this case, is > there a principal source for going back upwards? It seems > unfair to expect a treatment to keep track of all other treatments > which point to it (I may be infering too much here). Not sure if I am misreading what you mean here, but if not, then I whine thusly: A /single/ character set can get to the same taxon in different ways. Don't come up with something that guarantees that the key graph must be a tree. Even dichotomous keys aren't actually always trees even though when printed they often look so. They look so because humans routinely impute structure from format, which is one thing XML is supposed to help us avoid. Though you have to work to represent non-trees in XML... In an arbitrary directed acyclic graph, if an application wants to know how it got from one node to another, it has to keep track of the path or else climb upwards along each parent looking for a determining character. The latter is an exponential strategy, but the size of key graphs may well be too small to bother worrying about that. small -----------------speciesA / / / / / / ----wing spot--- / red--- \ / large --- wing color--- \ blue---speciesB BTW, this hopefully not absurd example suggests how much easier it is to deal with sometimes---but not always---irrelevant characters using XML and its semi-structured generalizations than using table and matrix representations of character data. Bob Morris

1 0

geometry first, MathML, CAD, etc
by Jean-Marc Vanel 01 Mar '00

01 Mar '00

Hello I have followed the debates about VRML/X3D for weeks, and it's time to speak. The aim of our project is to make botanical data available on Internet, including 3D images. We need a compact, non proprietary, preferably XML, clean definition for complex 3D geometries. It seems that a representation both compact and flexible should be based on mathematics. VRML's cones and cylinders are just special cases of intersections of volumes defined by equations: f(x,y,z)>=0 NURBS and Beziers patches are just special cases of surfaces defined by 3 functions R2 ---> R3 (u,v) ---> (X(u,v),Y(u,v),Z(u,v)) A solution is to use the content part of MathML. I have reviewed it: it has the desired capabilities, i.e. allows to define functions and sets, it is XML. Certainly only a subset of MathML is needed: n-dimentional geometry, n>3 is not relevant. On the other hand, some geometrical primitives could be added : - convex hulls, - recursive constructs like fractals and L-systems, - transforms, deformations, parametrization, movement My second point is about modular schemas versus monolythic Schemas. X3D is a very "good" example of monolythic DTD. NOTHING is taken from the XML world outside X3D. It seems that Virtual Reality involves several layers that can be used and designed independently: - volumic objects definition (see above) - colors and textures on volumic objects - behavior of volumic objects among them (contact, glued or sliding, rotating, interpenetrable, etc) - behavior of volumic objects with User Interface - a scene as composite Design pattern of volumic objects - light sources - scenarios (time-dependant aspect ) - sounds Conclusion: This need for a compact, non proprietary, preferably XML, clean definition for complex 3D geometries is common with other important domains: - Computer Aided Design - Architecture - simulation in mechanics, physics, and biology CAD is a very important field that has currently no XML non-proprietary language. It seems that the proposed solution could bring an interesting synergy able to speed up developments, together with a better design. And also a common subset for CAD and Virtual Reality will bring new possibilities to exchange data. A well-designed model and XML syntax for virtual reality could also be used for cartoons and video games. -- <person> <first_name>Jean-Marc</first_name> <name>Vanel</name> <project>Worlwide Botanical Knowledge Base - making botany available on Internet <a href="http://wwbota.free.fr/" >site</a> </project> <homePage>http://jmvanel.free.fr/</homePage> <a href="mailto:jmvanel@free.fr">mail (eventually put "wwbota" in subject to route your mail in relevant folder)</a> </person>

1 0

Re: SDD Specifications Document
by Kevin Thiele 29 Feb '00

29 Feb '00

At 10:13 25/02/00 -0300, Mauro wrote: >I was about to suggest that a graphic model, perhaps using the UML >methodology, was in order, but Jean-Marc Vanel has already taken care of >that. His model is interesting, and deserves attention. I agree that a graphical representation of the model may be a good idea, but it's beyond me. Jean-Marc's work does indeed need appraisal by the group in relation to our goals. My understanding from a brief perusal of his site is that he's trying to establish a franework for creating a set (one day he hopes for a global set) of descriptions of plants on the web, the data structured in such a way as to be queryable as well as readable. So, has Jean-Marc already done all that this group set out to do and can we go home? Two likely differences between the present attempt and his may be 1. that we need to create something more flexible so that any botanist can capture any aspect of systematic data for any taxon - i.e. a Stipa can be described just as fully as a Taraxacum and a Nepenthes, with data ranging from morphology to anatomy, phytochemisty and gene sequences (the latter is the only easy one). Jean-Marc's thing seems to me to be of the fairly inflexible variety (see the earlier lexicon debate) i.e. systematic description is reduced to filling in the blank spaces on the form provided: Taxon name <fill in your taxon> Leaves <phyllotaxy>; margins <indentation>; venation <type>. I may be misrepresenting Jean-Marc's intentions here. Any comment? 2. Jean-Marc's model is directed towards capturing descriptions. We need to also allow for identification data for the range of key programs available and in the future. Thus, things like character sets, dependencies, allowing for misinterpretations etc may perhaps have little meaning in Jean-Marc's domain, but are critical in ours. Does anyone think we shold discontinue our attempt and run with Jean-Marc's instead? At the very least, we should make sure that they're compatible, perhaps seeing Jean-Marc's as a special-purpose subset of the general. -k

1 0

XML Schema
by Robert A. (Bob) Morris 28 Feb '00

28 Feb '00

Those who care may already know that on Feb 25 W3C released a working draft for public comment of XML-Schema, with a last call for review expected in March. Perhaps of more general interest is XML Schema Part 0: Primer at http://www.w3.org/TR/xmlschema-0/ >>>From the abstract: "XML Schema Part 0: Primer is a non-normative document intended to provide an easily readable description of the XML Schema facilities and is oriented towards quickly understanding how to create schemas using the XML Schema language."

1 0

Re: Progressive Revelation
by Mike Dallwitz 28 Feb '00

28 Feb '00

> From: Kevin Thiele <kevin.thiele(a)PI.CSIRO.AU> > To: TDWG - Structure of Descriptive Data <TDWG-SDD(a)USOBI.ORG> > Create a key to all grass species so you're working with a list of all > taxa at species level including all the Poa species. The character list > has two classes of characters - ones that are scored over all taxa (these > will be the easily generalised characters) and ones that are scored for > only a subset of taxa (the characters that are highly specific and/or not > easily generalisable). When the key program starts it splashes up the > generalised characters only. But if after answering some characters you > end up with only Poas, the program finds and adds to the character list > the Poa-specific characters. Characters are progressively revealed as you > proceed through the key, with as much depth as necessary - e.g. you may > come down to a species complex of alpine Poas and presto! some characters > appear that are just the ticket to separate them. Something like this, but less rigid, is achieved automatically by algorithms for finding 'best' characters. A 'best' algorithm typically has a penalty for characters which are unknown, inapplicable, or variable for some taxa, but it does not completely exclude them. The 'best' algorithm used in our programs Intkey (interactive identification) and Key (generation of conventional keys) has a natural penalty for such characters, arising from the goal of minimizing the average length of an identification. The algorithm also has a parameter, Varywt, which can be used to add an arbitrary penalty for such characters. However, this tends to increase the average length of an identification, so its use in Intkey is not recommended. A value of 0 for Varywt would have an effect similar to that proposed above by Kevin. If you try this in Intkey, note that a value of 0 is treated as 0.01, an ad hoc adjustment made specifically to _avoid_ the complete exclusion of the characters. In a data set with a substantial number of missing or inapplicable values (e.g. the sample data supplied with the DELTA programs), it is easy to observe how low values of Varywt cause characters with comparatively low separating power (and which therefore result in longer identifications) to move towards the top of the 'best' list. Varywt is primarily intended for use in Key, where the increased average length of an identification is offset by a reduction in the _printed_ length of the key. In Key, a value of 0 completely prevents the use of characters with intra-taxon variability. -- Mike Dallwitz CSIRO Entomology, GPO Box 1700, Canberra ACT 2601, Australia Phone: +61 2 6246 4075 Fax: +61 2 6246 4000 Email: md(a)ento.csiro.au Internet: biodiversity.uno.edu/delta/

1 0

Progressive Revelation
by Robert A. (Bob) Morris 28 Feb '00

28 Feb '00

Since theology has arisen, how's this: Object oriented databases are better for this stuff than either hierarchical or relational, precisely for the reasons you outline below. [By the way, hoping this is not a forbidden commercialism but at least reveals my conflict of interest: we operate the university program for eXcelon, Inc. (formerly Object Design) by which universities can get Object Store, eXcelon [a native XML store], and most other eXcelon products for a total of $650/year. See our web site at www.cs.umb.edu/~serl/odiedu. This is a pretty good way to get into OODB's if you are a university. The program applies in most of the world.] Bob Morris Kevin Thiele writes: > Date: Mon, 28 Feb 2000 09:47:10 +1100 > From: Kevin Thiele <kevin.thiele(a)PI.CSIRO.AU> > To: TDWG-SDD(a)usobi.org > Subject: Progressive Revelation > > At 15:30 24/02/00 +1100, Eric Zurcher wrote: > > >6) I'm intrigued by the notion of a "Progressive Revelation model" > >(footnote 5). It sounds terribly theological - or perhaps that's > >Thiele-logical? (my apologies to Kevin, but I really can't resist bad puns). > > I'm often accused of teleology, but rarely of theology. > > Progressive Revelation is perhaps a new way of handling holes in data > matrices for random-access keys. The background is this: > > The simplest data structure for a random-access key is a fully populated > matrix i.e. all taxa are scored for all characters/states. Works well > sometimes, especially if the taxa are highly comparable e.g. the species of > a genus or the genera of a family. > > This structure is problematic sometimes though, for two reasons. Firstly and > most simply, you may not have data for all taxa, and need to leave holes in > the matrix. Solution is simple - fill the holes with ?s and allow for this > in the key program. But it often also happens that some characters are > simply inapplicable to some taxa, or (worse) are non-ambiguous for some taxa > but ambiguous for others. For instance, stipules don't occur in monocots, > stipule-like structures sometimes do but if you try scoring stipule > characters as defined for dicots against monocots you run into all sorts of > strife because of ambiguity of context. LucID can handle this to some extent > using the "present by misinterpretation" score, but the problem is in the > character definition, not the score. > >...

1 0

Re: <XML> Abstract Data Model for Taxonomy
by Kevin Thiele 28 Feb '00

28 Feb '00

>>>From Jean-Marc >Leigh Dodds a écrit : > >> 1. Firstly is it possible to express items like >> "Feature often present by mis-identification" > >Sorry, my english is too weak. Does this mean a character value that is >often reported by confusion with another taxon ? Leigh is presumably referring to the "Feature present (by misinterpretation)" score (see LucID). This is a score applicable to a state for a taxon so that, for instance, if a taxon lacks petals but has petal-like pseudopetals, you can score that petals are absent (the truth) but also present (by misinterpretation). This pre-empts likely user errors in an interactive key, while allowing the treatment builder to maintain data integrity. If there is no such score, the only way of pre-empting the mistake is to deliberately introduce errors into the data.

1 0