- tdwg-content - lists.tdwg.org

Re: HTML in XML
by Leigh Dodds 20 Dec '01

20 Dec '01

> Tim wrote: > ><species_comments>The nature of Carcinus maenas is blah blah > >blah.</species_comments> > > This is horrible mixed content stuff, and I am told that this is ipso > facto, evil, and we will burn in hell for eternity for considering such a > construct... Except for this kind of usage. Mixed content is perfect for marking up free-form text. Bad for mapping data into data structures (in most cases a database). If I had XML data containing this structure, I'd simply store the contents as the raw XML. L.

1 0

Re: HTML in XML
by Leigh Dodds 20 Dec '01

20 Dec '01

> <species_comments>The nature of Carcinus maenas is blah blah > blah.</species_comments> > > What I would like to do is have this attribute parsed using XSL and retain > the formatting over species names. When the text is processed normally the > HTML tags are dropped. That's because you're hitting the default template rules, which strips elements but retains their content. > If it is kept as CDATA the tags are not dropped but > also not processed. If you mean you've got Carcinus maenas or something similar, then the XSLT processor won't treat them as tags, but text. >Any clues how to handle this? You need a template rule that copies over the contents of this element into the result tree unchanged. something like: <xsl:template match="species_comments"> <xsl:copy> <xsl:copy-of select="@*"/> <xsl:apply-templates/> </xsl:copy> </xsl:template> This should copy the contents unchanged (including attributes) into the output. HtH, Cheers, L.

1 0

Re: Character hierarchies
by Robert A. (Bob) Morris 12 Dec '01

12 Dec '01

Jim Croft writes: > Date: Sat, 1 Dec 2001 07:28:21 +1100 > From: Jim Croft <jrc(a)anbg.gov.au> > To: TDWG-SDD(a)usobi.org > Subject: Re: Character hierarchies > > >The other qualifiers (misinterps etc) aren't in there yet, but will come. > >We'll have to be much more sophisticated that in my example, of course. > > > >Now how about "definately known to be unknown" > > Very droll... However, Greg and I have discussed this in depth over a ^^^^^^^ > bottle of red and have come to the firm conclusion that without the ^^^^^^^^^^^^^ over "a" bottle of wine, or over "a quantity of bottles of red wine wine more than zero and of indeterminate number" ? > attribute of "often misinterpreted as definitely known to be unknown", and > the corresponding "rarely misinterpreted as definitely known to be > unknown", accurate representation of author intent in biological > descriptions will never be possible... > > So get to it - we expect to see this in the next software release, real > soon now... > > jim > > ~ Jim Croft ~ jrc(a)anbg.gov.au ~ 02-62465500 ~ www.anbg.gov.au/jrc/ ~ >

1 0

Working Document
by Georgina MacKenzie 11 Dec '01

11 Dec '01

Kevin Thiele's working document for the SDD group is now available from the TDWG web site at: http://www.tdwg.org/tdwg2001/SDD_TDWG_2001.htm Georgina MacKenzie, Secretary, Newsletter Editor and Webmaster, International Working Group on Taxonomic Databases (TDWG) gmackenzie(a)york.biosis.org Telephone: +44 1904 642816 Fax: +44 1904 612793 54 Micklegate, York, YO1 6WF, UK ********************************************************************* This email is intended for the addressee(s) only. If you are not an intended recipient please inform the sender. BIOSIS UK does not accept responsibility for the contents of this message. *********************************************************************

1 0

Re: Questions on Dependencies and identifiers
by Kevin Thiele 06 Dec '01

06 Dec '01

Jim wrote | Thinking further about Kevin's two types of dependency, context and | content (or whatever the terminology was), isn't the former an artifact | of the character set we have decided to use and the latter an artifact | of the taxa we are covering? The former is just not possible, the latter | might be possible but is just not there. If this really is the case and | one sort of dependency arises from the character state and another | arises from the taxa being considered, then it is very likely that they | should be modelled in slightly or completely different ways. This resulted from c.10 minutes thinbking in response to Steve's point. It does seem to be true that there are these 2 types of dependencies, but we need to think much more about (a) whether they need to be treated differently and (b) whether there aren't even more types. Again, shall we keep this thread going or set it aside to pick up later. I'm inclined to the latter. Cheers - k

1 0

Re: Questions on Dependencies and identifiers
by Kevin Thiele 06 Dec '01

06 Dec '01

Barry - these are just the sort of problems that Challenge 1 was designed to throw up. In retrospect, though, it's perhaps too complex already. Maybe as we consider it we can break it up into sub-challenges? Dependencies are indeed a large issue that need to be dealt with carefully. Should we set them aside for the time being or do you want to deal with them now? When I put in that bit about mucronate within a notch my thought was that there would be two states "obtuse" and "mucronate within an apical notch" and obviously Jim's right that it's dependent on the context - the other taxa in the treatment - whether to treat it like this or otherwise. Your interpretation: | Apex shape | obtuse | notched | | Apex projection | immucronate | mucronate within notch (dependent on 'Apex shape - notched') seems to have a dependency between a state of one character and a state of another character (I'll use the character/state terminology for the time being here for simplicity). This is not the way it's usually done - dependencies usually set up a relationship between an entire character and a state of another character. e.g. the character Leaf shape ovate/elliptic will be dependent on the state "present" of the character "Leaves present/absent". In an interactive key the behaviour will be that if someone chooses Leaves absent then the character Leaf Shape will disappear. Likewise when building natural language descriptions, if a taxon has no leaves then the leaf descriptive characters will all be skipped. You seem to be doing something else here, and I'm not sure what's the behaviour you were trying to get. Can you give us a fragment of your model for the simplest case extractable from Challenge 1, without the complications of numeric characters, dependencies etc. Yes, there's a real need for character lists and identifiers. Do we go with e.g. <Feature Name = "Leaves"> <Value>ovate</Value> </Feature> or <Feature FeatureID="23" valueID = "6"> The former certainly makes for better human-readability, but is this important? I would have thought that a computer can make sense of either, and we can support both. The parser would simply need to know that if the attribute FeatureID is specified then use that and look up the idref to get the name, else if the attribute Name is specified then use that. Presumably our document structure will be the same whether we use Names or Idrefs? Cheers - k

1 0

Re: Questions on Dependencies and identifiers
by Jim Croft 06 Dec '01

06 Dec '01

Kerry wrote: > First, I am stuck trying to adequately model the structure and dependencies > of the leaf apex characters > "apex obtuse or minutely mucronate within an apical notch". I am parsing > this out as: > Apex shape > obtuse > notched > Apex projection > immucronate > mucronate within notch (dependent on 'Apex shape - notched') > The model has an attribute for 'dependent on' under the character element. > Does the proper solution of this problem require a 'dependent on' attribute > as part of the //character/state element as well? Am I parsing the data > correctly? Maybe... but is there a correct answer to this in the absence of context? Without knowing what the other taxa in the application are, you could get away with: Apex shape obtuse minutely mucronate in an apical notch Or even, trivially: Apex shape obtuse or minutely mucronate in an apical notch It maybe that there are only two character states in this group of taxa - the bluntish ones with obtuse/mucronate tips and the sharp ones with acute/attenuate tips. In this case building a line of dependency might not only be unnecessary, but might it also be misleading? But I do not this that was the real issue, was it? The challenge was, if you had a dependency, how would you deal with it... > The first problem led me to the second problem, which is: how do we > identify the character or state another character depends on? This is not a > problem when working with DELTA format data which has unique identifiers > already, but text descriptions do not. Is an Xpath description adequate or > do we need to be able to have a structure that allows us to uniquely > identify what a character state describes? Is there any harm in requiring a unique id for each node? Most descritpive applications probably do this already to manage their data matrices. DELTA lets you see them which is considered mabye useful, Lucid doesn't which is considered healthy... :) > I originally thought a unique identifier was needed to identify > missing data not explicitly encoded in the description and to provide a > pointer to characters or states that are other characters of states depend > on. Now that I see how complex this may be to implement, I'm wondering if > another way would be better. Is our structure/Schema/DTD is right, it would be nice to think that it was self referenceing and that we did not have to worry about internal ids and pointers, wouldn't it... Thinking further about Kevin's two types of dependency, context and content (or whatever the terminology was), isn't the former an artifact of the character set we have decided to use and the latter an artifact of the taxa we are covering? The former is just not possible, the latter might be possible but is just not there. If this really is the case and one sort of dependency arises from the character state and another arises from the taxa being considered, then it is very likely that they should be modelled in slightly or completely different ways. > I originally included an attribute called 'describes' under the > //character/state element. I thought that this could be used for two > purposes: (1) to add descriptive text to the character state and (2) to help > build a unique identifier for a character state. The first use is no > problem; this structure allows an author to easily add "high' to the value > for the character state describing height for example. The structure is > less useful for building an identifier for what the character state > describes. //character/state is not always going to be unique is it? The tip mucronate, obtuse, acute condition could exist in scales, bracts, leaves, petals, sepals, anthers, etc. A more complete path is going to have to be specified which could get pretty nasty... but hey, computers are good at nasty... > I thought I could build a unique identifier by combining the > character name and the 'describes' attribute. So for > > <Character> > <CharName>Leaf</Charname> > <State describes="PlaneShape" modifier="+/-">oblong</State> > </Character> > > The identifier 'LeafPlaneShape' identifies what the state describes. This > allows you to recognize that even when there are multiple states (ie. > 'ovate', 'round', etc.) they are all describing one aspect of the leaf. If > nested characters are recognized, the identifier must get longer. > 'ApexShape' or 'ApexProjection' wouldn't work, it would have to be > 'LeafApexShape' or 'LeafApexProjection'. If the nesting gets deep, this > gets awkward pretty fast. It would work, though. but not very elegantly... Dependency is the act of a state/value controlling the applicability or otherwise of a character/feature or as Kevin has pointed out a taxon controlling a character feature (in the identification process this get s bit circular as you have to hav an idea what a taxon is before you can decide if a character is inapplicable, but such knowledge can be built into the data in various ways). Thus the model will have to accommodate that a certain character/feature is unavailable if a certain state/value exists, perhaps in a totally different branch of the character/feature tree - and the obligate reciprocal relationship between controlling state/values and controlled characters/feataures. Unique ids, even arbitrary sequential numeric ones a la DELTA, might be the best (or an adequate) way to do this. How about: <feature id="123"> <feateName>leaf</featureName> <character id="6" nullcharacter="3" nullstate="4"> <characterName>leaf shape</characterName> <state modifier="+/-" value="present">oblong</State> <state value="rarely">ovate</State> <state value="rarely">obovate</State> </character> </feature> ... just doodling... maybe this needs to be specified in a separate characterlist up front, rather than in the guts of the descriptive data. Actually we have not even decided that yet have we? Should the list of characters/states and the lists of taxa be described as separate blocks, or should they be implied from the content and structure of the standard data file. I prefer the former and both DELTA and Lucid do this... but have we decided that this is the way to go? > Do others think that having an identifier is necessary and, if so, > has anyone been able to come up with a better way to handle it? Is Xpath > adequate to our needs? I always have problems with Xpath - it is supposed to work, but I always manage to get lost in the hierarchy... need more practice... :) jim

1 0

Questions on Dependencies and identifiers
by Barringer, Kerry 05 Dec '01

05 Dec '01

I am still working through Challenge 1, modifiying my data model so that it can completely model the text descriptions. I am now stuck on two problems that arose when I was trying to work through the practical implications of the model. First, I am stuck trying to adequately model the structure and dependencies of the leaf apex characters "apex obtuse or minutely mucronate within an apical notch". I am parsing this out as: Apex shape obtuse notched Apex projection immucronate mucronate within notch (dependent on 'Apex shape - notched') The model has an attribute for 'dependent on' under the character element. Does the proper solution of this problem require a 'dependent on' attribute as part of the //character/state element as well? Am I parsing the data correctly? The first problem led me to the second problem, which is: how do we identify the character or state another character depends on? This is not a problem when working with DELTA format data which has unique identifiers already, but text descriptions do not. Is an Xpath description adequate or do we need to be able to have a structure that allows us to uniquely identify what a character state describes? I originally thought a unique identifier was needed to identify missing data not explicitly encoded in the description and to provide a pointer to characters or states that are other characters of states depend on. Now that I see how complex this may be to implement, I'm wondering if another way would be better. I originally included an attribute called 'describes' under the //character/state element. I thought that this could be used for two purposes: (1) to add descriptive text to the character state and (2) to help build a unique identifier for a character state. The first use is no problem; this structure allows an author to easily add "high' to the value for the character state describing height for example. The structure is less useful for building an identifier for what the character state describes. I thought I could build a unique identifier by combining the character name and the 'describes' attribute. So for <Character> <CharName>Leaf</Charname> <State describes="PlaneShape" modifier="+/-">oblong</State> </Character> The identifier 'LeafPlaneShape' identifies what the state describes. This allows you to recognize that even when there are multiple states (ie. 'ovate', 'round', etc.) they are all describing one aspect of the leaf. If nested characters are recognized, the identifier must get longer. 'ApexShape' or 'ApexProjection' wouldn't work, it would have to be 'LeafApexShape' or 'LeafApexProjection'. If the nesting gets deep, this gets awkward pretty fast. It would work, though. Do others think that having an identifier is necessary and, if so, has anyone been able to come up with a better way to handle it? Is Xpath adequate to our needs? Thanks, Kerry ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Kerry Barringer (Curator of the Herbarium) Herbarium 718-623-7318 (office) Brooklyn Botanic Garden 718-941-4774 (fax) 1000 Washington Avenue 718-623-7312 (herbarium) Brooklyn, NY 11225-1099 U.S.A. kbarringer(a)bbg.org http://www.bbg.org/ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

1 0

Re: It's How the Data will be Used that Counts
by Jim Croft 05 Dec '01

05 Dec '01

Una rote: > These <freeform comments> are natural-language paragraphs embedded > in otherwise fully parsable data, which is the opposite of Kevin's > case above. And much easier to handle, I think. That is an intersting perspecitve, and probably right... but although coming from different directions, is there a real difference in the final result? Both end up as mixed content of structured (and parsable character data interspersed with free form text... > I put interesting data in the comments for an extremely important > reason: I know I don't have enough data to represent the probable > diversity in the taxon (the item), and I do not want to understate > the diversity. Better to leave the character coded as unknown, > but record the data. At some point in the future, the data can be > moved from the comments to where it belongs. Absolutely... and isn't this the holy grail we are seeking in this descriptive data standard? I am quite sure that the model we come up with with have a structured place for this freeform and comment stuff... to do anything less will be to admit failure... jim

1 0

Re: It's How the Data will be Used that Counts
by Jim Croft 05 Dec '01

05 Dec '01

In respnse to Steve Kevin wrote: > The terminology is +/- trivial at this stage, but I'll explain that I chose > something different from character/state simply to break with tradition for > a while. Traditionally, a character has states and that's it - a 2-level > tree. In the example above one character (leaf) has as child another > character (margin). This seems odd to many people thinking traditionally > about characters/states. Let's agree that we'll use them interchangeably for > now. And I think we have all become bilingual in this regard... But sooner rather than later I would like us to nail this terminology down, to free up the synonyms forus elsewhere in our model as much as anything else... Also, I am not yet convinced that unbounded nesting of characters is necessary the best way to go in terms of representing a hioerarch of character data... but maybe it is... At the end of the day, a state of a particular character (or a value of a feature) is used in a key decision or choice and the hierachy is not all that important other than the order of presentation orlogical grouping of characters. In the description the hierarch similar provides the logical order of the characters and their states/values. Perhaps we could use features for the hierachy and characters for the ultimate branch. For example features could contain features or characters (but not both) and characters would have states which would have values (present, absent, doubtful, rarely, in error, or whatever, or a measurement/count). Reaching agreement of this level of data description, and the terminology we are going to use would seem to be essential for clear and unambiguous communication within the group. But if we can't agree, I can wait... because we aregoing to have to doit sooner or later... jim

1 0