- tdwg-content - lists.tdwg.org

Re: It's How the Data will be Used that Counts
by Jim Croft 05 Dec '01

05 Dec '01

> No, this is not equivalent to dependencies, although at first sight it seems > similar. It's a tree-topology rule: the rule "<states>s can have > <characters>s as siblings" allows certain topologies for the > data-representation trees that would be preluded by the rule "<states>s > cannot have <characters>s as siblings". As I said, I don't know whether the > rule's a good or necessary one or not. But it is important that we sort it out... Dependency can be expressed in ways other than allowing the ultimate state to be a parent of a character, can't it? This would seem to be very messy and recursive and would probably break the clean hierarchical representation of data we seemed to be aiming towards. > Dependencies establish relationship rules between different parts of the > tree. Sometimes these rules affect parent-sibling relationships as in the > example, but sometimes not - that is, dependencies are more complex than can > be handled by a topology rule. I tend to agree with this... > In the key, if a user chooses a state marked * the character "Trap > Structures" will disappear because of the dependencies. Note that the > dependency on tree and shrub is not a logical one but a contextual one - it > just happens that no trees or shrubs are carnivorous, so the dependency's > useful but not logically necessary. Well, this is important, isn't it? I think we have just established that our model must accomodate two kinds of dependency one based on logic andaracter tree topology, an the other based on content/context. In the former a character can not be present because to parent feature/character/organ is absent; in the latter it could conceivalbly be present, but just isn't. Up till now applications have just allowed scoring dependencies without making this distinction. > I don't think these dependency relationships can be expressed merely by the > topology of the tree, as you suggest. We will need to deal with dependency > rules as a separate issue. I agree with this. We need to bear it in mind, but let's try and nail down a basic descriptive framework first... jim

1 0

Re: It's How the Data will be Used that Counts
by Jim Croft 05 Dec '01

05 Dec '01

Kevin wrote: > I seem to be in a minority of one here again, but I'll continue to argue my > case for a bit longer. why not... whe have be doing this for several years now, a few more email iterations won't hurt... :) > 1. Exactly, it's easier to go from data to +/- natural language, which is > precisely why we need to try hard to facilitate the reverse. Which is not all that difficult... to mark up a single description manually is a relatively trivial task... but to process a lot automatically is another matter... and to do it in a way that the whole community accepts is another matter entirely... :) > 2. If we can effectively embed fully parsable data in a natural-language > paragraph, why not? becasue it creates mixed content (legal, but evil) where the data is parsable, but not structured. If we can not represent our data universe in a database in an elegant and easily understood way we may not necessarily have failed, but we will have fallen short of the target by a large margin. In fact, DELTA already does this sort of thing by allowing liberal appending and prepending <freeform comments> all over the place. While this makes for quasireadable descriptions, authors often embed interesting character data in the comments making it unusable by other applications, even within the DELTA suite. The 'rarely' and 'misinterpreted' scoring options in Lucid might be seen as an attempt to capture some of this information. > 3. If a structured data document based on our standard is a subset of a > marked-up description based on the standard, then creating a standard that > can support the latter gives us the best of both worlds. If it can be done, > why not? I agree... this is a laudable aim... but I must admit to never thinking of a structured document as a subset of a marked up description - oftentimes the reverse may be the case... Isn't it better to think of both structured documents and marked up descriptions as being subsets of the standard we are trying to create? > Personally I think that creating an XML representation of structured data > would be a doddle. But as we have seen, getting everyone to agree that one person's way of doing it is the one true and proper path to descriptive enlightenment is no easy task... > Creating a fully parsable but lossless XML representation > of a natural language description (which hence can also handle the degraded > case of structured data) - now that would really be something to write home > about! Well, dreaming about it at least... I think we are dealing with two different, and I fear irreconsilable things here... descriptions by their very nature are lossy - they are abstract representations of the gestalt of a sample of a taxon, often with an arbitrary word limit, attempting to portray in a familiar format what an author thinks a taxon looks like. Structured documents such as DELTA and Lucid at least have to potential to store everything that is remotely interesting about every taxon in the set and often come close to achieving it in reality. So what if the resulting descriptions do not have the poetic beauty of a Shakespearian sonnet; at least the information will be there and retrievable... In the case of biological description, beauty is not necessarily truth, or at least the whloe truth... > Anyone else out there +/- agree with me, or should I give up now? Don't do that... if you do, you will never have anything to write home about... jim

1 0

Re: It's How the Data will be Used that Counts
by Kevin Thiele 04 Dec '01

04 Dec '01

I seem to be in a minority of one here again, but I'll continue to argue my case for a bit longer. ----- Original Message ----- From: "Robert A. (Bob) Morris" <ram(a)CS.UMB.EDU> To: <TDWG-SDD(a)USOBI.ORG> Sent: Tuesday, December 04, 2001 2:11 PM Subject: It's How the Data will be Used that Counts | I think natural language parsing understanding is harder than natural | language production from structure. So I think there is less work to | go from data to description than the other way around. 1. Exactly, it's easier to go from data to +/- natural language, which is precisely why we need to try hard to facilitate the reverse. 2. If we can effectively embed fully parsable data in a natural-language paragraph, why not? 3. If a structured data document based on our standard is a subset of a marked-up description based on the standard, then creating a standard that can support the latter gives us the best of both worlds. If it can be done, why not? Personally I think that creating an XML representation of structured data would be a doddle. Creating a fully parsable but lossless XML representation of a natural language description (which hence can also handle the degraded case of structured data) - now that would really be something to write home about! Anyone else out there +/- agree with me, or should I give up now? Cheers - k

1 0

Re: It's How the Data will be Used that Counts
by Kevin Thiele 04 Dec '01

04 Dec '01

----- Original Message ----- From: "Steve Shattuck" <Steve.Shattuck(a)CSIRO.AU> To: <TDWG-SDD(a)USOBI.ORG> Sent: Tuesday, December 04, 2001 11:31 AM Subject: It's How the Data will be Used that Counts | Kevin's representation is too focused on text descriptions. A more complete | representation might be: | | <character name="leaf"> | <state>present</state> | <character name="leaf margin"> | <state>serrate</state> | <character name = "tooth orientation"> | <state>forward-pointing</state> | </character> | </character> | </character> I would argue that this is a less complete representation, because you've abstracted the original data further than I have. I won't use the "markup" from my last email for "Leaf margins serrate with forward-pointing teeth", as this was designed to exemplify one limited problem (how to expand a description if characters are nested) and wasn't actually marked up in the way I proposed for challenge case 1. Two possible ways I'd do this example are: "Leaf margins serrate with forward-pointing teeth" {Using the rule "<states>s cannot have <characters>s as siblings"} <Feature><Name>Leaf</Name> <Feature name = "Presence" value = "Present"> <Feature Name="marginal toothing">margins <Value>serrate</Value> </Feature> with <Feature Name = "tooth orientation"> <State>forward-pointing</State> teeth </Feature> </Feature> {Relaxing the rule so that <states>s can have <characters>s as siblings} <Feature><Name>Leaf</Name> <Feature name = "Presence" value = "Present"> <Feature Name="marginal toothing">margins <Value>serrate</Value> with <Feature Name = "tooth orientation"> <State>forward-pointing</State> teeth </Feature> </Feature> </Feature> Your proposal is: <character name="leaf"> <state>present</state> <character name="leaf margin"> <state>serrate</state> <character name = "tooth orientation"> <state>forward-pointing</state> </character> </character> </character> The score: 1. Can we parse from these the data atoms "leaf = present "leaf margin = serrate" and "tooth orientation = forward-pointing"? Kevin's = Yes Steve's = Yes Can we easily retrieve from these the original natural language string? Kevin's = Yes Steve's = No On this scoring I'm one up. Then again, yours would be slightly easier to parse than mine, so we're probably equal. What's most important here? Dunno. Further, it seems to me that yours is a subset of mine: a Schema that allowed mine would also allow yours, but not vice versa. | In my original, DELTA-centric model I used a <description> tag to try and | capture the text description information separate from the <state> | information. My thinking was that these two | requirements/approaches/viewpoints are too distinct to cram together without | falling into the same trap as the current DELTA Standard (which is a | least-common denominator approach). Yes we could tag the bits of free-form text. But is there any need? They will (by definition) be ignored by all processors except for natural-language - since the NL is fully retrievable from my model, why not leave them untagged? In your model, they would need to be tagged since the model does not represent a natural description - it represents abstracted data from which a description can be +/- created. | The problem here is that the phrase "Leaf margins serrate with | forward-pointing teeth" concerns 3 characters (leaf, margin and teeth) and 1 | implied and 2 expressed states (present, serrate and forward pointing) with | the characters being dependent (and therefore the context containing | significant information - we know that 'teeth' have something to do with | 'serrate' which has something to do with 'leaf margins' - the leaves being | present because we're describing them). There's a lot of logic involved in | parsing this. I can't think of a simple way of representing all this | complex information without separating it at some level. Kevin's suggestion | represents the text description and mine the underlying data, but neither | works well for the other. I agree - there's still too much complex logic even in the very simple types of examples we're using so far. We need somehow to step back further to even more basic examples to tease these issues out. Cheers - k

1 0

Re: It's How the Data will be Used that Counts
by Kevin Thiele 04 Dec '01

04 Dec '01

----- Original Message ----- From: "Steve Shattuck" <Steve.Shattuck(a)CSIRO.AU> To: <TDWG-SDD(a)USOBI.ORG> Sent: Tuesday, December 04, 2001 11:31 AM Subject: It's How the Data will be Used that Counts | A couple of points: | | Kevin suggests a rule "<states>s cannot have <characters>s as siblings". | But this is what DELTA calls Dependencies, it represents the state that | controls a character. This would seem to be a good thing (and may be very | important). No, this is not equivalent to dependencies, although at first sight it seems similar. It's a tree-topology rule: the rule "<states>s can have <characters>s as siblings" allows certain topologies for the data-representation trees that would be preluded by the rule "<states>s cannot have <characters>s as siblings". As I said, I don't know whether the rule's a good or necessary one or not. Dependencies establish relationship rules between different parts of the tree. Sometimes these rules affect parent-sibling relationships as in the example, but sometimes not - that is, dependencies are more complex than can be handled by a topology rule. In dependencies, there may be several controlling states for a character, and these may or may not occur on nodes antecedent to the node in question. An example (from a key of mine): the character: Trap structures (on carnivorous plants) submerged or underground bladder-traps pitcher-traps sticky hairs irritable leaf-blade segments has controlling states in several other characters, viz: Nutritional strategy carnivorous parasitic* neither carnivorous nor parasitic* Habit tree* shrub* terrestrial herb aquatic herb In the key, if a user chooses a state marked * the character "Trap Structures" will disappear because of the dependencies. Note that the dependency on tree and shrub is not a logical one but a contextual one - it just happens that no trees or shrubs are carnivorous, so the dependency's useful but not logically necessary. I don't think these dependency relationships can be expressed merely by the topology of the tree, as you suggest. We will need to deal with dependency rules as a separate issue. Cheers - k

1 0

Re: It's How the Data will be Used that Counts
by Kevin Thiele 04 Dec '01

04 Dec '01

Thanks Steve - at last we have some alternatives we can sink our teeth into. Comments below. ----- Original Message ----- From: "Steve Shattuck" <Steve.Shattuck(a)CSIRO.AU> To: <TDWG-SDD(a)USOBI.ORG> Sent: Tuesday, December 04, 2001 11:31 AM Subject: It's How the Data will be Used that Counts | I've been giving Kevin's approach some thought and have the following | comments: | | Kevin's original information flow model is too simplistic. A more realistic | model would be something like: | | Text Descriptions | | Text Descriptions | Phylogenetics |---> Structured Data --->| Phylogenetics | Specimens | | Identification Tools | (many) Yes, of course. There are varied sources for the structured data. It still seems to me that capturing the non-text sources will probably be a subset of what's needed to capture the text sources. This is because textual descriptions are probably the least formally structured data we need to deal with as input (with the exception of original observations which, in some taxonomists' minds at least, are highly structureless but are readily structurable) | The sources are much more varied and are often group-specific. For example, | invertebrates have very few good quality text descriptions (most are old, | are in a range of languages (English, French, German, etc), vary greatly in | style, quality, etc. etc) and the majority of invertebrates are currently | undescribed (having 80% new taxa during a revision is common). Yes I agree, a botany bias is showing through here. | Similarly, the outputs required vary greatly and in ways hard to predict. | While text descriptions would seem to be a common requirement, they are in | some ways "legacy" and may become less important in the future as | applications (and users) become more sophisticated. We need to make sure we | keep this range of uses in mind at all times. Yes, but see comment above. | Because of this I don't really think the details of the model | matter too much, more that it is rich enough to represent all data of | interest. Exactly my point - it needs to be rich enough to capture and express a textual description, hopefully losslessly! | ******** | | I've also been thinking about Kevin's latest example: | | "Leaf margins serrate with forward-pointing teeth" | <feature name="leaf"> | <feature name="margin"> | <feature name = "teething shape"> | <value>serrate</value> | </feature> | <feature name = "teeth orientation">with | <value>forward-pointing</value>teeth | </feature> | </feature> | </feature> | | First, it seems to me that "feature" is what taxonomists call "character" | and "value" is "state". Being a traditionalist I'll switch back to this | common terminology: The terminology is +/- trivial at this stage, but I'll explain that I chose something different from character/state simply to break with tradition for a while. Traditionally, a character has states and that's it - a 2-level tree. In the example above one character (leaf) has as child another character (margin). This seems odd to many people thinking traditionally about characters/states. Let's agree that we'll use them interchangeably for now. (Other points have been split into separate emails) Cheers - k

1 0

Re: It's How the Data will be Used that Counts
by unknown＠example.com 04 Dec '01

04 Dec '01

Perfect agreement with Bob? Maybe he's starting to rub off on me! >For this reason, it is probably unnecessary to represent the state >`present' at all, provided the semantics could reasonably require that >a feature which is absent is never described. Is that a reasonable >requirement? I think this is probably true. I can only come up with two potential problems: (1) In this case the state "present" is implied but this may not always be the case when states are not specified. For example: <character name="distribution"> <character name="country"> <state>Australia</state> </character> </character> A default state of "present" doesn't make much sense for the character "distribution". This may not be a serious problem but we should give it some thought. (2) I think the bigger problem is the flip-side: if a character isn't in the description then it must be "absent". We can't do it this way. >To me, the main thing that this kind of model implies is the need, in >some cases, to provide a thesaurus, e.g. to provide advice that if a >character (here `margin') has a subcharacter `teeth' then it may be >described as `serrate'. Is that bad? Or would a purely textual >description which just said "Leaf margins with forward-pointing teeth" >be deemed wrong absent the word "serrate" ? Adding a thesaurus would add way too much work and removing "serrate" would be too restrictive. >I think natural language parsing understanding is harder than natural >language production from structure. So I think there is less work to >go from data to description than the other way around. If true (and I think it is) then this would suggest a more DELTA-like structure and less of a text-markup structure for the representation. Steve

1 0

Re: It's How the Data will be Used that Counts
by Una Smith 04 Dec '01

04 Dec '01

Kevin Thiele wrote: >> I seem to be in a minority of one here again, but I'll continue to argue my >> case for a bit longer. Go for it. Defensiveness is a waste of everyone's time. >> 2. If we can effectively embed fully parsable data in a natural-language >> paragraph, why not? Jim Croft wrote: >becasue it creates mixed content (legal, but evil) where the data is >parsable, but not structured. [...] >In fact, DELTA already does this sort of thing by allowing liberal >appending and prepending <freeform comments> all over the place. While >this makes for quasireadable descriptions, authors often embed >interesting character data in the comments making it unusable by other >applications, even within the DELTA suite. [...] These <freeform comments> are natural-language paragraphs embedded in otherwise fully parsable data, which is the opposite of Kevin's case above. And much easier to handle, I think. I put interesting data in the comments for an extremely important reason: I know I don't have enough data to represent the probable diversity in the taxon (the item), and I do not want to understate the diversity. Better to leave the character coded as unknown, but record the data. At some point in the future, the data can be moved from the comments to where it belongs. -- Una Smith Los Alamos National Laboratory, Mailstop K-710, Los Alamos, NM 87545

1 0

It's How the Data will be Used that Counts
by unknown＠example.com 04 Dec '01

04 Dec '01

I've been giving Kevin's approach some thought and have the following comments: Kevin's original information flow model is too simplistic. A more realistic model would be something like: Text Descriptions | | Text Descriptions Phylogenetics |---> Structured Data --->| Phylogenetics Specimens | | Identification Tools (many) The sources are much more varied and are often group-specific. For example, invertebrates have very few good quality text descriptions (most are old, are in a range of languages (English, French, German, etc), vary greatly in style, quality, etc. etc) and the majority of invertebrates are currently undescribed (having 80% new taxa during a revision is common). Similarly, the outputs required vary greatly and in ways hard to predict. While text descriptions would seem to be a common requirement, they are in some ways "legacy" and may become less important in the future as applications (and users) become more sophisticated. We need to make sure we keep this range of uses in mind at all times. The Structured Data is the format we're talking about here. Don't know what that will look like yet (but see below). I think it's important to realise that applications will be used to move among these in almost all cases and only very rarely will people manipulate the data directly. It's also important to remember that XSLT is nothing more than an application. Some comments seem to imply that XSLT is part of XML, but it isn't. For example, from a BioLink perspective (and DELTA and probably LucID) the above model will need to be extended by adding: Text Descriptions | | Text Descriptions Phylogenetics |---> Structured Data --->| Phylogenetics Specimens | | | Identification Tools (many) | |<-- | BioLink |--> | DELTA | LucID Builder That is, applications will import Structured Data, manipulate it and spit it out again. Because of this I don't really think the details of the model matter too much, more that it is rich enough to represent all data of interest. ******** I've also been thinking about Kevin's latest example: "Leaf margins serrate with forward-pointing teeth" <feature name="leaf"> <feature name="margin"> <feature name = "teething shape"> <value>serrate</value> </feature> <feature name = "teeth orientation">with <value>forward-pointing</value>teeth </feature> </feature> </feature> First, it seems to me that "feature" is what taxonomists call "character" and "value" is "state". Being a traditionalist I'll switch back to this common terminology: "Leaf margins serrate with forward-pointing teeth" <character name="leaf"> <character name="margin"> <character name = "teething shape"> <state>serrate</state> </character> <character name = "teeth orientation"> <state>forward-pointing</state> </character> </character> </character> A couple of points: Kevin suggests a rule "<states>s cannot have <characters>s as siblings". But this is what DELTA calls Dependencies, it represents the state that controls a character. This would seem to be a good thing (and may be very important). Kevin's representation is too focused on text descriptions. A more complete representation might be: <character name="leaf"> <state>present</state> <character name="leaf margin"> <state>serrate</state> <character name = "tooth orientation"> <state>forward-pointing</state> </character> </character> </character> This allows us to directly extract: leaf = present leaf margin = serrate tooth orientation = forward-pointing This will be important for both identification tools and phylogenetics. Trying to recover this information from Kevin's representation should be possible but will require a number of assumptions be made about the data. This representation also captures dependencies (although this is an advanced feature we shouldn't be talking about yet). In my original, DELTA-centric model I used a <description> tag to try and capture the text description information separate from the <state> information. My thinking was that these two requirements/approaches/viewpoints are too distinct to cram together without falling into the same trap as the current DELTA Standard (which is a least-common denominator approach). The problem here is that the phrase "Leaf margins serrate with forward-pointing teeth" concerns 3 characters (leaf, margin and teeth) and 1 implied and 2 expressed states (present, serrate and forward pointing) with the characters being dependent (and therefore the context containing significant information - we know that 'teeth' have something to do with 'serrate' which has something to do with 'leaf margins' - the leaves being present because we're describing them). There's a lot of logic involved in parsing this. I can't think of a simple way of representing all this complex information without separating it at some level. Kevin's suggestion represents the text description and mine the underlying data, but neither works well for the other. Two steps forward, one step back. Sorry about that. Thanks, Steve

1 0

It's How the Data will be Used that Counts
by Robert A. (Bob) Morris 03 Dec '01

03 Dec '01

Wow, Steve, we are virtually in perfect agreement! A first! I have one small quibble---you actually raised it yourself though---and some suggestions Steve Shattuck writes: > ... > Kevin's representation is too focused on text descriptions. A more complete > representation might be: > > <character name="leaf"> > <state>present</state> > <character name="leaf margin"> > <state>serrate</state> > <character name = "tooth orientation"> > <state>forward-pointing</state> > </character> > </character> > </character> > > > This allows us to directly extract: > > leaf = present > leaf margin = serrate > tooth orientation = forward-pointing > > The problem here is that the phrase "Leaf margins serrate with > forward-pointing teeth" concerns 3 characters (leaf, margin and teeth) and 1 > implied and 2 expressed states (present, serrate and forward pointing) with > the characters being dependent (and therefore the context containing > significant information - we know that 'teeth' have something to do with > 'serrate' which has something to do with 'leaf margins' - the leaves being > present because we're describing them). For this reason, it is probably unnecessary to represent the state `present' at all, provided the semantics could reasonably require that a feature which is absent is never described. Is that a reasonable requirement? For example, in such a case to extract from the XML all taxa which have a leaf character in their description is not any harder for lacking <state>present</state>. In fact it is triffling bit easier. The desire for a `present' state perhaps comes from table-based character-by-state organization where it could be hard to distinguish whether the character is absent from the taxon or absent from the data. That distinction can be made moot here, perhaps. `presence' may be the only such state though. >There's a lot of logic involved in > parsing this. I can't think of a simple way of representing all this > complex information without separating it at some level. If we assume that there is a rigorous semantics to the effect that syntactically nested characters are always logically nested (and here we may need to return to "feature" if "character" is too dear to overload), is there a problem with this: <character name="leaf"> <character name="margin"> <character name="teeth"> <character name = "orientation"> <state>forward-pointing</state> </character> </character> </character> </character> To me, the main thing that this kind of model implies is the need, in some cases, to provide a thesaurus, e.g. to provide advice that if a character (here `margin') has a subcharacter `teeth' then it may be described as `serrate'. Is that bad? Or would a purely textual description which just said "Leaf margins with forward-pointing teeth" be deemed wrong absent the word "serrate" ? >Kevin's suggestion > represents the text description and mine the underlying data, agreed >but neither works well for the other. I think natural language parsing understanding is harder than natural language production from structure. So I think there is less work to go from data to description than the other way around. > > Two steps forward, one step back. Sorry about that. Nah, at most 1/2 step back. Is your middle name Zeno? > > Thanks, Steve > Bob Morris

1 0