From kevin.thiele at BIGPOND.COM Fri Nov 30 12:57:23 2001 From: kevin.thiele at BIGPOND.COM (Kevin Thiele) Date: Fri, 30 Nov 2001 12:57:23 Subject: Alternate Hierarchies Message-ID: I agree. I've added an issue to challenge case 6 to address this. 6. Remote and local taxon lists a.. remote referencing of hierarchies from master lists e.g. ITIS b.. local definition of hierarchies c.. efficient rearrangement of hierarchies with knowledge changes d.. alternate hierarchies Cheers - k ----- Original Message ----- From: "Jim Croft" To: Sent: Friday, November 30, 2001 1:15 AM Subject: Re: Document vs. database | > In DELTA, Lucid etc we express alternate hierarchies in alternate documents. | > Do we need to do better than this? | | Let's stop this line of thought right now... | | build one hierachy of characters that works first, THEN consider how to | manage alternatives... | | maybe... | | jim | ------=_NextPart_000_0070_01C1799E.8DFDD1E0 Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable
I agree. I've added an issue to challenge case 6 to address this.
6. Remote and local taxon lists
  • remote referencing of hierarchies from master lists e.g. ITIS
  • local definition of hierarchies
  • efficient rearrangement of hierarchies with knowledge changes
  • alternate hierarchies
 

Cheers - k

----- Original Message -----
From: "Jim Croft" <
jrc at ANBG.GOV.AU>
To: <
TDWG-SDD at USOBI.ORG>
Sent: Friday, November 30, 2001 1:15 AM
Subject: Re: Document vs. database


| > In DELTA, Lucid etc we express alternate hierarchies in alternate
documents.
| > Do we need to do better than this?
|
| Let's stop this line of thought right now...
|
| build one hierachy of characters that works first, THEN consider how to
| manage alternatives...
|
| maybe...
|
| jim
|
From kevin.thiele at BIGPOND.COM Fri Nov 30 12:48:10 2001 From: kevin.thiele at BIGPOND.COM (Kevin Thiele) Date: Fri, 30 Nov 2001 12:48:10 Subject: Character hierarchies Message-ID: ----- Original Message ----- From: "Jim Croft" To: Sent: Friday, November 30, 2001 1:03 AM Subject: Re: Document vs. database | Also, in our new model we also want to show more than just presence | or absence of a state. Don't we want to show if it is present/rarely | present/present by misinterpretation/rarely present by | misisnterpretation, etc., and my favourites yet to be implemented as | a character stae attribute: definitely absent, absent by | misinterpretation, unknown, unscored Of course these will come in. In my example, see: persistent(rarelydeciduous) The other qualifiers (misinterps etc) aren't in there yet, but will come. We'll have to be much more sophisticated that in my example, of course. Now how about "definately known to be unknown" | > I'm simply suggesting allowing n levels: | > | > | > | > | > | > | > | > | | That is where we want to get to do, but echoing Bob's words, what we | are after is the structure that allows us to get to that, giving | people all the appearance of freedom to do what they like but not | actually doing it and imposing a schematic straight jacket on the data. | A schema for the data structure rather than the descriptive data itself. | Like all freedoms, the descriptive data one too must be an illusion or | we wil miss out on on all the creative potential that it offers as the | free spirits among us do theri own thing. But that is another | discussion... :) Sorry, I made a blue with the example. It should have been: ovate elliptic blue red | There has been a lot of talk about the feature/value paradigm and how | this might be made to represent a biological description and even nested | features in such descriptions... | | At the recent TDWG meeting Richard Pankhurst described something like a | feature/character/value paradigm and at the time I made a note that this | was probalably worth considering in more detail, but so far have not had | the time. | | It probably would mean something like: | | or some But I can't see that "margin" is a thing different in kind from "leaf". Sure, the margin is part of a leaf but equally the leaf is part of the plant, so if you have feature|character|value then you will end up with "leaf" as a feature in one context and a character in another - messy, surely. My example above just has Feature|value. The difference is that a feature is a thing but a value is not (both "leaf" and "margin" are things that can be pointed at, "ovate" and "blue" are not). In the XML tree representation above features are nodes and values are twigs (or leaves if you like - of course, a "twig" would be a node, not a twig (or a leaf) - oh dear, can someone invent a language that isn't a language!) Ditto for Jim's second example: | | | | | | | Try interpolating another level in here (e.g. marginal teeth shape) and you get into strife. Whereas with We can surely add as many levels as we like. Or can we - here's a challenge - find the simplest way of expanding description 1: "Leaf margin serrate" into description 2: "Leaf margin serrate with forward-pointing teeth". Attempt 1. "Leaf margin serrate" serrate "Leaf margins serrate with forward-pointing teeth" serrate with forward-pointingteeth This requires moving from the node to a new interpolated node , so as to keep a rule that s cannot have s as siblings. An alternative: Attempt 2. "Leaf margins serrate with forward-pointing teeth" serrate forward-pointing breaks this rule. Attempt 2 seems preferable to me in terms of efficiency, but the rule seems somehow important. Of course, restricting ourselves to a 2-level heirarchy a la Lucid and DELTA makes these problems go away, but only because we artificially collapse the levels down and lose the structure: "Leaf margins serrate with forward-pointing teeth" serrate forward-pointing I'm getting lost here - can someone help? We've +/- discussed this before, but we need more thinking around real examples such as these. Cheers - k From kerrybarringer at BBG.ORG Fri Nov 30 11:08:16 2001 From: kerrybarringer at BBG.ORG (Barringer, Kerry) Date: Fri, 30 Nov 2001 11:08:16 Subject: Process: Forum software? Message-ID: Bob, If the group can keep up the current level of activity, a forum would be a great idea. They are really a much more flexible way to work together. Kerry From kerrybarringer at BBG.ORG Fri Nov 30 11:01:25 2001 From: kerrybarringer at BBG.ORG (Barringer, Kerry) Date: Fri, 30 Nov 2001 11:01:25 Subject: Thought and a Question Message-ID: Thoughts on Leigh's Question At this stage, I am not worried about the differences in the XML vocabularies used by different people. For at least a little while we should explore the alternatives and see what the different models and different encodings have to offer. I have found that other peoples work has really helped me to see better solutions to some of the problems and sometimes helped me think of new ideas to try. I think that after a while, the alternatives will become set and the group will need to have a means of selecting among the alternatives. The challenge cases should provide a way to evaluate the effectiveness of the different solutions. Then the group can come up with a proposal, invite comments, and work out the first version. How to present the different solutions to the challenge cases is a problem that should be worked out now. If this was decided at the last meeting, let us know. Otherwise, it seems that many people have there own ideas of the proper way to present solutions. My ideas are: An XML data model is, for me, a DTD or a Schema. The model is an abstract representation of the data. This is what we should be producing. Personally, I prefer DTD's because I am more familiar with them. They are easier to understand because they are simple and relatively free of markup. As Bob Morris has pointed out though, they do lack some of the features found in XML Schemas. However, Schemas are more difficult to understand, especially for people who might know the systematics, but not the XML. A solution to a challenge case should have the model (DTD or Schema), markup of the challenge case that validates to the model, and a short program that solves the challenge using the markup. The last is important because it is too easy to develop a theoretically neat model that has practical problems. Also, if the group is to pick from the features of different alternatives it would be better to see how they work. These are just my thoughts and I would like to understand what others are thinking and what may have already been decided at the last meeting. Kerry From ldodds at INGENTA.COM Fri Nov 30 09:27:35 2001 From: ldodds at INGENTA.COM (Leigh Dodds) Date: Fri, 30 Nov 2001 9:27:35 Subject: Thought and a Question Message-ID: I'll start with the question: Am I right in assuming that the process is to propose and agree on a set of challenges, and then present different solutions for solving these? If so, how should those solutions be presented. Kevin and Kerry have both posted XML documents that meet Challenge 1, and and Steve Shattuck and I have also provided XML formats that can encode similar data. How will we judge these formats as better or worse? I can imagine producing a near infinite number of different XML vocabularies that can encode the same data (+/- attributes, different tag names, nesting, cross-referencing, etc, etc). I'd like to suggest, again, that the solutions for the challenges should be presented as simple data models. Syntax can come later (and may fall naturally out of the model). -- Now the thought: Kevin has pointed out that there is a large amount of taxonomic data which is simply free text, and it is a worthwhile goal to consider how to repurpose this information. However, I think that that is a separate work item to defining a format for capturing that data. I'd actually interpeted Challenge 1 to mean "represent the data contained in these natural language descriptions", and was ignoring (for the present) how that data would be extracted (manual markup of the text description or direct data entry into a suitable tool). Extracting information from free text is still far from being an automated process. If there is still going to be a manual element involved for some time, it seems better to have a standard and supporting tools to capture _new_ data in a rigorous format, while considering the markup and processing of 'legacy' data as a separate issue. To approach this slightly differently, the example markup that Kevin has suggested adding to free text descriptions would be interpreted according to a standard data model :) -- I hope this isn't interpreted as an attempt to derail the good progress being made (nice to see some discussion again! :) but in the 'XML world' we're seeing problems surface because of incompatibilities between specifications that have arisen due to slightly different interpretations of the XML data model. This is because the model came after the syntax. L. -- Leigh Dodds, Research Group, Ingenta | "Pluralitas non est ponenda http://weblogs.userland.com/eclectic | sine necessitate" http://www.xml.com/pub/xmldeviant | -- William of Ockham From jrc at ANBG.GOV.AU Fri Nov 30 08:23:22 2001 From: jrc at ANBG.GOV.AU (Jim Croft) Date: Fri, 30 Nov 2001 8:23:22 Subject: Document vs. database In-Reply-To: from "Peter Stevens" at Nov 29, 2001 11:13:53 AM Message-ID: > Am I missing something, but why do we keep on talking about states? This > is the last thing I think is interesting (double entendre here...) States > are frosting, observations - measurements, photos - make up the cake itself. Is this a real issue or one of semantics? When I write "leaf shape ovate", does it mean I have observed the leaf and its shape most closely matches that of ovate, or that I have observed the shape character leaf and consider it to be in a state of ovateness? The end result is probably the same... But I think the issue is worth pondering... Perhaps Richard did not have it quite right and perhaps we should be recording zero or more observations which are made of a feature that has a certain character one of the attributes (rather than the state) of which is present/ absent/misinterpreted, etc. Maybe something like: etc... or maybe it would be better to invert it and define a feature up front and make one or more observations about it: There are numberous ways to moddel this sort of stuff, and has been repeatedly pointed out there is no right way to do it, there are only ways. The challenge is to find one that is functional, comprehensive, flexible, extensible, pragmatic and elegant. And one that we can all agree on... > Probably getting old. as are we all... and we have to get this damn thing working before our use-by dates are up... :) jim From ram at CS.UMB.EDU Fri Nov 30 07:12:37 2001 From: ram at CS.UMB.EDU (Robert A. (Bob) Morris) Date: Fri, 30 Nov 2001 7:12:37 Subject: Process: Forum software? Message-ID: DON'T REPLY TO THE LIST: As I mentioned in Sydney, we have now established an instance of the vBulletin web-based forum software (see www.vbulletin.com) and are prepared to host a forum here for the current discussions if there is any desire to do so. Forums provide better threading than email (it is not based solely on the Subject field) and also permit participants to consider things somwhat more at their own schedule and according to their own interests. The board administrators can set how long postings remain at the forefront, so that it is easier to understand what is current. vBulletin allows you to be notified briefly by mail when something of interest appeared on the forum. There are a few disadvantages to moving this discussion now: (a)Quite a bit has transpired and people might feel they need to restate things in the forum [that could be an advantage....]; (b)we are new to this software and might not administer it as seamlessly as a mail-based system. So as not to clutter the list. I would urge people to /privately/ reply to me with their sympathy for or against the issue. I'll summarize and post the results and then people can discuss it, if warranted. Send mail to ram at cs.umb.edu. Your "Reply" facility will likely send to the list. For a glimpse of how vBulletin forums operate, see those at http://vbulletin.com/forum/index.php http://forums.progresstalk.com/index.php http://www.linuxquestions.org/questions/ http://www.ntcompatible.com/vb/index.php http://www.webdevforums.com/ As far as I can tell, vBulletin is the software under the forum parts of sourceforge.net Bob Morris p.s. Forum software is aimed at community building, so vBuilder has lots of flashy features like smiley face annotations, personal avatars, polls about the quality of a thread, "buddy lists", etc. I don't expect that SDD forums would operate with that stuff in your face, especially as we gain administrative experience. From jrc at ANBG.GOV.AU Fri Nov 30 01:15:31 2001 From: jrc at ANBG.GOV.AU (Jim Croft) Date: Fri, 30 Nov 2001 1:15:31 Subject: Document vs. database In-Reply-To: <005001c178a4$9e2cf280$62058690@presario> from "Kevin Thiele" at Nov 29, 2001 06:04:08 PM Message-ID: > In DELTA, Lucid etc we express alternate hierarchies in alternate documents. > Do we need to do better than this? Let's stop this line of thought right now... build one hierachy of characters that works first, THEN consider how to manage alternatives... maybe... jim From jrc at ANBG.GOV.AU Fri Nov 30 01:03:10 2001 From: jrc at ANBG.GOV.AU (Jim Croft) Date: Fri, 30 Nov 2001 1:03:10 Subject: Document vs. database In-Reply-To: <005001c178a4$9e2cf280$62058690@presario> from "Kevin Thiele" at Nov 29, 2001 06:04:08 PM Message-ID: I have been watching the animated discussion on the revived SDD list and it is tood to see... too busy to comment, but still interested, and useful things are being said by others... :) > Existing data structures allow 2-level hierarchies e.g. > #1. Leaf shape/ > 1. ovate/ > 2. elliptic/ > #2. Flower colour/ > 1. blue/ > 2. red/ Is that really two levels of character is is it only one with a state attribure? Also, in our new model we also want to show more than just presence or absence of a state. Don't we want to show if it is present/rarely present/present by misinterpretation/rarely present by misisnterpretation, etc., and my favourites yet to be implemented as a character stae attribute: definitely absent, absent by misinterpretation, unknown, unscored > I'm simply suggesting allowing n levels: > > > > > > > > That is where we want to get to do, but echoing Bob's words, what we are after is the structure that allows us to get to that, giving people all the appearance of freedom to do what they like but not actually doing it and imposing a schematic straight jacket on the data. A schema for the data structure rather than the descriptive data itself. Like all freedoms, the descriptive data one too must be an illusion or we wil miss out on on all the creative potential that it offers as the free spirits among us do theri own thing. But that is another discussion... :) There has been a lot of talk about the feature/value paradigm and how this might be made to represent a biological description and even nested features in such descriptions... At the recent TDWG meeting Richard Pankhurst described something like a feature/character/value paradigm and at the time I made a note that this was probalably worth considering in more detail, but so far have not had the time. It probably would mean something like: or some equivalent using XML entities rather than attributes. Is there any merit in this approach above using a feature that is say "leaf margin" and another that is say "leaf tip" and yet another that is say "leaf base"? It would seem to give some structure to the character set data. On the surface it would also allow easy generation of more readable descriptions without excessive text processing: Leaf outline ovate, margin serrate, tip acute, base attenuate, etc. as opposed to: Leaf outline ovate, leaf margin serrate, leaf tip acute, leaf base attenuate, leaf... etc. Has anyone else considered this approach? jim From kevin.thiele at BIGPOND.COM Thu Nov 29 18:04:08 2001 From: kevin.thiele at BIGPOND.COM (Kevin Thiele) Date: Thu, 29 Nov 2001 18:04:08 Subject: Document vs. database Message-ID: ----- Original Message ----- From: "Eric Zurcher" To: Sent: Thursday, November 29, 2001 3:32 PM Subject: Document vs. database | At 02:15 PM 28-11-01 +1100, Kevin Thiele wrote: | >Issues | > | >I've taken the route of marking up a textual description, using a minimum of | >tags. It seems to me that a description comprises a series of features with | >values. I've used mixed markup because I wanted to have the mimimum tagging | >and make maximum use of the text. That is, in cases like: | > | >shrub | > | >I've made use of the fact that "shrub" occurs in the text by enclosing it in | > tags. In cases like: | > | >Leaves | > | >I'm using a tag rather than a Name attribute | > | >This may make processing considerably more difficult, but it seems to me | >that there are some advantages. An alternative would be to first process the | >description into a DELTA-like structure then mark up that. I chose not to go | >that route because I still have the dream that one day we'll have a method | >of semi-automatically marking up the content of natural language into | >something like the above. The other advantage of this structure I believe is | >that it maintains all the complexity and nuance of the original natural | >language (which can be retrieved merely by removing all the tags). I prefer | >this to the DELTA-process (take a natural-language description -> atomise | >it -> reconstruct semi-natural-language by putting the atoms back together | >again in some fashion). | | My position on this is rather different, and I would prefer to see | characters defined a bit more rigorously. | | As an example, suppose that the description of one taxon is given as: | | shrub | | A second as: | | bush | | And a third as: | | small woody plant | | And perhaps a fourth as: | | shrub | | Are these equivalent? There's certainly no obvious way to associate them, | aside from knowing that "bush" and "shrub" mean pretty much the same thing | in some dialects of English. I think the goal of trying to extract meaning | by inserting a few tags into existing textual descriptions relies too | heavily on a level of artificial intelligence that we're still unlikely to | see for a couple of decades. Good point, and there will clearly be a need for some type of processing somewhere along the line to catch and allow validation of these situations. But this is a processing issue, not a structural one. There's nothing in the DELTA data structure to prevent #1. Life form/ 1. shrub/ 2. bush/ 3. small woody plant/ #2. Growth form/ 1. shrub/ 2. smallish twiggy plant/ and I argue there should be nothing in the structure of the SDD standard to prevent it either. People simply need to be aware of the issue. But the SDD does, I agree, need to be structured in such a way that validation of these situations is made possible. The first-cut structure that I proposed does not allow much eyeball validation of this type - is this a problem? | But pondering on this made me think about why Kevin and I often disagree on | some of these issues. If I may greatly over-simplify our viewpoints, his | view is (I think) that a formal description ought to be a document from | which data can be extracted; whereas I tend of view a formal description | more as a database from which text can be generated. | | So what ARE we after: the free-flowing flexibility of a document, or the | rigour and precision of a database? Fortunately, there is a fair bit of | middle ground. It's worth noting that XML was developed as a markup | language for documents, but that it's primary usage to date has been as a | sort of portable and relatively light-weight data container. Still I think | that finding the right balance between flexibility and rigour is going to | be a major challenge in this exercise. My view I suppose is that part of the power of XML (something that has never been possible in the past) is that it allows precise retrieval of atomised data from a fairly free-form document. XML blurs the distinction between database and text. Before XML a text string like: Rigid, spreading shrub to c. 1m high and wide; stems glabrous. Leaves soon deciduous..... was highly intractable to computer processing. This intractability required us to manually and substantially restructure into e.g. a DELTA file before we could really do much with it. But there are problems I think with the DELTA process vis-a-vis natural language. In the relatively common case that we begin with a natural language description, the process is: natural language ----(1)-----> DELTA -------(2)-------->natural language (etc) where (1) is human processing and (2) is largely computer (CONFOR) processing. Two related problems here. Firstly, I often don't much like the output of CONFOR (some people say we humans should put up with the limitations computers force on us, but call me old-fashioned , I rebel at that {see also Asimov's Second Law of Robotics}). The second problem is that step 2 is unidirectional - if I edit the output natural language I break the process (the edits are volatile, next time I invoke CONFOR they get overwritten). So I'm trying to explore the new possibilities opened up by XML to produce a minimally restructured but highly parsable document. I'm happy to admit that I may be off-beam here. I just reckon we shouldn't discount the possibility. | >Features are nested: | > | >Leaves | > Shape | > | > | >Is this allowable? In XML-Spy I can create a Schema OK for this document. | >It's also well-formed. | | It's certainly allowable in XML. Is it desirable in a description? | Possibly, but I think it relates to my question from yesterday about | hierarchies of characters ought to be expressed. Nesting is a very good way | to express a single hierarchy, but not for handling alternative | hierarchies. Which do people want? Existing data structures allow 2-level hierarchies e.g. #1. Leaf shape/ 1. ovate/ 2. elliptic/ #2. Flower colour/ 1. blue/ 2. red/ I'm simply suggesting allowing n levels: I can't decide if this a trivial or fundamental difference. | Nesting is a very good way | to express a single hierarchy, but not for handling alternative | hierarchies. In DELTA, Lucid etc we express alternate hierarchies in alternate documents. Do we need to do better than this? Cheers - k From kevin.thiele at BIGPOND.COM Thu Nov 29 17:16:28 2001 From: kevin.thiele at BIGPOND.COM (Kevin Thiele) Date: Thu, 29 Nov 2001 17:16:28 Subject: Taxonomic hierarchy in SDD Message-ID: I agree with all this Eric :-) | While canalization is a risk, there certainly is also merit in trying to | retain the best features of "prior art". The DELTA format has persisted for | over 20 years, which is nearly an eternity in the IT field. It must have | been doing one or two things right. There's no need to follow it too | closely, but certainly we can learn quite a bit from DELTA about what works | well and what doesn't. No-one has suggested ignoring prior art, be it DELTA or any of the other programs. I think we have agreed several times to start with +/- a blank slate, learning as you say quite a bit from DELTA, Lucid etc about what works and what doesn't along the way. | >2. Of all the descriptions in the world, 99.9999999% of them are not in | >DELTA. Probably 99.999% of them are textual (natural language) descriptions. | >Surely this should form the basis of our first challenge, methinks. | | I think that's actually a rather harsh assessment of DELTA's uptake. Let's | say there are about 4x10^6 known species. We'd like descriptions for them | all, along with descriptions of higher taxonomic levels (genera, families). | So the magnitude of total number of descriptions is about 10^7. I'd guess | that the number of taxa with DELTA descriptions is of the magnitude of | 10^5. So that means roughly 1% of all taxa already have a DELTA description. I didn't stop to consider the maths, for which my apologies. My point was simply that a vast minority of decriptions are in any type of standardised format at the present. If your estimations are correct (and I can't judge that) then there are still 99 non-DELTA descriptions for every DELTA one, or any other. | >3. Related to 2 above, many (though by no means all) DELTA datasets are | >already abstractions from the source (a set of natural language | >descriptions). We should start with the source. | | The source? Surely the ultimate source is observations made on individual | specimens. If we wish to start with the source, the first thing that needs | to be done is to make sure that we have a system which can adequately | describe a specimen in hand. Yes, you're right about the ultimate source, but there are chains of sources, of course. The reason why I'm interested in the vast legacy of textual descriptions as source material is that I think that effective semi-automated processing of these descriptions along the lines of Pankhurst and Taylor is not too far off (some processing is here already, if course). We should make this easier rather than harder, if at all possible. Capturing the truly ultimate direct observations is something that any Builder program supporting the new standard will have to be good at, of course. | (Sorry if I get too defensive about DELTA. I just can't help myself...) Please don't think that I'm too dismissive of DELTA. I just think we should be assuming that we can do a fair bit better. Cheers - k From Eric.Zurcher at PI.CSIRO.AU Thu Nov 29 15:32:33 2001 From: Eric.Zurcher at PI.CSIRO.AU (Eric Zurcher) Date: Thu, 29 Nov 2001 15:32:33 Subject: Document vs. database In-Reply-To: <019601c177bb$58114100$f5058690@presario> Message-ID: At 02:15 PM 28-11-01 +1100, Kevin Thiele wrote: >Issues > >I've taken the route of marking up a textual description, using a minimum of >tags. It seems to me that a description comprises a series of features with >values. I've used mixed markup because I wanted to have the mimimum tagging >and make maximum use of the text. That is, in cases like: > >shrub > >I've made use of the fact that "shrub" occurs in the text by enclosing it in > tags. In cases like: > >Leaves > >I'm using a tag rather than a Name attribute > >This may make processing considerably more difficult, but it seems to me >that there are some advantages. An alternative would be to first process the >description into a DELTA-like structure then mark up that. I chose not to go >that route because I still have the dream that one day we'll have a method >of semi-automatically marking up the content of natural language into >something like the above. The other advantage of this structure I believe is >that it maintains all the complexity and nuance of the original natural >language (which can be retrieved merely by removing all the tags). I prefer >this to the DELTA-process (take a natural-language description -> atomise >it -> reconstruct semi-natural-language by putting the atoms back together >again in some fashion). My position on this is rather different, and I would prefer to see characters defined a bit more rigorously. As an example, suppose that the description of one taxon is given as: shrub A second as: bush And a third as: small woody plant And perhaps a fourth as: shrub Are these equivalent? There's certainly no obvious way to associate them, aside from knowing that "bush" and "shrub" mean pretty much the same thing in some dialects of English. I think the goal of trying to extract meaning by inserting a few tags into existing textual descriptions relies too heavily on a level of artificial intelligence that we're still unlikely to see for a couple of decades. But pondering on this made me think about why Kevin and I often disagree on some of these issues. If I may greatly over-simplify our viewpoints, his view is (I think) that a formal description ought to be a document from which data can be extracted; whereas I tend of view a formal description more as a database from which text can be generated. So what ARE we after: the free-flowing flexibility of a document, or the rigour and precision of a database? Fortunately, there is a fair bit of middle ground. It's worth noting that XML was developed as a markup language for documents, but that it's primary usage to date has been as a sort of portable and relatively light-weight data container. Still I think that finding the right balance between flexibility and rigour is going to be a major challenge in this exercise. >Features are nested: > >Leaves > Shape > > >Is this allowable? In XML-Spy I can create a Schema OK for this document. >It's also well-formed. It's certainly allowable in XML. Is it desirable in a description? Possibly, but I think it relates to my question from yesterday about hierarchies of characters ought to be expressed. Nesting is a very good way to express a single hierarchy, but not for handling alternative hierarchies. Which do people want? Eric Zurcher CSIRO Livestock Industries Canberra, ACT Australia E-mail: Eric.Zurcher at pi.csiro.au From Eric.Zurcher at PI.CSIRO.AU Thu Nov 29 15:06:15 2001 From: Eric.Zurcher at PI.CSIRO.AU (Eric Zurcher) Date: Thu, 29 Nov 2001 15:06:15 Subject: Taxonomic hierarchy in SDD In-Reply-To: <019801c177bb$62ad5180$f5058690@presario> Message-ID: At 12:38 PM 28-11-01 +1100, Kevin Thiele wrote: >Concerning Steve's exemplar drawn from a DELTA butterfly treatment: I've >added a few challenges to the challenge cases (number 9 in the attached) >specifically designed to address representation of DELTA datasets in the new >standard. Personally, I don't think we should start with a DELTA data set as >our first challenge, for several reasons: > >1. We have agreed several times that it's important not to be canalized by >DELTA or any other existing representation, but to keep the existing >representations in mind while working. Starting with XDELTA seems to me to >increase the risk of canalization While canalization is a risk, there certainly is also merit in trying to retain the best features of "prior art". The DELTA format has persisted for over 20 years, which is nearly an eternity in the IT field. It must have been doing one or two things right. There's no need to follow it too closely, but certainly we can learn quite a bit from DELTA about what works well and what doesn't. >2. Of all the descriptions in the world, 99.9999999% of them are not in >DELTA. Probably 99.999% of them are textual (natural language) descriptions. >Surely this should form the basis of our first challenge, methinks. I think that's actually a rather harsh assessment of DELTA's uptake. Let's say there are about 4x10^6 known species. We'd like descriptions for them all, along with descriptions of higher taxonomic levels (genera, families). So the magnitude of total number of descriptions is about 10^7. I'd guess that the number of taxa with DELTA descriptions is of the magnitude of 10^5. So that means roughly 1% of all taxa already have a DELTA description. >3. Related to 2 above, many (though by no means all) DELTA datasets are >already abstractions from the source (a set of natural language >descriptions). We should start with the source. The source? Surely the ultimate source is observations made on individual specimens. If we wish to start with the source, the first thing that needs to be done is to make sure that we have a system which can adequately describe a specimen in hand. (Sorry if I get too defensive about DELTA. I just can't help myself...) Eric Zurcher CSIRO Livestock Industries Canberra, ACT Australia E-mail: Eric.Zurcher at pi.csiro.au From kerrybarringer at BBG.ORG Wed Nov 28 14:47:45 2001 From: kerrybarringer at BBG.ORG (Barringer, Kerry) Date: Wed, 28 Nov 2001 14:47:45 Subject: Challenge Case 1 Message-ID: Attached are the files for a partial solution to the first Challenge case. I have deliberately left out the XSLT code that capitalizes the first letter in a sentence and eliminates the comma before the period. The stuff I have written is awkward and embarassing and I will try to come up with a more elegant way that can be used for general formatting of the nodes. This DTD corrects some of the typos in the DTD posted yesterday. The XML file should validate against this DTD. I commented out the DOCTYPE line in the XML because the Xalan parser is fussy about how the file location is cited and I wanted one file to test with all the parsers. The XSLT code runs correctly using Microsoft, Sablotron, Saxon, and Xalan parsers in XMLCooktop. The resulting html file is also attached. I had to do this because the code does not run properly under MS Internet Explorer 5.5. This could be a namespace problem. I will look into it. Unfortunately, I did this coding before I saw Kevin Thiele's Nov 27 posting. That posting contains some good ideas about markup that I want to incorporate in this DTD. I especially like his handling of 'shrub' as a character state, not a name. Like Kevin, I nest characters. There are many small differences in the two approaches. The most notable is the handling of numeric characters and ranges. My approach may be too simple for anything but natural language output. As a side challenge, I would like to try to develop code that would produce a tabular description from this markup. The table should have blank cells for missing data. This will help determine if it is necessary to code missing characters as part of the description, or if we can get by with only coding the character we know. I think the later approach is most practical. This will also help determine if nested characters can be used. Kerry ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Kerry Barringer (Curator of the Herbarium) Herbarium 718-623-7318 (office) Brooklyn Botanic Garden 718-941-4774 (fax) 1000 Washington Avenue 718-623-7312 (herbarium) Brooklyn, NY 11225-1099 U.S.A. kbarringer at bbg.org http://www.bbg.org/ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ------_=_NextPart_000_01C17845.8CA00B60 Content-Type: application/octet-stream; name="DescrKB112801.dtd" Content-Disposition: attachment; filename="DescrKB112801.dtd" ------_=_NextPart_000_01C17845.8CA00B60 Content-Type: application/octet-stream; name="TDWGSDDtext.xsl" Content-Disposition: attachment; filename="TDWGSDDtext.xsl" TDWG-SDD Challenge Case 1

TWD-SDD

Challenge Case 1


.  

  .    long  wide   ( ) .  
------_=_NextPart_000_01C17845.8CA00B60 Content-Type: text/html; name="TDWGtest1.html" Content-Transfer-Encoding: quoted-printable Content-Disposition: attachment; filename="TDWGtest1.html" TDWG-SDD Challenge Case 1

TWD-SDD

Challenge Case 1


1. =A0Discaria pubescens

shrub=A0=A0rigid,=A0=A0spreading,=A0=A0to c. 1m=A0long,=A0=A0to c. 1 = cm=A0wide,=A0.stems=A0=A0glabrous,=A0.leaves=A0soon=A0deciduous=A0 = (particularly on older plants). =A0,=A0+/-=A0oblong,=A0=A0(4-)6-10(-15) = mm=A0long,=A0=A02-3 mm=A0wide,=A0.stipules=A0=A0dark = reddish-brown,=A0=A0c. 1 mm=A0long,=A0often=A0shallowly joined around = the node,=A0.spines=A0=A0stout,=A0=A01.5-4 cm=A0long,=A0.

2. =A0Discaria nitida

shrub=A0=A0slender,=A0=A0to 5 = m,=A0.stems=A0=A0glabrous,=A0.leaves=A0rarely=A0deciduous,=A0=A0elliptic= ,=A0=A0(8-)10-20(-30) mm=A0long,=A0=A03-7 = mm=A0wide,=A0=A0glabrous,=A0=A0shining,=A0.spines=A0=A0not developed at = each node,=A0=A0to c. 1 cm=A0long,=A0.

------_=_NextPart_000_01C17845.8CA00B60 Content-Type: application/octet-stream; name="TDWGtest1.xml" Content-Transfer-Encoding: quoted-printable Content-Disposition: attachment; filename="TDWGtest1.xml" Discaria pubescens shrub rigid spreading to c. 1m to c. 1 cm stems glabrous leaves soon deciduous particularly on older = plants +/- oblong (4-)6-10(-15) mm 2-3 mm apex obtuse minutely mucronate or within an apical notch glabrous a few hairs present near = tip or stipules dark reddish-brown c. 1 mm often shallowly joined around the = node spines stout 1.5-4 cm Discaria nitida shrub slender to 5 m stems glabrous leaves rarely deciduous persistent elliptic obovate to (8-)10-20(-30) mm 3-7 mm glabrous shining spines not developed at each = node to c. 1 cm From kevin.thiele at BIGPOND.COM Wed Nov 28 14:15:48 2001 From: kevin.thiele at BIGPOND.COM (Kevin Thiele) Date: Wed, 28 Nov 2001 14:15:48 Subject: Challenge 1 Message-ID: Here's my first go at representing Challenge 1. The challenge: Rigid, spreading shrub to c. 1m high and wide; stems glabrous. Leaves soon deciduous (particularly on older plants), +/- oblong, (4-)6-10(-15) mm long, 2-3 mm wide, apex obtuse or minutely mucronate within an apical notch, glabrous or a few hairs present near tip; stipules dark reddish-brown, c. 1 mm long, often shallowly joined around the node; spines stout, 1.5-4 cm long. Discaria nitida Slender shrub, to 5 m high; stems glabrous. Leaves persistent (rarely deciduous), elliptic to obovate, (8-)10-20(-30) mm long, 3-7 mm wide, glabrous, shining; spines not developed at each node, to c. 1 cm long. The representation: Rigid, spreading shrub to c. 1m high and wide; stems glabrous. Leavessoon deciduous(particularly on older plants), +/- oblong, (4-)6-10(-15) mm long, 2-3 mm wide, obtuse or minutely mucronate within an apical notch, surfaces glabrous or a few hairs present near tip;stipulesdark reddish-brown, c. 1 mm long, often shallowly joined around the node spinesstout, 1.5-4cm long. Slender shrub to 5mhig; stems glabrous. Leavespersistent(rarelydeciduous), elliptic to obovate, (8-)10-20(-30) mm long, 3-7 mm wide, glabrous shining; spines, not developing at each node, to c. 1cm long or, if you prefer Rigid, spreading shrub to c. 1m high and wide ; stems glabrous. Leaves soon deciduous(particularly on older plants), +/- oblong, (4-)6-10(-15) mm long, 2-3 mm wide, obtuse or minutely mucronate within an apical notch, surfaces glabrous or a few hairs present near tip; stipules dark reddish-brown, c. 1 mm long, often shallowly joined around the node spines stout, 1.5-4cm long . Slender shrub to 5mhigh ; stems glabrous. Leaves persistent(rarelydeciduous), elliptic to obovate, (8-)10-20(-30) mm long, 3-7 mm wide, glabrous shining; spines, not developing at each node, to c. 1cm long Issues I've taken the route of marking up a textual description, using a minimum of tags. It seems to me that a description comprises a series of features with values. I've used mixed markup because I wanted to have the mimimum tagging and make maximum use of the text. That is, in cases like: shrub I've made use of the fact that "shrub" occurs in the text by enclosing it in tags. In cases like: Leaves I'm using a tag rather than a Name attribute This may make processing considerably more difficult, but it seems to me that there are some advantages. An alternative would be to first process the description into a DELTA-like structure then mark up that. I chose not to go that route because I still have the dream that one day we'll have a method of semi-automatically marking up the content of natural language into something like the above. The other advantage of this structure I believe is that it maintains all the complexity and nuance of the original natural language (which can be retrieved merely by removing all the tags). I prefer this to the DELTA-process (take a natural-language description -> atomise it -> reconstruct semi-natural-language by putting the atoms back together again in some fashion). Features are nested: Leaves Shape Is this allowable? In XML-Spy I can create a Schema OK for this document. It's also well-formed. In this example, there is not a 1:1 correspondence between the descriptions - e.g. the description for D. pubescens has while that for D. nitida does not. This is anathema to most of us, but is a real world case. We'll have to deal with it. In this example there is no character list against which character names etc can be validated (to e.g. pick up the above problem). This is something we need to deal with. Be gentle with me. Your mission, should you choose to accept it, is to show that I'm crazy and do better than this. This email will self-destruct in 5 seconds Cheers - k ------=_NextPart_000_018D_01C17817.2E446620 Content-Type: text/xml; name="Discaria.xml" Content-Transfer-Encoding: quoted-printable Content-Disposition: attachment; filename="Discaria.xml" Rigid, spreading=20 shrub to c. 1m high and = wide ;=20 stems = glabrous.=20 Leaves soon deciduous(particularly = on older plants),=20 +/- oblong,=20 (4-)6-10(-15) mm = long,=20 2-3 = mm wide,=20 obtuse or minutely = mucronate within an apical notch,=20 surfaces = glabrous or a few hairs present near tip; stipules dark reddish-brown, c. 1 mm long,=20 often shallowly joined around the = node =09 =09 spines stout,=20 1.5-4cm<= /Units> long . Slender=20 shrub to 5mhigh ;=20 stems = glabrous.=20 Leaves persistent(rarelydeciduous),=20 elliptic to = obovate,=20 (8-)10-20(-30) mm = long, 3-7 = mm wide,=20 glabrous = shining;=20 spines, not developing at each node,=20 to c. = 1cm long From rousse at CCR.JUSSIEU.FR Wed Nov 28 13:51:53 2001 From: rousse at CCR.JUSSIEU.FR (Guillaume Rousse) Date: Wed, 28 Nov 2001 13:51:53 Subject: Morphological Data Representation In-Reply-To: Message-ID: Ainsi parlait Steve Shattuck : [..] > Guillaume commented that he's "always surprised seeing people recommend > using proprietary stuff" in response to my suggestion of using Microsoft's > XML Notepad (which, by the way, is actually free). The point I was trying > to make was that XML can get complicated very quickly and using an XML view > (of any sort) is better than using a text editor. Nothing more. 'Free' in free software refers to freedom, not price (think free speech, not free beer). XML Notepad is freeware (currently), not free software. I think distinction here is really important. And about the utility of a dedicated xml editor vs a text editor, it is defenitively a question of personal taste. It all depends of the editor, the ease you have with it, etc... As a die-hard vim-user, i never found any specialized xml editor which allowed so many functionalities. -- Guillaume Rousse GPG key http://lis.snv.jussieu.fr/~rousse/gpgkey.html From kevin.thiele at BIGPOND.COM Wed Nov 28 12:42:01 2001 From: kevin.thiele at BIGPOND.COM (Kevin Thiele) Date: Wed, 28 Nov 2001 12:42:01 Subject: Morphological Data Representation Message-ID: ----- Original Message ----- From: "Steve Shattuck" To: Sent: Friday, November 23, 2001 2:33 PM Subject: Morphological Data Representation | Below and attached are a first attempt at representing simple, common | DELTA-type data in an XML-based structure. I've used a selected set of | characters and items from the Butterfly sample data on the DELTA web site. | The DELTA-formatted data looks like this: | | | ==== DELTA-Standard CHARS File ==== | | *SHOW: Lepidoptera demonstration characters. Revised 28-AUG-91. | | *CHARACTER LIST | | #1. main colour of inner part of front wing/ | 1. white/ | 2. cream/ | 3. grey/ | 4. brown/ | 5. black/ | 6. yellow/ | 7. orange/ | 8. blue/ | 9. green/ | | #2. wings / | 1. with transparent areas/ | 2. without transparent areas/ | | #3. length of front wing/ | mm/ | | #4. antennae / | times length of front wing/ | | | ==== DELTA-Standard ITEMS File ==== | | *SHOW: Lepidoptera demonstration items. Revised 18-OCT-94. | | *ITEM DESCRIPTIONS | | # Antheraea/ | 1,4 2,2 3,43-50 4,0.15-0.2 | | # Ethmia/ | 1,2-4 2,2 3,11-14 4,0.6-0.65 | | # Graphium/ | 1,1-2/9 2,2 3,29-33 4,0.45-0.5 | | # Hecatesia/ | 1,4 2,1/2 3,11-14 4,0.8-0.9 | | | ==== DELTA-Standard SPECS File ==== | | *SHOW: Lepidoptera demonstration specifications. Revised 28-AUG-91. | | *NUMBER OF CHARACTERS 4 | *MAXIMUM NUMBER OF STATES 9 | *MAXIMUM NUMBER OF ITEMS 4 | | *CHARACTER TYPES 3,RN 4,RN | | *NUMBERS OF STATES 1,9 | | | | | ==== For these files, the DELTA-generated natural language would look | something like this: | | Antheraea | Main colour of inner part of front wing brown. Wings without transparent | areas. Length of front wing 43-50 mm. Antennae 0.15-0.2 times length of | front wing. | | Ethmia | Main colour of inner part of front wing cream to brown. Wings without | transparent areas. Length of front wing 11-14 mm. Antennae 0.6-0.65 times | length of front wing. | | Graphium | Main colour of inner part of front wing white to cream, or green. Wings | without transparent areas. Length of front wing 29-33 mm. Antennae 0.45-0.5 | times length of front wing. | | Hecatesia | Main colour of inner part of front wing brown. Wings with transparent areas | (small, translucent window), or without | transparent areas. Length of front wing 11-14 mm. Antennae 0.8-0.9 times | length of front wing. | | | | | ==== Hand-generated natural language would be essentially the same except | for the last item, where it might look more like this: | | Hecatesia | Main colour of inner part of front wing brown. Wings with or without | transparent areas (when present, forming a small window). Length of front | wing 11-14 mm. Antennae 0.8-0.9 times length of front wing. | | | ============================================ | | I've translated this into the XML file that is attached. (Even this fairly | simple example is moderately large and I would recommend using an XML viewer | such as Microsoft's XML Notepad when working with it - see | http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnxml/html/ | xmlpaddownload.asp.) | | The basic structure of the attached file is: | | | | - ANY NUMBER | | | | | | | - ANY NUMBER | | | | | - ANY NUMBER | | | | - ANY NUMBER | | | - ANY NUMBER | | | | | Note that I've treated everything as elements and haven't used attributes. | Simplicity is the only reason for this and some elements would be better as | attributes; these can be converted when the dust settles. | | I've tried to generalise as much as possible and use only two main elements: | and . Each character is assigned a that tells what | it is (ordered multistate, unordered multistate, real, integer, etc). | Similarly the item can be specified as taxon or specimen (or | potentially something else). | | A couple of points are probably worth making: | | The and are used to support DELTA | comments. This probably needs to be generalised further to support any | number of alternate phrasings. | | is used for the units of numeric characters and isn't | needed (?) for other character types - it's an attempt at keeping | general. | | The element in is used to house natural | language representations. Codes for the states (when needed) are placed in | square brackets, these being translated during generation. As noted above, | this may been to be generalised to support any number of phrasings. | | In , is used to hold numeric values, the | being used to define what the number means (minimum, maximum, etc., | rather than using placement in the attribute string as in the DELTA | standard). This element won't (?) be needed for multistate characters. | | I think/hope the remainder is fairly clear. | | | ** The Next Step ** | | I would suggest the following path from here: | | 1) Make sure the above representation makes sense for the data given. | | 2) Expand the above data to support LucID-specific requirements (without | adding additional complexity). | | Once this is finished we can: | | Add additional DELTA features (dependencies, default values, etc.) | | Add more complex data sets and examples | | Add new features on our assorted "wish lists" | | | I look forward to comments and forward progress! | | | Thanks, Steve | | Steve Shattuck | CSIRO Entomology | biolink at ento.csiro.au | | From kevin.thiele at BIGPOND.COM Wed Nov 28 12:38:51 2001 From: kevin.thiele at BIGPOND.COM (Kevin Thiele) Date: Wed, 28 Nov 2001 12:38:51 Subject: Taxonomic hierarchy in SDD Message-ID: Can I propose that we note but set aside the Taxonomic Hierarchy thread for the moment. In the challenge cases, Challenge 7 will begin to address this issue (in this challenge we will attempt to represent descriptions from a nested set of descriptions e.g. family (genus (species (specimens)))). When this comes up, we will need to address both hierarchy as a way of handling inheritance, and also alternate hierarchies (differing taxonomic opinions). But surely we need to work out how to represent a single description before we tackle a nested set. Concerning Steve's exemplar drawn from a DELTA butterfly treatment: I've added a few challenges to the challenge cases (number 9 in the attached) specifically designed to address representation of DELTA datasets in the new standard. Personally, I don't think we should start with a DELTA data set as our first challenge, for several reasons: 1. We have agreed several times that it's important not to be canalized by DELTA or any other existing representation, but to keep the existing representations in mind while working. Starting with XDELTA seems to me to increase the risk of canalization 2. Of all the descriptions in the world, 99.9999999% of them are not in DELTA. Probably 99.999% of them are textual (natural language) descriptions. Surely this should form the basis of our first challenge, methinks. 3. Related to 2 above, many (though by no means all) DELTA datasets are already abstractions from the source (a set of natural language descriptions). We should start with the source. This time I've attached the attachment. Sometime soon we'll put this up on the web instead of flying it around as an attachment. Cheers - k ------=_NextPart_000_0180_01C17809.A2844220 Content-Type: text/html; name="SDD TDWG 2001.htm" Content-Transfer-Encoding: quoted-printable Content-Disposition: attachment; filename="SDD TDWG 2001.htm" Basic description of two taxa

SDD Challenge Cases

The challenge cases below represent specific classes = of problems that the SDD Standard needs to address. The exemplar for each = case provides a chunk or fragment of descriptive data that minimally = exemplifies the problem; the challenge is to validly represent the data under the = evolving standard. The challenges are listed in the approximate order in which = they will be worked through.

Challenge 1: Represent basic descriptions of two taxa from one = (natural language) source

Exemplar:

Discaria pubescens

Rigid, spreading shrub to c. 1m high and wide; stems glabrous. Leaves = soon deciduous (particularly on older plants), +/- oblong, (4-)6-10(-15) mm = long, 2-3 mm wide, apex obtuse or minutely mucronate within an apical notch, glabrous or a few hairs present near tip; stipules dark = reddish-brown, c. 1 mm long, often shallowly joined around the node; spines stout, 1.5-4 cm = long

Discaria nitida

Slender shrub, to 5 m high; stems glabrous. Leaves persistent (rarely = deciduous), elliptic to obovate, (8-)10-20(-30) mm long, 3-7 mm wide, glabrous, shining; spines not = developed at each node, to c. 1 cm long.

Issues:

  • missing data (e.g. stipule characters for D. nitida)
  • modifiers (e.g. leaves of D. nitida rarely deciduous)
  • freeform comments (e.g. D. pubescens "particularly on = older plants")
  • structured data vs marked-up text (e.g. let's say that stipule = arrangement will not be a character - how to retain the data "stipules = often shallowly joined around the node")

 

2. Basic description of one taxon, two sources, basic = markup

  • referencing
  • ascription

 

3. One treatment, several contributing authors

  • flagging of data
  • ownership
  • maintenance
  • privileges

 

4. Define basic character list

  • structure of character lists
  • character types

 

5. Remote vs local character list

  • lexicons

 

6. Remote vs local taxon lists

  • master lists

 

7. Several descriptions at different levels (e.g. family (genus = (species (specimens))))

  • nesting of taxon(/character?) lists
  • data collation
  • data inheritance

 

8. Universal treatment, several outputs = (natural language, NEXUS, interactive key)

  • flagging/layering of data elements

 

9. Representation of legacy structured = descriptive data

  • DELTA data sets
  • Lucid data sets
  • DELTA Access data sets
  • other formats

 

From Eric.Zurcher at PI.CSIRO.AU Wed Nov 28 09:54:06 2001 From: Eric.Zurcher at PI.CSIRO.AU (Eric Zurcher) Date: Wed, 28 Nov 2001 9:54:06 Subject: Taxonomic hierarchy in SDD In-Reply-To: <15363.3039.179472.115188@u11.cs.umb.edu> Message-ID: One additional point to consider is that hierarchies arise not only in the classification of items (taxa or specimens), but also in the classification of the characters used to describe those item. And along both the item and character "axes", there will often be multiple, alternative hierarchies which reflect different purposes or philosophies. For example, in assembling an interactive key to plants, one might well want to group characters dealing with leaves separately from characters dealing with the infloresence. And perhaps a trickier "hierarchy" of characters is one that might be used to describe the rules for "natural language" generation, aggregating characters into phrases, sentences and paragraphs. Does anyone see a good generalized mechanism for handling the creation and maintenance of hierarchies (or other groupings)? Here are a few thoughts off the top of my head: Because of the need for handling alternative hierarchies for different purposes, it would seem logical to keep the "classification" separate from the "core" data (going back to DELTA, this was one reason why "directives" files were used - it provided a mechanism for changing groupings without touching the core data). But this approach comes at a significant cost: if the "core" data is changed, it will probably require synchronized changes in the various hierarchy descriptions which reference that data. Maintaining consistency becomes a problem. Perhaps the way out is to make use of unique global identifiers (GUIDs), as Steve Shattuck has suggested. At 10:43 PM 26-11-01 -0500, Bob Morris wrote: >We may not be arguing here. I don't dispute the need for this, only >argue that there should not be a standard for taxonomy, but rather a >standard for how to specify taxonomy. Or any other hierarchy for that >matter. > >Doing this would allow you to get the hiearchy from /any/ suitable >source, including but not limited to the data source itself. To me it >seems that the least brittle thing for data inheritance is to use the >same model as for datatype inheritance, i.e. a separate "schema" to >which reference is made, as Tim Jones argued. Eric Zurcher CSIRO Livestock Industries Canberra, ACT Australia E-mail: Eric.Zurcher at pi.csiro.au From ram at CS.UMB.EDU Tue Nov 27 18:31:06 2001 From: ram at CS.UMB.EDU (Robert A. (Bob) Morris) Date: Tue, 27 Nov 2001 18:31:06 Subject: Taxonomic hierarchy in SDD In-Reply-To: <5.1.0.14.0.20011128091733.00ab91d0@pop.pi.csiro.au> Message-ID: Here's another need for Use Cases. Oops. I mean data challenges. I suspect it is not profitable to approach this problem without detailed examples. Those examples might illuminate whether the problems of hierarchy rearangement can simply be handled by XSLT, as probably can the examples you give. Bob Morris Eric Zurcher writes: > Date: Wed, 28 Nov 2001 09:54:06 +1100 > From: Eric Zurcher > To: TDWG-SDD at usobi.org > Subject: Re: Taxonomic hierarchy in SDD > > One additional point to consider is that hierarchies arise not only in the > classification of items (taxa or specimens), but also in the classification > of the characters used to describe those item. And along both the item and > character "axes", there will often be multiple, alternative hierarchies > which reflect different purposes or philosophies. > > For example, in assembling an interactive key to plants, one might well > want to group characters dealing with leaves separately from characters > dealing with the infloresence. And perhaps a trickier "hierarchy" of > characters is one that might be used to describe the rules for "natural > language" generation, aggregating characters into phrases, sentences and > paragraphs. > > Does anyone see a good generalized mechanism for handling the creation and > maintenance of hierarchies (or other groupings)? > > Here are a few thoughts off the top of my head: Because of the need for > handling alternative hierarchies for different purposes, it would seem > logical to keep the "classification" separate from the "core" data (going > back to DELTA, this was one reason why "directives" files were used - it > provided a mechanism for changing groupings without touching the core > data). But this approach comes at a significant cost: if the "core" data is > changed, it will probably require synchronized changes in the various > hierarchy descriptions which reference that data. Maintaining consistency > becomes a problem. Perhaps the way out is to make use of unique global > identifiers (GUIDs), as Steve Shattuck has suggested. > > At 10:43 PM 26-11-01 -0500, Bob Morris wrote: > >We may not be arguing here. I don't dispute the need for this, only > >argue that there should not be a standard for taxonomy, but rather a > >standard for how to specify taxonomy. Or any other hierarchy for that > >matter. > > > >Doing this would allow you to get the hiearchy from /any/ suitable > >source, including but not limited to the data source itself. To me it > >seems that the least brittle thing for data inheritance is to use the > >same model as for datatype inheritance, i.e. a separate "schema" to > >which reference is made, as Tim Jones argued. > > Eric Zurcher > CSIRO Livestock Industries > Canberra, ACT Australia > E-mail: Eric.Zurcher at pi.csiro.au > From Steve at Tue Nov 27 16:51:05 2001 From: Steve at (Steve at ) Date: Tue, 27 Nov 2001 16:51:05 Subject: Taxonomic hierarchy in SDD Message-ID: Sounds fine, store the hierarchy but keep it separate from the items. No worries. The idea of supporting multiple hierarchies for different inheritances is interesting and I can see its potential importance. I'll need to test it out here in "biology land" to see if it makes sense in practice. I believe that the classification is tied very closely to taxon concepts, and that descriptions are certainly tied to concepts. If you change the concept you almost always change the description, and so knowing the classification can be important in understanding what the author is up to. Steve From tim.jones at MARINE.CSIRO.AU Tue Nov 27 14:28:54 2001 From: tim.jones at MARINE.CSIRO.AU (Tim Jones) Date: Tue, 27 Nov 2001 14:28:54 Subject: Taxonomic hierarchy in SDD Message-ID: I am not sure I follow here but I do agree it is definately wise to understand the thoughts of a taxononomist ... I am not an expert in this area by any means but it it would seem to me that whether you choose to accept or reject someones description and whether you want to accept their taxonomic classification are two different things (albeit closely related). Could it be that they should be modelled seperately? That way it is possble to store a description without having to adopt the classification structure in the data. Or is it that these two concepts are just so tightly coupled that it is not needed? Just a thought... not of a taxonomist Cheers -----Original Message----- From: Steve Shattuck [mailto:Steve.Shattuck at CSIRO.AU] Sent: Tuesday, 27 November 2001 2:08 PM To: TDWG-SDD at USOBI.ORG Subject: Re: Taxonomic hierarchy in SDD If a dataset includes the descriptions of two families and 4 genera but doesn't tell you which genera belong to which family you will be forced to get this information from someplace else (e.g. ITIS). If the dataset is based on a different arrangement from ITIS you don't know this because the dataset didn't tell you because it's not part of the standard. In this case the data won't make sense because the family descriptions need to be a superset of the genera which belong to them. I think this is a pretty basic problem if we want to support hierarchical data (taxa at different taxonomic ranks). Again, if you don't want to follow the author of the dataset then you are free to ignore the suggested classification - but I would strongly suggest that you better know what the author is thinking and ignore all of her data if you don't agree with it, not accept the descriptions while rejecting the classification. Steve From Steve at Tue Nov 27 14:07:35 2001 From: Steve at (Steve at ) Date: Tue, 27 Nov 2001 14:07:35 Subject: Taxonomic hierarchy in SDD Message-ID: If a dataset includes the descriptions of two families and 4 genera but doesn't tell you which genera belong to which family you will be forced to get this information from someplace else (e.g. ITIS). If the dataset is based on a different arrangement from ITIS you don't know this because the dataset didn't tell you because it's not part of the standard. In this case the data won't make sense because the family descriptions need to be a superset of the genera which belong to them. I think this is a pretty basic problem if we want to support hierarchical data (taxa at different taxonomic ranks). Again, if you don't want to follow the author of the dataset then you are free to ignore the suggested classification - but I would strongly suggest that you better know what the author is thinking and ignore all of her data if you don't agree with it, not accept the descriptions while rejecting the classification. Steve From nickl at CALM.WA.GOV.AU Tue Nov 27 14:05:52 2001 From: nickl at CALM.WA.GOV.AU (Lander, Nicholas) Date: Tue, 27 Nov 2001 14:05:52 Subject: Taxonomic hierarchy in SDD Message-ID: This raises the old argument for scoring _data_ in our databases rather than information derived from those data. This is to say that your system should at least allow of scoring descriptive data at specimen level rather than at the conceptual level of species (or other taxa). Thus the descriptions of taxa could be derived "just in time", ie on the fly when required. Redetermination of a given suite of specimens would result in all the relevant descriptions, keys, and other products being, in effect, dynamic. This will be vital for projects which are institutional or international in scope. Nicholas Lander WA Herbarium (PERTH) -----Original Message----- From: Steve Shattuck [mailto:Steve.Shattuck at CSIRO.AU] Sent: Tuesday, November 27, 2001 1:51 PM To: TDWG-SDD at USOBI.ORG Subject: Re: Taxonomic hierarchy in SDD Sounds fine, store the hierarchy but keep it separate from the items. No worries. The idea of supporting multiple hierarchies for different inheritances is interesting and I can see its potential importance. I'll need to test it out here in "biology land" to see if it makes sense in practice. I believe that the classification is tied very closely to taxon concepts, and that descriptions are certainly tied to concepts. If you change the concept you almost always change the description, and so knowing the classification can be important in understanding what the author is up to. Steve From tim.jones at MARINE.CSIRO.AU Tue Nov 27 13:46:22 2001 From: tim.jones at MARINE.CSIRO.AU (Tim Jones) Date: Tue, 27 Nov 2001 13:46:22 Subject: Taxonomic hierarchy in SDD Message-ID: On the taxonomic heirarchy thread - I agree wholeheartedly that the heirarchy should not be part of the description. Could it rather be a recommended link to a classification? We often come across the situation where one person has a diferent view of the hierarchy to another. If the heirarchy is not part of the descriptive data but can be refered to from an alternative source then would it be possible to use one (or another) source depending on your point of view. That way you could follow say the ITIS calssification if it suits or if the framework is published any other hierarchy that supports the framework interface. Tim --------------------------------------------------------- Database Manager Centre for Research on Introduced Marine Pests GPO Box 1538, Hobart Tas 7000, Australia. Phone : (03) 62325222 (switch), (03) 62325213 (direct) Mobile: 0411 560057 Fax : (03) 62325485 E-mail: tim.jones at marine.csiro.au --------------------------------------------------------- -----Original Message----- From: Steve Shattuck [mailto:Steve.Shattuck at CSIRO.AU] Sent: Tuesday, 27 November 2001 1:36 PM To: TDWG-SDD at USOBI.ORG Subject: Taxonomic hierarchy in SDD The full taxonomic hierarchy of the included taxa/items certainly must be supported by the standard (and it will be addressed after we deal with simple characters, states and items). If the creator of the dataset doesn't think it's important then they can choose to leave it out; if the user of the data doesn't think it's important then they can ignore it. This will be especially important if we intend to support inheritance and compilation up and down the classification (as has been suggested by several of us). Jumping the gun a bit, I would think these relationships would be stored either as a separate nested set of elements with ID's linking to specific items, or the items themselves would be nested with parent items containing their children. Steve Shattuck CSIRO Entomology biolink at ento.csiro.au From Steve at Tue Nov 27 13:35:31 2001 From: Steve at (Steve at ) Date: Tue, 27 Nov 2001 13:35:31 Subject: Taxonomic hierarchy in SDD Message-ID: The full taxonomic hierarchy of the included taxa/items certainly must be supported by the standard (and it will be addressed after we deal with simple characters, states and items). If the creator of the dataset doesn't think it's important then they can choose to leave it out; if the user of the data doesn't think it's important then they can ignore it. This will be especially important if we intend to support inheritance and compilation up and down the classification (as has been suggested by several of us). Jumping the gun a bit, I would think these relationships would be stored either as a separate nested set of elements with ID's linking to specific items, or the items themselves would be nested with parent items containing their children. Steve Shattuck CSIRO Entomology biolink at ento.csiro.au From peter.stevens at MOBOT.ORG Tue Nov 27 12:57:11 2001 From: peter.stevens at MOBOT.ORG (Peter Stevens) Date: Tue, 27 Nov 2001 12:57:11 Subject: Taxonomic hierarchy in SDD In-Reply-To: Message-ID: A non-text attachment was scrubbed... Name: not available Type: text/enriched Size: 4198 bytes Desc: not available Url : http://lists.tdwg.org/pipermail/tdwg-content/attachments/20011127/4accfc4e/attachment.bin From kerrybarringer at BBG.ORG Tue Nov 27 11:18:56 2001 From: kerrybarringer at BBG.ORG (Barringer, Kerry) Date: Tue, 27 Nov 2001 11:18:56 Subject: DTD for Description Message-ID: Attached is an attempt at a usuable DTD for taxonomic descriptions. It is modified from a Taxonomic Document Markup based largely on DELTA format data. I modified the original based on the mailing list contributions of Steve Shattuck, L. Dodds, and G. Rousse. The most visible difference is use of attributes to hold information without real-world meaning. This is based on current modelling techniques and is explained by Rouse in his posting of Nov. 23 much more clearly than I could explain it. There are also many attributes derived from the DELTA format which may not be applicable. I would like to omit the attribute for order from the standard. If each taxon/specimen, character, and state is uniquely identified, then ordering is more flexibly handled in XSL. This would allow those who want to order based on alphabet, a character state (as in field guides), inferred relationships, etc. to do as they please without changing the data. Because of the nature of XML, each description will be part of a higher element, usually either a specimen or a taxon. It is these entities that are ordered, not the descriptions themselves. Within a description, the ordering and selection of characters is easiest with XSL or other processing. This allow the author (or compiler) of the document to arrange characters in a way that makes sense for the treatment. The summary outline is also included in the DTD. Description Heading Character CharName codedCharName textCharName State Connector Qualifier codedStateName textStateName Comment Comment Character For those still looking for XML/XSL tools, I recommend XML Cooktop for Windows http://www.xmleverywhere.com as a good general editor, it handles DTD's and Schema' as text files for editing, but otherwise is good and the XSL editing and testing is excellent. I apologize for using DTDs instead of Schemas. It is the language I know and have been using for a while now. For me, they are a little clearer and easier to understand, even if they are not quite as flexible. If the group wants only Schemas, I will learn that next. There is a good utility for converting DTDs to Schemas at http://puvogel.informatik.med.uni-giessen.de/dtd2xs/ In the next few days, I will send in some markup of challenge cases and XSL routines. Kerry ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Kerry Barringer (Curator of the Herbarium) Herbarium 718-623-7318 (office) Brooklyn Botanic Garden 718-941-4774 (fax) 1000 Washington Avenue 718-623-7312 (herbarium) Brooklyn, NY 11225-1099 U.S.A. kbarringer at bbg.org http://www.bbg.org/ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ------_=_NextPart_000_01C1775F.367A0790 Content-Type: application/octet-stream; name="DescrKB112701.dtd" Content-Disposition: attachment; filename="DescrKB112701.dtd" <-- The two elements above should not be part of the standard. They are only here to allow the testing of real world examles --> reliability CDATA #IMPLIED weight CDATA #IMPLIED dependentOn CDATA #IMPLIED keyStates CDATA #IMPLIED linkTo CDATA #IMPLIED applicable CDATA #IMPLIED > numericType CDATA #IMPLIED defaultValue CDATA #IMPLIED allowedValues CDATA #IMPLIED source CDATA #IMPLIED measurementSystem CDATA #IMPLIED > From trainor at UIC.EDU Tue Nov 27 01:39:21 2001 From: trainor at UIC.EDU (Douglas Trainor) Date: Tue, 27 Nov 2001 1:39:21 Subject: Taxonomic hierarchy in SDD Message-ID: Taxonomic disagreements aside, I have run into programming "gotchas" in the past because I did not know about generic homonyms and thus my code was ignorant of homonyms. I was surprised then to find both a corticioid fungus and a marine sponge with the same genus name (Corticium is in both Carticiaceae and Plakinidae). Also a surprise to find both a fungus and a plant with the same genus name (Virgaria). Then a taxonomic guru told me about the genus Erica for insects/plants. There are probably two dozen others... douglas Steve Shattuck wrote: > If a dataset includes the descriptions of two families and 4 genera but > doesn't tell you which genera belong to which family you will be forced to > get this information from someplace else (e.g. ITIS). If the dataset is > based on a different arrangement from ITIS you don't know this because the > dataset didn't tell you because it's not part of the standard. In this case > the data won't make sense because the family descriptions need to be a > superset of the genera which belong to them. I think this is a pretty basic > problem if we want to support hierarchical data (taxa at different taxonomic > ranks). > > Again, if you don't want to follow the author of the dataset then you are > free to ignore the suggested classification - but I would strongly suggest > that you better know what the author is thinking and ignore all of her data > if you don't agree with it, not accept the descriptions while rejecting the > classification. > > Steve From ram at CS.UMB.EDU Mon Nov 26 22:43:27 2001 From: ram at CS.UMB.EDU (Robert A. (Bob) Morris) Date: Mon, 26 Nov 2001 22:43:27 Subject: Taxonomic hierarchy in SDD In-Reply-To: Message-ID: We may not be arguing here. I don't dispute the need for this, only argue that there should not be a standard for taxonomy, but rather a standard for how to specify taxonomy. Or any other hierarchy for that matter. Doing this would allow you to get the hiearchy from /any/ suitable source, including but not limited to the data source itself. To me it seems that the least brittle thing for data inheritance is to use the same model as for datatype inheritance, i.e. a separate "schema" to which reference is made, as Tim Jones argued. Bob Steve Shattuck writes: > Date: Tue, 27 Nov 2001 14:07:35 +1100 > From: Steve Shattuck > To: TDWG-SDD at usobi.org > Subject: Re: Taxonomic hierarchy in SDD > > If a dataset includes the descriptions of two families and 4 genera but > doesn't tell you which genera belong to which family you will be forced to > get this information from someplace else (e.g. ITIS). If the dataset is > based on a different arrangement from ITIS you don't know this because the > dataset didn't tell you because it's not part of the standard. In this case > the data won't make sense because the family descriptions need to be a > superset of the genera which belong to them. I think this is a pretty basic > problem if we want to support hierarchical data (taxa at different taxonomic > ranks). > > Again, if you don't want to follow the author of the dataset then you are > free to ignore the suggested classification - but I would strongly suggest > that you better know what the author is thinking and ignore all of her data > if you don't agree with it, not accept the descriptions while rejecting the > classification. > > Steve > From ram at CS.UMB.EDU Mon Nov 26 22:03:31 2001 From: ram at CS.UMB.EDU (Robert A. (Bob) Morris) Date: Mon, 26 Nov 2001 22:03:31 Subject: Taxonomic hierarchy in SDD In-Reply-To: Message-ID: Surely there will be other hierarchies than classical taxonomy through which providers wish to offer inheritance. Phylogenetic hierarchies come to mind, for example. If the problem is inheritance, then the standard should provide for that, not for a special case. Success would be measured by the ability of the inheritance mechanisms to provide for the special cases. Bob Morris Steve Shattuck writes: > Date: Tue, 27 Nov 2001 13:35:31 +1100 > From: Steve Shattuck > To: TDWG-SDD at usobi.org > Subject: Taxonomic hierarchy in SDD > > The full taxonomic hierarchy of the included taxa/items certainly must be > supported by the standard (and it will be addressed after we deal with > simple characters, states and items). If the creator of the dataset doesn't > think it's important then they can choose to leave it out; if the user of > the data doesn't think it's important then they can ignore it. This will be > especially important if we intend to support inheritance and compilation up > and down the classification (as has been suggested by several of us). > > Jumping the gun a bit, I would think these relationships would be stored > either as a separate nested set of elements with ID's linking to specific > items, or the items themselves would be nested with parent items containing > their children. > > Steve Shattuck > CSIRO Entomology > biolink at ento.csiro.au > From ram at CS.UMB.EDU Mon Nov 26 18:13:09 2001 From: ram at CS.UMB.EDU (Robert A. (Bob) Morris) Date: Mon, 26 Nov 2001 18:13:09 Subject: Morphological Data Representation In-Reply-To: Message-ID: Certainly end users need this and many kinds of information. We believe in integrated applications that discover or know where to find the answers to questions like this. See our toy demo http://www.cs.umb.edu/efg/xml1/DEVELOP/itis/demo.htm (This requires MSIE, but it is perfectly possible---and probably better---to support this kind of thing at the server. I subscribe to the belief that simplicity is superior to complexity, especially since much of the infrastructure now available to XML-based applications means that much of the programming to support gathering information from disparate sources is provided in the toolkits. To pick a field guide at random (or rather, what's at hand): the National Audubon Field Guide to Insects and Spiders has almost no taxonomy above Order (it does mention that the insects are in Class Insecta and Phylum Arthropoda; no mention of Subphylum, Subclass, Superorder, Suborder or other arcana easily found from ITIS). In my experience, field guide books may have a little nod toward taxonomy above Family (or Order if they are meant to be broad) but make no attempt to place the descriptive data in such taxonomy. Electronic descriptive pages vary widely, but those that embed taxonomy in their databases risk errors, excessive and often redundant data, and fragility in the face of taxonomic revisions. I say that if we are talking about descriptive data, we should just stick to descriptive data. Bob Morris Peter Rauch writes: > Date: Mon, 26 Nov 2001 11:22:45 -0800 > From: Peter Rauch > To: TDWG-SDD at usobi.org > Subject: Re: Morphological Data Representation > > On Mon, 26 Nov 2001, Robert A. (Bob) Morris wrote: > > > My feeling is that taxonomic hierarchy is best got by web > > applications from services such as ITIS or other web > > services offering XML. There is a large community of > > descriptive data consumers, e.g. field naturalists, that > > find taxonomic hierarchy generally uninteresting. IMO, > > trying to integrate it with descriptive data actually > > addresses a small group of applications at a cost of added > > complexity. > > "There is a large community of descriptive data consumers, e.g. > field naturalists, that find taxonomic hierarchy generally > uninteresting." > > Bob, this comment needs a little bit more explanation / > discussion. > > Without arguing _where_ the field naturalist and other consumers > should get their taxonomic hierarchical information, I'd want to > argue that such information is (should be!) anything but > "uninteresting" to many (most?, all?) consumers, in their work > of understanding who is present in their study worlds and why. > > Having access to purported (evolutionary) relationships of their > study organisms focuses a special, valuable, most "interesting" > light on their studies. These users need to find these taxonomic > relationships handily, somewhere. In that context, your question > --Where?-- is probably a fair one to ask. > > Peter > From kevin.thiele at BIGPOND.COM Mon Nov 26 16:47:17 2001 From: kevin.thiele at BIGPOND.COM (Kevin Thiele) Date: Mon, 26 Nov 2001 16:47:17 Subject: Forgot the bloody attachment Message-ID: Forgot the bloody attachment - How many times does one have to put up a message before getting the message through? Oops. Steve Shattuck's pointed out, quite rightly, that we shouldn't fuss too much over the details of the challenge cases before actually taking the challenges. I think it's important to have a bit of an overview of the challenges for some distance ahead of us while working on one challenge in hand, rather than just taking steps one at a time, because it may be that solving one challenge may canalize us into a solution that makes a challenge down the road more difficult. So I suggest we keep a set of challenges in mind while working. But, of course, it'd be a disaster if we couldn't even agree on the challenges before starting to find the solutions! Cheers - k ----- Original Message ----- From: "Kevin Thiele" To: Sent: Sunday, November 25, 2001 9:27 PM Subject: Challenge Cases | Welcome back! | | I actually posted this message a week ago, but my email address has changed | slightly, and the list server rejected me. So here it is again. | | At the Sydney TDWG meeting we agreed that we would continue with the SDD | discussion on this list, but try to keep a tighter focus. The last active | period (about 12 months ago) was an important brainstorm session, but didn't | seem to be very effective at actually getting us to the goal of a workable | standard. | | We decided this time to try working through some challenge cases - real or | made-up instances of descriptive data that need to be accommodated in the | standard. By agreeing first on a +/- complete set of challenges, then | working through the challenge cases in order from simple to difficult, we | should be able to reasonably bound our problem while keeping an overview of | the territory while actually sinking out teeth into the nitty-gritty. | | Attached is my first attempt at a set of challenge cases, presented in +/- | this form to the Sydney meeting. The first challenge case has an exemplar, | the others have not yet. This document as it's worked up will be placed on | the TDWG web site for working reference. As the standard evolves, this will | also be put up on the site, possibly with progressive status indicators for | parts of the standard (e.g. working, proposed, normative - may we one day | get to normative) | | I suggest that we should first add to or modify the list of challenges. | Propose a challenge (with exemplar) to add to the list. Once we are happy | with an approximate list of challenges (keeping in mind that others will | become clear as we proceed, so there's no need to agonise over this step), | we'll start with challenge 1. We'll throw up the challenge, give a week or | so for contributors to propose data structures that can meet the challenge, | then compare and discuss alternate solutions. | | Gregor I think will shortly be posting a summary of the meeting discussions. | We agreed, I believe, that the goal is to provide a standard that can | adequately address the descriptive data requirements (ie be a superset) of | all existing programs (e.g. Lucid, DELTA, DELTA Access, Biolink) but not be | limited to existing programs. It should be able to function as an | interchange standard, but should not be limited to that. Bob Morris agreed | to provide shortly a discussion on interchange vs interoperability standards | and conflicts that may arise in trying to allow for both goals. We agreed | that XML will be the basis for the standard. | | May the force be with us! | | Cheers - k ------=_NextPart_001_0117_01C1769A.02097560 Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable
Forgot the bloody attachment - How many times does one have to put up a message before getting the message through? Oops.
Steve Shattuck's pointed out, quite rightly, that we shouldn't fuss too much over the details of the challenge cases before actually taking the challenges. I think it's important to have a bit of an overview of the challenges for some distance ahead of us while working on one challenge in hand, rather than just taking steps one at a time, because it may be that solving one challenge may canalize us into a solution that makes a challenge down the road more difficult. So I suggest we keep a set of challenges in mind while working. But, of course, it'd be a disaster if we couldn't even agree on the challenges before starting to find the solutions!

Cheers - k

----- Original Message -----
From: "Kevin Thiele" <kevin.thiele at BIGPOND.COM>
To: <TDWG-SDD at USOBI.ORG>
Sent: Sunday, November 25, 2001 9:27 PM
Subject: Challenge Cases


| Welcome back!
|
| I actually posted this message a week ago, but my email address has
changed
| slightly, and the list server rejected me. So here it is again.
|
| At the Sydney TDWG meeting we agreed that we would continue with the SDD
| discussion on this list, but try to keep a tighter focus. The last active
| period (about 12 months ago) was an important brainstorm session, but
didn't
| seem to be very effective at actually getting us to the goal of a workable
| standard.
|
| We decided this time to try working through some challenge cases - real or
| made-up instances of descriptive data that need to be accommodated in the
| standard. By agreeing first on a +/- complete set of challenges, then
| working through the challenge cases in order from simple to difficult, we
| should be able to reasonably bound our problem while keeping an overview
of
| the territory while actually sinking out teeth into the nitty-gritty.
|
| Attached is my first attempt at a set of challenge cases, presented in +/-
| this form to the Sydney meeting. The first challenge case has an exemplar,
| the others have not yet. This document as it's worked up will be placed on
| the TDWG web site for working reference. As the standard evolves, this
will
| also be put up on the site, possibly with progressive status indicators
for
| parts of the standard (e.g. working, proposed, normative - may we one day
| get to normative)
|
| I suggest that we should first add to or modify the list of challenges.
| Propose a challenge (with exemplar) to add to the list. Once we are happy
| with an approximate list of challenges (keeping in mind that others will
| become clear as we proceed, so there's no need to agonise over this step),
| we'll start with challenge 1. We'll throw up the challenge, give a week or
| so for contributors to propose data structures that can meet the
challenge,
| then compare and discuss alternate solutions.
|
| Gregor I think will shortly be posting a summary of the meeting
discussions.
| We agreed, I believe, that the goal is to provide a standard that can
| adequately address the descriptive data requirements (ie be a superset) of
| all existing programs (e.g. Lucid, DELTA, DELTA Access, Biolink) but not
be
| limited to existing programs. It should be able to function as an
| interchange standard, but should not be limited to that. Bob Morris agreed
| to provide shortly a discussion on interchange vs interoperability
standards
| and conflicts that may arise in trying to allow for both goals. We agreed
| that XML will be the basis for the standard.
|
| May the force be with us!
|
| Cheers - k
From maurobio at ACD.UFRJ.BR Mon Nov 26 15:21:52 2001 From: maurobio at ACD.UFRJ.BR (Mauro J. Cavalcanti) Date: Mon, 26 Nov 2001 15:21:52 Subject: Morphological Data Representation Message-ID: Steve Shattuck wrote: > Leigh's comments are good and worth a detailed look. His > model/representation/syntax (or whatever you want to call it) of the same > data I used (see http://www.bath.ac.uk/~ccslrd/delta/lep.xml) is exactly the > kind of thing I had in mind. Does his representation make more sense than > the one I proposed? What are the strengths/weaknesses of our approaches. > Does one allow us to get to where we want to be? Again, I don't think the Shattuck's and Dodd's proposals indeed seem basically identical (with XDELTA deserving priority, since it has been presented first; it also seems a little bit more detailed than Shattuck's proposal). What I miss is an integration of the morphological data representation with the representation of the taxonomic hierarchy (that has never been properly dealt with by DELTA) in a single XML format for the purpose of storing and exchanging taxonomic datasets. Gilmour's "Taxonomic Markup Language" (http://www.albany.edu/~gilmr/pubxml/ ) could provide a good start in this direction. I would appreciate very much any comments in this connection. Regards, -- + - - - - - - - - - - - - Mauro J. Cavalcanti - - - - - - - - - - - - + | Setor de Paleovertebrados, Departamento de Geologia e Paleontologia | | Museu Nacional do Rio de Janeiro | | Quinta da Boa Vista, 20940-040, Rio de Janeiro, RJ, BRASIL | | E-mail: maurobio at acd.ufrj.br | | Home Page: http://www.maurobio.cjb.net | + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + "Life is complex. It consists of real and imaginary parts." From ram at CS.UMB.EDU Mon Nov 26 13:58:29 2001 From: ram at CS.UMB.EDU (Robert A. (Bob) Morris) Date: Mon, 26 Nov 2001 13:58:29 Subject: Morphological Data Representation In-Reply-To: <3C027A30.72CD4A2A@acd.ufrj.br> Message-ID: Mauro J. Cavalcanti writes: > Date: Mon, 26 Nov 2001 15:21:52 -0200 > From: "Mauro J. Cavalcanti" > To: TDWG-SDD at usobi.org > Subject: Re: Morphological Data Representation > > ... > > What I miss is an integration of the morphological data representation with the > representation of the taxonomic hierarchy (that has never been properly dealt > with by DELTA) in a single XML format for the purpose of storing and exchanging > taxonomic datasets. Gilmour's "Taxonomic Markup Language" > (http://www.albany.edu/~gilmr/pubxml/ ) could provide a good start in this > direction. > > I would appreciate very much any comments in this connection. My feeling is that taxonomic hierarchy is best got by web applications from services such as ITIS or other web services offering XML. There is a large community of descriptive data consumers, e.g. field naturalists, that find taxonomic hierarchy generally uninteresting. IMO, trying to integrate it with descriptive data actually addresses a small group of applications at a cost of added complexity. From Steve at Mon Nov 26 12:31:40 2001 From: Steve at (Steve at ) Date: Mon, 26 Nov 2001 12:31:40 Subject: Morphological Data Representation Message-ID: First, I see Kevin has proposed a process slightly different to the one I started on Friday. I would suggest that we focus on one of these rather than running both at the same time. I don't think it matter which one we follow as they both have pros and cons. My idea was to start simple and build as we go while Kevin's is to scope the project first then fill in the details. Any strong views on the process. Guillaume commented that he's "always surprised seeing people recommend using proprietary stuff" in response to my suggestion of using Microsoft's XML Notepad (which, by the way, is actually free). The point I was trying to make was that XML can get complicated very quickly and using an XML view (of any sort) is better than using a text editor. Nothing more. He also commented that my example "was closely related to delta (as xdelta), whereas everyone seems to agree on building something from scratch". I fully agree, but I would suggest that DELTA has it basically right and that whatever we develop will look suspiciously like DELTA. For years taxonomists have talked about things called "characters" with "states" and have used these to build "descriptions" (= DELTA attributes). DELTA is firmly founded in taxonomic practices and since this is driving this process I would be surprised if they diverge too far. In other words there's a reason DELTA has been as broadly accepted as it has and we shouldn't ignore this. Guillaume's final comment, about the use of XML elements and attributes, is important but I still think it can wait. There often isn't a clear distinction between information that "has real-world meaning" and that which is "modelling artefacts." ["One person's data is another person's metadata."] The fact that "default XSLT transformation enforces this by outputting elements contents and ignoring attributes" is too application specific. Many people process XML using DOM tools and they shouldn't be constrained just because XSLT does it another way. Leigh's comments are good and worth a detailed look. His model/representation/syntax (or whatever you want to call it) of the same data I used (see http://www.bath.ac.uk/~ccslrd/delta/lep.xml) is exactly the kind of thing I had in mind. Does his representation make more sense than the one I proposed? What are the strengths/weaknesses of our approaches. Does one allow us to get to where we want to be? Again, I don't think the exact syntax is important at this stage. For example, both models describe the meaning of the numbers present in item descriptions for numeric characters, Leigh as "" and myself as " " with the meaning stored with the state rather than the item description. At the syntax-level these are very different but at the modelling-level they are the same - the same information is being managed (and both differ from the current DELTA-standard in this regard). We will need to work on the syntax but let's get the model agreed to first, then worry about specific syntax. >However, we must agree on model extend: will it concerns only concept >description (aka: characters) or also case description (aka: items) ? IMHO, >only the first one can be generalized, or we'll have to validate the case >description twice: against a generic model and against its concept. > >For using an example, if i have a description of the characters of >Pociloporidae familiy, and a description of the items of Pociloporidae >family, i'll have to make sure characters are really characters (validating >against generic character model), to make sure items are really items >(validating against generic items model), and make sure Pociloporidae items >are really Pociloporida (validating items against characters). I would prefer >to have only to validate characters against a generic character model, and >validate items against characters, meaning using a character description as a >suitable model for items description. I believe this misses the goals of this project in a number of important ways and we should avoid going down this path. It seems to mix the (i) processing of the data with (ii) the data representation with (iii) taxonomic work practices. I'm very uncomfortable going there. Finally, Peter's concerns are important for the next step, expanding the proposed representation to include information not currently managed. One of the strong recommendations from SDD Round 1 was to manage raw data. This needs to be housed under the summarized data (in this case, the actual measurements under the ratios). We WILL need to do this eventually. Peter also pointed out possibly our largest challenge. He noted that "having clear spots in wings is not very precise if the data is to be comparable beyond the group in question - which I suppose is part of the goal." At every TDWG meeting I've been to we decide that we can't build standards for specific character values and yet at every TDWG meeting I've been to we try to build standards for specific character values. We need to build mechanisms to allow sharing of character lists across projects IF THOSE PROJECTS WANT TO USE THIS FEATURE. If projects don't want to share character lists, for what ever reason, then they won't not matter how important we think it is to do so. I think this focus on "standard character lists" is very much a "plant thing." In animals it would never occur to me to use "clear spots in wings" for anything other than the local context for which it was established. No one would suggest that "clear spots in wings" in bees has anything to do with "clear spots in wings" in butterflies and trying to use a single character coding for this would receive limited support at best. I think the problem is that it's common to talk about "identifying a plant" but very rare to talk about "identifying an animal." There are no "faunas" that are equivalent to "floras." This very fundamental difference between plants and animals and the way people view them has a huge impact on this very development - it's one of the reason's that the botanical community has accepted DELTA much more strongly than the zoological community. While a "global flora" is a completely reasonable goal, a "global fauna" isn't even a faint blip on distant radar. Zoologists work in relative isolation compared to botanists and have very different work practices and needs. Meeting the needs of both of these communities will be a significant challenge, one I'm not sure we can meet in a single set of tools. Thanks, Steve Steve Shattuck CSIRO Entomology steve.shattuck at csiro.au From peterr at SOCRATES.BERKELEY.EDU Mon Nov 26 11:22:45 2001 From: peterr at SOCRATES.BERKELEY.EDU (Peter Rauch) Date: Mon, 26 Nov 2001 11:22:45 Subject: Morphological Data Representation In-Reply-To: <15362.37077.649324.22976@u11.cs.umb.edu> Message-ID: On Mon, 26 Nov 2001, Robert A. (Bob) Morris wrote: > My feeling is that taxonomic hierarchy is best got by web > applications from services such as ITIS or other web > services offering XML. There is a large community of > descriptive data consumers, e.g. field naturalists, that > find taxonomic hierarchy generally uninteresting. IMO, > trying to integrate it with descriptive data actually > addresses a small group of applications at a cost of added > complexity. "There is a large community of descriptive data consumers, e.g. field naturalists, that find taxonomic hierarchy generally uninteresting." Bob, this comment needs a little bit more explanation / discussion. Without arguing _where_ the field naturalist and other consumers should get their taxonomic hierarchical information, I'd want to argue that such information is (should be!) anything but "uninteresting" to many (most?, all?) consumers, in their work of understanding who is present in their study worlds and why. Having access to purported (evolutionary) relationships of their study organisms focuses a special, valuable, most "interesting" light on their studies. These users need to find these taxonomic relationships handily, somewhere. In that context, your question --Where?-- is probably a fair one to ask. Peter From ram at CS.UMB.EDU Sun Nov 25 21:28:58 2001 From: ram at CS.UMB.EDU (Robert A. (Bob) Morris) Date: Sun, 25 Nov 2001 21:28:58 Subject: Morphological Data Representation In-Reply-To: Message-ID: Steve Shattuck writes: > Date: Mon, 26 Nov 2001 12:31:40 +1100 > From: Steve Shattuck > To: TDWG-SDD at usobi.org > Subject: Re: Morphological Data Representation > > First, I see Kevin has proposed a process slightly different to the one I > started on Friday. I would suggest that we focus on one of these rather > than running both at the same time. I don't think it matter which one we > follow as they both have pros and cons. My idea was to start simple and > build as we go while Kevin's is to scope the project first then fill in the > details. Any strong views on the process. The "data challenges" process was what was agreed in Sydney. Also, it corresponds closely to processes generally regarded as successful in software design, namely Use Case analysis, though it focuses on the use and not the user (in the sense of both human and software). > > Guillaume commented that he's "always surprised seeing people recommend > using proprietary stuff" in response to my suggestion of using Microsoft's > XML Notepad (which, by the way, is actually free). The point I was trying > to make was that XML can get complicated very quickly and using an XML view > (of any sort) is better than using a text editor. Nothing more. Also, MSIE is a reasonable XML viewer, albeit not an editor. > > He also commented that my example "was closely related to delta (as xdelta), > whereas everyone seems to agree on building something from scratch". I > fully agree, but I would suggest that DELTA has it basically right and that > whatever we develop will look suspiciously like DELTA. For years > taxonomists have talked about things called "characters" with "states" and > have used these to build "descriptions" (= DELTA attributes). DELTA is > firmly founded in taxonomic practices and since this is driving this process > I would be surprised if they diverge too far. In other words there's a > reason DELTA has been as broadly accepted as it has and we shouldn't ignore > this. In Sydney it was agreed that DELTA experience and functionality should not be ignored, i.e. we should not start from scratch. [I don't happen to agree with that because of the large community outside TDWG that is already heavily involved in descriptive data if only informally. But it is what was decided, and it does leverage the experience of most of SDD. The trick will be to not arrive at something that only models DELTA] > > Guillaume's final comment, about the use of XML elements and attributes, is > important but I still think it can wait. There often isn't a clear > distinction between information that "has real-world meaning" and that which > is "modelling artefacts." ["One person's data is another person's > metadata."] The fact that "default XSLT transformation enforces this by > outputting elements contents and ignoring attributes" is too application > specific. Many people process XML using DOM tools and they shouldn't be > constrained just because XSLT does it another way. Not only can it wait, but these things are possibly technical enough that they shouldn't be initially in this mailing list. We are nearly finished an installation of vBulletin forum software and will offer to operate a forum off this list for discussion of the technical bits. > > Leigh's comments are good and worth a detailed look. His > model/representation/syntax (or whatever you want to call it) of the same > data I used (see http://www.bath.ac.uk/~ccslrd/delta/lep.xml) is exactly the > kind of thing I had in mind. Does his representation make more sense than > the one I proposed? What are the strengths/weaknesses of our approaches. > Does one allow us to get to where we want to be? Again, I don't think the > exact syntax is important at this stage. For example, both models describe > the meaning of the numbers present in item descriptions for numeric > characters, Leigh as "" and myself as > " />" with the meaning stored with the state rather than the item description. > At the syntax-level these are very different but at the modelling-level they > are the same - the same information is being managed (and both differ from > the current DELTA-standard in this regard). We will need to work on the > syntax but let's get the model agreed to first, then worry about specific > syntax. > > > >However, we must agree on model extend: will it concerns only concept > >description (aka: characters) or also case description (aka: items) ? IMHO, > >only the first one can be generalized, or we'll have to validate the case > >description twice: against a generic model and against its concept. > > To the extent these are separable, there would probably be less dispute about character representation. Probably this means that lots of case description models have to fit on the same character model. > >For using an example, if i have a description of the characters of > >Pociloporidae familiy, and a description of the items of Pociloporidae > >family, i'll have to make sure characters are really characters (validating > >against generic character model), to make sure items are really items > >(validating against generic items model), and make sure Pociloporidae items > >are really Pociloporida (validating items against characters). I would > prefer > >to have only to validate characters against a generic character model, and > >validate items against characters, meaning using a character description as > a > >suitable model for items description. As a not biologist, this sounds right to me. > > I believe this misses the goals of this project in a number of important > ways and we should avoid going down this path. It seems to mix the (i) > processing of the data with (ii) the data representation with (iii) > taxonomic work practices. I'm very uncomfortable going there. > > > Finally, Peter's concerns are important for the next step, expanding the > proposed representation to include information not currently managed. One > of the strong recommendations from SDD Round 1 was to manage raw data. This > needs to be housed under the summarized data (in this case, the actual > measurements under the ratios). We WILL need to do this eventually. > > Peter also pointed out possibly our largest challenge. He noted that "having > clear spots in wings is not very > precise if the data is to be comparable beyond the group in question - which > I suppose is part of the goal." At every TDWG meeting I've been to we > decide that we can't build standards for specific character values and yet > at every TDWG meeting I've been to we try to build standards for specific > character values. We need to build mechanisms to allow sharing of character > lists across projects IF THOSE PROJECTS WANT TO USE THIS FEATURE. If > projects don't want to share character lists, for what ever reason, then > they won't not matter how important we think it is to do so. I agree with this. Whether sharing character lists matters may be a function of the purpose of the list. For example, paper field guides to a given group of taxa often have a far greater commonality of description of characters than they do of characters. And they often come equipped with a character metadata section that explains how to use the characters. > > > I think this focus on "standard character lists" is very much a "plant > thing." In animals it would never occur to me to use "clear spots in wings" > for anything other than the local context for which it was established. No > one would suggest that "clear spots in wings" in bees has anything to do > with "clear spots in wings" in butterflies and trying to use a single > character coding for this would receive limited support at best. I think > the problem is that it's common to talk about "identifying a plant" but very > rare to talk about "identifying an animal." There are no "faunas" that are > equivalent to "floras." This very fundamental difference between plants and > animals and the way people view them has a huge impact on this very > development - it's one of the reason's that the botanical community has > accepted DELTA much more strongly than the zoological community. While a > "global flora" is a completely reasonable goal, a "global fauna" isn't even > a faint blip on distant radar. Zoologists work in relative isolation > compared to botanists and have very different work practices and needs. > Meeting the needs of both of these communities will be a significant > challenge, one I'm not sure we can meet in a single set of tools. > > > Thanks, Steve > > Steve Shattuck > CSIRO Entomology > steve.shattuck at csiro.au > From kevin.thiele at BIGPOND.COM Sun Nov 25 21:27:19 2001 From: kevin.thiele at BIGPOND.COM (Kevin Thiele) Date: Sun, 25 Nov 2001 21:27:19 Subject: Challenge Cases Message-ID: Welcome back! I actually posted this message a week ago, but my email address has changed slightly, and the list server rejected me. So here it is again. At the Sydney TDWG meeting we agreed that we would continue with the SDD discussion on this list, but try to keep a tighter focus. The last active period (about 12 months ago) was an important brainstorm session, but didn't seem to be very effective at actually getting us to the goal of a workable standard. We decided this time to try working through some challenge cases - real or made-up instances of descriptive data that need to be accommodated in the standard. By agreeing first on a +/- complete set of challenges, then working through the challenge cases in order from simple to difficult, we should be able to reasonably bound our problem while keeping an overview of the territory while actually sinking out teeth into the nitty-gritty. Attached is my first attempt at a set of challenge cases, presented in +/- this form to the Sydney meeting. The first challenge case has an exemplar, the others have not yet. This document as it's worked up will be placed on the TDWG web site for working reference. As the standard evolves, this will also be put up on the site, possibly with progressive status indicators for parts of the standard (e.g. working, proposed, normative - may we one day get to normative) I suggest that we should first add to or modify the list of challenges. Propose a challenge (with exemplar) to add to the list. Once we are happy with an approximate list of challenges (keeping in mind that others will become clear as we proceed, so there's no need to agonise over this step), we'll start with challenge 1. We'll throw up the challenge, give a week or so for contributors to propose data structures that can meet the challenge, then compare and discuss alternate solutions. Gregor I think will shortly be posting a summary of the meeting discussions. We agreed, I believe, that the goal is to provide a standard that can adequately address the descriptive data requirements (ie be a superset) of all existing programs (e.g. Lucid, DELTA, DELTA Access, Biolink) but not be limited to existing programs. It should be able to function as an interchange standard, but should not be limited to that. Bob Morris agreed to provide shortly a discussion on interchange vs interoperability standards and conflicts that may arise in trying to allow for both goals. We agreed that XML will be the basis for the standard. May the force be with us! Cheers - k From ldodds at INGENTA.COM Fri Nov 23 16:56:06 2001 From: ldodds at INGENTA.COM (Leigh Dodds) Date: Fri, 23 Nov 2001 16:56:06 Subject: Morphological Data Representation In-Reply-To: Message-ID: This nicely demonstrates the point that the syntax is only a way to encode a particular model. I'd previously cast the same basic Lepidoptera data into XML (with some additional data, such as images) using XDelta: http://www.bath.ac.uk/~ccslrd/delta/lep.xml I'm sure the following is wrong in places and/or overly simplistic, and deliberately so: A Character has - a Character Type (one of Ordered or Unordered or Real, or ...) - a Character Identifier (a number) - a Short Description (free text) - a Long Description (free text) - an Order Identifier (a number) - a State Descriptor (free text) A State has - a State Value (free text) - a State Identifier (a number) - an Order Identifier An Item has - An Item Type (free text) - An Item Identifier (a number) - A Description (free text) - One or more Characteristics (observed values of states for specific characters) Characters describe a list of properties for some Type of Thing might have. States describe the list of (observed) values of that property for a specific Type of Thing (avoiding the word Class as it's an overloaded term). An Item is the specific instances of those properties, with some specific Thing. What did I get wrong? L. From rousse at CCR.JUSSIEU.FR Fri Nov 23 16:00:44 2001 From: rousse at CCR.JUSSIEU.FR (Guillaume Rousse) Date: Fri, 23 Nov 2001 16:00:44 Subject: Modelling and XML In-Reply-To: Message-ID: Ainsi parlait Leigh Dodds : [..] > I'll also interject 2p/euro/cents at this point: > > I believe I'd tackle this problem firstly from a modelling perspective: > can we agree upon, and define a model for the data that we're > capturing. It seems like there is a conceptual model inherent in the > data currently captured by DELTA and other formats, that is > separate to the details of its syntax. I wholefully agree there: let's define first a conceptual model, then we should discuss of its implementation. However, we must agree on model extend: will it concerns only concept description (aka: characters) or also case description (aka: items) ? IMHO, only the first one can be generalized, or we'll have to validate the case description twice: against a generic model and against its concept. For using an exemple, if i have a description of the characters of Pociloporidae familiy, and a description of the items of Pociloporidae family, i'll have to make sure characters are really characters (validating against generic character model), to make sure items are really items (validating against generic items model), and make sure Pociloporidae items are really Pociloporida (validating items against characters). I would prefer to have only to validate characters againt a generic character model, and validate items against characters, meaning using a character description as a suitable model for items description. -- Guillaume Rousse GPG key http://lis.snv.jussieu.fr/~rousse/gpgkey.html From rousse at CCR.JUSSIEU.FR Fri Nov 23 15:07:30 2001 From: rousse at CCR.JUSSIEU.FR (Guillaume Rousse) Date: Fri, 23 Nov 2001 15:07:30 Subject: Morphological Data Representation In-Reply-To: Message-ID: Ainsi parlait Steve Shattuck : [..] > I've translated this into the XML file that is attached. (Even this fairly > simple example is moderately large and I would recommend using an XML > viewer such as Microsoft's XML Notepad when working with it - see > http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnxml/html >/ xmlpaddownload.asp.) I'm always surprised seeing people recommend using proprietary stuff, when they are plenty of free-software equivalents: http://freshmeat.net/search/?site=Freshmeat&q=xml+editor§ion=projects sourceforge.net also also provide a complete list vim and emacs also have xml support, and are available on many platforms > The basic structure of the attached file is: > > > > - ANY NUMBER > > > > > > > - ANY NUMBER > > > > > - ANY NUMBER > > > > - ANY NUMBER > > > - ANY NUMBER > > This is a mixed approach, whereas Gregor's proposition is to separate characters description and items. Morevoer, it's closely related to delta (as xdelta), whereas everyone seems to agree on building something from scratch. > Note that I've treated everything as elements and haven't used attributes. > Simplicity is the only reason for this and some elements would be better as > attributes; these can be converted when the dust settles. No, there is also a performance argument for attributes: as they make files less verbose, parsing is quicker. See xerces-j/xalan-j FAQ. However, from a modeling point-of-view, the rule should be: use elements for what is has real-world meaning use attributes for modeling artifacts (id, idrefs, etc...) Default XSLT transformation enforces this by outputting elements contents and ignoring attributes. -- Guillaume Rousse GPG key http://lis.snv.jussieu.fr/~rousse/gpgkey.html From Steve at Fri Nov 23 14:33:06 2001 From: Steve at (Steve at ) Date: Fri, 23 Nov 2001 14:33:06 Subject: Morphological Data Representation Message-ID: Below and attached are a first attempt at representing simple, common DELTA-type data in an XML-based structure. I've used a selected set of characters and items from the Butterfly sample data on the DELTA web site. The DELTA-formatted data looks like this: ==== DELTA-Standard CHARS File ==== *SHOW: Lepidoptera demonstration characters. Revised 28-AUG-91. *CHARACTER LIST #1. main colour of inner part of front wing/ 1. white/ 2. cream/ 3. grey/ 4. brown/ 5. black/ 6. yellow/ 7. orange/ 8. blue/ 9. green/ #2. wings / 1. with transparent areas/ 2. without transparent areas/ #3. length of front wing/ mm/ #4. antennae / times length of front wing/ ==== DELTA-Standard ITEMS File ==== *SHOW: Lepidoptera demonstration items. Revised 18-OCT-94. *ITEM DESCRIPTIONS # Antheraea/ 1,4 2,2 3,43-50 4,0.15-0.2 # Ethmia/ 1,2-4 2,2 3,11-14 4,0.6-0.65 # Graphium/ 1,1-2/9 2,2 3,29-33 4,0.45-0.5 # Hecatesia/ 1,4 2,1/2 3,11-14 4,0.8-0.9 ==== DELTA-Standard SPECS File ==== *SHOW: Lepidoptera demonstration specifications. Revised 28-AUG-91. *NUMBER OF CHARACTERS 4 *MAXIMUM NUMBER OF STATES 9 *MAXIMUM NUMBER OF ITEMS 4 *CHARACTER TYPES 3,RN 4,RN *NUMBERS OF STATES 1,9 ==== For these files, the DELTA-generated natural language would look something like this: Antheraea Main colour of inner part of front wing brown. Wings without transparent areas. Length of front wing 43-50 mm. Antennae 0.15-0.2 times length of front wing. Ethmia Main colour of inner part of front wing cream to brown. Wings without transparent areas. Length of front wing 11-14 mm. Antennae 0.6-0.65 times length of front wing. Graphium Main colour of inner part of front wing white to cream, or green. Wings without transparent areas. Length of front wing 29-33 mm. Antennae 0.45-0.5 times length of front wing. Hecatesia Main colour of inner part of front wing brown. Wings with transparent areas (small, translucent window), or without transparent areas. Length of front wing 11-14 mm. Antennae 0.8-0.9 times length of front wing. ==== Hand-generated natural language would be essentially the same except for the last item, where it might look more like this: Hecatesia Main colour of inner part of front wing brown. Wings with or without transparent areas (when present, forming a small window). Length of front wing 11-14 mm. Antennae 0.8-0.9 times length of front wing. ============================================ I've translated this into the XML file that is attached. (Even this fairly simple example is moderately large and I would recommend using an XML viewer such as Microsoft's XML Notepad when working with it - see http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnxml/html/ xmlpaddownload.asp.) The basic structure of the attached file is: - ANY NUMBER - ANY NUMBER - ANY NUMBER - ANY NUMBER - ANY NUMBER Note that I've treated everything as elements and haven't used attributes. Simplicity is the only reason for this and some elements would be better as attributes; these can be converted when the dust settles. I've tried to generalise as much as possible and use only two main elements: and . Each character is assigned a that tells what it is (ordered multistate, unordered multistate, real, integer, etc). Similarly the item can be specified as taxon or specimen (or potentially something else). A couple of points are probably worth making: The and are used to support DELTA comments. This probably needs to be generalised further to support any number of alternate phrasings. is used for the units of numeric characters and isn't needed (?) for other character types - it's an attempt at keeping general. The element in is used to house natural language representations. Codes for the states (when needed) are placed in square brackets, these being translated during generation. As noted above, this may been to be generalised to support any number of phrasings. In , is used to hold numeric values, the being used to define what the number means (minimum, maximum, etc., rather than using placement in the attribute string as in the DELTA standard). This element won't (?) be needed for multistate characters. I think/hope the remainder is fairly clear. ** The Next Step ** I would suggest the following path from here: 1) Make sure the above representation makes sense for the data given. 2) Expand the above data to support LucID-specific requirements (without adding additional complexity). Once this is finished we can: Add additional DELTA features (dependencies, default values, etc.) Add more complex data sets and examples Add new features on our assorted "wish lists" I look forward to comments and forward progress! Thanks, Steve Steve Shattuck CSIRO Entomology biolink at ento.csiro.au ------_=_NextPart_000_01C173CF.907ED420 Content-Type: application/octet-stream; name="Morphology 23 Nov 01.xml" Content-Transfer-Encoding: quoted-printable Content-Disposition: attachment; filename="Morphology 23 Nov 01.xml" Ordered C1 main colour of inner part of front = wing main colour of inner part of front = wing 1 white C1S1 1 cream C1S2 2 grey C1S3 3 brown C1S4 4 black C1S5 5 yellow C1S6 6 orange C1S7 7 blue C1S8 8 green C1S9 9 Unordered C2 wings transparent areas on wings 2 with transparent areas C2S1 1 without transparent areas C2S2 2 Real C3 length of front wing length of front wing mm 3 minimum C3S1 1 maximum C3S2 2 Real C4 antennae length of antennae times length of front = wing 4 minimum C4S1 1 maximum C4S2 2 Taxon T1 Antheraea C1 [C1S4] C1S4 C2 [C2S2] C2S2 C3 43 to 50 mm C3S1 43 C3S2 50 C4 0.15 to 0.2 times length of front = wing C4S1 0.15 C4S2 0.2 Taxon T2 Ethmia C1 [C1S2] to [C1S4] C1S2 C1S4 C2 [C2S2] C2S2 C3 11 to 14 mm C3S1 11 C3S2 14 C4 0.6 to 0.65 times length of front = wing C4S1 0.6 C4S2 0.65 Taxon T3 Graphium C1 [C1S1] to [C1S2], or [C1S9] C1S1 C1S2 C1S9 C2 [C2S2] C2S2 C3 29 to 33 mm C3S1 29 C3S2 33 C4 0.45 to 0.50 times length of front = wing C4S1 0.45 C4S2 0.50 Taxon T4 Hecatesia C1 [C1S4] C1S4 C2 with or without transparent areas (when = present, forming a small window) C2S1 C2S2 C3 11 to 14 mm C3S1 11 C3S2 14 C4 0.8 to 0.9 times length of front = wing C4S1 0.8 C4S1 0.9 From Steve at Thu Nov 22 10:41:55 2001 From: Steve at (Steve at ) Date: Thu, 22 Nov 2001 10:41:55 Subject: is there an "xml-include" Message-ID: There seems to be a number of problems being addressed in the current discussion: 1) Gregor's original post asked for a solution ("how do I do an include in XML") without posing a problem ("how do I maintain a separate global character list and link a number of independent taxon descriptions to it"). I would suggest that using an "include" is only one of a range of solutions to this problem. And if this isn't the problem then we should revisit it before spending too much time coming up with a solution. 2) Whether XML is hierarchical or relational is of little importance in practice. All major relational databases can suck in and spit out XML - it's a non-issue from an implementation standpoint (which is what we are ultimately concerned with here). IDs and RefIDs are primary and foreign keys - the mapping is one-to-one if you want to structure the XML that way. 3) Leigh is spot-on when he says "define a model for the data" as this is "separate to the details of its [the model's] syntax". The syntax (= XML model) will be useless if it doesn't manage the information that's important to us no matter how "rigorous" or "technically accurate" it is. 4) At the Sydney meeting I thought we agreed to restart this discussion with specific examples based on real-life characters (in the biological sense). This current thread would seem to be very much a continuation of the one started 2 years ago: 80% focus on solutions (largely round XML syntax) and 20% on specific problems (addressing business needs). I fear we'll end up the same place we were before. Thanks, Steve Steve Shattuck CSIRO Entomology biolink at ento.csiro.au P.S. - A solution to Gregor's problem would be to use unique identifiers (GUIDs) in his global character list. Any one using this list would then include these identifiers as part of taxon descriptions along with metadata (a citation) to find where the global list is housed. If you then want to combine separate description datasets you can compare the character identifiers: if they are the same then it's the same character, if they are different then it's not the same character. It's a simple, foolproof solution that's easy to implement technically (whether people will actually do this is a completely different matter ;-) ). P.P.S. - I'm well aware that I have the power to direct this discussion by actually doing what I suggest in No. 4 above. And I really, really intend to do it. I just have one more meeting to get through, and then another 10 emails, and then .... From ram at CS.UMB.EDU Wed Nov 21 20:32:23 2001 From: ram at CS.UMB.EDU (Robert A. (Bob) Morris) Date: Wed, 21 Nov 2001 20:32:23 Subject: is there an "xml-include" In-Reply-To: Message-ID: Steve Shattuck writes: > Date: Thu, 22 Nov 2001 10:41:55 +1100 >[...] > 4) At the Sydney meeting I thought we agreed to restart this discussion with > specific examples based on real-life characters (in the biological sense). > This current thread would seem to be very much a continuation of the one > started 2 years ago: 80% focus on solutions (largely round XML syntax) and > 20% on specific problems (addressing business needs). I fear we'll end up > the same place we were before. Amen. More explicitly, what was posed was to provide "data challenges", i.e. (as I understood) problems about characters that the poser feels important to model. Implicit (as I understood) in that discussion---and in almost all of this recent thread---is that a particular representation (e.g. XML) is NOT what is initially desired. If, indeed, my understandings were correct, and if a sufficiently inclusive set of challenges were identified, what would then be accomplished sets the stage exactly for the modeling Leigh urges. Further, this complies with one of the admonitions in Rob Atkinson's talk to first find out what you are trying to make a standard about, i.e.: XML is the answer. What is the question? Finally, I would paraphrase what I understood the subgroup meeting arrived at: The first job is to identify with these data challenges the strengths and the limitations of what DELTA-motivated applications can accomplish, rather than throw out the experience of DELTA and start over. Probably Gregor's meeting minutes will reveal whether that understanding is correct and perhaps how it came to be. --Bob Morris From trainor at UIC.EDU Wed Nov 21 18:22:06 2001 From: trainor at UIC.EDU (Douglas Trainor) Date: Wed, 21 Nov 2001 18:22:06 Subject: is there an "xml-include" Message-ID: Amen. I'm not a taxonomist, but I suggest that several folks on this list take some of their current character data and make an example out of it in XML and comment on the example. It need not be perfect. This could quickly educate others about the strengths and potential weaknesses of an approach with a specific examplar. XML will make it very nice to move data between different applications, or between component parts of a large application. I just want to write nifty plant/insect applications against the XML-wrapped character data. Going right off for a perfect Holy Grail has sabotaged other group projects in the past. One of the best examples is how Algol language folks got into bitter debates about when an object is a chair and when it's a table. douglas Steve Shattuck wrote: > There seems to be a number of problems being addressed in the current > discussion: > > 1) Gregor's original post asked for a solution ("how do I do an include in > XML") without posing a problem ("how do I maintain a separate global > character list and link a number of independent taxon descriptions to it"). > I would suggest that using an "include" is only one of a range of solutions > to this problem. And if this isn't the problem then we should revisit it > before spending too much time coming up with a solution. > > 2) Whether XML is hierarchical or relational is of little importance in > practice. All major relational databases can suck in and spit out XML - > it's a non-issue from an implementation standpoint (which is what we are > ultimately concerned with here). IDs and RefIDs are primary and foreign > keys - the mapping is one-to-one if you want to structure the XML that way. > > 3) Leigh is spot-on when he says "define a model for the data" as this is > "separate to the details of its [the model's] syntax". The syntax (= XML > model) will be useless if it doesn't manage the information that's important > to us no matter how "rigorous" or "technically accurate" it is. > > 4) At the Sydney meeting I thought we agreed to restart this discussion with > specific examples based on real-life characters (in the biological sense). > This current thread would seem to be very much a continuation of the one > started 2 years ago: 80% focus on solutions (largely round XML syntax) and > 20% on specific problems (addressing business needs). I fear we'll end up > the same place we were before. > > Thanks, Steve > > Steve Shattuck > CSIRO Entomology > biolink at ento.csiro.au > > P.S. - A solution to Gregor's problem would be to use unique identifiers > (GUIDs) in his global character list. Any one using this list would then > include these identifiers as part of taxon descriptions along with metadata > (a citation) to find where the global list is housed. If you then want to > combine separate description datasets you can compare the character > identifiers: if they are the same then it's the same character, if they are > different then it's not the same character. It's a simple, foolproof > solution that's easy to implement technically (whether people will actually > do this is a completely different matter ;-) ). > > P.P.S. - I'm well aware that I have the power to direct this discussion by > actually doing what I suggest in No. 4 above. And I really, really intend > to do it. I just have one more meeting to get through, and then another 10 > emails, and then .... From ldodds at INGENTA.COM Wed Nov 21 09:26:16 2001 From: ldodds at INGENTA.COM (Leigh Dodds) Date: Wed, 21 Nov 2001 9:26:16 Subject: is there an "xml-include" In-Reply-To: <3BFA89D5.12601.925A3E5@localhost> Message-ID: > I think this comes closest. Can you point me to information on > XInclude? As you say, it seems unclear whether a validating parser, > say msxml 4.0 would be able to check such id/idrefs. http://www.w3.org/TR/xinclude XInclude in Java: http://www.ibiblio.org/xml/XInclude/ Cheers, L. From ldodds at INGENTA.COM Wed Nov 21 09:23:03 2001 From: ldodds at INGENTA.COM (Leigh Dodds) Date: Wed, 21 Nov 2001 9:23:03 Subject: Modelling and XML Message-ID: As an outside, but interested, observer I'm keen to contribute to this discussion in any way that seems useful. E.g. technical help with XML. Judging by the recent discussion there is progress being made on this front. Is there any publically available documentation? I'll also interject 2p/euro/cents at this point: I believe I'd tackle this problem firstly from a modelling perspective: can we agree upon, and define a model for the data that we're capturing. It seems like there is a conceptual model inherent in the data currently captured by DELTA and other formats, that is separate to the details of its syntax. Having agreed a particular model, a suitable syntax for expressing that model would be the logical next step. XML seems like a good fit for this, and, as I've noted elsewhere, can serialise graphs of information. Cheers, L. -- Leigh Dodds, Research Group, Ingenta | "Pluralitas non est ponenda http://weblogs.userland.com/eclectic | sine necessitate" http://www.xml.com/pub/xmldeviant | -- William of Ockham From ldodds at INGENTA.COM Wed Nov 21 09:15:02 2001 From: ldodds at INGENTA.COM (Leigh Dodds) Date: Wed, 21 Nov 2001 9:15:02 Subject: is there an "xml-include" In-Reply-To: <951DC0D57185D311B88B00A0C9E9258D04AED6@MIDNIGHT> Message-ID: At risk of distracting from the discussion at hand: > If you feel that descriptive data can only be modeled with > relations, then you can never get XML to work for you. It is not relational, it is strictly > hierarcical. It does not and, by design, cannot support the relational model. Do not > confuse XML with a database. It is -not- designed to be a database. The few databases that > use XML are all hierarcical XML is designed to model the structure of > documents. If you want to create a database, use a database > language. XML is what you use in the report. Well strictly speaking XML can be used to model/serialise graph structures, so while it is hierarchical, and the syntax does require a single root, there's no reason why this can't become a graph (or other) structure once parsed. There's been a lot of activity (unsurprisingly) in serialising XML into and out of databases. In general it's not an easy problem to solve, but it can be done. Also, I'd have to disagree with the statement that 'the few databases that use XML are all hierarchical'. Current XML-aware databases fall into two categories: XML-Enabled Databases (i.e. relational databases that have added XML features) and Native XML Databases (i.e. those specifically designed to store XML). Databases in the first category do tend to be primarily relational in nature. Those in the second are a mixture: they model the XML document internally, and this might be achieved using a relational model. Cheers, L. -- Leigh Dodds, Research Group, Ingenta | "Pluralitas non est ponenda http://weblogs.userland.com/eclectic | sine necessitate" http://www.xml.com/pub/xmldeviant | -- William of Ockham From ram at CS.UMB.EDU Wed Nov 21 05:52:21 2001 From: ram at CS.UMB.EDU (Robert A. (Bob) Morris) Date: Wed, 21 Nov 2001 5:52:21 Subject: is there an "xml-include" In-Reply-To: Message-ID: Leigh Dodds writes: > Date: Wed, 21 Nov 2001 09:15:02 -0000 > From: Leigh Dodds > To: TDWG-SDD at usobi.org > Subject: Re: is there an "xml-include" > > At risk of distracting from the discussion at hand: > >[...] > Well strictly speaking XML can be used to model/serialise graph structures, > so while it is hierarchical, and the syntax does require a single root, > there's > no reason why this can't become a graph (or other) structure once parsed. > > There's been a lot of activity (unsurprisingly) in serialising XML into and > out of databases. In general it's not an easy problem to solve, but it > can be done. People interested in how easily it can always be done, but why it is hard to do it well in general, might consult S. Abiteboul, P. Buneman, D. Suciu, "Data on the Web. From Relations to Semistructured Data and XML", Morgan Kaufmann Publishers, 2000, especially pp. 172 ff. From rousse at CCR.JUSSIEU.FR Tue Nov 20 17:08:09 2001 From: rousse at CCR.JUSSIEU.FR (Guillaume Rousse) Date: Tue, 20 Nov 2001 17:08:09 Subject: XML: is there an "xml-include" In-Reply-To: <3BFA3BC7.8791.7F4B146@localhost> Message-ID: Ainsi parlait Gregor Hagedorn : > If character are defined with an id (numeric or character) in one > file, and the item descriptions use these ids through idref or > through xml-schema keyref means: > > How can a validating parser validate the schema, including keyrefs, > without having to include the entire character definition in each of > 1000s of taxon description xml files? Can the ids for idref/keyref be > declared to be in a separate file? > > I could not find a standard xml-include command. I know there is one > defined for dtds, but that is all I could find. > > Can the xml experts help? I'll try to summarize the discussion we had during the field trip, maybe it will be clearer for everyone. You want to have a descriptors file (the model), validated against a descriptors schema. And a taxons file, validated againt a taxons schema. But you also want to ensure coherency between the model and the taxons, such as referencing the correct descriptors, or the correct descriptors states. One way would be to use XSLT to performs this additional coherency enforcement. You can include the descriptors file in the taxons file before validating it, you can search for aberrant xpath statement result, etc... The other way i would suggest would be to drop the static taxons schema, and dynamically generate it from the descriptors file using XSLT. Then you'll have one-step validation. But no more generic descriptive model. BTW, the second approch is the one used by IKBS project (http://www.univ-reunion.fr/ikbs), even if they don't use XML. -- Guillaume Rousse GPG key http://lis.snv.jussieu.fr/~rousse/gpgkey.html From G.Hagedorn at BBA.DE Tue Nov 20 16:50:29 2001 From: G.Hagedorn at BBA.DE (Gregor Hagedorn) Date: Tue, 20 Nov 2001 16:50:29 Subject: is there an "xml-include" In-Reply-To: Message-ID: > This is a means to separate out a schema into multiple files, > if I understand Gregor correctly he's asking for a way to test for > id/keyref uniqueness across multiple files? Yes, and data files, not schema/dtd. One xml data file defines ids, the other points to these ids. In the case of descriptive data, 1000s of item description files should all point to the same, centrally defined character definitions. > At a higher level there is XInclude which allows for including > content from one document into another. Not sure on > current support for this however. I think this comes closest. Can you point me to information on XInclude? As you say, it seems unclear whether a validating parser, say msxml 4.0 would be able to check such id/idrefs. Gregor ---------------------------------------------------------- Gregor Hagedorn (G.Hagedorn at bba.de) Institute for Plant Virology, Microbiology, and Biosafety Federal Research Center for Agriculture and Forestry (BBA) Koenigin-Luise-Str. 19 Tel: +49-30-8304-2220 14195 Berlin, Germany Fax: +49-30-8304-2203 Often wrong but never in doubt! From G.Hagedorn at BBA.DE Tue Nov 20 16:45:29 2001 From: G.Hagedorn at BBA.DE (Gregor Hagedorn) Date: Tue, 20 Nov 2001 16:45:29 Subject: XML: is there an "xml-include" In-Reply-To: <15354.16715.272698.46624@u11.cs.umb.edu> Message-ID: > Yes. See > http://www.w3.org/TR/xmlschema-0/#SchemaInMultDocs I think this is not possible. If I understand correctly, you are referring to schema, not idref in one xml document referring to id defined in another document. The schema would define the fact that two attribute (say cid and cidref) have to be interpreted as key and keyref, but not the actual values. ---------------------------------------------------------- Gregor Hagedorn (G.Hagedorn at bba.de) Institute for Plant Virology, Microbiology, and Biosafety Federal Research Center for Agriculture and Forestry (BBA) Koenigin-Luise-Str. 19 Tel: +49-30-8304-2220 14195 Berlin, Germany Fax: +49-30-8304-2203 Often wrong but never in doubt! From kerrybarringer at BBG.ORG Tue Nov 20 16:36:36 2001 From: kerrybarringer at BBG.ORG (Barringer, Kerry) Date: Tue, 20 Nov 2001 16:36:36 Subject: is there an "xml-include" Message-ID: Gregor, We disagree and I'd like to get to the reason why because I am sure that both of us are not fully understanding what the other is saying. I agree that a character name is not necessarily unique across different taxa. A character is what a person defines it to be. But I don't understand what you mean when you say that the name cannot stand for the character, that only a code can. This seems absurd. I don't see the difference between my 'name' and your 'code', especially when your code is just a compact coding that refers to a name. In your haircolor example, I think you are misunderstanding the hierarcical nature of XML coding. Each character is part of another, which all eventually are part of a 'root,' in our case probably a plant. So the character 'Plant' has the characters 'stems.' 'leaves,' etc. 'Leaves" may contain the character 'trichomes' which may contain the character 'color.' If I understand what you wrote, your criticism applies to a relational model where the character of 'aircolor' might be used for leaf trichomes, fine enations inside a fruit, or the body covering of mammals. It is impossible to define a character this way in an XML file because it is strictly hierarcical, not relational. If you feel that descriptive data can only be modeled with relations, then you can never get XML to work for you. It is not relational, it is strictly hierarcical. It does not and, by design, cannot support the relational model. Do not confuse XML with a database. It is -not- designed to be a database. The few databases that use XML are all hierarcical XML is designed to model the structure of documents. If you want to create a database, use a database language. XML is what you use in the report. My feeling is that the hierarcical nature of XML might be suitable for plant descriptions. I know it does a good job coding taxomomic papers and taxonomic nomenclature. It is less suitable, though still workable, for specimen data, which is easier to model with relational tables. To use XML though, I think we must look outside the DELTA model of character coding. XML won't improve the DELTA model or fix its faults. We must also look beyond the database model. Plant descriptions are highly structured documents. XML can model that structure. Finally, we have to get beyond the wordiness of XML files. They are designed to be that way. we can only live with it. Let me know if this makes sense and be sure to correct me if you feel I am being too stupid. all the best, Kerry -----Original Message----- From: Gregor Hagedorn [mailto:G.Hagedorn at BBA.DE] Sent: Tuesday, November 20, 2001 10:31 AM To: TDWG-SDD at USOBI.ORG Subject: Re: is there an "xml-include" Dear Kerry, I can not see how what you propose could work. The name of a character is not necessarily unique across different taxa, and often not even within a taxon. Further the character concept and its names (technical, laymen, English, German, etc.) are a 1:n relation. The name can not stand for the character, only a code can. I do not care whether this code is characters or numbers, but I believe it is a mistake to think if something is to take the code at face value and assume you know what are hairs and what is color. Hairs are quite different things in plants, animals, or fungi, and color needs information whether it is a code from a color comparison chart, or an undefined term like "red". I think I am sticking less with a DELTA storage optimization model, but with a relational information model, which is what I have been using for all my projects. The relational model allows language independence. How can you preserve that, without having unique codes that lead to the definition of a character? That does not mean, that a free text description in some language, say Chinese, may be present, in addition to the data. That is why I am thinking of attributes, not element data. Gregor ---------------------------------------------------------- Gregor Hagedorn (G.Hagedorn at bba.de) Institute for Plant Virology, Microbiology, and Biosafety Federal Research Center for Agriculture and Forestry (BBA) Koenigin-Luise-Str. 19 Tel: +49-30-8304-2220 14195 Berlin, Germany Fax: +49-30-8304-2203 Often wrong but never in doubt! From G.Hagedorn at BBA.DE Tue Nov 20 16:30:53 2001 From: G.Hagedorn at BBA.DE (Gregor Hagedorn) Date: Tue, 20 Nov 2001 16:30:53 Subject: is there an "xml-include" In-Reply-To: <951DC0D57185D311B88B00A0C9E9258D04AED4@MIDNIGHT> Message-ID: Dear Kerry, I can not see how what you propose could work. The name of a character is not necessarily unique across different taxa, and often not even within a taxon. Further the character concept and its names (technical, laymen, English, German, etc.) are a 1:n relation. The name can not stand for the character, only a code can. I do not care whether this code is characters or numbers, but I believe it is a mistake to think if something is to take the code at face value and assume you know what are hairs and what is color. Hairs are quite different things in plants, animals, or fungi, and color needs information whether it is a code from a color comparison chart, or an undefined term like "red". I think I am sticking less with a DELTA storage optimization model, but with a relational information model, which is what I have been using for all my projects. The relational model allows language independence. How can you preserve that, without having unique codes that lead to the definition of a character? That does not mean, that a free text description in some language, say Chinese, may be present, in addition to the data. That is why I am thinking of attributes, not element data. Gregor ---------------------------------------------------------- Gregor Hagedorn (G.Hagedorn at bba.de) Institute for Plant Virology, Microbiology, and Biosafety Federal Research Center for Agriculture and Forestry (BBA) Koenigin-Luise-Str. 19 Tel: +49-30-8304-2220 14195 Berlin, Germany Fax: +49-30-8304-2203 Often wrong but never in doubt! From heidorn at ALEXIA.LIS.UIUC.EDU Tue Nov 20 15:00:46 2001 From: heidorn at ALEXIA.LIS.UIUC.EDU (Bryan Heidorn) Date: Tue, 20 Nov 2001 15:00:46 Subject: XML: is there an "xml-include" In-Reply-To: <3BFA3BC7.8791.7F4B146@localhost> Message-ID: At 11:17 AM 11/20/01 +0100, you wrote: >If character are defined with an id (numeric or character) in one >file, and the item descriptions use these ids through idref or >through xml-schema keyref means: > >How can a validating parser validate the schema, including keyrefs, >without having to include the entire character definition in each of >1000s of taxon description xml files? Can the ids for idref/keyref be >declared to be in a separate file? The character definition for a class of taxa, including the list of valid states could be included as part of the character-schema. An "include" can be used to insert the character schema within a taxon schema. The character schema would allow the validator to enforce the limitations on any particular xml instance file. As pointed out by Kerry Barringer I do not believe that there is a need in XML to use an additional level of indirection to include the item definitions in another file. This will cost space but there is nothing to keep the XML from being stored in a relational database and constructed on the fly as needed. If we want to enforce relational normalization it might best be done in the underlying database not in the xml instantiation of a view of the data. It might be cleaner if the XML validator could do this but for the purposes of data interchange and even interoperability, I think it is still useful for the xml validator to be able to check the xml documents even without the relational constraints. What is lost is that if you are moving data from one database to another, a program might not be able to easily recognize data relations and the receiving database might inefficiently save the data in a different database structure from the originating database. Bryan Heidorn >I could not find a standard xml-include command. I know there is one >defined for dtds, but that is all I could find. > >Can the xml experts help? > >Gregor > >PS I will post minutes of the TDWG meeting within the next days >---------------------------------------------------------- >Gregor Hagedorn (G.Hagedorn at bba.de) >Institute for Plant Virology, Microbiology, and Biosafety >Federal Research Center for Agriculture and Forestry (BBA) >Koenigin-Luise-Str. 19 Tel: +49-30-8304-2220 >14195 Berlin, Germany Fax: +49-30-8304-2203 > >Often wrong but never in doubt! -- -------------------------------------------------------------------- P. Bryan Heidorn Graduate School of Library and Information Science pheidorn at uiuc.edu University of Illinois at Urbana-Champaign MC-493 (V)217/ 244-7792 Rm. 221, 501 East Daniel St., Champaign, IL 61820-6212 (F)217/ 244-3302 http://alexia.lis.uiuc.edu/~heidorn Calendar: http://calendar.yahoo.com/pbheidorn Visit the Biobrowser Web site at http://www.biobrowser.org From ldodds at INGENTA.COM Tue Nov 20 12:29:42 2001 From: ldodds at INGENTA.COM (Leigh Dodds) Date: Tue, 20 Nov 2001 12:29:42 Subject: is there an "xml-include" In-Reply-To: <15354.16715.272698.46624@u11.cs.umb.edu> Message-ID: This is a means to separate out a schema into multiple files, if I understand Gregor correctly he's asking for a way to test for id/keyref uniqueness across multiple files? If so, then you can include chunks of XML in other documents using the dtd entity mechanism -- which is what I believe he has already tried. I don't believe there's actually an equivalent for XML Schema: you still need a DTD. At a higher level there is XInclude which allows for including content from one document into another. Not sure on current support for this however. (Assuming I'm on the right track) You may want to consider using an additional validation mechanism here, such as XLinkIt [1] which can make assertions (e.g. id uniqueness) across multiple documents. Hope that helps, [1]. http://www.xlinkit.com/main.html > -----Original Message----- > From: TDWG - Structure of Descriptive Data [mailto:TDWG-SDD at usobi.org]On > Behalf Of Robert A. (Bob) Morris > Sent: 20 November 2001 11:41 > To: TDWG-SDD at usobi.org > Subject: XML: is there an "xml-include" > > > Yes. See > http://www.w3.org/TR/xmlschema-0/#SchemaInMultDocs > > > Gregor Hagedorn writes: > > Date: Tue, 20 Nov 2001 11:17:27 +0100 > > From: Gregor Hagedorn > > To: TDWG-SDD at usobi.org > > Subject: XML: is there an "xml-include" > > > > If character are defined with an id (numeric or character) in one > > file, and the item descriptions use these ids through idref or > > through xml-schema keyref means: > > > > How can a validating parser validate the schema, including keyrefs, > > without having to include the entire character definition in each of > > 1000s of taxon description xml files? Can the ids for idref/keyref be > > declared to be in a separate file? > > > > I could not find a standard xml-include command. I know there is one > > defined for dtds, but that is all I could find. > > > > Can the xml experts help? > > > > Gregor > > > > PS I will post minutes of the TDWG meeting within the next days > > ---------------------------------------------------------- > > Gregor Hagedorn (G.Hagedorn at bba.de) > > Institute for Plant Virology, Microbiology, and Biosafety > > Federal Research Center for Agriculture and Forestry (BBA) > > Koenigin-Luise-Str. 19 Tel: +49-30-8304-2220 > > 14195 Berlin, Germany Fax: +49-30-8304-2203 > > > > Often wrong but never in doubt! > > From G.Hagedorn at BBA.DE Tue Nov 20 11:17:27 2001 From: G.Hagedorn at BBA.DE (Gregor Hagedorn) Date: Tue, 20 Nov 2001 11:17:27 Subject: XML: is there an "xml-include" In-Reply-To: <3BE6C22A.20968.A2ABC7@localhost> Message-ID: If character are defined with an id (numeric or character) in one file, and the item descriptions use these ids through idref or through xml-schema keyref means: How can a validating parser validate the schema, including keyrefs, without having to include the entire character definition in each of 1000s of taxon description xml files? Can the ids for idref/keyref be declared to be in a separate file? I could not find a standard xml-include command. I know there is one defined for dtds, but that is all I could find. Can the xml experts help? Gregor PS I will post minutes of the TDWG meeting within the next days ---------------------------------------------------------- Gregor Hagedorn (G.Hagedorn at bba.de) Institute for Plant Virology, Microbiology, and Biosafety Federal Research Center for Agriculture and Forestry (BBA) Koenigin-Luise-Str. 19 Tel: +49-30-8304-2220 14195 Berlin, Germany Fax: +49-30-8304-2203 Often wrong but never in doubt! From ram at CS.UMB.EDU Tue Nov 20 11:02:34 2001 From: ram at CS.UMB.EDU (Robert A. (Bob) Morris) Date: Tue, 20 Nov 2001 11:02:34 Subject: is there an "xml-include" In-Reply-To: Message-ID: Ah, you are correct. I read his request too quickly. Both Apache Cocoon and Microsoft .NET claim to support XInclude but I am not sure if there are any pure client-side implementations. Also, it is unclear at (another too quick) glance what the validation issues may be with either of these. Leigh Dodds writes: > Date: Tue, 20 Nov 2001 12:29:42 -0000 > From: Leigh Dodds > To: TDWG-SDD at usobi.org > Subject: Re: is there an "xml-include" > > This is a means to separate out a schema into multiple files, > if I understand Gregor correctly he's asking for a way to test for > id/keyref uniqueness across multiple files? > > If so, then you can include chunks of XML in other documents > using the dtd entity mechanism -- which is what I believe he > has already tried. I don't believe there's actually an equivalent > for XML Schema: you still need a DTD. > > At a higher level there is XInclude which allows for including > content from one document into another. Not sure on > current support for this however. > > (Assuming I'm on the right track) You may want to consider > using an additional validation mechanism here, such as XLinkIt [1] > which can make assertions (e.g. id uniqueness) across multiple > documents. > > Hope that helps, > > [1]. http://www.xlinkit.com/main.html > > > -----Original Message----- > > From: TDWG - Structure of Descriptive Data [mailto:TDWG-SDD at usobi.org]On > > Behalf Of Robert A. (Bob) Morris > > Sent: 20 November 2001 11:41 > > To: TDWG-SDD at usobi.org > > Subject: XML: is there an "xml-include" > > > > > > Yes. See > > http://www.w3.org/TR/xmlschema-0/#SchemaInMultDocs > > > > > > Gregor Hagedorn writes: > > > Date: Tue, 20 Nov 2001 11:17:27 +0100 > > > From: Gregor Hagedorn > > > To: TDWG-SDD at usobi.org > > > Subject: XML: is there an "xml-include" > > > > > > If character are defined with an id (numeric or character) in one > > > file, and the item descriptions use these ids through idref or > > > through xml-schema keyref means: > > > > > > How can a validating parser validate the schema, including keyrefs, > > > without having to include the entire character definition in each of > > > 1000s of taxon description xml files? Can the ids for idref/keyref be > > > declared to be in a separate file? > > > > > > I could not find a standard xml-include command. I know there is one > > > defined for dtds, but that is all I could find. > > > > > > Can the xml experts help? > > > > > > Gregor > > > > > > PS I will post minutes of the TDWG meeting within the next days > > > ---------------------------------------------------------- > > > Gregor Hagedorn (G.Hagedorn at bba.de) > > > Institute for Plant Virology, Microbiology, and Biosafety > > > Federal Research Center for Agriculture and Forestry (BBA) > > > Koenigin-Luise-Str. 19 Tel: +49-30-8304-2220 > > > 14195 Berlin, Germany Fax: +49-30-8304-2203 > > > > > > Often wrong but never in doubt! > > > > From kerrybarringer at BBG.ORG Tue Nov 20 09:44:53 2001 From: kerrybarringer at BBG.ORG (Barringer, Kerry) Date: Tue, 20 Nov 2001 9:44:53 Subject: is there an "xml-include" Message-ID: Gregor, My understanding of XML is that indeed, the character name and states should be written out fully in all descriptions. Characters and states should not be placed in a separate file aws in DELTA. The 'efficiency' of the DELTA model is really a trade off optimizing storage over processing. The makers of the XML model deliberately chose not to optimize storage and to create wordy XML files to optimize processing. File storage is not viewed as being a limiting factor any more, as it was when DELTA was designed. There are a few ways to include data from other files and DTD's but these are best used to create templates and allow for differently structured DTD in the same document. Say you have a DTD describing a taxonomic paper. That DTD can easily reference separate DTD's for taxonomic nomenclature, plants specimens, and taxon descriptions, combining the results which can then be transformed into a full taxonomic treatment. I think it is a mistake to try to directly translate the DELTA data model into XML. The XML model is different. It is more flexible, more readable, and more easily processed. It can also model hierarchies, though its relational capabilities are poor. The focus should be on developing a good XML model of taxonomic data. If necessary, it can be easily translated into DELTA format, or any other format, using XSL and can be processed by the available XML tools. with Best regards Kerry Barringer Brooklyn Botanic Garden -----Original Message----- From: Gregor Hagedorn [mailto:G.Hagedorn at BBA.DE] Sent: Tuesday, November 20, 2001 5:17 AM To: TDWG-SDD at USOBI.ORG Subject: XML: is there an "xml-include" If character are defined with an id (numeric or character) in one file, and the item descriptions use these ids through idref or through xml-schema keyref means: How can a validating parser validate the schema, including keyrefs, without having to include the entire character definition in each of 1000s of taxon description xml files? Can the ids for idref/keyref be declared to be in a separate file? I could not find a standard xml-include command. I know there is one defined for dtds, but that is all I could find. Can the xml experts help? Gregor PS I will post minutes of the TDWG meeting within the next days ---------------------------------------------------------- Gregor Hagedorn (G.Hagedorn at bba.de) Institute for Plant Virology, Microbiology, and Biosafety Federal Research Center for Agriculture and Forestry (BBA) Koenigin-Luise-Str. 19 Tel: +49-30-8304-2220 14195 Berlin, Germany Fax: +49-30-8304-2203 Often wrong but never in doubt! From ram at CS.UMB.EDU Tue Nov 20 06:40:59 2001 From: ram at CS.UMB.EDU (Robert A. (Bob) Morris) Date: Tue, 20 Nov 2001 6:40:59 Subject: XML: is there an "xml-include" In-Reply-To: <3BFA3BC7.8791.7F4B146@localhost> Message-ID: Yes. See http://www.w3.org/TR/xmlschema-0/#SchemaInMultDocs Gregor Hagedorn writes: > Date: Tue, 20 Nov 2001 11:17:27 +0100 > From: Gregor Hagedorn > To: TDWG-SDD at usobi.org > Subject: XML: is there an "xml-include" > > If character are defined with an id (numeric or character) in one > file, and the item descriptions use these ids through idref or > through xml-schema keyref means: > > How can a validating parser validate the schema, including keyrefs, > without having to include the entire character definition in each of > 1000s of taxon description xml files? Can the ids for idref/keyref be > declared to be in a separate file? > > I could not find a standard xml-include command. I know there is one > defined for dtds, but that is all I could find. > > Can the xml experts help? > > Gregor > > PS I will post minutes of the TDWG meeting within the next days > ---------------------------------------------------------- > Gregor Hagedorn (G.Hagedorn at bba.de) > Institute for Plant Virology, Microbiology, and Biosafety > Federal Research Center for Agriculture and Forestry (BBA) > Koenigin-Luise-Str. 19 Tel: +49-30-8304-2220 > 14195 Berlin, Germany Fax: +49-30-8304-2203 > > Often wrong but never in doubt! > From G.Hagedorn at BBA.DE Mon Nov 5 16:45:30 2001 From: G.Hagedorn at BBA.DE (Gregor Hagedorn) Date: Mon, 5 Nov 2001 16:45:30 Subject: Meeting in Sydney Message-ID: Bob had asked about a pre-meeting, which was difficult to organize because of two other premeetings, one about accessions. I propose to get some extra working time by meeting Sunday after the tdwg meeting at about 13:00 for a joint lunch, with open end for those who are staying anyway. I will be available during the afternoon for anybody who wants to join me. Depending on the interest, I will either either try to organize a room, or we could just move into a caf?. Please contact me on the meeting if you are interested. Gregor ---------------------------------------------------------- Gregor Hagedorn (G.Hagedorn at bba.de) Institute for Plant Virology, Microbiology, and Biosafety Federal Research Center for Agriculture and Forestry (BBA) Koenigin-Luise-Str. 19 Tel: +49-30-8304-2220 14195 Berlin, Germany Fax: +49-30-8304-2203 Often wrong but never in doubt! From G.Hagedorn at BBA.DE Mon Nov 5 16:31:05 2001 From: G.Hagedorn at BBA.DE (Gregor Hagedorn) Date: Mon, 5 Nov 2001 16:31:05 Subject: General Message-ID: Dear colleagues, this message is from the so-called convener of the sdd group, who feels very guilty about his incapacity to organize more and invest more work. As you probably know, the topic of the group is very close to my interest, being also the topic of my proposed thesis. However, in my current situation this seems to make it rather more difficult for me. The GLOPP project I am leading here in Germany is over demanding my resources, i.e. the project is in danger of failing. I can not keep the deadlines there, and I do not find the time to continue the thesis work. So SDD has become part of a big guilt complex. I would love to do more, but I am afraid I have not done enough for a long time now. So, I will be at the TDWG congress in Sydney, and I hope we have a working group discussion there. If anybody wants to take up the lead of the group, she or he will have my support for that task. I hope we can still use the time in Sydney for fruitful discussions! Best wishes Gregor ---------------------------------------------------------- Gregor Hagedorn (G.Hagedorn at bba.de) Institute for Plant Virology, Microbiology, and Biosafety Federal Research Center for Agriculture and Forestry (BBA) Koenigin-Luise-Str. 19 Tel: +49-30-8304-2220 14195 Berlin, Germany Fax: +49-30-8304-2203 Often wrong but never in doubt!