Re: HyperSpace shuttles and hyperbikes
Kevin,
I think the approach that Stan outlines in looking at specific relationships among entities should make much of the complication, seem a bit less of an obstacle. We need to better understand the properties of our "data objects" and the various ways they are associated. If we define our vision too narrowly, and here is perhaps where I would disagree with you and Stan, we will end up with a result that may not be as general and flexible as we might need later. We run the risk that it will probably difficult to adapt to more complex relations. However, I do believe Stan is correct in that we need to first focus on the simplest relations among entities. This is probably a good idea for two reasons: 1) if we have started with the right notion of "fundamental objects" at least some of their interactions will be simple (also read general, ie less constrained), if we have evaluated them correctly, and 2) simple relations will likely be more efficiently implemented and at the core of our "hyperspace shuttle" (hyperbike?) we will require efficiency, since the number of data sets and transformations are already large.
If I understand the rationale for engaging in this exercise it is to permit users to search, select and "interoperate" with a variety of different data types and structures so that data mining and new insights can be gained. I believe we agree that we need more than a translator, but how much more, I believe, will remain unclear until we have a better handle on exactly what properties these various entities (features and data structures used to represent them) actually have. This is not to say that studying carefully how specific translators may work (ie Lucid to DELTA or reverse or Lucid to NEXUS, etc.) will not be informative for ellucidating the generalities that are required. On the contrarary, the approach you suggest is quite a workable one and may well suffice for more limited ends. My earlier comments were directed to permit the structure of the specifications to extend to more general data types than we might be able to currently envision. I reiterate the general structure of the "standard descriptor language" should not restrict progress along the lines you suggest, only that it would be useful adopt a general approach that could subequently incorporate such extended functionality. I guess it stems from my need (hope) for mechanisms that will help me integrate loosely structured text (in a web-based environment) and more quantitative (operational) means of structuring data.
At the conceptual level, the notion of "character" is central to the dicussion of what we mean by taxonomic information. I believe we need to have a recomendation for a specification that taxonomists/systematists can adopt to include the "universe" of such characters, not that all are equal or equally important, or utilized by a given implementation. The recommendation probably also needs to be simple enough that most can understand and use it. Nonetheless, there appear to be many different kinds of characters and here is where things start to get hard. Some such as cladistic characters will have a basis of comparison that maps the characters into specific disjoint character states. Others will be mensural or countable in nature, whose basis of comparison defines an axis of variation and may or may not exhibit the property of disjunction (disjuction may be a property of sampling). In both cases they could be represented by a vector. However, we must understand the rules of accessing each element and whether or not specific relations can be drawn from manipulation across vector types. In the case of the former they can be represented by text or by tags (ie state A vs. B, or 0 and 1). The text used may be circumscribed via controled vocabularies to permit analogy, if not direct comparision, with quantified characters (ie long vs short, round vs square, etc). Admittedly, this will be difficult because most biologists often have used "local" reference frames when using free text in this way, rather than any sort of "universally" applicable frame that we might be speaking about for the first time here. Consequently, "round" or "large" can have very different meanings. However, even here we can probably safely conclude that the dimensionality of primary interest is 3D, at least when attempting to characterize the size and shapes of features.
While the number of permutations may be extremely large, it may well be possible to circumscribe characters with relatively few "tags", if we can define common rules for the use of these "tags". Perhaps the tags might define a classification of characters (ie <qualitative character>, <quantitative character>, in the former one may have <cladistic characters> and <general qualitative> and among these <cladistic characters> might be further subdivided <two state> or <multi-state> and the later may be tagged as <fully ordered> or <partially ordered>, where as the former as well as the latter may also be tagged as <directed> or <undirected>. Similarly, cladistic characters will have other properties that could likewise be tagged <name>, <study used>, <basis of comparison>, <textual description>, <date first proposed>, <represention of states (ie 0's and 1's or some other tagging scheme)> some of which might have multiple entries, such as <name (perhaps different for two different investigators, but the same "character" nonetheless)> or <study used>. Not all tags might get used. However, if they are used, a standard must specifiy how they are to be used.
Characters also have other properites such as scale. Although most informative in a measured sense, (Angstroms, microns, millimeters, meters, and nanograms, grams, kilograms, etc.), we would likely have to establish relations among a variety of different kinds of reference scales (ie molecular, subcellular, cellular, tissue, organ, organismal, ecosystem), which are variously defined in a broad array of contexts. Where possible, it would be useful to attempt to at least loosely interoperate across such reference frames, say for purposes of a broad search. Characters also have scope in that they are mapped onto organisms. Consequently, conventions are also necessary to circumscribe what we mean by taxon and how we deal with issues of rank (and perhaps newer philosophies of "rankless" classfication that seek to deprecate the concept of rank). Although multiple classifications and synonyms make both kinds of mapping complex, "standardized" lists of names for molecules, structures, and taxa do exist that we can use to simplify some of the complexity. Characters also exist through time, but again absolute scales are often wanting and we will likely need to deal with the period of time in which a character exists (existed) in relation to relative reference frames. Obviously, in each case we require a means to specify the nature of the standardization, much as XML names spaces attempt to deal with the potential confounding of elements of the same "name"). Here it would be helpful to have a "standard list of standard lists" so that we can begin to understand how ungainly such an approach might actually be. Some such as the geological reference scale and ITIS names list are obvious candidates, but it would be nice begin to assemble the URL's that we might have to visit to actually understand the details.
While not all implementations would need utilize all "implict" tags, specification of the nestedness of a particular tagging scheme would require specification (standardization). Such an approach does lend itself nicely to parsing and to object-oriented definition. At the level of implementation, it would not be hard to imaging for example Lucid or Delta passing their own internal representation to the parser that uses the specification inherent in the internal representation to wrap characters with (XML) tags that could then be either translated to other data formats, or processed according to a previously specified search criteria or set of processing instructions that might perform another kind of association, data mining, or analysis. Similarly, it would be relatively easy to take range data from a given CAT scan or structured light range-sensor or graphical data gif or jpg files and similarly tag (annotate) it so that particular features of the data could be given "equivalent" meaning for association with otherwise qualitative data. Admittedly, getting the machine to do the annotation remains, except in limited circumstances, a dream for the heavens, and I too would just settle for a "hyperbike" (with training wheels) at this stage in the development of "hyperspace shuttle archicture".
Stuart
Kevin Thiele wrote:
Dear Stuart,
I agree completely with everything you say, but it worries me all the same. You point out the complexity of descriptive data and the enormity of the task of completely capturing it. But we need to get something done, and I think we need some incremental stages.
Your suggestion as to maintaining threads of discussion is not unlike the way the list was running before it fell over. Some of the threads did indeed morph into monsters, others got lost and I think many people with them. I'd really like to try for a while keeping the discussion focused on the document with the proposed list of elements, to glean suggestions from people as to whether it's completely inadequate or what. At the same time, of course, I don't want to constrain people to run with this or with my suggested way of doing things. I may well be way off the track of what's possible or achievable. Working up the document may provide us with an incremental advance, or it may be that such incremental advances are not worth achieving and your suggestion for a great leap forward is the way to go. My way of looking at it is that if DELTA is a bicycle, I'm proposing a motor bike, and you're sketching out plans for a space shuttle. Maybe I'm not being visionary enough?
It seems to me that there's an old way of describing something, and many possible new ways. The old way is with a set of characters with values (states) applied to a set of taxa. This is the form of DELTA data, Lucid data, textual descriptions (in a way). Updating our standard for this way of describing is achievable now, I think.
New ways of describing something, such as with 3-D tomographic imaging etc, may well be the way of the future. But I'm not sure that we can have one descriptive standard that encompasses both old and new ways under one roof. This is why we need extensibility - can we take an incremental step along the lines that I'm proposing while allowing for the future brave new world. Or can we have a set of linked standards - one for describing in the boring old characters/states/taxa way, and others for the more space-shuttle ways that can be linked in as they develop.
Looking forward to responses
Cheers - k
At 06:07 PM 19/7/00 -0600, you wrote:
Kevin and colleagues,
As per our discussions at the US-Australia Workshop, I would again reiterate a few general observation with respect to the list and express my agreement with specific comments made by you, Bryan Heidorn, and Stan Blum. However, with your indulgence, I would also like to provide what may be a somewhat different perspective.
The focus on "requirements" for a descriptive data standard for taxonomy, as you and Stan emphasize is a critical one, even though as Bryan points out there remains a number of issues that need to be dealt with that may not be fully accounted for in the draft standard you have kindly provided. I would agree that we need a mechanism (structure?) for subsequent discussion on the list to permit both general, theoretical issues to be addressed, while simultaneously breaking down the practical realities of dealing with the complexity of specific issues involved that at times dictates useful "digression" into jargon-laden specifics that might be relevant for particular implementation issues that require vigorous discussion. I'm not sure at this stage whether it is possible, at least in my own mind, to distinguish structure from content, since dealing with existing structure and content may be necessary to define what we perceive are requirements. My own sense of the previous discussion is that there are a variety of perspectives as to what constitutes "descriptive data" and "requirements" in this context, as well as what are the specific priorities (aspects of "standards") that are necessary for specific applications (eg DELTA and LUCID) to intercommunicate in an application-neutral manner. However, mixing them into a common thread proved a bit overwhelming.
My own bias is for a better understanding of how we can construct such a "draft standard" so that it is open to considerable extension for the incorporation of meta-language descriptors for more esoteric data structures, while maintaining a flexible general framework needed to associate existing "character" data, while also addressing the practical necessity of managing various "annotations" of qualitative characters. I believe this is important primarily because we ultimate want machines to do most of the translating among formats, with minimal loss of information or human intervention. I believe it is also important for the more difficult task that lies ahead of encoding means for machines to "feature extract" across a multiplicity of representations of character data.
As a taxonomists/morphologist I am constantly confronted with new data formats and widely different data sources. Virtually all are created in specific contexts and do not generally have a "web-wide" mechanism for associating their content. For example, it is difficult for me to determine if there exist data sets that encompass different "encodings" of information pertaining to specific structures for specific taxa. I need a dyanamic mechanism that will permit me to become aware of data sets pertaining to say the pectoral fins of a particular scorpionfish, without having to know in advance that such data may exist in the form of 1) a collections record of a skeleton in a particular collection, 2) a published data set characterizing the measurements taken from a particular study, 3) a CAT scan of such a critter, 4) an archive containing the representation of specific character states used in a phylogenetic analysis, 5) numerous gif/jpg files of radiographs of specimens, 6) a text based description of the pectoral fins in a fossil, 7) the title of a paper describing the sensory innervation of the fin, or 8) a database of specific HOX genes involved in fin formation.
Certainly, the Rich Attribution component of your document is critical element for this, but I do not yet see how I can use this document to establish the "meta-data wrapper" needed to compile such a list, much less establish to what extent I can use such a text based "wrapper" to associate these disparate kinds of taxonomic data. How do I deal with data that are largely numeric in content or purely graphic (pixel encoded)? Nonetheless, I would agree that there is a need for a series of "collation rules" to establish scope at different hierarchical levels or for specific context-oriented activity. I would, for example, add several lower levels still in this context (including parts of specimens as described at the organ, tissue, and cellular, subcellular, and molecular level). Of course, the difficulty here is that resolution and context may create data structures that are not entirely hierarchical, particularly for objects of composite origin or study. For the nervous system in chordates one can break down the system into units with respect to various elements that could in one sense be heirarchical (perhaps brain, spinal chord, ramus lateralis accessorius, neuron, motor unit, motor endplate, etc.). However, with respect to a physiological classification dealing with action at the level of specific neurons this classification scheme would not work since the nerve is composite and composed of both sensory and motor elements. Likewise, it would be difficult to place neuroendrocrine components, specific neuropeptides, or developmental anlagen, such as placodes, however important, into a parallel heirarchy. Likewise, usefully descriptive properties could not be easily restricted to specific components. I found the discussion at the workshop regarding the use of acyclic directional graphs as a fundamental data structure most interesting, but I'm not sure that morphological descriptors, perhaps unlike gene products, are necessarily acyclic. For example, a specific bone such as the mandible can be classified as an element of the visceral skeleton as well as a composite element containing both endochondral as well as dermal bone. If one looks early enough in development, one can't even recognize these anatomical distinctions, although they may exist at a molecular level. How should structures that change with development or function be tagged and associated? Would this not depend upon context? Nonetheless, following from your document, it might be a useful excercise to consider to what extent certain classes of morphological descriptors can be considered in such a graph-theoretical framework from which we might be able to establish certain constructs as useful in associating otherwise disparate, yet specific data (glossaries?). Trees certainly are a useful data structure for description of many morphological features, but not the only ones.
Consequently, it might be useful to break up the discussion into sub-discussions or threads for which specific requirements can be more readily circumscribed and for which the makings of a "meta-language" needed to search and assimilate alternate representations might be more quickly forthcomming. This is important because the universe of potentially different data structures for encoding character data is very large. There is no need for those interested primarily in DELTA - LUCID translations, or LUCID - PHYLIP, etc. transformations to be held up by more specific requirements concerning translations/annotations of more arcane data structures, even though some, like Bryan and I, may feel that transformations between "other kinds" of data structures must also be incorporated in a way that allows their potential richness to be exploited. However, acheiving such extensibility will require the "standard discriptors" to be be quite general (but not ambiguous) in construction.
Such an approach might permit us to generalize across a number of possibly highly specific topics and requirements that are not universally applicable and with which many of us are differentially fluent. This approach would be especially useful, should we begin at a latter point begin to use them to construct XML schema or to outline what might be necessary using XSLT to transform them from one XML format to another. Since XML is promising as a data neutral specification language, we might want to maintain a separate "XML thread", and perhaps even various XML (alternative)-implementaton subthreads (Java XML API's vs MS XML API's, vs "others? or DTD's vs Schemas, "elements" vs "atributes", etc.) that will influence how such a "standard" could currently be implemented. Although certainly I would agree that we do not want implementations to drive the standards, it is important to have an understanding of how potential implementation might affect the utility of the standards. It might be useful here to draw an analogy to the presentation made at the workshop by Sue Rhee in her discussion of the need for an "ontological" database for common annotation of gene function across molecular databases. Likewise, we need a generalized means of characterizing the "language" used to describe the various entries in different "glossaries" used to describe character data. The need for such "cross molecular" databases would not arise, except for specific implementation issues that are not presently adequately addressed. Likewise, your "External lexica" might be usefully encompassed in the concept of XML name spaces. Although I can't think of specific examples off the top of my head, some anatomical terms are used differently in different contexts ("viens" in animals and plants might be a simple example). We need to be able to distinguish the contexts. Perhaps this is what you mean by global versus local characters?
Hence, from my perspective it might be useful for the dialog to move forward along several separate, yet not entirely distinct threads, where folks with specific interests could provide input as they see fit, ignoring that which seems irrelevant. A few may even want to keep their thumbs in all the pies. In glancing over what has come before, we might consider as possible threads: 1) general theoretical perspectives on "taxonomic data", 2) one or more application specific threads (ie DELTA, LUCID, "phylogeny packages", NEXUS, others?, etc.), 3) issues pertaining to description and characterization of qualitative data, 4) issues pertaining description and characterization of quantitative characters, 5) issues pertaining to text based description (semi-structured data), 6) issues pertaining to structured data (ie relational or object modeled data structures), and 7) meta-language requirements (headers, tagging architecture, XML etc.). No doubt you or others might be able to amend these or to add a few others from within which we might eventually reach consensus on assembly of a few key requirements that are general to all and from which interoperable implementation could proceed so as to be able to assess the usefulness of our work. Perhaps some threads could rule out discussion of "content" and others "structure". In any event, it might be useful to let natural selection act to allow the most productive threads to survive and "establish focus", while the others die out, without letting the whole wither because of the complexity and interconnection of the fundamental issues.
No doubt some of these threads will might morph into monsters not anticipated. Consequently, to keep it all coordinate, there must be some general agreement/understanding to focus on common requirements (GOALS) that we are trying to acheive. However, at this early stage, these might be largely implict so as not to lock ourselves into unnecessarily narrow perspectives. For this to work, perhaps one or two "ring masters" or "virtual ushers (bouncers?)" are needed to keep the various performers and audience on cue, to summarize progress from time to time, and to remove, add, or combine threads at key moments (ie oversee and exert some "administrative" control over the various threads). This is important so that a specific set of useful general requirements is forthcomming in a timely fashion. I nominate you and Bryan (actually you guys nominated yourselves in Boston or was it unanimous proclamation?). Subsequent to such general and specific discussion, I believe we would then be in a better position to respond to specific requests for comments on documents, such as that you have put forth outlining draft standards.
Stuart
Kevin Thiele wrote:
Dear Colleagues,
you will all be aware that the SDD list fell over several months ago. My interpretation of this is that many of the taxonomists on the list were left behind, perhaps early on, by the energetic discussions over issues of data structuring (XML, schemas, RDF etc). Most of this was certainly way over my head. Things got too top-heavy, and attempts to structure the discussion using message tags didn't seem to provide much focus.
Recent discussions (at a meeting on US-Australian cooperation in bioinformatics in Washington, July 2000, attended by several SDD contributors) has again highlighted the great need for an SDD standard and shown that the lack of a new, inclusive standard is holding back progress on descriptive databasing and software design.
We need to restart the list with better focus. I'd like to suggest that the way forward is to entirely set aside (for the time being) any discussion of data structure and focus entirely on content (the requirements analysis) for a while. We should agree on an outline of the data that we need to capture, then pass this on to the computerheads to provide a best-practice structure for storing and managing this captured data.
The attached document was put up to the list shortly before it fell over. It's attached again here, slightly edited. ANY TAXONOMISTS STILL OUT THERE
- please look at this. What data that you need to capture aren't handled
here? Will this work? Is this the way to proceed?
I think that the document subsumes the data requirements of the DELTA and Lucid programs, plus a bit more particularly in the areas of data attribution and hierarchical nesting of treatments. The intention is that the elements in this list should provide a way of storing any data needed to describe the morphology or anatomy of any organism or taxon.
Note that this should be read merely as a list of data elements - the structure of the list does not imply a structure for the data file (XML or otherwise) used to store the data.
It may be the case that this document can be jointly modified to produce a final document, or we may need to start from scratch with another. Any
ideas?
Cheers - k
Name: DDST Specifications.doc
DDST Specifications.doc Type: Microsoft Word Document
(application/msword)
Encoding: base64
participants (1)
-
Stuart G. Poss