goals

  1. Characterize, in an accessible and utilitarian manner, the secondary structure of rRNA molecules by using a comparative cross-taxon approach.
  2. Use these characterizations, where applicable, to generate phylogenies for the taxa in question.
  3. Provide tools to examine unanswered questions regarding the utility of using secondary structure to hypothesize homology

overview

RNA secondary structure can be used as a criterion for hypothesizing homologous regions in multiple sequence alignments of RNA molecules. This process has traditionally been largely manual, though an increasing number of algorithmic approaches exist (see links). Different approaches incorporate structure (i.e. use structure to justify homology) to varying degrees. These range from dividing data into multiple blocks prior to analysis in POY (e.g. Giribet 2001, Terry and Whiting 2005), to re-coding of regions delimited by structural motifs for parsimony analysis (e.g. ARC, INAASE ), to producing full multiple-sequence alignments (e.g. Kjer 2004). Kjer (1995) proposed a method to annotate structural information in a multiple-sequence alignment, with the ultimate goal being the phylogenetic analysis of the end product. Kjer's (1995) proposals sought to strengthen homology statements among sequence data by rigorously (albeit manually) comparing and annotating data with reference to known (i.e. observed) published structural models. His method works to a large degree because some secondary-structural motifs are conserved across much of life- and alignment of these regions is trivial (trivial in the sense that a manual approach would produce an alignment with no significant differences than that produced by modern algorithmic approach), and because, as initially proposed, the alignment of hypervariable regions (wherein secondary structure is more conserved than primary sequence) is improved with information of RNA basepairing. Kjer (1995, 1997) also recommended the exclusion of regions within alignments from phylogeny estimation wherein the assignment of nucleotide homology cannot be made with confidence. These pages seek to critically review and update the methodology initially proposed Kjer (1995), with the goal of easing the incorporation of structural information into homology hypothesis (not only for the proposals of Kjer, but also for those methods proposed by Wheeler (1996, 1999) for implementation in POY (Gladstein and Wheeler 1997)). Central to our methods is an initial structural hypothesis for a given molecule, one which can be overturned by any number of means (including, theoretically, the direct observation of the molecule). Our goal is to both allow the user of POY to be able to partition their data using information from a structural model, and to provide the end user who is looking at their data on a finer-scale (e.g. using hypotheses of co-variation) a means to manipulate and test their structural hypothesis- these goals are of course not mutually exclusive. We present scripts for parsing structural alignments, and display results and data from analyses in progress, both published and working.

There is no clear definition of what a secondary-structure based alignment is, i.e. what components is must have or not have, and it is unclear whether there needs to be such definition (though perhaps that would simplify existing debates; c.f. Kjer, 2004; Ogden, Whiting and Wheeler, 2005). Given this, we suggest the premise of secondary structure-based alignment is to simply provide a set of homology hypotheses spanning the sampled taxa for the nucleotide sequence (region of DNA) under investigation. The process of deriving these hypotheses facilitates: 1) the assignment of positional nucleotide homologies (with comparative evidence); 2); the localization of ambiguously aligned regions within homologous positions that have structural and functional differences within the molecule (Gillespie 2004); 3) the appropriate accommodation of base composition and substitution rates that are often unique in stems, loops and ambiguously aligned regions; 4) the improvement of higher order structural models that exist on public databases (see links); and 5). the detection of investigator artifacts (sequencing errors, contamination, presence of pseudogenes or paralogous loci within multicopy gene families) (see Gillespie et al. 2005a). Information provided under alignment basics provides further theoretical background, and a practical example highlighting the various methodologies involved.

The end alignment-product (examples) is considered to be a model of molecular evolution across all taxa being analyzed. We have written Perl scripts that parse simply-formatted alignments and return both input files for use in popular phylogenetic analysis programs and html-based reports demonstrating the characteristics of the given datasets. Several reports and statistics provided by the jRNA scripts help illustrate the robustness (or lack there of) of given models. For phylogenetic analyses, the scripts incorporate information derived from knowledge of the secondary structure wherever possible. The various ways in which this is done, along with sample input format are available under analysis methods.

Finally, it is also our goal to provide global alignments that address the utility of a given RNA model for phylogeny reconstruction of a specific taxon. These alignments will also provide templates for researchers collecting data for closely-related lineages for which structural models remain unpredicted.