eagle-i University of PennsylvaniaUniversity of Pennsylvania
See it in Search

Kim Laboratory: Computational Evolutionary Biology


A key property of living objects is that each object, whether they are proteins, cells, or whole organisms, has an associated generating process, that is, a decoding process whereby stored information is converted into a complex functioning biological object. For example, generating a protein involves translation and folding; generating an organism involves a cascade of gene regulatory and cell biological processes. We are interested in such bio-generative processes and understanding the temporal control and architectural constraints of these processes.

Questions include how to infer the organizational structure of such generative processes from available data, the evolution of control processes, and how the relationship between generative dynamics, variability, and the final form interact to determine the evolution of the biological object. Two central projects in our lab are using comparative transcriptome profiling of time-series to uncover the architecture of temporal control in yeast and using computational analysis of non-coding RNAs to understand the evolution of sub-cellular processes in neurons.

Since 2007, Jim Eberwine (Pharm) and I have been engaged in multiple joint projects concerning genomics of cell differentiation and cell diversity. Our labs collaborate in all kinds of projects where we bounce ideas off each other, design and carryout experiments together, and design analysis of data together. Many of the projects described below, especially in neuroscience are joint projects between our two labs.

In addition to these theoretical problems, we work on a wide range of collaborative projects and computational biology projects. Currently, these collaborations involve molecular control of neurons, functional prediction of sequence elements for genes involved in synaptic transmission, novel technologies for functional genomics, statistical analysis of whole-genome expression profiling, as well as software engineering bioinformatics analysis platforms. We employ a variety of techniques including discrete algorithms,simulations, statistical learning, dynamical systems and algebraic geometry, molecular biology, functional genomics, and single-cell genomics.





  • Genomic Locus Operation - DataBase ( Database )

    "Glo-DB is designed to perform position-based queries of genomic sequence annotations (features). It contains a query language that affords many different types of position searches via command line and graphical user interfaces, and incorporates various visualization tools. In this application, features are combined into sets, called "tracks," where a single track can contain features from any number of genomics sequences. For example, a track might contain all exons in a genome, the introns on a particular chromosome segment, etc. These feature sets can be loaded from different types of text files.

    Since features are just start and stop positions on a sequence, each feature can be viewed as a unique object located on the sequence or as a mask over the specified region of the sequence. Glo-DB's built-in operators will seamlessly manipulate features in either representation. For example, a user might be interested in the set of all exons on a chromosome that overlap with a specific set of genes on that same chromosome. In this case one track would contain the set of exons ("exon_track"), another the set of genes ("gene_track"). To find all overlapping features, the user would perform an "AND" operation on these two sets of features, returning a track containing the set of overlapping features ("exon_track AND gene_track"). If the user only wanted the exons in the output set, the genes could then be subtracted out ("((exon_track AND gene_track) sMINUS gene_track)"). Alternatively, a user could "subtract" the positions of the exons on a chromosome from the gene positions, to get a track containing a set of new features that represent the introns in the genes. Using tracks containing the exons ("exon_track") and genes ("gene_track"), the user would then negate the two ("gene_track - exon_track"), returning a set of new features encoding the positions within the genes not encoded by the exons. In the first example, the "set based" operators acted on the features as immutable position pairs allowing for the sets to be altered but not the features themselves. In the second example, the "binary" operator acted on the features as positions on the sequence, allowing for the features to be spliced and merged into new features."


  • Clustering conserved noncoding elements in prokaryotes algorithm ( Algorithmic software suite )

    "We adopted a two-step algorithm to cluster conserved noncoding elements in prokaryotes. First, we identified all homologous noncoding pairs by using all-against-all BLAST search; second, we used single-linkage clustering algorithm to group homlogous single positions together, which is followed by reordering and reassembly."

  • Crimson ( Algorithmic software suite )

    "Crimson facilitates the extraction of sub-trees from very large phylogenetic trees. Trees are loaded into a shared database and sampled according to schemes controlled by the user. Comprehensive graphical dialogs allow users to easily manage and query trees in the database. Queries can be stored in the database to be shared with other users and moved between databases. A command line interface enables users to write their own function and scripts to manage the database, manipulate the trees and queries, and automate any of the built-in functions."

  • KimLabIDV ( Algorithmic software suite )

    "This program is developed based on the Shiny framework, a set of R packages and a collection of scripts written by members of Junhyong Kim Lab at University of Pennsylvania. Its goal is to facilitate fast and interactive RNA-Seq data analysis and visualization. Current version of IDV supports routine RNA-Seq data analysis including DESeq normalization, differential expression analysis, principal component analysis, pairwise correlation, hierarchical/K-means clustering, group classification and plotting of gene heatmap. It also supports report generation and program state sharing.

    KimLabIDV is provided as an R package and runs as a web application. It has been tested on Mac OSX and Windows."

  • NoFold ( Algorithmic software suite )

    "NoFold is an approach for characterizing and clustering RNA secondary structures without computational folding or alignment. It works by mapping each RNA sequence of interest to a structural feature space, where each coordinate within the space corresponds to the probabilistic similarity of the sequence to an empirically defined structure model (e.g. Rfam family covariance models). NoFold provides scripts for mapping sequences to this structure space, extracting any robust clusters that are formed, and annotating those clusters with structural and functional information."

  • Quasi-Periodic Feature Classifier algorithm for G protein-coupled receptors ( Algorithmic software component )

    "Quasi-Periodic Feature Classifier is an algorithm for identifying multi-transmembrane proteins from genomic databases with a specific application to identifying G protein-coupled receptors. The QFC algorithm uses concise statistical variables as the 'feature space' to characterize the quasi-periodic physico-chemical properties of multi-transmembrane proteins. For the case of identifying GPCRs, the variables are then used in a non-parametric linear discriminant function to separate GPCRs from non-GPCRs. The algorithm runs in time linearly proportional to the number of sequences."

  • RNA self containment ( Algorithmic software component )

    "RNA self containment is an implementation of the Self-Containment Index (SC), as described in the related publication. It measures the robustness of RNA structures to changes in the surrounding sequence context, which we hypothesize to be a hallmark of structural modularity. SC values range from 0.0 (no self containment) to 1.0 (completely self contained)."

    "Secondary structure prediction is performed using the Vienna RNA Package, provided by Ivo Hofacker, et al, at the Institute for Theoretical Chemistry of the University of Vienna."

  • rnasim: Simulating RNA evolution ( Algorithmic software suite )

    Simulated nucleotide sequences are widely used in theoretical and empirical molecular evolution studies. Conventional simulators generally use fixed parameter time-homogeneous Markov model for sequence evolution. rnasim simulates RNA evolution on phylogenies based on the model described in the related publication. Briefly, rnasim uses the folding free energy of the secondary structure of an RNA, the energy_of_struct() class in the Vienna RNA package, as a proxy for its phenotypic fitness, and simulates RNA macroevolution by a mutation-selection population genetics model. Because the two-step process is conditioned on an RNA and its mutant ensemble, we no longer have a global substitution matrix, nor do we explicitly assume any for this inhomogeneous stochastic process. The sequences generated by rnasim, have greater statistical complexity than sequences generated by two standard simulators, ROSE and Seq-Gen, and are close to empirical sequences.

  • SNP Identification using Probability of Every Read ( Algorithmic software component )

    "Sniper is a Bayesian probabilistic model that enables SNP discovery in both unique and repetitive regions of a genome by utilizing the information from multiply-mapped sequence reads."

    "Sniper can perform all steps of analysis, including read map generation, organization of read maps into singly mapped and multiply mapped partitions, and SNP calling. Although Sniper is designed to use Bowtie for read alignment, any alignment program can be specified, as long as the read map output is stored in a SAM-formatted file."

  • TAGD ( Algorithmic software suite )

    "A phylogeny is a tree graph depicting the genealogical history of vertices of the tree. The vertices of the tree represent biological objects. The biological objects may be of type: whole organism, whole genomes, genes, etc. The vertices of a single tree always represent the same type of object. The leaf-vertices are degree one vertices that represent present day objects for which measurement data is available. Therefore, we assume that each leaf vertex has associated data consisting of a (genomic) string. See tutorial on phylogenies.

    The root of a phylogeny is a special vertex that represents the common ancestor of all vertices. Any non-leaf vertex is called an ancestral vertex. A rooted phylogenetic tree has directed edges where each edge is directed along the path from the root to the leaves. We will call an edge directed out from an ancestral vertex, a daughter edge and corresponding connected vertex will be called a daughter vertex. For each leaf vertex there is a unique path from the root to the leaf, which we implicitly refer to as the path. Typical phylogenetic trees have two daughter edges that we will call Left and Right daughters.

    A phylogenetic tag or tag for short is a substring corresponding to a daughter edge, d1, of an ancestral vertex, P, such that the substring exists in the strings of all leaf vertices that are in the path of edge, d1, and NOT in the leaf vertices that are in the path of other edges, d2---dk, from the ancestral vertex. That is, it is a substring that is uniquely present in the set of leaves that are daughters of one edge and NOT present in the set of leaves that are daughters of other edges.

    TagD is implemented as a command-line application. This application will generate tags for a specific branch of a tree, given the tree's structure as well as the sequence for all its taxa. A user interface has also been developed to provide useful interpretive and visualization tools. The user interface is provided as a plug-in for Mesquite."

  • VERSE ( Algorithmic software suite )

    "VERSE is designed for high-performance read summarization for next generation sequencing. VERSE is 50x faster than HTSeq when computing the same gene counts. It introduces a novel, hierarchical assignment scheme, which allows simultaneous quantification of multiple feature types or annotation levels without repeatedly assigning reads. There is also a set of parameters the user can use to fine-tune the assignment logic. VERSE can be readily incorporated into any existing RNA-Seq analysis pipelines.

    VERSE is implemented in C. It is built on top of featureCounts. VERSE supports Mac OSX and linux systems."

Web Links:

Last updated: 2013-03-13T11:24:53.224-04:00

Copyright © 2016 by the President and Fellows of Harvard College
The eagle-i Consortium is supported by NIH Grant #5U24RR029825-02 / Copyright 2016