eagle-i University of PennsylvaniaUniversity of Pennsylvania
See it in Search

Computational Biology and Informatics Laboratory


The goal of our work is to help make sense of the enormous amount of biomedical data generated by high-throughput genomic approaches and synthesize them into something more than the sum of the parts. To that end, we are developing tools that enable researchers to mine and integrate data from a variety of different sources and types of experiments. In particular we are applying these approaches to expand our understanding in the areas of diabetes and infectious disease. We model data with networks and reality with ontologies especially the Ontology for Biomedical Investigations (OBI) for the latter.





  • Beta Cell Genomics ( Database )

    Beta Cell Genomics is part of the Beta Cell Biology Consortium. "The Beta Cell Biology Consortium (BCBC) is a team science initiative that was established by the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK). It was first funded in 2001 (RFA DK-01-014), and competitively continued in 2005 (RFAs DK-01-17, DK-01-18), and in 2009 (RFA DK-09-011)."

    "The genomics section provides searches and tools to explore detailed information about genes, transcripts, gene interactions, genomic regions, and functional genomics studies."

  • ErythronDB ( Database )

    "The ErythronDB project is the result of a collaboration between the Palis Lab (University of Rochester) and the Stoeckert Lab (University of Pennsylvania). The laboratory of Dr. Palis brings expertise in the cellular and molecular events underlying the ontogeny of hematopoiesis in the mammalian embryo. The laboratory of Dr. Stoeckert contributes state of the art computational and bioinformatics expertise.

    The Erythron Database is a resource dedicated to facilitating better understanding of the cellular and molecular underpinnings of mammalian erythropoiesis. The resource is built upon a searchable database of gene expression in murine primitive and definitive erythroid cells at progressive stages of maturation.

    ErythronDB allows users to identify sets of genes that are differentially expressed or exhibit similar levels or patterns of mRNA expression during erythrocyte development. Searches are also available for exploring networks of gene-interactions inferred from annotated expression data.

    All ErythronDB strategies can easily be refined and expanded using the Strategies interface. Register to save and share strategies with others. Registration also provides access to a Basket, which allows users to interactively assemble a list of genes of interest from search strategy results."

    Members from the Palis Lab in the Department of Pediatrics, University of Rochester Medical Center that developed ErythronDB are Paul D. Kingsley, Ph.D.; Jenna M. Frame, M.S.; Timothy P. Bushnell, Ph.D.; Jeffrey Malik, Ph.D.; Kathleen E. McGrath, Ph.D.; and James Palis, M.D..

  • EuPathDB ( Database )

    "EuPathDB (formerly ApiDB) is an integrated database covering the eukaryotic pathogens in the genera Acanthamoeba, Annacaliia, Babesia, Crithidia, Cryptosporidium, Edhazardia, Eimeria, Encephalitozoon, Endotrypanum, Entamoeba, Enterocytozoon, Giardia, Gregarina, Hamiltosporidium, Leishmania, Nematocida, Neospora, Nosema, Plasmodium, Theileria, Toxoplasma, Trichomonas, Trypanosoma and Vavraia, Vittaforma. While each of these groups is supported by a taxon-specific database built upon the same infrastructure, the EuPathDB portal offers an entry point to all of these resources, and the opportunity to leverage orthology for searches across genera.

    The EuPathDB databases are funded by NIAID and they have strict guidelines as to which organisms we can support. EuPathDB is one of four NIH funded bioinformatics resource centers. Mainly we are to incorporate any eukaryotic pathogen that is deemed emerging or re-emerging (see list). We also have some organisms that are not on the list, those were either added in the past on the behest of the community and additional funding (such as TrichDB), organisms that are related to emerging or re-emerging pathogen (such as Plasmodium or Neospora) and the trypanosomatids (funded as a pilot project by the Bill and Melinda Gates foundation).

    The EuPathDB family of databases is expected to expand further over the coming years, incorporating new species including the Amoebozoa (Entamoeba, Acanthamoebs), Microsporidia, and additional Apicomplexans, Kinetoplastida and Diplomonads. Of course, we continue collecting data that are being deposited at an ever-increasing rate, including additional genome sequences, microarray data, probe-based hybridization and sequencing data (e.g. ChIP-chip and RNA-Seq), proteomics data, isolate data, phenotype information and metabolomic data."

    EuPathDB was developed jointly with Cristina Aurrecoechea1, Dave Falke1, Alan Gingle3, Mark Heiges1, Jessica C. Kissinger1,2,5, Eileen T. Kraemer4, Ganesh Srinivasamoorthy1, Haiming Wang1, Susanne Warrenfeltz1, and Betsy Wenthe1 from the 1Center for Tropical & Emerging Global Diseases, University of Georgia, 2Department of Genetics, University of Georgia, 3Center for Applied Genetic Technologies, University of Georgia, 4Department of Computer Science, University of Georgia, and 5Institute of Bioinformatics, University of Georgia.

  • NIAGADS Genomics Database ( Database )

    "The NIAGADS GenomicsDB annotation resource provides a simple, but powerful, workspace to explore, analyze, and discover genes, SNPs, and genomics locations of interest or with special relevance to Alzheimer’s Disease."

  • OrthoMCL DB ( Database )

    Ortholog Groups of Protein Sequences from 150 genomes which contain 1398546 protein sequences.

    "For each ortholog group, we provide basic information and other useful data about the group:
    a). Size of the group, in terms of both number of sequences and number of taxa.
    b). Sequence similarity info, indicating the degree of conservation within the group: % Match Pairs (percentage of all possible pairs within the group that are matched through BLAST under the default cutoff, and the rest of similarity info is calculated based on these matched pairs only), Average E-value, Average % Coverage, and Average % Identity.
    c). Phyletic profile, displaying #sequences from each species that belong to this ortholog group; black box indicates presence (with the number below the genome abbreviation representing #sequences) while white box stands for absence.
    d). Keywords: the most frequently occurring keywords in the annotations of the member sequences.
    e). Pfam domains: the most frequently occurring Pfam domains in the member sequences.
    f). Pfam domain architecture: useful to compare among group members and to identify outliers (due to evolution or sequencing/gene model errors).
    e). BioLayout graph, displaying the sequence similarity relationship between group members together with OrthoMCL edge information (in the SVG version of the graph).
    f). Multiple Sequence Alignment of the ortholog group."


  • AnnotCompute ( Algorithmic software suite )

    "AnnotCompute is a tool to identify similar functional genomics experiments (mainly microarray experiments) based on standardized annotations containing the MGED Ontology (MO) terms."

  • OrthoMCL ( Algorithmic software suite )

    "To distinguish functional redundancy from divergence, this method identifies “recent” paralogs to be included in ortholog groups as within-species BLAST hits that are reciprocally better than between-species hits. This approach is similar to INPARANOID, but differs primarily in the requirement that recent paralogs must be more similar to each other than to any sequence from other species. To resolve the many-to-many orthologous relationships inherent in comparisons across multiple genomes, OrthoMCL applies the Markov Cluster algorithm (MCL; Van Dongen 2000; http://micans.org/mcl/), which is based on probability and graph flow theory and allows simultaneous classification of global relationships in a similarity space. MCL simulates random walks on a graph using Markov matrices to determine the transition probabilities among nodes of the graph. The MCL algorithm has previously been exploited for clustering a large set of protein sequences, where it was found to be very fast and reliable in dealing with complicated domain structures (Enright et al. 2002). OrthoMCL generates clusters of proteins where each cluster consists of orthologs or “recent” paralogs from at least two species."

    OrthoMCL was developed jointly with Mark Heiges and Ryan Thibodeau from the Center for Tropical & Emerging Global Diseases, University of Georgia.

  • Patterns from Gene Expression ( Algorithmic software suite )

    "PaGE can be used to produce sets of differentially expressed genes with confidence measures attached. These lists are generated the False Discovery Rate method of controlling the false positives.

    But PaGE is more than a differential expression analysis tool. PaGE is a tool to attach descriptive , dependable, and easily interpretable expression patterns to genes across multiple conditions, each represented by a set of replicated array experiments.

    The input consists of (replicated) intensities from a collection of array experiments from two or more conditions (or from a collection of direct comparisons on 2-channel arrays). The output consists of patterns, one for each row identifier in the data file.

    One condition is used as a reference to which the other types are compared. The length of a pattern equals the number of non-reference sample types. The symbols in the patterns are integers, where positive integers represent up-regulation as compared to the reference sample type and negative integers represent down-regulation.

    The patterns are based on the false discovery rates for each position in the pattern, so that the number of positive and negative symbols that appear in each position of the pattern is as descriptive as the data variability allows.

    The patterns generated are easily interpretable in that integers are used to represent different levels of up- or down-regulation as compared to the reference sample type."

  • Significance Tester for the Accumulation of Reads ( Algorithmic software component )

    STAR was developed to identify regions enriched for a histone modification based on ChIP-Seq evidence, by identifying regions with a significant accumulation of reads.

  • Strategies WDK ( Software )

    "The Strategies WDK is a framework for creating "data mining" genomics websites. It is a layer on top of your relational database. Supported DBMS platforms are Oracle and PostgreSQL. It is schema independent, which means it does not require the data in your database to be in a particular form.

    Use the WDK to:
    • define a coarse-grained data model (based on the tables in your existing database) that specifies the searches the user can run and the kinds of results he or she can get
    • define a customized view of that data model using Java Server Pages

    The data model that you define in the WDK is a Data Transfer Object (DTO) layer. A DTO is an object that brings together data that may come from many tables in the database. It is good practice to provide to a web site or other high-level data consumers objects at a coarse granularity. In the WDK the DTOs are called records. For example, a Gene record may bring together data from many tables that contain information relating to a Gene. Records in the WDK are configured in XML.

    In sum, the WDK model lets you configure records and searches that return sets of them."

    Strategies WDK was developed jointly with Cristina Aurrecoechea1, Eileen T. Kraemer2, Cary Pennington1, and Jessica C. Kissinger1,3 from the 1Center for Tropical and Emerging Global Diseases, University of Georgia, 2Department of Computer Science, University of Georgia, and 3Department of Genetics, University of Georgia.

Web Links:

Last updated: 2014-10-16T16:39:46.478-04:00

Copyright © 2016 by the President and Fellows of Harvard College
The eagle-i Consortium is supported by NIH Grant #5U24RR029825-02 / Copyright 2016