eagle-i University of PennsylvaniaUniversity of Pennsylvania
See it in Search

Bushman Laboratory


Research in the Bushman laboratory focuses on host-microbe interactions in health and disease, with particular focus on studies of 1) the human microbiome, 2) HIV pathogenesis, and 3) DNA integration in human gene therapy.

In recent years, our work has been driven increasingly by the remarkable new deep sequencing methods, which can produce more than 100 billion bases of DNA sequence information in a single instrument run.

For microbiome studies, this allows comprehensive analyze of microbial populations without reliance on culture-based methods, which can detect only a small fraction of all organisms present.

For studies of HIV replication, this allows analysis of complex viral populations or distributions of retroviral DNA integration sites in the human genome.

For gene therapy, this allows tracking of integrated vectors in gene-corrected subjects and molecular characterization of adverse events. Sample acquisition can sometimes be difficult in such projects, but bioinformatic analysis afterwards is almost always harder. We have been carrying out this type of study since 2002, when we showed that HIV DNA integration in human cells was favored in active transcription units, and over the years have built up partially automated software pipelines that allow efficient analysis deep sequencing data.

Lab members and collaborators cover a range of specialties, including clinical researchers, molecular biologists, computational biologists, and statisticians.





  • BROCC ( Algorithmic software suite )

    BROCC is a flexible software pipeline for classifying single cell eukaryotes in microbiome samples that easily interfaces with the popular QIIME pipeline.

    BROCC classifies amplicons using BLAST searches against large and relatively uncurated databases. BROCC uses blastn, but output from other versions of BLAST such as blastx can be substituted. BROCC first filters input BLAST hits for sufficient coverage and identity to the query sequence.

    If a query sequence has too many hits that are below the preset coverage threshold (70% default), or BLAST did not return a hit, it is not classified, and a message is written to the output file. BROCC then determines the identity and taxonomic hierarchy of each high quality hit using a local user installed sql database and NCBI’s e-fetch tool.

    BROCC then votes on the quality filtered BLAST hits, starting at the species level. At each level of the taxonomy BROCC requires the taxon with the most votes to surpass a user specified threshold for that level in order to accept it as a valid classification. If a sufficient majority is not reached, BROCC will not make a classification for that level and iterate to the next higher taxonomic level for another round of voting. BROCC filters are independently configurable at the genus and species levels, and another filter can be assigned for the remaining taxonomic levels.

    BROCC also contains a user modifiable list of high level and partial assignments in its configuration file. These assignments are ignored at lower taxonomic levels where they are uninformative and can distort voting, but included in higher levels. For example, a sequence read with a kingdom level assignment only is excluded up to the kingdom level, at which point the vote is counted in the kingdom assignment. In cases where the proportion of high level and partial assignments exceeds a given threshold (default 0.70), the query sequence is unassigned and marked accordingly.

    BROCC output includes both files containing classifications with standardized taxonomy (domain, kingdom, phylum, class, order, family, genus, species) and a second with the complete NCBI taxonomy, which includes subtaxa, supertaxa, and unranked intermediate taxonomic levels. The third file contains a log of the voting record, including how many votes were cast, how many votes the winning taxon received, and how many generic classifications were ignored for each query sequence. This file also indicates those queries that were unclassified. Both taxonomy files are suitable for use in the QIIME pipeline (i. e. they are in the same format as the output classifications as the QIIME assign_taxonomy.py script).

    The BROCC program is implemented in Python version 2.7. It queries the NCBI taxonomy and requires local installations of SQL and BLAST.

  • Gene Overlapper ( Algorithmic software component )

    "The Gene Overlapper provides output from genome-wide surveys of host-cell genes linked to HIV infection and allows user-configured exploration of overlaps among studies."

    Overlap analysis and comparisons to random distributions are carried out using R. The p-values for overlaps between lists are generated by comparison to results of random simulation and by calculation based on the hypergeometric distribution.

  • hiAnnotator ( Algorithmic software component )

    "hiAnnotator contains set of functions which allow users to annotate a RangedData object with custom set of annotations. The basic philosophy of this package is to take two RangedData objects (query & subject) with common set of space (i.e. chromosomes) and return associated annotation per space and rows from the query matching space and rows from the subject (i.e. genes or cpg islands). The package comes with three types of annotation functions which calculates if a position from query is: within a feature, near a feature, or count features in defined window sizes. Moreover, one can utilize parallel backend for each annotation function to utilize multiple processors. In addition, the package is equipped with a wrapper function, which finds appropriate columns needed to make a RangedData object from a common data frame."

  • hiReadsProcessor ( Algorithmic software component )

    "hiReadsProcessor contains set of functions which allow users to process single-end LM-PCR sequence data coming out of the 454 sequencer. Given an excel file containing parameters for demultiplexing and sample metadata, the functions automate trimming of adaptors and identification of the genomic product. In addition, if IntSites MySQL database is setup, the sequence attrition is loaded into respective tables for post processing setup and analysis."

  • Optimized Iterative De Bruijn Graph Assembly ( Algorithmic software suite )

    OptItDBA is "a BASH script that will iteratively assemble a set of paired-end metagenomic reads."

    In each iteration of the optimized iterative de Bruijn graph assembly pipeline, "OptItDBA:

    (1) selects the optimal kmer,
    (2) generates a de Bruijn graph for the optimal kmer length,
    (3) removes the reads that map to the most highly abundant contigs from the dataset or reads that map to circular contigs, and
    (4) starts another iteration using all of the reads that do not map to those contigs.

    The loop ends when there are no highly abundant contigs meeting the criteria outlined below. At that point, all of the remaining reads will be assembled and mapped using the optimal values from the final iteration."

  • Polyafit ( Algorithmic software component )

    Polyafit is an R package that provides tools to fit data to a multivariate Polya distribution (Dirichlet-multinomial distribution).

  • Protein Cassette Discovery ( Algorithmic software component )

    After clustering ORFs using UCLUST and comparing protein families to the Conserved Domain Database using rpsblast, protein families are grouped into cassettes. Cassettes are defined as multiple protein families that can be found together on contigs.

    More details are given in the related publication:
    "Each protein family was classified according to the list of contigs that encoded it. Next, all of the protein families were compared, seeing how many of those occurred on common contigs. A given pair of protein coding families was grouped into a cassette when the smaller of the two families was found on a shared contig at least 80% of the time. This process was performed iteratively, recalculating the overlap scores after each pair of protein families was merged together. In subsequent iterations, protein families could also merge in the same way with cassettes that formed earlier.

    If a pair of proteins formed a cassette found on multiple contigs, we expect shared ORFs to be in the same relative orientations. To calculate the consistency of orientation across contigs, we used a simple co-orientation score, calculated in the following way. Any two genes have four possible relative orientations. For every pair of protein clusters in a module, we calculate the proportion of contigs that contain the orientation found most commonly."

  • QIIMER ( Algorithmic software component )

    "This package provides R functions for (1) reading QIIME output files and (2) creating figures from the resultant data frames."

  • Simple Levenshtein alignment and distance calculation ( Algorithmic software component )

    Simple Levenshtein alignment and distance calculation with ends-free and reduced homopolymer gap costs

Web Links:

Last updated: 2013-04-02T12:41:01.718-04:00

Copyright © 2016 by the President and Fellows of Harvard College
The eagle-i Consortium is supported by NIH Grant #5U24RR029825-02 / Copyright 2016