Harvard School of Public Health
PH525x: Data Analysis for Genomics
Rafa Irizarry, PhD
Professor of Biostatistics
Department of Biostatistics
Harvard School of Public Health
Michael Love, PhD
Department of Biostatistics
Harvard School of Public Health
- PH207x: Health in Numbers: Quantitative Methods in Clinical and Public Health Research. This is another HarvardX course, which we recommend, but is not a strict pre-requisite.
- Basic programming skills. We will assume that the students are familiar with very basic programming concepts (variables, functions).
- Familiarity with the R language. The course will use R in order to demonstrate data analyses. We will introduce software from the Bioconductor project, but will not cover a basic introduction to the R language. See below for online R resources. Students can try the Self-assessment included in Pre-Week in the Courseware. Note that this assessment is not counted toward the class.
- Software for Data Analysis: Programming with R (Statistics and Computing) by John M. Chambers (Springer)
- S Programming (Statistics and Computing) Brian D. Ripley and William N. Venables (Springer)
- Programming with Data: A Guide to the S Language by John M. Chambers (Springer)
Online R resources:
- R referecence card (PDF) by Tom Short (more can be found under Short Documents and Reference Cards here)
- Quick-R: quick online reference for data input, basic statistics and plots
- Thomas Girke's R & Bioconductor manuals
Course Description and Learning Objectives
The purpose of this course is to enable students to analyze and interpret data generated by modern genomics technology, specifically microarray data and next generation sequencing data. We will focus on applications common in public health and biomedical research: measuring gene expression differences between populations, associated genomic variants to disease, measuring epigenetic marks such as DNA methylation, and transcription factor binding sites.
The course covers the necessary statistical concepts to properly design the experiments and analyze the high dimensional data produced by these technologies. These includes exploratory data analysis, estimation, hypothesis testing, multiple comparison corrections, modeling, linear models, principal component analysis, clustering, nonparametric and Bayesian techniques. Along the way, students will learn to analyze data using the R programming language and several packages from the Bioconductor project.
Currently, Biomedical research groups around the world are producing more data than they can handle. The training and skills acquired by taking this course will be of significant practical use for these groups. The learning that will take place in this course will set them up for greater success in making biological discoveries and improving individual and population health.
Please note that we will not cover: population genetics, comparative genomics, sequence alignment algorithms, systems biology, genome assembly, Python, Perl.
Upon completion of this course, you should be able to:
- Parse the statistical methods section of a genomics paper
- Formulate and perform statistical tests on various types of genomic data
- Describe and account for known technical artifacts
- Explore patterns among samples and among genes in high-dimensional genomic data
You will be assessed at the end of each section (each week is divided into several sections). You are given one attempt per set of multiple choice questions and are then are provided with feedback about the correct answer. You must score overall at or above 70% in order to pass the course and earn either the Honor Code or Verified Certificate of Achievement.
You may choose to audit this course or to earn either a Verified Certificate of Achievement or the edX Honor Code Certificate of Achievement. You can change your enrollment option on the course site, or get a refund (if applicable) before April 18, 2014 by contacting email@example.com and providing the email address used to register for the course.
- Audit: By auditing this course, you will have access to videos, labs, assessments and the discussion board and can participate as much, or as little, as you like. There is no penalty for registering for PH525x and not completing the assessments.
- Honor Code: An Honor Code Certificate of Achievement certifies that you have successfully completed a course, but does not verify your identity. Honor code certificates are currently free.
- Verified: A Verified Certificate of Achievement shows that you have successfully completed your edX course and verifies your identity through your photo and ID. Verified certificates are available PH525x for $250.
All questions should be made on the discussion board under the course discussion topic: “Questions about the Course”. Given the high enrollment in the class, emails sent directly to the course teaching staff will not be answered.
Course content will be discussed on a weekly basis with the following schedule:
SESSION DETAILS BY WEEK
Week 1: Introduction (April 7-April 13)
This week we will take a brief look at the biology behind the data to be analyzed throughout the course. We will look at the genomic endpoints to be measured with high-throughput technology and how they fit into the central dogma. We will also give a review of the R programming language.
- Think critically about what should be measured in genomic studies and why.
- Read, compute and display several summary statistics and plots.
- Solve new problems by navigating the help system as well as online resources.
- What we measure and why
- R programming skills
- Probability distributions
- Exploratory data analysis
Week 2: Measurement technology (April 14-April 20)
This week we will explore a concrete example of genomics data analysis by looking at two major high-throughput technologies used for genomics measurement: microarrays and next generation sequencers. We will learn about exploratory data analysis, apply it to real data, and look at some background material on the measured output: gene expression.
- Explain the basics of how microarray and next generation sequencing technologies work.
- Enumerate several applications of the technologies in public health and biomedical sciences.
- Compare and contrast software for processing next generation sequencing data.
- Microarray technology
- Next generation sequencing technology
- Working with data in R
Week 3: Inference (April 21-April 27)
This week we will introduce basic statistical concepts related to inference, including populations, parameters, random sampling, estimation and hypothesis testing. We will explore how these concepts are useful for detecting differentially expressed genes. Finally, we will look at linear models and how the use of linear algebra in simplifying notation.
- State the purpose of and assumptions of statistical inference.
- Use Monte Carlo simulation in evaluating statistical estimators.
- Define the following core concepts: p-value, statistical power, Type I and II error.
- Write a linear model describing group effects using matrix notation.
- Use R formula to create matrices with represent various experimental designs.
- Linear models
Week 4: Background correction and normalization (April 28-May 4)
Genomics data is usually affected by systematic bias and unwanted variability. Without proper normalization, it can often be impossible to compare across samples. This week we will take a look at statistical models in the context of genomics data. In particular we will explore how to describe and adjust for background noise and achieve normalization with data.
- Describe the typical distribution and joint distribution of microarray data.
- Distinguish between several approaches to background adjustments.
- Identify the impact of background noise on microarray data.
- Explain why normalization is necessary for proper interpretation of genomics data.
- Describe the different normalization approaches for various microarray technologies.
- Enumerate special cases for which current normalization techniques are not appropriate.
Week 5: Distance, clustering, and prediction (May 5-May 11)
Batch effects are currently a major challenge in genomics data analysis. This week we will look at statistical concepts used in detecting batch effects via exploration, including distances, dimension reduction techniques, and clustering. We will also present basic concepts and techniques for prediction and machine learning.
- Describe how confounding variables might adversely affect a scientific study.
- Distinguish between various distance metrics for comparing samples and comparing genes.
- Describe the usefulness of clustering for identifying inherent groups in a dataset.
- Describe the basic goal or task in a prediction setting.
- Distance and clustering
Week 6: Batch effects (May 12-May 18)
In this session, we will explore some useful statistical ideas for accounting for batch effects as well as looking at specific solutions for batch effects.
- Identify and remove batch effects in microarray data.
- Describe the usefulness of PCA in finding batch effects in high dimensional data.
- Statistical solutions to batch effects
- Applying batch effects solutions
Week 7: Advanced differential expression (May 19-May 25)
In this module we will look at advanced topics related to differential expression such as correcting for multiple comparisons, empirical Bayes approaches for stabilizing estimates, and the use of permutations to avoid parametric assumptions. We will also conduct gene set enrichment analysis, defined as performing differential expression analysis for groups of genes at time.
- Use empirical Bayes methods in order to improve gene-level estimates and hypothesis tests.
- Define the false discovery rate, and explain why such a concept is useful for genomic scientists.
- Explain how gene sets provide an additional layer of interpretation on genomic data.
- Map various genomic and technology annotations to each other.
- Hierarchical modeling
- Multiple comparisons
- Gene set testing
- Gene and technology annotations
Week 8: Advanced topics (May 26-June 1)
In this session, we will explore the main Bioconductor tools for analyzing next generation sequencing, including methods for analyzing these data and detecting systematic bias. You will have the opportunity to focus on one of four data analysis pipelines: 1) detecting genomic variants, 2) finding differential expression with RNA-seq, 3) finding differentially methylation regions, and 4) peak detection with ChIP-seq data.
- Enumerate what kind of systematic errors can arise in NGS data.
- Visualize genomic data such as annotated ranges or experiment data within Bioconductor and using web or stand-alone tools.
- Use Bioconductor packages to create a table of the counts of reads overlapping genes for different samples.
- Perform analyses using Bioconductor: variant calling, RNA-Seq analysis, DNA methylation analysis, ChIP-Seq analysis
- Manipulating NGS data using Bioconductor
- Genome variation
- RNA sequencing
- DNA methylation
- ChIP sequencing
All assignments must be completed before June 13, 2014 at 5:00 UTC.