NGS - RNAseq Analysis using R
Overview
RNAseq is a major application of high-throughput sequencing technologies (NGS), but researchers in biology often struggle with data analysis. This module teaches you to use various R-based tools for exploring RNAseq data. It is designed for those from biology background, not computer scientists.
You will learn to combine tables of data and annotations and find the expressions for a gene or a subset of genes (e.g. all transciption factors). You will also perform differential expression analysis to identify genes over-expressed or under-expressed under a condition. Finally, you will learn to use Kallisto to map raw short read data (FASTQ) on to the annotated genes to find expression levels.
The module is taught live online in three sessions using a chat-based platform. Every week, we will send out class materials on the prior Tuesday so that you get enough time to go over the materials. The classes will run for two hours, and then you have an additional hour to practice the tools and solve problems.
Prerequisite
Basic familiarity with the concepts of vectors and data frames is needed. This is covered in R for Biology - Core Concepts.
Module Length
Three sessions of 2 hours each.
Sessions
First Session
1. Reading and writing external csv/excel files into R.
2. Using dplyr library (select, filter, mutate, arrange) for extracting rows or columns from the file.
3. Joining multiple RNAseq data sets using dplyr and extracting information.
4. Generating summary statistics and plots of RNAseq data.
Second Session
1. Differential expression analysis (DESeq2, edgeR, limma-voom, sleuth).
2. Clustering and heatmap plots of differentially expressed genes.
Third Session
-
Introduction to using Linux operating system.
-
Mapping of Illumina short read data (FASTQ) on to annotated genes using Kallisto.
Questions You Will Be Able to Answer after This Module
-
You have an Excel file with RNAseq data from four samples along with gene IDs and another file with gene annotation details. i) Find the expression level of gene Hox1a by merging two tables. ii) Find the expression levels of all Hox genes. iii) Find average expression level for all transcription factors. iv) Compare the expressions of transcription factors and signaling genes by plotting as histograms.
-
You have an Excel file with RNAseq data from four samples along with gene IDs and another file with gene annotation details. i) Find the differentially expressed genes in your experiment. ii) Plot heatmaps for those differentially expressed genes and cluster similarly expressed genes.
-
You have short read data sets (FASTQ) from NGS experiment and also annotated genes from the organism. i) Generate a Excel or csv table with expression levels for various genes.
Details
About RNAseq: RNAseq experiment is a major application of high-throughput sequencing technologies (NGS). In these experiments, researchers extract RNAs from different samples (e.g. heart versus brain, cancer versus normal, root versus plant, treatment versus control), convert them into cDNA and sequence to find out, which genes are differentially expressed.
Common experience: Biologists often struggle with data analysis for RNAseq experiments. There sources of frustration come in two flavors.
(i) NGS experiments produce massive data files, which the core facilities map on to available annotations to send the biologists tables of expression data. Those expression tables need to be combined with other available information on the genes (e.g. annotations, data from previous experiments, data from other organisms) to find biologically meaningful information.
(ii) Researchers receive the massive short read data files, and they need to complete the mapping step and then everything in (i).
Excel versus R: Researchers from biology background traditionally use Excel for data analysis, but R is far more productive than Excel for RNAseq data analysis consisting of multiple tables. Especially, when one needs to combine information from multiple spreadsheets to perform an analysis, using R saves significant amount of time. As an example, researchers in genomics often have gene expression data in one table and the annotations in a different table, and they need to combine them. Such tasks are straightforward with the dplyr library in R.
This module shows you how to manipulate tables of data using R. On completion, you will be able to replace Excel or other spreadsheet programs with R and gain efficiency in your data analysis. You will load data from existing “.csv” or “.xls” files, complete the analysis and then save the results in another spreadsheet. You will also learn to use ‘dplyr’ library.
Statistical Analysis of differentially expressed genes: Identifying differentially expressed genes is a key step in RNAseq data analysis. You will learn about differential expression analysis libraries (DESeq2, edgeR, sleuth) and clustering of those subset of genes.
Mapping: Finally, for those receiving the raw data sets need to map them on the annotated genes to determine the expression levels. This step used to be time-consuming, but not any more, thanks to tools like Kallisto. You will learn to use Kallisto to convert raw data into expression levels.
Class Style
The classes will be conducted through online interactive chat session.
Testimonials
You can read the testimonials for our summer R classes here.
Cost
$125 (20% discount for premium members).
Register
Please sign up for the module at the following link. At present no payment is necessary to register.