Analyzing Tabular Biological Data in R
Overview
This module shows you how to manipulate tables of data using R. On completion, you will be able to replace Excel or other spreadsheet programs with R and gain efficiency in your data analysis. You will also learn about ‘data frame’, a data structure used extensively in almost all R applications, and ‘dplyr’, a powerful and versatile R library.
Questions One Can Answer after This Module

You have an Excel file with RNAseq data from four samples along with gene IDs and another file with gene annotation details. i) Find the expression level of gene Hox1a by merging two tables. ii) Find the expression levels of all Hox genes. iii) Find average expression level for all transcription factors. iv) Compare the expressions of transcription factors and signaling genes by plotting as histograms.

Pokemon are comic characters. We will use their example to understand table processing, but the same methods can be used for any biological dataset as well. Load pokemon.csv file from the given link and  i) Find the heaviest and lightest pokemons. ii) Compare the weights and heights of all pokemons. Do you see any pattern? (Hint. The pattern is clear, if you take log of one axis). iii) Combine Pokemon types from the second table, and find the average weights of Pokemons of different types.

Load the csv file containing information for all bacterial genome available from NCBI. i) Which bacteria has the largest genome? ii) Compare GC contents of proteobacteria, firmicutes and actinobacteria. Which group has the highest GC?

Load the csv file containing the results of all international soccer matches since the 1850s. i) Which are the worst losses of the Brazil team? ii) Do teams win more frequently when the play at home than away? Decide using the history of all of Brazil’s matches.
Module Length
Three sessions of 2 hours each, or two sessions of 3 hours each.
Topics
 From vector to data frame in R,
 Dplyr library commands  select, filter, mutate, arrange,
 Reading and writing data from Excel and csv files,
 Joining multiple tables using dplyr.
First session

Introduction to dataframe.

Reading and writing external csv/excel files.

summary statistics.

Learning about inbuilt datasets in R.

Other data types  list, matrix, factors.

Linear algebra in R.
Second session

Introduction to dplyr library.

dplyr functions  select, filter, mutate, arrange.

Practice data analysis using dplyr.
Third session

Joining multiple tables using dplyr and extracting information.

Practice data analysis on multiple tables using dplyr.
Details
Researchers from biology background often get introduced to R through Bioconductor, while they continue to use Excel for common spreadsheettype operations. This creates a disconnect and also makes learning R difficult.
R is actually far more powerful than Excel or other spreadsheet programs for manipulating data tables. Especially, when one needs to combine information from multiple spreadsheets to perform an analysis, using R saves significant amount of time. As an example, researchers in genomics often have gene expression data in one table and the annotations in a different table, and they need to combine them. Such tasks are straightforward with the dplyr library in R.
This module shows you how to manipulate tables of data using R. On completion, you will be able to replace Excel or other spreadsheet programs with R and gain efficiency in your data analysis. You will load data from existing “.csv” or “.xls” files, complete the analysis and then save the results in another spreadsheet. You will also learn about ‘data frame’, a data structure used extensively in almost all R libraries, and ‘dplyr’, a powerful and versatile R library.
Cost
$99 for premium members, $125 for other.
Class Style
The classes will be conducted through online interactive chat session.
Testimonials
You can read the testimonials for our summer R classes here.
Register
Please sign up for the module at the following link. At present no payment is necessary to register.